WO2017080090A1 - Extraction and comparison method for text of webpage - Google Patents

Extraction and comparison method for text of webpage Download PDF

Info

Publication number
WO2017080090A1
WO2017080090A1 PCT/CN2015/100180 CN2015100180W WO2017080090A1 WO 2017080090 A1 WO2017080090 A1 WO 2017080090A1 CN 2015100180 W CN2015100180 W CN 2015100180W WO 2017080090 A1 WO2017080090 A1 WO 2017080090A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
text
tags
sub
module
Prior art date
Application number
PCT/CN2015/100180
Other languages
French (fr)
Chinese (zh)
Inventor
孙燕群
Original Assignee
孙燕群
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 孙燕群 filed Critical 孙燕群
Publication of WO2017080090A1 publication Critical patent/WO2017080090A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management

Definitions

  • the invention relates to a computer network technology method, in particular to a web page text extraction comparison method.
  • the main web page text extraction methods are as follows: DOM-based web page text extraction method, statistics-based web page text extraction method, block-based web page text extraction method, and other web page text extraction methods.
  • the Document Object Model is a standard interface specification developed by the W3C. Because the DOM nodes are organized based on the tree's hierarchy, after the tree structure is established, the original operations on the web page can be converted into operations through the tree. Although the web page structure can be converted into a DOM tree format according to the standards set by the W3C organization, in fact many web pages do not follow the standard. Therefore, when the DOM method is used, it usually needs a preprocessing module to finally abstract the web page into a DOM tree.
  • the DOM-based web page text extraction method is a DOM-based web page content extraction method, and its original purpose is to improve the PDA application and remove the advertisement content.
  • the DOM method abstracts the content of the web page into corresponding objects and converts them into the form of nodes; then organizes the nodes with the parent-child relationship to form a tree structure.
  • the structure of web pages from the same website on the Internet is mostly the same.
  • the ⁇ body> tag of Yahoo News page is composed of two tags: ⁇ iframe> and ⁇ div>, so you can group these web page templates into one. class.
  • the clustering similar DOM tree needs to calculate the similarity.
  • the procedure for calculating the similarity of two simple DOM trees is: the first step is to judge whether the root nodes of the two trees are the same, and if they are not the same, return 0; if they are the same, continue to compare The leaf nodes of the two trees.
  • the second step compares the names and attributes of the leaf nodes of the two DOM trees and returns the number of identical nodes in the two DOM trees.
  • the statistical-based method is mainly used to extract the body of news-based web pages.
  • the principle of this method is that the web page body information can only be located in the ⁇ table> tag node in the web page.
  • the basic steps of the method are as follows: the first step is to remove the noise of the page, and the webpage is correspondingly represented as a tree according to the webpage label; the second step processes each ⁇ table> node, removes the HTML label in the node, and then obtains the label without any label. String The third step compares the number of characters in each node. Usually, the node with the largest number of characters is the body of the web page.
  • the advantage of this method is that it utilizes the characteristics of the news webpage, has good versatility, is simple to implement, does not need to construct different templates for different webpages, does not require sample learning, and has low time complexity.
  • the disadvantage is that the algorithm is only applicable to the case where all the text information in the webpage is placed in a ⁇ table> node, and the effect is not good for a webpage having multiple ⁇ table> texts. Due to the rise of Weibo, light blogs, etc., more and more complex formats and short text pages have been created, and the limitations of this method are more obvious.
  • the method to be solved by the present invention is to provide a web page text extraction and comparison method based on the similarity of the subject, and the result shows that the method of the present invention achieves a large improvement in accuracy.
  • the present invention provides a web page text extraction and comparison method, comprising the following steps:
  • Step A determining whether the webpage is a text page based on a specific label for the webpage
  • Step B Identification of parallel web pages.
  • Step C For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation.
  • a threshold that is, the number of Chinese punctuation
  • the network is judged.
  • Page ⁇ p> tag Chinese text if the number of Chinese punctuation is greater than the given threshold, you can add it to the body, and then get multiple consecutive ⁇ P> tags (1 or 2 between p tags) The text of the other tags) is added to the text by the above judgment.
  • the step A may further comprise the following sub-steps:
  • Step 1 Preprocessing the web page to construct an HTML tree
  • Step 2 Pruning the HTML tree
  • Step 3 Obtain the webpage theme
  • Step 4 Extract the contents of the string in the block
  • Step 5 Calculate the distance between the subject S and the content y in a block
  • Step 6 Compare the edit distances L and max(p, q).
  • the second step may further include the following substeps: performing block according to the ⁇ table> tag, and removing the leaf node that does not contain text and link information.
  • the step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
  • the step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
  • the feature information extraction sub-step further includes:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; When the structure symmetry is deleted.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the present invention also provides a webpage text extraction and comparison system, comprising the following modules:
  • Module A for determining whether a webpage is a text page based on a specific label for a webpage
  • Module B Used to identify parallel web pages.
  • the module A may further comprise the following sub-modules:
  • Pre-processing sub-module used to pre-process the web page and construct an HTML tree
  • Pruning sub-module used to pruning HTML trees
  • Extracting the sub-module of the block for extracting the content of the string within the block;
  • Calculating the distance sub-module used to calculate the distance between the subject S and the content y within a block;
  • Compare Distance Submodule Used to compare the edit distances L and max(p, q).
  • the pruning sub-module may be further configured to: block the leaf according to the ⁇ table> tag, and remove the leaf node that does not include the text and the link information.
  • the calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
  • the module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
  • the feature information extraction submodule is used to:
  • feature information includes web page HTML tag structure information and content-based text The length information, the text sentence number information, and the digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the webpage text extraction comparison method of the present invention has the following advantages over the conventional webpage blocking algorithm and the webpage text extraction method based on the topic similarity partitioning:
  • Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.
  • the invention is based on the theme mentioned in the web page text extraction and comparison method of the topic similarity block, namely the title and label of the webpage.
  • the algorithm of the present invention does not calculate the entropy of the content block, and mainly uses the similarity of the topic and the content block as the judgment basis of the extracted block.
  • the main features of the web page are:
  • the web page format has a tree structure.
  • Web page tags are usually nested in pairs, so they can be converted into an HTML tree.
  • the shape structure in fact, also takes advantage of this feature in the DOM-based web page text extraction method.
  • the tree structure of HTML is constructed in the method of the present invention, mainly for cutting out useless branches and reducing the amount of calculation.
  • web pages are usually arranged in chunks.
  • each web page basically includes the following blocks: a classification block, a navigation block, a text block, a related link block, and an advertisement information block.
  • web page tags are usually nested in pairs, web pages are used to block web pages.
  • the label ⁇ table> ⁇ /table> tag has a good layout feature
  • most of the web pages now use the ⁇ table> tag for the layout of the web page format when finally presented to the user. .
  • the web page text extraction method is based on this, and the ⁇ table> tag is used to parse the web page.
  • the theme and content are related.
  • Web pages usually have a title and a number of tags, and a high-level summary of the body of the page, so the theme actually reflects the characteristics of the body of the page, representing the key content of the page. This was not considered in the previous web page extraction method.
  • the method of the present invention is to use the relationship between the subject and the text as an important index for text extraction. Especially because the structure of mobile Internet webpages is more and more diversified, the length of webpage content is different, the interrogation information of advertisements is many, and the webpage content of short texts is easily submerged in advertisement information, so the theme and webpage content are extracted in webpage extraction. Similarity considerations are indispensable.
  • the indicator for measuring similarity in the present invention is the edit distance (i.e., the Levenshtein distance).
  • the Levenshtein distance is the minimum number of insertions, deletions, and substitutions required to convert from the original string (a) to the target string (b).
  • the Levenshtein formula is shown in the following equation (1):
  • a, b are strings, i is the length of the string a, and j is the length of the string b.
  • the basic idea of the web page text extraction method based on the topic similarity block is as follows: converting the web page into the structure of the HTML tree; extracting the theme of the web page; extracting the content block by using the webpage label; and editing the theme and content viewing
  • the distance L from the Levenshtein is regarded as the content of the webpage body when the distance L is smaller than the length p of the content block. When the distance L is greater than (including equal to) the length of a certain content block, the content is ignored.
  • the present invention provides a web page body text comparison and comparison method, comprising the following steps:
  • Step A determining whether the webpage is a text page based on a specific label for the webpage
  • Step B identification of parallel web pages
  • Step C For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation.
  • a threshold that is, the number of Chinese punctuation
  • the text of the ⁇ p> tag is judged. If the number of Chinese punctuation is greater than a given threshold, you can After adding the text, and then obtaining a plurality of consecutive ⁇ P> tags (there may be one or two other tags between the p tags), the text is added to the text by the above determination.
  • the step A may further comprise the following sub-steps:
  • Step 1 Preprocessing the web page to construct an HTML tree
  • Step 2 Pruning the HTML tree
  • Step 3 Obtain the webpage theme
  • Step 4 Extract the contents of the string in the block
  • Step 5 Calculate the distance between the subject S and the content y in a block
  • Step 6 Compare the edit distances L and max(p, q).
  • the second step may further include the following substeps: performing block according to the ⁇ table> tag, and removing the leaf node that does not contain text and link information.
  • the step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
  • the step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
  • the feature information extraction sub-step further includes:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the present invention also provides a web page text extraction and comparison system, comprising the following modules:
  • Module A for determining whether a webpage is a text page based on a specific label for a webpage
  • Module B Used to identify parallel web pages.
  • the module A may further comprise the following sub-modules:
  • Pre-processing sub-module used to pre-process the web page and construct an HTML tree
  • Pruning sub-module used to pruning HTML trees
  • Extracting the sub-module of the block for extracting the content of the string within the block;
  • Calculating the distance sub-module used to calculate the distance between the subject S and the content y within a block;
  • Compare Distance Submodule Used to compare the edit distances L and max(p, q).
  • the pruning sub-module may be further configured to: block the leaf according to the ⁇ table> tag, and remove the leaf node that does not include the text and the link information.
  • the calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
  • the module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
  • the feature information extraction submodule is used to:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the algorithm of the present invention obviously includes three main steps of constructing an HTM tree, extracting a web page theme, calculating a topic, and blocking similarity;
  • the basic steps of the algorithm are as follows:
  • Step 1 Web page preprocessing, constructing an html tree. Normalize the web page and finally map it into a tree structure, including the following substeps:
  • each start tag corresponds to an end tag, such as ⁇ body> corresponding ⁇ /body>, ⁇ head> corresponding ⁇ /head>.
  • the tags are nested correctly, such as ⁇ a>, ⁇ b>, ⁇ /b>, ⁇ /a>. Only nested correctly can be correctly iterated.
  • Step 2 Pruning the HTML tree. Since the block is segmented according to the ⁇ table> tag, some leaf nodes do not contain text and link information, so these useless branches are removed, reducing the amount of computation.
  • Step 3 Get the web page theme. Get the content of the page Title and its various levels of title ⁇ h1> ⁇ hg> and the tag ⁇ meta>. If it is Chinese, you can use the ICTCLAS word segmentation system proposed by the Chinese Academy of Sciences to process the above words, then remove the word, stop words, etc., and finally get only the The sequence Stitle of the real word.
  • Step 4 Extract the contents of the string in the block. First, the leaf nodes of the HTML tree, that is, the subtree corresponding to the innermost ⁇ table> tag, are merged into one block, and the HTML mark in the block is removed, and the string content Y in the block is obtained.
  • Step 5 Calculate the distance between the subject S and the content y within a block.
  • the distance between the subject S and the content y For Chinese, it is necessary to segment Chinese words, and also use the Chinese Academy of Sciences word segmentation system in step (3).
  • the Levenshtein distance specifically used in the present invention is as shown in the formulas (2) and (3):
  • Step 6 Compare the edit distances L and max(p, q). If L ⁇ max(p,q), the block is the body information, which is extracted; otherwise it is recognized as interference information and ignored. Finally get the body information of the web page.
  • webpage text extraction and comparison method of the present invention further includes the identification of parallel webpages.
  • the parallel webpage identification of the invention mainly comprises two parts: feature information extraction and support vector machine classification.
  • the feature information mainly includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information.
  • HTML label is divided into structural labels, format labels and according to different functional features such as webpage layout, display, and link.
  • Unrelated tags three types of tags:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul, etc.;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u, etc.
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend, etc., when calculating structural symmetry Delete.
  • the similarity of the classified HTML tag sequences is calculated using the improved edit distance.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings.
  • the edit operation consists of replacing one character with another, inserting one character, and deleting one character.
  • the improved edit distance is defined as the minimum operational cost of converting different types of tags into one string by deleting, inserting, and replacing them into another string.
  • the cost of the delete operation and the insert operation is 1, the cost of the in-class replacement operation is 0, and the cost of the replacement operation between classes is 1.5, which is:
  • the lower right corner element M[A, B] is the modified editing distance of S 1 and S 2 , then the label structure information D t :
  • the improved edit distance matrix is shown in Table 1.
  • the content surface features specifically refer to information that is directly related to the content but not vocabulary, mainly including the text sentence number information, the text length information and the digital sequence information of the text pair, and the features are calculated as follows:
  • the matrix C is used to establish the maximum matching length matrix D of the string, and the calculation principle of the element D[i, j] is as follows:
  • the finally generated element D[0,0] in the matrix D is the maximum matching length Z.
  • the calculated matching relationship matrix C is as shown in Table 2.
  • the webpage text extraction comparison method of the present invention adopts the SVM algorithm of support vector machine classification.
  • the SVM algorithm is an implementation of statistical theory.
  • the SVM is based on the theory of Vapnik-Chervonenkis Dimension and the principle of structural risk minimization.
  • the kernel function By introducing the kernel function, the sample vector is mapped to the high-dimensional feature space, and then the optimal classification surface is constructed in the high-dimensional space. Linear optimal decision function.
  • the advantage of SVM is that it can solve the dimension problem by using the kernel function, which avoids the direct correlation between the computational complexity of the learning algorithm and the sample dimension.
  • Sgn[.] is a symbol function
  • non-negative variable ⁇ i is a Lagrange function
  • b is an offset value of a hyperplane.
  • Selecting a webpage within two levels of the mirrored to local path from the preprocessed source language and the target language document constitutes a candidate parallel webpage pair.
  • Dt reflects the webpage structure information, and extracts from the preprocessed webpage; Di, Ds and Dn reflect the webpage content information, and extract it from the webpage body.
  • a method for extracting and comparing webpage texts including double sentence alignment is also provided.
  • the step of aligning the two sentences in the method for extracting and comparing the webpage text of the present invention is: after obtaining the chapter-level bilingual parallel webpage document, the bilingual parallel webpage is extracted by the text, and the sentence is formed to form a sentence pair (S i , T j ), and the candidate sentence is aligned.
  • C and B are ⁇ c 1 , c 2 , ..., c n ⁇ and ⁇ b 1 , b 2 , ..., b n ⁇ , respectively, where C i and B i are words after word segmentation. Assuming that there are K pairs of words that are translated into each other, then the similarity of (S i , T j )
  • stf(c m , b m ) is the number of occurrences of mutually translated words in the pair of sentences
  • are the number of sentences in the source language S i and the target language T j , respectively
  • idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text; with They are the lengths of the sentences in the source language S i and the target language T j respectively;
  • ) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from taking more sentences. combine it all toghther; Is a penalty factor determined by length.
  • the webpage text extraction comparison method of the present invention compares the traditional webpage blocking algorithm with the webpage text extraction method based on the topic similarity partitioning, and the latter has the following advantages:
  • Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

An extraction and comparison method for a text of a webpage comprises the following steps: step A: determining whether a webpage is a text page according to a specific tab of a webpage; and step B: identifying a parallel webpage. The step A further comprises the following substeps: step one, pretreating the webpage, and constructing an HTML tree; step two, pruning the HTML tree; step three, acquiring webpage themes; step four, extracting character string content in subblocks; step five, calculating the distance between a theme S and content y in a block; and step six, comparing an editing distance L and max (p, q). The webpage text extraction and comparison method has the following advantages: a webpages having a short text can be extracted, and the correction of selection is not affected no matter how the content long is. No matter how long the text is, the text can participate in calculation and is not ignored. All <table> tabs can be consistently processed when a complicated <table> nested webpage is processed.

Description

一种网页正文提取比对方法Web page body text comparison and comparison method
方法领域Method area
本发明涉及计算机网络技术方法,特别涉及一种网页正文提取比对方法。The invention relates to a computer network technology method, in particular to a web page text extraction comparison method.
背景方法Background method
网页正文提取方法有很多,其中有专门针对评论网页或者新闻网页的方法,但是本发明所讨论的是针对大部分通用网页的正文提取方法。总的说来,目前主要的网页正文提取方法有以下几个方向:基于DOM的网页正文提取方法、基于统计的网页正文提取方法、基于分块的网页正文提取方法及其他网页正文提取方法。There are many methods for extracting webpage texts, among which there are methods for commenting webpages or news webpages, but the present invention discusses a text extracting method for most common webpages. In general, the main web page text extraction methods are as follows: DOM-based web page text extraction method, statistics-based web page text extraction method, block-based web page text extraction method, and other web page text extraction methods.
文档对象模型(Document Object Model,DOM)是W3C所制定的标准接口规范。因为DOM节点是基于树的层次结构来组织的,因此在建立了树结构之后,就可以将原本对网页的操作转化为通过对树的操作。虽然按照W3C组织所制定的标准,网页结构均可以对应地转换成DOM树的形式,但实际上许多网页并没有遵循该标准。因此在DOM方法使用时通常都需要预处理模块,将网页最终抽象为一棵DOM树。The Document Object Model (DOM) is a standard interface specification developed by the W3C. Because the DOM nodes are organized based on the tree's hierarchy, after the tree structure is established, the original operations on the web page can be converted into operations through the tree. Although the web page structure can be converted into a DOM tree format according to the standards set by the W3C organization, in fact many web pages do not follow the standard. Therefore, when the DOM method is used, it usually needs a preprocessing module to finally abstract the web page into a DOM tree.
一、基于DOM的网页正文提取方法First, DOM-based web page text extraction method
基于DOM的网页正文提取方法是一种基于DOM的网页内容提取方法,其最初目的是完善PDA应用,移除广告内容。DOM方法先将网页内容抽象为对应的对象,转换为节点的形式;然后用父子关系将各节点组织起来,最终形成一棵树型结构。The DOM-based web page text extraction method is a DOM-based web page content extraction method, and its original purpose is to improve the PDA application and remove the advertisement content. The DOM method abstracts the content of the web page into corresponding objects and converts them into the form of nodes; then organizes the nodes with the parent-child relationship to form a tree structure.
在互联网中来自同一网站的网页结构大部分都是相同的,例如Yahoo新闻网页<body>标签都是由<iframe>和<div>两个标签组成的,因此可以把这类网页模板聚为一类。聚类相似的DOM树需要计算相似度,计算两棵简单的DOM树相似度的步骤是:第一步判断两棵树的根节点是否相同,若不相同就返回0;若相同,则继续比较两棵树的叶子节点。第二步比较两棵DOM树的叶子节点的名称和属性,返回两棵DOM树中相同节点的数目。The structure of web pages from the same website on the Internet is mostly the same. For example, the <body> tag of Yahoo News page is composed of two tags: <iframe> and <div>, so you can group these web page templates into one. class. The clustering similar DOM tree needs to calculate the similarity. The procedure for calculating the similarity of two simple DOM trees is: the first step is to judge whether the root nodes of the two trees are the same, and if they are not the same, return 0; if they are the same, continue to compare The leaf nodes of the two trees. The second step compares the names and attributes of the leaf nodes of the two DOM trees and returns the number of identical nodes in the two DOM trees.
二、基于统计的网页正文提取方法Second, based on statistics, web page text extraction method
基于统计的方法主要用于提取新闻类网页的正文。该方法的原理是网页正文信息只能位于网页中的<table>标签节点。方法的基本步骤是:第一步去除页面的噪声,根据网页标签将网页对应表示成一棵树;第二步处理每个<table>节点,去除节点内的HTML标签,然后得到不含任何标签的字符串; 第三步比较每个节点的字符数量,通常选取字符数量最大的节点为网页正文。该方法优点是利用了新闻网页的特性,通用性好,实现简单,也不需要针对不同的网页构建不同的模板,不需要样本学习,时间复杂度低。但是缺点是该算法只适用于网页中所有正文信息都被放在一个<table>节点中的情况,对于有多个<table>正文的网页,效果并不好。由于现在微博、轻博客等的兴起,越来越多的复杂格式和短文本网页被产生,这种方法的局限性更加明显。The statistical-based method is mainly used to extract the body of news-based web pages. The principle of this method is that the web page body information can only be located in the <table> tag node in the web page. The basic steps of the method are as follows: the first step is to remove the noise of the page, and the webpage is correspondingly represented as a tree according to the webpage label; the second step processes each <table> node, removes the HTML label in the node, and then obtains the label without any label. String The third step compares the number of characters in each node. Usually, the node with the largest number of characters is the body of the web page. The advantage of this method is that it utilizes the characteristics of the news webpage, has good versatility, is simple to implement, does not need to construct different templates for different webpages, does not require sample learning, and has low time complexity. However, the disadvantage is that the algorithm is only applicable to the case where all the text information in the webpage is placed in a <table> node, and the effect is not good for a webpage having multiple <table> texts. Due to the rise of Weibo, light blogs, etc., more and more complex formats and short text pages have been created, and the limitations of this method are more obvious.
现有方法中网页正文提取比对效果表:In the existing method, the webpage text extraction comparison effect table:
Figure PCTCN2015100180-appb-000001
Figure PCTCN2015100180-appb-000001
总的说来,目前在网页正文提取和网页相似性计算的相关算法都还停留在主要针对传统互联网网页阶段,无论是网页正文提取还是网页相似性研究,对移动互联网网页内容的新特点并没有认真考量,主要表现在以下几个缺点:In general, the relevant algorithms for web page text extraction and web page similarity calculation are still mainly in the stage of traditional Internet web pages. Whether it is web page text extraction or web page similarity research, there are no new features for mobile web page content. Serious considerations are mainly manifested in the following shortcomings:
(1)移动互联网的网页结构越来越复杂,涉及的新兴方法也越来越多,传统的2.2节所介绍的网页正文提取算法的局限性越来越明显。(1) The structure of the webpage of the mobile Internet is becoming more and more complex, and more and more emerging methods are involved. The limitation of the webpage text extraction algorithm introduced in the traditional section 2.2 is more and more obvious.
(2)由于短文本网页内容太多,部分2.3节介绍的文本相似性研究算法的理论基础不再存在,算法准确率降低,已经不能适应大规模数据使用的需求。(2) Due to the too much content of short text pages, the theoretical basis of the text similarity research algorithm introduced in Section 2.3 no longer exists, and the accuracy of the algorithm is reduced, which cannot meet the needs of large-scale data usage.
发明内容Summary of the invention
本发明所要解决的方法问题在于,提供了一种本基于主题相似分块的网页正文提取及比对方法,结果表明本发明方法在准确率上取得较大提升。The method to be solved by the present invention is to provide a web page text extraction and comparison method based on the similarity of the subject, and the result shows that the method of the present invention achieves a large improvement in accuracy.
为解决上述方法问题,本发明提供了一种网页正文提取对比方法,包括以下步骤:In order to solve the above method problem, the present invention provides a web page text extraction and comparison method, comprising the following steps:
步骤A:基于对于网页特定标签,判断网页是否为正文页;Step A: determining whether the webpage is a text page based on a specific label for the webpage;
步骤B:对平行网页的识别。Step B: Identification of parallel web pages.
步骤C:对中文网页,正文部分往往包括中文标点,而标题中是不包含或包含很少的中文标点,通过设置一个阈值,即中文标点的个数,来判断网 页<p>标签中文字,如果其中中文标点的个数大于给定的阈值,则可以将其加入正文内,然后获得多个连续的<P>标签(p标签之间可以有1个或2个其它标签)的文本,通过以上的判定,加入到正文中。Step C: For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation. By setting a threshold, that is, the number of Chinese punctuation, the network is judged. Page <p> tag Chinese text, if the number of Chinese punctuation is greater than the given threshold, you can add it to the body, and then get multiple consecutive <P> tags (1 or 2 between p tags) The text of the other tags) is added to the text by the above judgment.
所述步骤A可以进一步包括以下子步骤:The step A may further comprise the following sub-steps:
步骤一:网页预处理,构造HTML树;Step 1: Preprocessing the web page to construct an HTML tree;
步骤二:对HTML树剪枝;Step 2: Pruning the HTML tree;
步骤三:获取网页主题;Step 3: Obtain the webpage theme;
步骤四:提取分块内的字符串内容;Step 4: Extract the contents of the string in the block;
步骤五:计算主题S和一个块内内容y的距离;Step 5: Calculate the distance between the subject S and the content y in a block;
步骤六:比较编辑距离L和max(p,q)。Step 6: Compare the edit distances L and max(p, q).
所述步骤二还可以进一步包括以下子步骤:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。The second step may further include the following substeps: performing block according to the <table> tag, and removing the leaf node that does not contain text and link information.
所述步骤五可以进一步包括:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:The step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
Figure PCTCN2015100180-appb-000002
Figure PCTCN2015100180-appb-000002
Figure PCTCN2015100180-appb-000003
Figure PCTCN2015100180-appb-000003
.
所述步骤B可以进一步包括:特征信息提取子步骤和支持向量机分类子步骤;The step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
所述特征信息提取子步骤进一步包括:The feature information extraction sub-step further includes:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息;Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算 结构对称性时删去。Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; When the structure symmetry is deleted.
采用改进的编辑距离计算分类的HTML标签序列的相似度:Calculate the similarity of the classified HTML tag sequences with improved edit distance:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;
编辑操作包括一个字符替换成另一个字符、插入一个字符和删除一个字符;Editing operations include replacing one character with another, inserting one character, and deleting one character;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、插入和替换转换成另一个字符串不同类型标签最少操作代价。According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
为解决上述技术问题,本发明还提供了一种网页正文提取对比系统,包括以下模块:To solve the above technical problem, the present invention also provides a webpage text extraction and comparison system, comprising the following modules:
模块A:用于基于对于网页特定标签,判断网页是否为正文页;Module A: for determining whether a webpage is a text page based on a specific label for a webpage;
模块B:用于对平行网页的识别。Module B: Used to identify parallel web pages.
所述模块A可以进一步包括以下子模块:The module A may further comprise the following sub-modules:
预处理子模块:用于对网页预处理,构造HTML树;Pre-processing sub-module: used to pre-process the web page and construct an HTML tree;
剪枝子模块:用于对HTML树剪枝;Pruning sub-module: used to pruning HTML trees;
获取主题子模块:用于获取网页主题;Get the topic sub-module: used to get the web page theme;
提取分块子模块:用于提取分块内的字符串内容;Extracting the sub-module of the block: for extracting the content of the string within the block;
计算距离子模块:用于计算主题S和一个块内内容y的距离;Calculating the distance sub-module: used to calculate the distance between the subject S and the content y within a block;
比较距离子模块:用于比较编辑距离L和max(p,q)。Compare Distance Submodule: Used to compare the edit distances L and max(p, q).
所述剪枝子模块可以进一步用于:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。The pruning sub-module may be further configured to: block the leaf according to the <table> tag, and remove the leaf node that does not include the text and the link information.
所述计算距离子模块可以进一步用于:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:The calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
Figure PCTCN2015100180-appb-000004
Figure PCTCN2015100180-appb-000004
Figure PCTCN2015100180-appb-000005
Figure PCTCN2015100180-appb-000005
.
所述模块B可以进一步包括以下子模块:特征信息提取子模块和支持向量机分类子模块;The module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
所述特征信息提取子模块用于:The feature information extraction submodule is used to:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文 本长度信息、文本句数信息和数字序列信息;Establish feature information: feature information includes web page HTML tag structure information and content-based text The length information, the text sentence number information, and the digital sequence information;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
采用改进的编辑距离计算分类的HTML标签序列的相似度:Calculate the similarity of the classified HTML tag sequences with improved edit distance:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;
编辑操作包括一个字符替换成另一个字符、插入一个字符和删除一个字符;Editing operations include replacing one character with another, inserting one character, and deleting one character;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、插入和替换转换成另一个字符串不同类型标签最少操作代价。According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
本发明有益的方法效果在于:本发明网页正文提取比对方法对比传统的网页分块算法和基于主题相似分块的网页正文提取方法,具有以下优点:The beneficial method of the present invention has the following effects: the webpage text extraction comparison method of the present invention has the following advantages over the conventional webpage blocking algorithm and the webpage text extraction method based on the topic similarity partitioning:
(1)能提取正文较短的网页,内容的长短并不会影响选择的正确性。因为无论正文长短都会参与计算,都不会被忽略。(1) It is possible to extract a web page with a short text, and the length of the content does not affect the correctness of the selection. Because no matter the length of the text will participate in the calculation, it will not be ignored.
(2)对处理<table>嵌套的复杂的网页。因为构建了一棵HTML树,可以保证每一个<table>标签都能得到一致的处理。(2) Complex web pages that are nested with <table>. Because an HTML tree is built, every <table> tag can be guaranteed to be processed consistently.
(3)降低了运算量。不需要进行簇的分析,聚类是非常耗费时间的,不需要计算块的熵,只是针对本网页进行分析就能判断。(3) Reduce the amount of calculation. Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.
(4)增加了一定程度的语义信息。因为有效利用了标题标签与正文的语义信息,提取正文的语义相关性更强。(4) Increased a certain degree of semantic information. Because the semantic information of the title tag and the body is effectively utilized, the semantic relevance of the extracted body is stronger.
具体实施方式Detailed ways
以下将结合实施例来详细说明本发明的实施方式,借此对本发明如何应用方法手段来解决方法问题,并达成方法效果的实现过程能充分理解并据以实施。The embodiments of the present invention will be described in detail below with reference to the embodiments, so that the method of the present invention is applied to solve the method problems, and the implementation process of achieving the effect of the method can be fully understood and implemented.
本发明基于主题相似分块的网页正文提取对比方法中所说的主题,即网页的标题和标签。本发明算法为了避免移动互联网短文本分块被忽略,不计算内容块的熵,主要利用主题和内容块的相似度作为提取块的判断依据。具体而言,主要利用网页的以下几个特点:The invention is based on the theme mentioned in the web page text extraction and comparison method of the topic similarity block, namely the title and label of the webpage. In order to avoid the short text block of the mobile internet being ignored, the algorithm of the present invention does not calculate the entropy of the content block, and mainly uses the similarity of the topic and the content block as the judgment basis of the extracted block. Specifically, the main features of the web page are:
一是网页格式具有树形结构。现在越来越多的网页格式是按照xml的标准构建,网页标签通常是嵌套成对出现的,因此可以转换成一个HTML树 形结构,实际上在基于DOM的网页正文提取方法中也有利用这一特性。在本发明方法中构建HTML的树形结构,主要是为了剪掉无用的分枝,减少运算量。First, the web page format has a tree structure. Now more and more web page formats are built according to the xml standard. Web page tags are usually nested in pairs, so they can be converted into an HTML tree. The shape structure, in fact, also takes advantage of this feature in the DOM-based web page text extraction method. The tree structure of HTML is constructed in the method of the present invention, mainly for cutting out useless branches and reducing the amount of calculation.
二是网页通常是分块布局的。移动互联网的网页格式虽然复杂,但是从内容上来讲,每个网页基本都包括以下块:分类块、导航块、正文块、相关链接块和广告信息块等。利用网页的这种特性,并且网页标签通常是嵌套成对出现的,利用网页标签对网页进行分块。实际上目前由于DIV+CSS方法的广泛使用,加之标签<table></table>标签具有很好的布局特性,现在大部分网页在最终呈现给用户时都采用<table>标签进行网页格式的布局。基于主题相似分块的网页正文提取方法正是以此为依据,利用<table>标签对网页进行解析。Second, web pages are usually arranged in chunks. Although the web format of the mobile internet is complicated, in terms of content, each web page basically includes the following blocks: a classification block, a navigation block, a text block, a related link block, and an advertisement information block. Utilizing this feature of web pages, and web page tags are usually nested in pairs, web pages are used to block web pages. In fact, due to the widespread use of the DIV+CSS method, and the label <table></table> tag has a good layout feature, most of the web pages now use the <table> tag for the layout of the web page format when finally presented to the user. . Based on the topic similarity block, the web page text extraction method is based on this, and the <table> tag is used to parse the web page.
三是主题和内容有关联性。网页通常都具有标题和若干标签,而且高度概括了网页正文,因此主题实际上最能体现网页正文的特征,代表了网页的关键内容。这在以前的网页正文提取方法中都未能考虑。本发明方法正是将主题与正文的关系作为正文提取的重要指标。特别由于移动互联网网页的结构越来越多样化,网页内容的长短不一,广告的干扰信息多,短文本的网页内容很容易淹没在广告信息中,因此在网页提取中将主题和网页内容的相似度考虑进来是必不可少的。本发明度量相似度的指标是编辑距离(即Levenshtein距离)。Levenshtein距离即从原串(a)转换到目标串(b)所需要的最少的插入、删除和替换的数目。Levenshtein公式如下式(1)所示:Third, the theme and content are related. Web pages usually have a title and a number of tags, and a high-level summary of the body of the page, so the theme actually reflects the characteristics of the body of the page, representing the key content of the page. This was not considered in the previous web page extraction method. The method of the present invention is to use the relationship between the subject and the text as an important index for text extraction. Especially because the structure of mobile Internet webpages is more and more diversified, the length of webpage content is different, the interrogation information of advertisements is many, and the webpage content of short texts is easily submerged in advertisement information, so the theme and webpage content are extracted in webpage extraction. Similarity considerations are indispensable. The indicator for measuring similarity in the present invention is the edit distance (i.e., the Levenshtein distance). The Levenshtein distance is the minimum number of insertions, deletions, and substitutions required to convert from the original string (a) to the target string (b). The Levenshtein formula is shown in the following equation (1):
Figure PCTCN2015100180-appb-000006
Figure PCTCN2015100180-appb-000006
说明:a、b为字符串,i为字符串a的长度,j为字符串b的长度。利用以上三点为基础,本基于主题相似分块的网页正文提取方法基本思想如下:将网页转换为HTML树的结构;提取网页的主题;利用网页标签提取内容块;计算主题和内容看的编辑距离Levenshtein距离L,当距离L小于内容块的长度p时,则视为网页正文内容被提取出来,当距离L大于(包括等于)某一内容块的长度时,则忽略该内容。Description: a, b are strings, i is the length of the string a, and j is the length of the string b. Based on the above three points, the basic idea of the web page text extraction method based on the topic similarity block is as follows: converting the web page into the structure of the HTML tree; extracting the theme of the web page; extracting the content block by using the webpage label; and editing the theme and content viewing The distance L from the Levenshtein is regarded as the content of the webpage body when the distance L is smaller than the length p of the content block. When the distance L is greater than (including equal to) the length of a certain content block, the content is ignored.
在一实施例中,本发明提供了一种网页正文提取对比方法,包括以下步骤:In an embodiment, the present invention provides a web page body text comparison and comparison method, comprising the following steps:
步骤A:基于对于网页特定标签,判断网页是否为正文页;Step A: determining whether the webpage is a text page based on a specific label for the webpage;
步骤B:对平行网页的识别;Step B: identification of parallel web pages;
步骤C:对中文网页,正文部分往往包括中文标点,而标题中是不包含或包含很少的中文标点,通过设置一个阈值,即中文标点的个数,来判断网页<p>标签中文字,如果其中中文标点的个数大于给定的阈值,则可以将其 加入正文内,然后获得多个连续的<P>标签(p标签之间可以有1个或2个其它标签)的文本,通过以上的判定,加入到正文中。Step C: For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation. By setting a threshold, that is, the number of Chinese punctuation, the text of the <p> tag is judged. If the number of Chinese punctuation is greater than a given threshold, you can After adding the text, and then obtaining a plurality of consecutive <P> tags (there may be one or two other tags between the p tags), the text is added to the text by the above determination.
所述步骤A可以进一步包括以下子步骤:The step A may further comprise the following sub-steps:
步骤一:网页预处理,构造HTML树;Step 1: Preprocessing the web page to construct an HTML tree;
步骤二:对HTML树剪枝;Step 2: Pruning the HTML tree;
步骤三:获取网页主题;Step 3: Obtain the webpage theme;
步骤四:提取分块内的字符串内容;Step 4: Extract the contents of the string in the block;
步骤五:计算主题S和一个块内内容y的距离;Step 5: Calculate the distance between the subject S and the content y in a block;
步骤六:比较编辑距离L和max(p,q)。Step 6: Compare the edit distances L and max(p, q).
所述步骤二还可以进一步包括以下子步骤:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。The second step may further include the following substeps: performing block according to the <table> tag, and removing the leaf node that does not contain text and link information.
所述步骤五可以进一步包括:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:The step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
Figure PCTCN2015100180-appb-000007
Figure PCTCN2015100180-appb-000007
Figure PCTCN2015100180-appb-000008
Figure PCTCN2015100180-appb-000008
.
所述步骤B可以进一步包括:特征信息提取子步骤和支持向量机分类子步骤;The step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
所述特征信息提取子步骤进一步包括:The feature information extraction sub-step further includes:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息;Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。 Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
采用改进的编辑距离计算分类的HTML标签序列的相似度:Calculate the similarity of the classified HTML tag sequences with improved edit distance:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;
编辑操作包括一个字符替换成另一个字符、插入一个字符和删除一个字符;Editing operations include replacing one character with another, inserting one character, and deleting one character;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、插入和替换转换成另一个字符串不同类型标签最少操作代价。According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
在另一实施例中,本发明还提供了一种网页正文提取对比系统,包括以下模块:In another embodiment, the present invention also provides a web page text extraction and comparison system, comprising the following modules:
模块A:用于基于对于网页特定标签,判断网页是否为正文页;Module A: for determining whether a webpage is a text page based on a specific label for a webpage;
模块B:用于对平行网页的识别。Module B: Used to identify parallel web pages.
所述模块A可以进一步包括以下子模块:The module A may further comprise the following sub-modules:
预处理子模块:用于对网页预处理,构造HTML树;Pre-processing sub-module: used to pre-process the web page and construct an HTML tree;
剪枝子模块:用于对HTML树剪枝;Pruning sub-module: used to pruning HTML trees;
获取主题子模块:用于获取网页主题;Get the topic sub-module: used to get the web page theme;
提取分块子模块:用于提取分块内的字符串内容;Extracting the sub-module of the block: for extracting the content of the string within the block;
计算距离子模块:用于计算主题S和一个块内内容y的距离;Calculating the distance sub-module: used to calculate the distance between the subject S and the content y within a block;
比较距离子模块:用于比较编辑距离L和max(p,q)。Compare Distance Submodule: Used to compare the edit distances L and max(p, q).
所述剪枝子模块可以进一步用于:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。The pruning sub-module may be further configured to: block the leaf according to the <table> tag, and remove the leaf node that does not include the text and the link information.
所述计算距离子模块可以进一步用于:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:The calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
Figure PCTCN2015100180-appb-000009
Figure PCTCN2015100180-appb-000009
Figure PCTCN2015100180-appb-000010
Figure PCTCN2015100180-appb-000010
.
所述模块B可以进一步包括以下子模块:特征信息提取子模块和支持向量机分类子模块;The module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
所述特征信息提取子模块用于:The feature information extraction submodule is used to:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息; Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
采用改进的编辑距离计算分类的HTML标签序列的相似度:Calculate the similarity of the classified HTML tag sequences with improved edit distance:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;
编辑操作包括一个字符替换成另一个字符、插入一个字符和删除一个字符;Editing operations include replacing one character with another, inserting one character, and deleting one character;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、插入和替换转换成另一个字符串不同类型标签最少操作代价。According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
在又一实施例中,结合本发明基于主题相似分块的网页正文提取方法的基本思想,本发明算法显然要包括构造HTM树、提取网页主题、计算主题和分块相似度三个主要步骤;另外由于网页是半结构化的,需要进行预处理;同时为了降低运算量,需要对构造的树进行剪枝。具体而言,算法的基本步如下:In another embodiment, in combination with the basic idea of the web page text extraction method based on the topic similarity block of the present invention, the algorithm of the present invention obviously includes three main steps of constructing an HTM tree, extracting a web page theme, calculating a topic, and blocking similarity; In addition, since the webpage is semi-structured, pre-processing is required; at the same time, in order to reduce the amount of computation, the constructed tree needs to be pruned. Specifically, the basic steps of the algorithm are as follows:
步骤一:网页预处理,构造html树。对网页进行规范化,最终映射成树形结构,包括以下子步骤:Step 1: Web page preprocessing, constructing an html tree. Normalize the web page and finally map it into a tree structure, including the following substeps:
(1)在除了网页<table>相关标签外的地方若出现的“〈”和“〉”均用&lt和&gt;替换,补全网页由于不规范所缺的<li>、<hr>等表示结束的标志。(1) If "<" and "〉" appearing in addition to the <table> related label on the webpage are replaced with &lt;&gt;, the completion page is represented by <li>, <hr>, etc. which are not standardized. The end of the sign.
(2)网页中全部标签的属性值都被放在引号中,如(2) The attribute values of all tags in the web page are placed in quotation marks, such as
〈a href="www.hust.edu.cn"〉。<a href="www.hust.edu.cn"〉.
(3)标签都是成对匹配的,即每个开始标签都对应一个结束标签,如<body>对应</body>,<head>对应</head>。(3) The tags are matched in pairs, that is, each start tag corresponds to an end tag, such as <body> corresponding </body>, <head> corresponding </head>.
(4)标签嵌套正确,如〈a〉,,〈b〉,,〈/b〉,,〈/a〉。只有嵌套正确了,才能被正确的迭代处理。(4) The tags are nested correctly, such as <a>, <b>, </b>, </a>. Only nested correctly can be correctly iterated.
(5)去除一些无用的标记,如form、img等。利用规范后的标签信息,利用递归的方法,构造网页对应的html树。(5) Remove some useless marks, such as form, img, etc. Using the tag information after the specification, the recursive method is used to construct the html tree corresponding to the web page.
步骤二:对HTML树剪枝。由于按照<table>标签进行分块,部分叶子节点不包含文本和链接信息,因此将这些无用枝去掉,降低运算量。Step 2: Pruning the HTML tree. Since the block is segmented according to the <table> tag, some leaf nodes do not contain text and link information, so these useless branches are removed, reducing the amount of computation.
步骤三:获取网页主题。获取网页Title及其各级标题〈h1〉~〈hg〉和标签<meta>的内容。若是中文,可以利用中国科学院提出的ICTCLAS分词系统对以上内容进行分词处理,然后去掉虚词、停用词等,最后得到只含有 实词的序列Stitle。Step 3: Get the web page theme. Get the content of the page Title and its various levels of title <h1>~<hg> and the tag <meta>. If it is Chinese, you can use the ICTCLAS word segmentation system proposed by the Chinese Academy of Sciences to process the above words, then remove the word, stop words, etc., and finally get only the The sequence Stitle of the real word.
步骤四:提取分块内的字符串内容。首先对HTML树的叶子节点,即最内层的<table>标签对应的子树合并成一个块,去掉块内的HTML标记,得到块内的字符串内容Y。Step 4: Extract the contents of the string in the block. First, the leaf nodes of the HTML tree, that is, the subtree corresponding to the innermost <table> tag, are merged into one block, and the HTML mark in the block is removed, and the string content Y in the block is obtained.
步骤五:计算主题S和一个块内内容y的距离。对于中文,需要对中文进行分词,也是利用步骤(三)中的中科院分词系统。在本发明中具体使用的Levenshtein距离如式(2)和式(3)所示:Step 5: Calculate the distance between the subject S and the content y within a block. For Chinese, it is necessary to segment Chinese words, and also use the Chinese Academy of Sciences word segmentation system in step (3). The Levenshtein distance specifically used in the present invention is as shown in the formulas (2) and (3):
Figure PCTCN2015100180-appb-000011
Figure PCTCN2015100180-appb-000011
Figure PCTCN2015100180-appb-000012
Figure PCTCN2015100180-appb-000012
步骤六:比较编辑距离L和max(p,q)。若L<max(p,q),则该块内是正文信息,提取出来;否则识别为干扰信息,忽略。最终得到网页的正文信息。Step 6: Compare the edit distances L and max(p, q). If L<max(p,q), the block is the body information, which is extracted; otherwise it is recognized as interference information and ignored. Finally get the body information of the web page.
另外,本发明网页正文提取对比方法还包括对平行网页的识别。In addition, the webpage text extraction and comparison method of the present invention further includes the identification of parallel webpages.
本发明平行网页识别主要包括特征信息提取和支持向量机分类两部分组成。The parallel webpage identification of the invention mainly comprises two parts: feature information extraction and support vector machine classification.
1、特征信息提取1. Feature information extraction
特征信息主要有网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息。The feature information mainly includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information.
(1)标签结构特征(1) Label structure characteristics
双语平行网页的主体内容互译,但网页的呈现形式往往差异性较大。为避免因形式的差异而误排除了平行网页,增强平行网页间结构标签对齐的相似性程度,,将HTML标签按其在网页布局、显示、链接等不同功能特征划分为结构标签、格式标签和无关标签三类标签:The main content of bilingual parallel web pages is translated, but the presentation forms of web pages are often different. In order to avoid the parallelization of the parallel webpage due to the difference of the form, and to enhance the degree of similarity of the structure label alignment between the parallel webpages, the HTML label is divided into structural labels, format labels and according to different functional features such as webpage layout, display, and link. Unrelated tags three types of tags:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul等;Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul, etc.;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u等;Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u, etc.
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend等,计算结构对称性时删去。 Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend, etc., when calculating structural symmetry Delete.
采用改进的编辑距离计算分类的HTML标签序列的相似度。The similarity of the classified HTML tag sequences is calculated using the improved edit distance.
编辑距离是指两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数,编辑操作包括一个字符替换成另一个字符、插入一个字符和删除一个字符。根据标签的分类特性,改进的编辑距离定义为一个字符串的不同类型标签通过删除、插入和替换转换成另一个字符串不同类型标签最少操作代价。其中,删除操作和插入操作代价为1,类内替换操作代价为0,类间替换操作代价为1.5,即为:The edit distance is the minimum number of edit operations required to convert from one string to another between two strings. The edit operation consists of replacing one character with another, inserting one character, and deleting one character. Depending on the classification characteristics of the tag, the improved edit distance is defined as the minimum operational cost of converting different types of tags into one string by deleting, inserting, and replacing them into another string. The cost of the delete operation and the insert operation is 1, the cost of the in-class replacement operation is 0, and the cost of the replacement operation between classes is 1.5, which is:
插入操作:Ct(t)=1;Insertion operation: C t (t)=1;
删除操作:Cd(t)=1;Delete operation: C d (t)=1;
替换操作:
Figure PCTCN2015100180-appb-000013
Replacement operation:
Figure PCTCN2015100180-appb-000013
HTML标签序列W=[w0,w1,…wa,…wA]和Z=[z0,z1,…zb,…zB]采用动态规划计算两者改进的编辑距离矩阵M,矩阵元素算法M[a,b]:The HTML tag sequence W=[w 0 ,w 1 ,...w a ,...w A ] and Z=[z 0 ,z 1 ,...z b ,...z B ] using the dynamic programming calculation to improve the edit distance matrix M , matrix element algorithm M[a,b]:
Figure PCTCN2015100180-appb-000014
Figure PCTCN2015100180-appb-000014
矩阵右下角元素M[A,B]即S1和S2改进的编辑距离,则标签结构信息DtThe lower right corner element M[A, B] is the modified editing distance of S 1 and S 2 , then the label structure information D t :
Dt=M[A,B]/Max(A+1,B+1)D t =M[A,B]/Max(A+1,B+1)
如HTML标签序列[div、style、style、div、style、style、p、p、div、div]和Z=[div、table、tr、td、span、span、td、tr、table、div],改进的编辑距离矩阵如表1所示,改进的编辑距离为3,标签结构信息Dt=0.3。Such as HTML tag sequence [div, style, style, div, style, style, p, p, div, div] and Z = [div, table, tr, td, span, span, td, tr, table, div], The improved edit distance matrix is shown in Table 1. The improved edit distance is 3 and the label structure information D t = 0.3.
表1:W与Z改进的编辑距离矩阵M Table 1: W and Z improved edit distance matrix M
Figure PCTCN2015100180-appb-000015
Figure PCTCN2015100180-appb-000015
(2)内容表面特征(2) Content surface features
为降低对双语词典的依赖程度,内容表面特征特指与内容直接相关但非词汇互译的信息,主要包含文本对的文本句数信息、文本长度信息和数字序列信息,各特征如下计算:In order to reduce the dependence on bilingual dictionaries, the content surface features specifically refer to information that is directly related to the content but not vocabulary, mainly including the text sentence number information, the text length information and the digital sequence information of the text pair, and the features are calculated as follows:
1)文本句数信息Ds:1) Text sentence number information Ds:
Ds=Min(SS,ST)/Max(SS,ST)D s =Min(S S ,S T )/Max(S S ,S T )
2)文木长度信息Dt:2) Wenmu length information Dt:
Dt=|LS-LT|/Max(LS,LT)D t =|L S -L T |/Max(L S ,L T )
3)数字序列信息Dn:3) Digital sequence information Dn:
Dn=1-Z/Max(m,n)D n =1-Z/Max(m,n)
其中m和n分别为源语言文本和目标语言文本出现数字的个数,Z为最大匹配长度,详细计算步骤如下:Where m and n are the number of digits appearing in the source language text and the target language text, respectively, and Z is the maximum matching length. The detailed calculation steps are as follows:
假设从源语言和目标语言文木对巾提取的数字序列分别为X=[x1,x2,…,xi,…,xm]和Y=[y1,y2,…,yj,…,yn],由此构建m*n维匹配关系矩阵C,矩阵元素c[i,j]为:It is assumed that the numerical sequences extracted from the source language and the target language wenwan are X=[x 1 , x 2 , . . . , x i , . . . , x m ] and Y=[y 1 , y 2 , . . . , y j ,...,y n ], thereby constructing an m*n-dimensional matching relation matrix C, and the matrix element c[i,j] is:
Figure PCTCN2015100180-appb-000016
Figure PCTCN2015100180-appb-000016
利用矩阵C建立字符串最大匹配长度矩阵D,元素D[i,j]计算原则: The matrix C is used to establish the maximum matching length matrix D of the string, and the calculation principle of the element D[i, j] is as follows:
a、循环从右向左、从下而上的。a, loop from right to left, bottom to top.
b、元素D[i,j]为:b, the element D[i, j] is:
D[i,j]=Max(C[i,j]+C[i+1,j+1],C[i,j+1],C[i+1,j])D[i,j]=Max(C[i,j]+C[i+1,j+1],C[i,j+1],C[i+1,j])
其中,矩阵D中最终生成的元素D[0,0]即为最大匹配长度Z。Among them, the finally generated element D[0,0] in the matrix D is the maximum matching length Z.
为充分展示共现数字序列信息的计算方法,列举数字序列分别为X=[4,5,34,5,2,45,8,12]和Y=[4,7,34,8,78,9,5,2,12]。计算所得匹配关系矩阵C如表2,最大匹配矩阵D如表3,因此得到最大匹配长度Z为5,数字序列信息Dn的大小为1-5/9=0.44。In order to fully display the calculation method of the co-occurrence digital sequence information, the numerical sequence is X=[4,5,34,5,2,45,8,12] and Y=[4,7,34,8,78, 9,5,2,12]. The calculated matching relationship matrix C is as shown in Table 2. The maximum matching matrix D is as shown in Table 3. Therefore, the maximum matching length Z is 5, and the size of the digital sequence information Dn is 1-5/9=0.44.
表2:X与Y匹配关系矩阵CTable 2: X and Y matching relationship matrix C
Figure PCTCN2015100180-appb-000017
Figure PCTCN2015100180-appb-000017
表3:X与Y最大匹配矩阵DTable 3: X and Y maximum matching matrix D
Figure PCTCN2015100180-appb-000018
Figure PCTCN2015100180-appb-000018
本发明网页正文提取比对方法采用了支持向量机分类的SVM算法。SVM算法是统计学理论的一种实现方法。SVM建立在统计学习VC维(Vapnik-Chervonenkis Dimension)理论和结构风险最小原理基础上,通过引入核函数,将样本向量映射到高维特征空间,然后在高维空间中构造最优分类面,获得线性最优决策函数。SVM的优势是可以通过采用核函数巧妙解决维数问题,避免了学习算法计算复杂度与样本维数的直接相关。The webpage text extraction comparison method of the present invention adopts the SVM algorithm of support vector machine classification. The SVM algorithm is an implementation of statistical theory. The SVM is based on the theory of Vapnik-Chervonenkis Dimension and the principle of structural risk minimization. By introducing the kernel function, the sample vector is mapped to the high-dimensional feature space, and then the optimal classification surface is constructed in the high-dimensional space. Linear optimal decision function. The advantage of SVM is that it can solve the dimension problem by using the kernel function, which avoids the direct correlation between the computational complexity of the learning algorithm and the sample dimension.
令{(xi,yi),i=1,…,S}由S个数据点构成了SVM的训练数据集,其中,xi∈Rn,yi∈{-1,1},最优决策函数为:Order {(x i, y i) , i = 1, ..., S} by the S data points constitute the SVM training data set, where, x i ∈R n, y i ∈ {-1,1}, the most The optimal decision function is:
Figure PCTCN2015100180-appb-000019
Figure PCTCN2015100180-appb-000019
其中,Sgn[.]为符号函数,非负变量αi为Lagrange函数,b为超平面的偏置值。Among them, Sgn[.] is a symbol function, non-negative variable α i is a Lagrange function, and b is an offset value of a hyperplane.
从预处理过的源语言和目标语言文档中分别选择镜像至本地路径相差两级以内的网页构成候选平行网页对。针对网页对分别计算HTML标签序列信息Dt、文本长度信息Di、文本句数信息Ds和数字序列信息Dn构成SVM分类器的特征信息xi∈Rn(n=4)。其中,Dt反映网页结构信息,从预处理过的网页中提取;Di、Ds和Dn反映网页内容信息,从网页正文中提取。Selecting a webpage within two levels of the mirrored to local path from the preprocessed source language and the target language document constitutes a candidate parallel webpage pair. The HTML tag sequence information Dt, the text length information Di, the text sentence number information Ds, and the digital sequence information Dn are respectively calculated for the web page pair to constitute the feature information x i ∈ R n (n=4) of the SVM classifier. Among them, Dt reflects the webpage structure information, and extracts from the preprocessed webpage; Di, Ds and Dn reflect the webpage content information, and extract it from the webpage body.
通过在由已知的平行网页对和非平行网页对构成的训练集上训练SVM,判定未知分类的网页是否为平行网页。支持向量机的判断结果yi=1表示网页对为平行网页对,yi=-1表示网页对为非平行网页对。By training the SVM on a training set consisting of known parallel web page pairs and non-parallel web page pairs, it is determined whether the web page of the unknown classification is a parallel web page. The judgment result of the support vector machine yi=1 indicates that the web page pair is a parallel web page pair, and yi=-1 indicates that the web page pair is a non-parallel web page pair.
在本发明的再一实施例中,还提供了一种包含双语句对齐的网页正文提取对比方法。In still another embodiment of the present invention, a method for extracting and comparing webpage texts including double sentence alignment is also provided.
本发明网页正文提取对比方法中双语句对齐的步骤是:在己获得篇章级的双语平行网页文档后,设双语平行网页经正文抽取后断句形成句对(Si,Tj),候选句对齐C和B分别为{c1,c2,…,cn}和{b1,b2,…,bn},其中,Ci和Bi是分词后的词汇。假定有K对互为翻译的词对,则(Si,Tj)的相似度采The step of aligning the two sentences in the method for extracting and comparing the webpage text of the present invention is: after obtaining the chapter-level bilingual parallel webpage document, the bilingual parallel webpage is extracted by the text, and the sentence is formed to form a sentence pair (S i , T j ), and the candidate sentence is aligned. C and B are {c 1 , c 2 , ..., c n } and {b 1 , b 2 , ..., b n }, respectively, where C i and B i are words after word segmentation. Assuming that there are K pairs of words that are translated into each other, then the similarity of (S i , T j )
Figure PCTCN2015100180-appb-000020
Figure PCTCN2015100180-appb-000020
用如下计算方法:Use the following calculation method:
其中,stf(cm,bm)是互为翻译的词语对在句对中出现的次数;|Si|和|Tj|分别是是源语言Si和目标语言Tj中的句子数;idtf(cm)为cm在Si中出现的总次数与cm在文本中出现次数的比值;
Figure PCTCN2015100180-appb-000021
Figure PCTCN2015100180-appb-000022
分别是是源语言Si和目标语言Tj中的句子的长度;Matching(|Si|,|Tj|)是惩罚因子,不同对齐模式进行不同程度的惩罚,以防止算法将更多句子组合在一起;
Figure PCTCN2015100180-appb-000023
是由长度决定的惩罚因子。
Where stf(c m , b m ) is the number of occurrences of mutually translated words in the pair of sentences; |S i | and |T j | are the number of sentences in the source language S i and the target language T j , respectively ;idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text;
Figure PCTCN2015100180-appb-000021
with
Figure PCTCN2015100180-appb-000022
They are the lengths of the sentences in the source language S i and the target language T j respectively; Matching(|S i |, |T j |) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from taking more sentences. combine it all toghther;
Figure PCTCN2015100180-appb-000023
Is a penalty factor determined by length.
在相似度评价函数Sim(Si,Tj)基础上,使用动态规划实现寻找最优句对齐路径,获取双语平行语料。Based on the similarity evaluation function Sim(S i , T j ), dynamic programming is used to find the optimal sentence alignment path to obtain bilingual parallel corpus.
本发明网页正文提取比对方法对比传统的网页分块算法和基于主题相似分块的网页正文提取方法,后者具有以下优点:The webpage text extraction comparison method of the present invention compares the traditional webpage blocking algorithm with the webpage text extraction method based on the topic similarity partitioning, and the latter has the following advantages:
(1)能提取正文较短的网页,内容的长短并不会影响选择的正确性。因为无论正文长短都会参与计算,都不会被忽略。(1) It is possible to extract a web page with a short text, and the length of the content does not affect the correctness of the selection. Because no matter the length of the text will participate in the calculation, it will not be ignored.
(2)对处理<table>嵌套的复杂的网页。因为构建了一棵HTML树,可以保证每一个<table>标签都能得到一致的处理。(2) Complex web pages that are nested with <table>. Because an HTML tree is built, every <table> tag can be guaranteed to be processed consistently.
(3)降低了运算量。不需要进行簇的分析,聚类是非常耗费时间的,不需要计算块的熵,只是针对本网页进行分析就能判断。(3) Reduce the amount of calculation. Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.
(4)增加了一定程度的语义信息。因为有效利用了标题标签与正文的语义信息,提取正文的语义相关性更强。(4) Increased a certain degree of semantic information. Because the semantic information of the title tag and the body is effectively utilized, the semantic relevance of the extracted body is stronger.
所有上述的首要实施这一知识产权,并没有设定限制其他形式的实施这种新产品和/或新方法。本领域方法人员将利用这一重要信息,上述内容修改,以实现类似的执行情况。但是,所有修改或改造基于本发明新产品属于保留的权利。 All of the above-mentioned primary implementations of this intellectual property are not set to limit other forms of implementation of this new product and/or new method. Those skilled in the art will utilize this important information and modify the above to achieve a similar implementation. However, all modifications or adaptations based on the novel products of the invention are reserved.

Claims (10)

  1. 一种网页正文提取对比方法,其特征在于,包括以下步骤:A method for extracting and comparing webpage texts, comprising the steps of:
    步骤A:基于对于网页特定标签,判断网页是否为正文页;Step A: determining whether the webpage is a text page based on a specific label for the webpage;
    步骤B:对平行网页的识别;Step B: identification of parallel web pages;
    步骤C:对中文网页,设定中文标点的个数阈值;通过所述中文标点的个数阈值来判断网页<p>标签中文字:如果其中中文标点的个数大于设定的阈值,则将其加入正文内。Step C: setting a threshold number of Chinese punctuation for the Chinese webpage; determining the text of the webpage <p> by the threshold of the number of Chinese punctuation: if the number of Chinese punctuation is greater than the set threshold, then It is added to the body.
  2. 根据权利要求1所述网页正文提取对比方法,其特征在于,所述步骤A进一步包括以下子步骤:The method for extracting and comparing webpage text according to claim 1, wherein said step A further comprises the following substeps:
    步骤一:网页预处理,构造HTML树;Step 1: Preprocessing the web page to construct an HTML tree;
    步骤二:对HTML树剪枝;Step 2: Pruning the HTML tree;
    步骤三:获取网页主题;Step 3: Obtain the webpage theme;
    步骤四:提取分块内的字符串内容;Step 4: Extract the contents of the string in the block;
    步骤五:计算主题S和一个块内内容y的距离;Step 5: Calculate the distance between the subject S and the content y in a block;
    步骤六:比较编辑距离L和max(p,q)。Step 6: Compare the edit distances L and max(p, q).
  3. 根据权利要求1或2所述网页正文提取对比方法,其特征在于,所述步骤二进一步包括以下子步骤:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。The webpage text extraction and comparison method according to claim 1 or 2, wherein the step 2 further comprises the substep of: deleting the leaf nodes that do not contain the text and the link information according to the <table> tag.
  4. 根据权利要求1~3中任一项所述网页正文提取对比方法,其特征在于,所述步骤五进一步包括:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:The method for extracting and comparing webpage text according to any one of claims 1 to 3, wherein the step 5 further comprises: segmenting Chinese characters, and using a Levenshtein distance as shown in equations (2) and (3) :
    Figure PCTCN2015100180-appb-100001
    Figure PCTCN2015100180-appb-100001
    Figure PCTCN2015100180-appb-100002
    Figure PCTCN2015100180-appb-100002
  5. 根据权利要求1~4中任一项所述网页正文提取对比方法,其特征在于,所述步骤B进一步包括:特征信息提取子步骤和支持向量机分类子步骤;The webpage text extraction and comparison method according to any one of claims 1 to 4, wherein the step B further comprises: a feature information extraction sub-step and a support vector machine classification sub-step;
    所述特征信息提取子步骤进一步包括:The feature information extraction sub-step further includes:
    建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息; Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
    将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
    结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
    格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
    无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
    采用改进的编辑距离计算分类的HTML标签序列的相似度:Calculate the similarity of the classified HTML tag sequences with improved edit distance:
    编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;
    编辑操作包括一个字符替换成另一个字符、插入一个字符和删除一个字符;Editing operations include replacing one character with another, inserting one character, and deleting one character;
    根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、插入和替换转换成另一个字符串不同类型标签最少操作代价;According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of converting different types of labels of one string into another string by deleting, inserting and replacing them;
    所述网页正文提取对比方法,进一步包括双语句对齐的网页正文提取对比步骤;The method for extracting and comparing webpage texts further includes a step of comparing and extracting webpage texts of two sentences;
    所述双语句对齐的网页正文提取对比步骤是:在己获得篇章级的双语平行网页文档后,设双语平行网页经正文抽取后断句形成句对(Si,Tj),候选句对齐C和B分别为{c1,c2,…,cn}和{b1,b2,…,bn},其中,Ci和Bi是分词后的词汇;假定有K对互为翻译的词对,则(Si,Tj)的相似度采用如下计算方法:The two-sentence-aligned webpage text extraction and comparison step is: after obtaining the chapter-level bilingual parallel webpage document, the bilingual parallel webpage is extracted by the text, and the sentence is formed to form a sentence pair (S i , T j ), and the candidate sentence is aligned with C and B is {c 1 , c 2 , . . . , c n } and {b 1 , b 2 , . . . , b n }, respectively, where C i and B i are words after word segmentation; it is assumed that K pairs are mutually translated. For word pairs, the similarity of (S i , T j ) is calculated as follows:
    Figure PCTCN2015100180-appb-100003
    Figure PCTCN2015100180-appb-100003
    其中,stf(cm,bm)是互为翻译的词语对在句对中出现的次数;Where stf(c m , b m ) is the number of times the mutually translated words appear in the pair of sentences;
    |Si|和|Tj|分别是是源语言Si和目标语言Tj中的句子数;|S i | and |T j | are the number of sentences in the source language S i and the target language T j , respectively;
    idtf(cm)为cm在Si中出现的总次数与cm在文本中出现次数的比值;Idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text;
    Figure PCTCN2015100180-appb-100004
    Figure PCTCN2015100180-appb-100005
    分别是是源语言Si和目标语言Tj中的句子的长度;
    Figure PCTCN2015100180-appb-100004
    with
    Figure PCTCN2015100180-appb-100005
    Is the length of the sentence in the source language S i and the target language T j respectively;
    Matching(|Si|,|Tj|)是惩罚因子,不同对齐模式进行不同程度的惩罚,以防止算法将更多句子组合在一起; Matching(|S i |, |T j |) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from combining more sentences;
    Figure PCTCN2015100180-appb-100006
    是由长度决定的惩罚因子;
    Figure PCTCN2015100180-appb-100006
    Is a penalty factor determined by length;
    在相似度评价函数Sim(Si,Tj)基础上,使用动态规划实现寻找最优句对齐路径,获取双语平行语料。Based on the similarity evaluation function Sim(S i , T j ), dynamic programming is used to find the optimal sentence alignment path to obtain bilingual parallel corpus.
  6. 一种网页正文提取对比系统,其特征在于,包括以下模块:A webpage text extraction and comparison system, comprising the following modules:
    模块A:用于基于对于网页特定标签,判断网页是否为正文页;Module A: for determining whether a webpage is a text page based on a specific label for a webpage;
    模块B:用于对平行网页的识别。Module B: Used to identify parallel web pages.
  7. 根据权利要求6所述网页正文提取对比系统,其特征在于,所述模块A进一步包括以下子模块:The webpage text extraction and comparison system according to claim 6, wherein the module A further comprises the following sub-modules:
    预处理子模块:用于对网页预处理,构造HTML树;Pre-processing sub-module: used to pre-process the web page and construct an HTML tree;
    剪枝子模块:用于对HTML树剪枝;Pruning sub-module: used to pruning HTML trees;
    获取主题子模块:用于获取网页主题;Get the topic sub-module: used to get the web page theme;
    提取分块子模块:用于提取分块内的字符串内容;Extracting the sub-module of the block: for extracting the content of the string within the block;
    计算距离子模块:用于计算主题S和一个块内内容y的距离;Calculating the distance sub-module: used to calculate the distance between the subject S and the content y within a block;
    比较距离子模块:用于比较编辑距离L和max(p,q)。Compare Distance Submodule: Used to compare the edit distances L and max(p, q).
  8. 根据权利要求6或7所述网页正文提取对比系统,其特征在于,所述剪枝子模块进一步用于:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。The webpage text extraction and comparison system according to claim 6 or 7, wherein the pruning sub-module is further configured to: block the leaf according to the <table> tag, and remove the leaf node that does not include the text and the link information.
  9. 根据权利要求6~8中任一项所述网页正文提取对比系统,其特征在于,所述计算距离子模块进一步用于:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:The webpage text extraction and comparison system according to any one of claims 6 to 8, wherein the calculation distance sub-module is further used for: segmenting Chinese characters, using a Levenshtein distance such as equations (2) and (3) ) shown:
    Figure PCTCN2015100180-appb-100007
    Figure PCTCN2015100180-appb-100007
    Figure PCTCN2015100180-appb-100008
    Figure PCTCN2015100180-appb-100008
  10. 根据权利要求6~9中任一项所述网页正文提取对比系统,其特征在于,所述模块B进一步包括以下子模块:特征信息提取子模块和支持向量机分类子模块;The webpage text extraction and comparison system according to any one of claims 6 to 9, wherein the module B further comprises the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
    所述特征信息提取子模块用于:The feature information extraction submodule is used to:
    建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息; Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
    将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
    结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
    格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
    无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
    采用改进的编辑距离计算分类的HTML标签序列的相似度:Calculate the similarity of the classified HTML tag sequences with improved edit distance:
    编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;
    编辑操作包括一个字符替换成另一个字符、插入一个字符和删除一个字符;Editing operations include replacing one character with another, inserting one character, and deleting one character;
    根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、插入和替换转换成另一个字符串不同类型标签最少操作代价;According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of converting different types of labels of one string into another string by deleting, inserting and replacing them;
    所述网页正文提取对比系统,进一步包括双语句对齐的网页正文提取对比模块;The webpage text extraction and comparison system further comprises a double sentence aligned webpage text extraction comparison module;
    所述双语句对齐的网页正文提取对比模块用于:在己获得篇章级的双语平行网页文档后,设双语平行网页经正文抽取后断句形成句对(Si,Tj),候选句对齐C和B分别为{c1,c2,…,cn}和{b1,b2,…,bn},,其中,Ci和Bi是分词后的词汇;假定有K对互为翻译的词对,则(Si,Tj)的相似度采用如下计算方法:The double-sentence-aligned webpage text extraction and comparison module is configured to: after obtaining the chapter-level bilingual parallel webpage document, set the bilingual parallel webpage to form a sentence pair (S i , T j ) after the text is extracted, and the candidate sentence alignment C And B are {c 1 , c 2 , . . . , c n } and {b 1 , b 2 , . . . , b n }, respectively, where C i and B i are words after word segmentation; For the translated word pair, the similarity of (S i , T j ) is calculated as follows:
    Figure PCTCN2015100180-appb-100009
    Figure PCTCN2015100180-appb-100009
    其中,stf(cm,bm)是互为翻译的词语对在句对中出现的次数;Where stf(c m , b m ) is the number of times the mutually translated words appear in the pair of sentences;
    |Si|和|Tj|分别是是源语言Si和目标语言Tj中的句子数;|S i | and |T j | are the number of sentences in the source language S i and the target language T j , respectively;
    idtf(cm)为cm在Si中出现的总次数与cm在文本中出现次数的比值;Idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text;
    Figure PCTCN2015100180-appb-100010
    Figure PCTCN2015100180-appb-100011
    分别是是源语言Si和目标语言Tj中的句子的长度;
    Figure PCTCN2015100180-appb-100010
    with
    Figure PCTCN2015100180-appb-100011
    Is the length of the sentence in the source language S i and the target language T j respectively;
    Matching(|Si|,|Tj|)是惩罚因子,不同对齐模式进行不同程度的惩罚,以防止算法将更多句子组合在一起; Matching(|S i |, |T j |) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from combining more sentences;
    Figure PCTCN2015100180-appb-100012
    是由长度决定的惩罚因子;
    Figure PCTCN2015100180-appb-100012
    Is a penalty factor determined by length;
    在相似度评价函数Sim(Si,Tj)基础上,使用动态规划实现寻找最优句对齐路径,获取双语平行语料。 Based on the similarity evaluation function Sim(S i , T j ), dynamic programming is used to find the optimal sentence alignment path to obtain bilingual parallel corpus.
PCT/CN2015/100180 2015-11-14 2015-12-31 Extraction and comparison method for text of webpage WO2017080090A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510793525.XA CN106528583A (en) 2015-11-14 2015-11-14 Method for extracting and comparing web page main body
CN201510793525.X 2015-11-14

Publications (1)

Publication Number Publication Date
WO2017080090A1 true WO2017080090A1 (en) 2017-05-18

Family

ID=58348780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/100180 WO2017080090A1 (en) 2015-11-14 2015-12-31 Extraction and comparison method for text of webpage

Country Status (2)

Country Link
CN (1) CN106528583A (en)
WO (1) WO2017080090A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019794A (en) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 Classification method, device, storage medium and the electronic device of textual resources
CN110196968A (en) * 2019-06-06 2019-09-03 北京林业大学 A kind of simplified form of Chinese Character coding mode automatic recognition system and method searched based on specific character string
CN110795933A (en) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 Method and device for identifying and processing webpage text
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN111241446A (en) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page
CN111708900A (en) * 2020-06-17 2020-09-25 北京明略软件系统有限公司 Expansion method and expansion device for tag synonym, electronic device and storage medium
CN112101004A (en) * 2020-09-23 2020-12-18 电子科技大学 General webpage character information extraction method based on conditional random field and syntactic analysis
CN112269906A (en) * 2020-10-14 2021-01-26 西安邮电大学 Automatic extraction method and device of webpage text
CN112287254A (en) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN112668309A (en) * 2020-11-25 2021-04-16 紫光云技术有限公司 Network behavior prediction model fusing compressed DOM tree structure vectors
CN113033220A (en) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 Lavenstein ratio-based method for constructing literary-modern translation system
CN113065086A (en) * 2021-04-23 2021-07-02 深圳壹账通智能科技有限公司 Webpage text extraction method and device, electronic equipment and storage medium
CN113434797A (en) * 2021-06-29 2021-09-24 中国电信集团系统集成有限责任公司 Webpage information extraction method and device
CN113486228A (en) * 2021-07-02 2021-10-08 燕山大学 Internet paper data automatic extraction algorithm based on MD5 ternary tree and improved BIRCH algorithm
CN113569119A (en) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 Multi-modal machine learning-based news webpage text extraction system and method
CN117573959A (en) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 General method for obtaining news text based on web page xpath

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920434B (en) * 2018-06-06 2022-08-30 武汉酷犬数据科技有限公司 Universal webpage theme content extraction method and system
WO2020026366A1 (en) * 2018-07-31 2020-02-06 株式会社 AI Samurai Patent evaluation determination method, patent evaluation determination device, and patent evaluation determination program
CN109543126B (en) * 2018-11-19 2022-04-29 四川长虹电器股份有限公司 Webpage text information extraction method based on block character ratio
CN112214737B (en) * 2020-11-10 2022-06-24 山东比特智能科技股份有限公司 Method, system, device and medium for identifying picture-based fraudulent webpage
CN112528205B (en) * 2020-12-22 2021-10-29 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium
CN112765940B (en) * 2021-01-20 2024-04-19 南京万得资讯科技有限公司 Webpage deduplication method based on theme features and content semantics
CN113449078A (en) * 2021-06-25 2021-09-28 完美世界控股集团有限公司 Similar news identification method, equipment, system and storage medium
CN114239590B (en) * 2021-12-01 2023-09-19 马上消费金融股份有限公司 Data processing method and device
CN115238208A (en) * 2022-06-28 2022-10-25 北京关键科技股份有限公司 Data retrieval method and equipment based on symbolic features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197849A (en) * 2007-12-21 2008-06-11 腾讯科技(深圳)有限公司 Method and device for commuting internet page into wireless application protocol page
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
EP2562656A1 (en) * 2010-10-14 2013-02-27 JVC KENWOOD Corporation Filtering device and filtering method
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197849A (en) * 2007-12-21 2008-06-11 腾讯科技(深圳)有限公司 Method and device for commuting internet page into wireless application protocol page
EP2562656A1 (en) * 2010-10-14 2013-02-27 JVC KENWOOD Corporation Filtering device and filtering method
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019794A (en) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 Classification method, device, storage medium and the electronic device of textual resources
CN110019794B (en) * 2017-11-07 2023-04-25 腾讯科技(北京)有限公司 Text resource classification method and device, storage medium and electronic device
CN110196968A (en) * 2019-06-06 2019-09-03 北京林业大学 A kind of simplified form of Chinese Character coding mode automatic recognition system and method searched based on specific character string
CN110196968B (en) * 2019-06-06 2023-04-07 北京林业大学 System and method for automatically identifying simplified Chinese coding mode based on specific character string search
CN110795933A (en) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 Method and device for identifying and processing webpage text
CN110795933B (en) * 2019-09-30 2023-10-31 奇安信科技集团股份有限公司 Webpage text recognition processing method and device
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN111241446B (en) * 2020-01-13 2023-10-31 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page
CN111241446A (en) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page
CN111708900A (en) * 2020-06-17 2020-09-25 北京明略软件系统有限公司 Expansion method and expansion device for tag synonym, electronic device and storage medium
CN111708900B (en) * 2020-06-17 2023-08-25 北京明略软件系统有限公司 Expansion method and expansion device for tag synonyms, electronic equipment and storage medium
CN112101004A (en) * 2020-09-23 2020-12-18 电子科技大学 General webpage character information extraction method based on conditional random field and syntactic analysis
CN112101004B (en) * 2020-09-23 2023-03-21 电子科技大学 General webpage character information extraction method based on conditional random field and syntactic analysis
CN112269906A (en) * 2020-10-14 2021-01-26 西安邮电大学 Automatic extraction method and device of webpage text
CN112269906B (en) * 2020-10-14 2023-04-14 西安邮电大学 Automatic extraction method and device of webpage text
CN112287254A (en) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN112287254B (en) * 2020-11-23 2023-10-27 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN112668309B (en) * 2020-11-25 2023-03-07 紫光云技术有限公司 Network behavior prediction method fusing compressed DOM tree structure vectors
CN112668309A (en) * 2020-11-25 2021-04-16 紫光云技术有限公司 Network behavior prediction model fusing compressed DOM tree structure vectors
CN113033220A (en) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 Lavenstein ratio-based method for constructing literary-modern translation system
CN113065086A (en) * 2021-04-23 2021-07-02 深圳壹账通智能科技有限公司 Webpage text extraction method and device, electronic equipment and storage medium
CN113434797A (en) * 2021-06-29 2021-09-24 中国电信集团系统集成有限责任公司 Webpage information extraction method and device
CN113434797B (en) * 2021-06-29 2024-05-31 中电信数智科技有限公司 Webpage information extraction method and device
CN113569119A (en) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 Multi-modal machine learning-based news webpage text extraction system and method
CN113486228A (en) * 2021-07-02 2021-10-08 燕山大学 Internet paper data automatic extraction algorithm based on MD5 ternary tree and improved BIRCH algorithm
CN113486228B (en) * 2021-07-02 2022-05-10 燕山大学 Internet paper data automatic extraction algorithm based on MD5 ternary tree and improved BIRCH algorithm
CN117573959A (en) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 General method for obtaining news text based on web page xpath
CN117573959B (en) * 2023-10-17 2024-04-05 北京国科众安科技有限公司 General method for obtaining news text based on web page xpath

Also Published As

Publication number Publication date
CN106528583A (en) 2017-03-22

Similar Documents

Publication Publication Date Title
WO2017080090A1 (en) Extraction and comparison method for text of webpage
WO2022022045A1 (en) Knowledge graph-based text comparison method and apparatus, device, and storage medium
KR102237702B1 (en) Entity relationship data generating method, apparatus, equipment and storage medium
CN109145260B (en) Automatic text information extraction method
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN101079025B (en) File correlation computing system and method
CN110770735A (en) Transcoding of documents with embedded mathematical expressions
CN112380864B (en) Text triple labeling sample enhancement method based on translation
CN104750820A (en) Filtering method and device for corpuses
CN101114281A (en) Open type document isomorphism engines system
CN111046660B (en) Method and device for identifying text professional terms
CN102779135A (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN111737623A (en) Webpage information extraction method and related equipment
CN105574066A (en) Web page text extraction and comparison method and system thereof
CN107463571A (en) Web color method
Jain et al. Context sensitive text summarization using k means clustering algorithm
CN112765999A (en) Machine translation bilingual comparison method and system
CN108763192B (en) Entity relation extraction method and device for text processing
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN106372232B (en) Information mining method and device based on artificial intelligence
Zanibbi et al. Math search for the masses: Multimodal search interfaces and appearance-based retrieval
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN117312711A (en) Search engine optimization method and system based on AI analysis
CN111859887A (en) Scientific and technological news automatic writing system based on deep learning
CN105426388A (en) Apparatus for extracting and comparing webpage text

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15908220

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15908220

Country of ref document: EP

Kind code of ref document: A1