CN102810097B - Webpage text content extracting method and device - Google Patents

Webpage text content extracting method and device Download PDF

Info

Publication number
CN102810097B
CN102810097B CN201110147583.7A CN201110147583A CN102810097B CN 102810097 B CN102810097 B CN 102810097B CN 201110147583 A CN201110147583 A CN 201110147583A CN 102810097 B CN102810097 B CN 102810097B
Authority
CN
China
Prior art keywords
content
text
unit
block
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110147583.7A
Other languages
Chinese (zh)
Other versions
CN102810097A (en
Inventor
朱海军
姜吉发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Autonavi Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Autonavi Software Co Ltd filed Critical Autonavi Software Co Ltd
Priority to CN201110147583.7A priority Critical patent/CN102810097B/en
Publication of CN102810097A publication Critical patent/CN102810097A/en
Application granted granted Critical
Publication of CN102810097B publication Critical patent/CN102810097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of Webpage text content extracting method and device, the method comprising the steps of: the webpage that need extract body matter is divided into each content blocks; Perform respectively for each content blocks marked off: determine the link text length of this content blocks and non-link text length; And according to the link text length determined and non-link text length, determine the link text density that this content blocks is corresponding; When link text density is not more than the first default defined threshold, determine that this content blocks is the body matter of this webpage.Adopt technical solution of the present invention, solve the problem that the extraction Web page text content accuracy that exists in prior art is lower.

Description

Webpage text content extraction method and device
Technical Field
The invention relates to the technical field of internet information processing, in particular to a method and a device for extracting webpage text content.
Background
With the rapid development of internet technology, information on web pages is becoming richer and richer, and in order to better use the information on web pages, people are continuously pursuing technology capable of effectively organizing and utilizing the information on web pages, but at the same time, the web pages are not as neat and clean as traditional texts, and the web pages contain a large amount of noise content, such as scripts added for enhancing user interactivity, navigation links added for facilitating user browsing, advertisement links added for commercial consideration, and the like.
The webpage text extraction refers to removing text-unrelated information such as text chain advertisements, pictures, copyrights and the like of a navigation bar and a side bar from a hypertext markup language (HTML) page, and extracting text contents of a webpage, wherein the extraction of the text contents of the webpage is one of indispensable steps in a search engine.
The method for extracting the webpage text in the prior art mainly comprises an extraction method based on visual features and an extraction method based on statistics, and the two extraction methods are respectively introduced below.
1. Extraction method based on visual features
The method comprises the steps of firstly mining the structure of a webpage based on the visual features of the webpage, wherein the visual features of the webpage comprise fonts, word sizes, background colors, blank areas, position information and the like, dividing the webpage into visual information blocks according to the visual features of the webpage, and then judging whether the visual information blocks are the text content of the webpage or not according to the visual feature rules of the visual information blocks aiming at each visual information block.
The visual feature-based extraction method mainly extracts the text content according to the visual features of the webpage, and the visual features sometimes cannot accurately distinguish the boundaries of the text content and the non-text content, so that the extraction accuracy is low. In addition, with the continuous development of network technology, the formats of web pages are more and more abundant, and when text content is extracted for web pages with certain formats, visual features such as fonts, font sizes, background colors, blank areas, position information and the like may not be obtained, or the obtained visual features are inaccurate. Therefore, the accuracy of extracting the text content by adopting the extraction method based on the visual features is lower.
2. Extraction method based on statistics
Generally, a part with less change in a webpage is considered to be redundant content, namely noise, such as a navigation bar, a side advertisement, copyright information and the like, and a part with frequent change in the webpage is generally text content of the webpage, so that a training set containing a large number of webpages can be constructed, areas with less change and areas with frequent change are counted from the training set, and then corresponding webpage templates are summarized, when the text content of the webpage is extracted, the webpage with text content to be extracted is compared with the corresponding webpage templates in the training set, and then the text content of the webpage is extracted.
The statistical-based extraction method mainly extracts the text content according to the summarized webpage template, and for each webpage, the distribution of the text content may be slightly different from that of the webpage template, that is, when the uniform webpage template is applied to extract the webpage text, the extraction accuracy is relatively low. In addition, with the continuous development of network technologies, the formats of web pages are more and more abundant, and web pages with many formats cannot correspond to a uniform web page template, for example, web pages 1 and 2 are web pages of a certain website, web page 2 is a web page after the website is modified, web page 1 is a web page before the website is modified, and the positions of the text contents in web pages 1 and 2 are different, so that the web page template corresponding to web page 1 cannot be applied to web page 2, and if the text contents of web page 2 are still extracted according to the web page template, the extraction accuracy is low.
Therefore, the webpage text extraction technology in the prior art has the problem of low extraction accuracy.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting webpage text content, which are used for solving the problem of low accuracy of extracting the webpage text content in the prior art.
The technical scheme of the embodiment of the invention is as follows:
a method for extracting webpage text content comprises the following steps: dividing a webpage from which text content needs to be extracted into content blocks; respectively executing the following steps for each divided content block: determining the length of the link text and the length of the non-link text of the content block; determining the link text density corresponding to the content block according to the determined link text length and the determined non-link text length; and when the link text density is not greater than a preset first specified threshold value, determining the content block as the text content of the webpage.
A web page text content extraction apparatus comprising: the content block dividing unit is used for dividing the webpage of which the text content needs to be extracted into content blocks; a first text length determining unit, configured to determine, for each content block, a link text length and a non-link text length of the content block, respectively; the first link text density determining unit is used for determining the link text density corresponding to the content block according to the link text length and the non-link text length determined by the first text length determining unit; the first link text density judging unit is used for judging whether the link text density determined by the first link text density determining unit is greater than a preset first specified threshold value or not; and the text content determining unit is used for determining the content block as the text content of the webpage when the judgment result of the first link text density judging unit is negative.
According to the technical scheme of the embodiment of the invention, for each webpage needing to extract the body content, whether the webpage is the body content of the webpage is determined according to the proportion of the length of the link text in the content block (namely the density of the link text), when the proportion of the length of the link text is more, the possibility of the webpage becoming the body content is considered to be smaller, if the proportion of the length of the link text is more than a first specified threshold value, the webpage is determined to be the non-body content, and vice versa. Therefore, when the webpage text content is extracted, the method and the device are carried out for each webpage needing to extract the text content, and cannot be influenced by different webpage form differences, and the proportion of the link text length in the content block can objectively and accurately reflect the possibility that the content block becomes the webpage text, so that the accuracy of extracting the webpage text is effectively improved.
Drawings
FIG. 1 is a schematic flow chart of a method for extracting text content of a web page according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a flow chart of a specific implementation of a method for extracting text content of a web page according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a web page text content extracting apparatus according to an embodiment of the present invention.
Detailed Description
The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical scheme of the embodiment of the invention are explained in detail in the following with the accompanying drawings.
As shown in fig. 1, which is a flowchart of a method for extracting text content of a web page in an embodiment of the present invention, a specific processing flow is as follows:
and 11, dividing the webpage from which the text content needs to be extracted into content blocks.
The web page usually describes one or more subjects by means of segmented words, which also include contents such as pictures and links, but these contents are not the main body of the web page and have less contents compared with the text contents of the web page.
Dividing the web page into content blocks means dividing the web page into a plurality of content blocks according to each pair of container tags in the web page. That is, the content in each pair of container tags in the web page is divided into one content block. The method specifically comprises the following substeps:
carrying out normalized preprocessing on a webpage from which text content needs to be extracted;
obtaining each container label pair in the preprocessed webpage;
and dividing the preprocessed webpage into a plurality of content blocks according to the obtained container label pairs.
The web page is normalized and preprocessed to meet the hypertext markup language (HTML) standard, and the normalized and preprocessed web page mainly includes processing parts such as unifying web page coding formats, simplifying tags, deleting code segments irrelevant to texts, and the like, which are introduced below.
1. Unified web page coding format
Since the encoding formats used by the websites are not necessarily the same, that is, the encoding formats of the webpages may be different, in order to correctly extract the text content from the webpages, it is necessary to convert the encoding formats of the webpages encoded by the different encoding formats into a uniform encoding format, but the conversion of the encoding formats may be performed by means of, but not limited to, the charset attribute of the meta tag.
2. Simplified label
The main purpose of the simplified tag is to further improve the accuracy of the extraction result in order to facilitate the extraction processing of the text content in the later stage, and the simplified tag mainly comprises the following contents:
because different tag attributes play different roles in web page display, for example, by setting tag attributes, some contents may not be displayed in a web page, and for such tag attributes, they should be deleted in the process of simplifying tags, so as to avoid the occurrence of such tag attributes in the extracted text contents. For example, the tag attribute in "< tdheight ═ 29" align ═ right ">" is deleted and reduced to "< td >".
Since HTML code does not distinguish case, for convenience of subsequent processing, when tags are reduced, case unification processing may be performed on all tags, for example, the tags are unified into capitals, for example, tags TD in "< tdheight ═ 29" align ═ right > "are reduced to" TD ".
And performing corresponding processing, such as deletion, replacement and the like, on the label according to the label attribute. For example, a "_ ATTR _ DEL" is used to represent a deletion flag, and a "_ ATTR _ REP" is used to represent a replacement flag, when simplifying a tag, a tag attribute may be deleted or replaced according to a corresponding deletion flag or replacement flag, if the tag corresponding to the tag attribute is "_ ATTR _ DEL", the tag pair including the tag attribute and the content included therein are all deleted, and if the tag corresponding to the tag attribute is "_ ATTR _ REP", the tag of the tag attribute is replaced.
In practical applications, the condensed label may further include a lot of contents, and those skilled in the art may design the label according to the specific description form of each web page, which is not limited herein.
3. Deleting text-independent code segments
In the HTML code, some codes are used for implementing some functions of the page, and are not related to the text content of the web page, for example, CSS code and Script code, and these codes need to be deleted when performing the normalization preprocessing, and table 1 lists the tag pairs corresponding to the codes that need to be deleted.
TABLE 1
Start tag Ending label Remarks for note
<SCRIPT> </SCRIPT> Script code
<STYLE> </STYLE> CSS code
<FORM> </FORM> Form sheet
The normalization preprocessing is a conventional step before extracting the text of the web page, and those skilled in the art may make adaptive changes and modifications based on the above description, and the embodiment of the present invention is not limited thereto.
According to the embodiment of the invention, when the text content of the webpage is extracted, the webpage is subjected to the standardized preprocessing, so that the problem that the text content cannot be correctly extracted due to the writing error in an HTML code or the non-standardized code is avoided, and the fault tolerance of the method for extracting the text content of the webpage provided by the embodiment of the invention is stronger.
After the normalized preprocessing is completed, the web page needs to be divided into content blocks, specifically: the method comprises the steps of firstly obtaining each container label pair in a preprocessed webpage, and dividing the preprocessed webpage into a plurality of content blocks according to the obtained container label pairs.
Common pairs of container tags are < TABLE > tag pair, < TR > tag pair, < TD > tag pair, < DIV > tag pair, < P > tag pair, and so on. The content between the start tag and the end tag of each container tag pair is the content block corresponding to the container tag pair.
According to the embodiment of the invention, the webpage is divided into the content blocks according to the container label pairs, and because the common webpage comprises the container label pairs, the method for dividing the content blocks according to the container label pairs has strong universality, is not limited by the webpage format, and is not influenced by the website version change.
And step 12, respectively determining the length of the link text and the length of the non-link text of each divided content block.
The text in the embodiment of the present invention refers to a character string in the content, and the text length refers to the length of the character string.
In the embodiment of the invention, for each divided content block, when determining the link text length of the content block, the text length of each link in the content block, that is, the number of characters contained in each link, is determined, and then the link text length of the content block is determined according to the text length of each link in the content block.
Specifically, the sum of the text lengths of the links in the content block may be, but is not limited to, taken as the link text length of the content block, that is, the link text length of the content block is determined by the following method:
<math> <mrow> <mi>Len</mi> <msub> <mrow> <mo>(</mo> <mi>LinkText</mi> <mo>)</mo> </mrow> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>n</mi> </munderover> <mi>Len</mi> <msub> <mrow> <mo>(</mo> <mi>LinkText</mi> <mo>)</mo> </mrow> <mi>ij</mi> </msub> </mrow> </math>
wherein, Len (LinkText)iIs the link text length of content block i;
n is the number of links contained in the content block i;
Len(LinkText)ijthe length of the j-th link text in the content block i is more than or equal to 1 and less than or equal to n.
In the embodiment of the invention, for each divided content block, when determining the length of the non-link text of the content block, the total text length of the content block can be determined, and then the total text length of the content block is subtracted by the length of the link text of the content block, so that the obtained text length is the length Len (NonLinkText) of the non-link text of the content blocki. Or, a method of calculating a sum of lengths of non-link texts in the content block, and the like, which is not limited by the embodiment of the present invention.
And step 13, determining the link text density corresponding to each divided content block according to the determined link text length and the determined non-link text length.
The text of the web page is usually displayed by using large-sized characters, and it is generally considered that the content of the link part in the web page, such as a navigation link, an advertisement information link, etc., is not the text content, and the content of the non-link part in the web page is the text content of the web page. In the embodiment of the invention, the core idea is to determine whether the text content is the text content according to the proportion of the length of the link text. Specifically, the more the proportion of the link text length is, the less the possibility of becoming the body content is considered, and the less the proportion of the link text length is, the more the possibility of becoming the body content is considered.
In this step, there are many specific implementations of determining the link text density corresponding to the content block (i.e., the proportion of the link text length) according to the link text length and the non-link text length, where the link text density corresponding to the content block is determined according to the ratio of the link text length to the non-link text length, or the link text density corresponding to the content block is determined according to the ratio of the link text length to the total text length of the content block, and so on, as long as the proportion of the link text length can be reflected to a certain extent, which is not limited in the embodiment of the present invention.
For the specific implementation mode of determining the link text density corresponding to the content block according to the ratio of the link text length to the non-link text length, the determined ratio of the link text length to the non-link text length can be directly used as the link text density corresponding to the content block:
f i = Len ( LinkText ) i Len ( NonLinkText ) i
wherein f isiThe link text density corresponding to the content block i;
Len(LinkText)iis the link text length of content block i;
Len(NonLinkText)iis the length of the unlinked text of the content chunk i.
In order to embody the principle, the link text density corresponding to the content block is determined according to the ratio of the link text length to the non-link text length, and the value of the link text density corresponding to the content block can be adjusted by using a penalty factor. Specifically, the ratio is multiplied by a penalty factor to obtain the link text density corresponding to the content block.
The penalty factor may be, but is not limited to, the number of links included in the content block, and the link text density determined to correspond to the content block at this time specifically is:
<math> <mrow> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mi>Len</mi> <msub> <mrow> <mo>(</mo> <mi>LinkText</mi> <mo>)</mo> </mrow> <mi>i</mi> </msub> </mrow> <mrow> <mi>Len</mi> <msub> <mrow> <mo>(</mo> <mi>NonLinkText</mi> <mo>)</mo> </mrow> <mi>i</mi> </msub> </mrow> </mfrac> <mo>&times;</mo> <mi>n</mi> </mrow> </math>
where n is the number of links contained in the content chunk i, i.e. the penalty factor.
Table 2 shows a content block i in a web page, the content block includes navigation links and copyright information, the link text length of the content block i is calculated to be 28, namely the sum of the number of characters included in each link in the content block i is 28, the non-link text length of the content block i is calculated to be 56, and if the penalty factor n is not used, f of the content block i is calculated to be fiAmplification, then fi0.5, less than the first prescribed threshold 0.9, is mistaken for unlinked content and the block is extracted as web page text content, if penalty factor n is used to extract f of block iiAmplification, where n is 7, then fiAnd 3.5, if the value is greater than the first specified threshold value of 0.9, the content block is regarded as the link content, the content block is deleted from the HTML code, and therefore, the accuracy of webpage text extraction can be further improved after the penalty factor is adopted for adjustment.
TABLE 2
Therefore, the punishment factors can amplify the link text density of the content blocks, so that some link contents can be prevented from being wrongly judged as the text contents, and the accuracy of extracting the text contents of the webpage is improved.
And step 14, for each divided content block, determining that the content block is the text content of the webpage when the link text density corresponding to the content block is not greater than a preset first specified threshold value.
In the embodiment of the invention, aiming at each divided content block, the link text density f corresponding to the content block is determinediThen, the link text density fiComparing with a preset first specified threshold value, if fiIf the number f is larger than the first predetermined threshold, the content block is regarded as non-text content, and therefore the content block is deleted from the HTML code, and if f is larger than the first predetermined threshold, the content block is deleted from the HTML codeiIf the content block is not greater than the first specified threshold, the content block is regarded as the text content of the webpage, and the content block can be added into the text content result set.
The first predetermined threshold is generally an empirical value, and a person skilled in the art may set the first predetermined threshold according to different extraction accuracy requirements, which is not limited in this embodiment of the present invention.
In the container tag pairs of < TABLE > tag pair, < TR > tag pair, < TD > tag pair, < DIV > tag pair, < P > tag pair etc., there may be a nesting of tag pairs, the < P > tag pair is generally used for segmentation of the body, belonging to tags in a comparative inner layer, and there are few cases of nesting of tag pairs, while the < TABLE > tag pair has more cases of nesting of tags. For the case of nesting, in the embodiment of the present invention, a tag with a relatively large coverage area is symmetric to be an outer-layer tag pair, and a tag with a relatively small coverage area is referred to as an inner-layer tag pair, for example, a < P > tag pair is nested in a < TABLE > tag pair, that is, a content block corresponding to the < P > tag pair is nested in a content block corresponding to the < TABLE > tag pair, then the < P > tag is symmetric to be the inner-layer tag pair, the < TABLE > tag is symmetric to be the outer-layer tag pair, and the content block corresponding to the < P > tag pair is referred to as a nested content block of the content block corresponding to the < TABLE > tag pair, as shown in TABLE 3.
TABLE 3
Start tag Ending label Label pair coverage
<P> </P> Inner label
<SPAN> </SPAN>
<DIV> </DIV>
<UL> </UL>
<TR> </TR>
<TABLE> </TABLE> Outer label
For the situation that label pair nesting occurs, in practical application, only one layer of nesting may occur, and multiple layers of nesting may also occur. The method and the device can perform webpage text content extraction processing according to the sequence from the inner layer label pair to the outer layer label pair, namely, preferentially process the inner layer content block, then process each layer of content block layer by layer from inside to outside, and also perform webpage text content extraction processing according to the sequence from the outer layer label pair to the inner layer label pair, namely, preferentially process the outer layer content block, and then process each layer of content block layer by layer from outside to inside.
Specific embodiments of sequential processing and reverse-sequential processing are described in detail below for the case of one-layer nested content blocks, and for the case of multiple-layer nested content blocks, the processing principle is consistent with that of the case of one-layer nested content blocks, and details are not described here.
When the sequential processing method is adopted, for each divided content block, before performing step 12 to step 14 for the content block, the method further includes: judging whether at least one nested content block is nested in the content block, if not, respectively executing the step 12 to the step 14 aiming at each divided content block, if so, respectively determining the link text length and the non-link text length of the nested content block aiming at each nested content block, determining the link text density corresponding to the nested content block according to the determined link text length and the non-link text length, determining the nested content block as the non-body content of the webpage when the link text density corresponding to the nested content block is greater than a preset first specified threshold value, and then taking the content except the nested content blocks determined as the non-body content in the content block as the content block again. That is, it is determined whether each nested content block in the content block is text content, if the nested content block is non-text content, the nested content block is deleted from the content block, and it is continuously determined whether the deleted content block is text content. For example, a content block corresponding to three < P > tag pairs is nested in a content block corresponding to a < TABLE > tag pair, whether the nested content blocks corresponding to the three < P > tag pairs are the text content of a web page is firstly respectively determined for the content blocks corresponding to the < TABLE > tag pairs, if it is determined that the nested content blocks corresponding to two < P > tag pairs are the text content, and one < P > tag pair is the non-text content, the nested content block determined as the non-text content is deleted from the content block, then it is continuously determined whether the deleted content block is the text content of the web page, if it is determined as the text content, the content block is added into a text content result set, and if it is determined as the non-text content, the content block is deleted from an HTML code. The sequential processing mode can effectively avoid the text content of the webpage from being filtered.
When the reverse order processing mode is adopted, for each divided content block, after determining that the content block is the text content of the web page in step 14, the method further includes: judging whether at least one nested content block is nested in the content block, and if the judgment result is negative, ending the process; if the nested content block is the text content of the webpage, determining the link text length and the non-link text length of the nested content block aiming at each nested content block, determining the link text density corresponding to the nested content block according to the determined link text length and the determined non-link text length, determining the nested content block as the non-text content of the webpage when the link text density corresponding to the nested content block is larger than a preset first specified threshold, then deleting the nested content block from the text content, and determining the content except the nested content blocks determined as the non-text content in the content block as the text content of the webpage. That is to say, it is determined whether the link text density corresponding to the content block is greater than a preset first prescribed threshold value, if so, the content block is deleted from the HTML code, if not, it is determined whether each nested content block is the text content, the nested content blocks determined as the non-text content are deleted from the HTML code, and the deleted content blocks are the text content of the web page. For example, nested content blocks corresponding to three < P > tag pairs are nested in content blocks corresponding to the < TABLE > tag pairs, for the content blocks corresponding to the < TABLE > tag pairs, it is first determined whether the link text density of the content blocks corresponding to the < TABLE > tag pairs is greater than a preset first prescribed threshold, if so, the content blocks are deleted from the HTML code, if not, it is then respectively determined whether the nested content blocks corresponding to the three < P > tag pairs are text content of the web page, the nested content blocks determined as non-text content are deleted from the HTML code, and the deleted content blocks are added to the text content result set. The reverse order processing mode can effectively avoid that the link content is judged as the text content of the webpage by mistake.
The embodiment of the invention provides that when the divided content blocks are respectively determined to be the text content of the webpage, preferably, the nested content blocks in the extraction of the text content of the webpage are processed by using a reverse processing mode.
As can be seen from the above processing procedure, in the technical solution of the embodiment of the present invention, for each web page whose text content needs to be extracted, whether the web page is the text content of the web page is determined according to the proportion of the length of the link text in the content block (i.e., the density of the link text), when the proportion of the length of the link text is more, the possibility of the web page being the text content is considered to be smaller, and if the proportion of the length of the link text is more than a first predetermined threshold value, the web page is determined to be the non-text content, otherwise. Therefore, when the webpage text content is extracted, the method and the device are carried out for each webpage needing to extract the text content, and cannot be influenced by different webpage form differences, and the proportion of the link text length in the content block can objectively and accurately reflect the possibility that the content block becomes the webpage text, so that the method and the device are more accurate and reasonable compared with other judgment factors.
In another embodiment of the present invention, in order to further improve the accuracy of extracting the text content of the web page, the embodiment of the present invention further includes a high frequency unit filtering step.
The webpage body content determined according to the link text density may further include "noise" such as weather forecast, website source, click rate, copyright information and the like, the frequency of the redundant information appearing in the webpage is high, and the webpage body content determined according to the link text density is subjected to high-frequency unit filtering processing to filter the redundant information, so that the webpage body content extraction result is more accurate.
Performing high-frequency unit filtering on the text content determined according to the link text density, and firstly determining each content unit, specifically: obtaining each tag in the text content, and determining the content between two adjacent tags as a content unit, for example, when determining the text content according to the link text density, determining the content block corresponding to the < TABLE > tag pair as the text content, and nesting two < P > tag pairs in the < TABLE > tag pair, so that the content between the start tag and the end tag of the < TABLE > tag pair can be divided into 5 content units, the content between the start tag of the < TABLE > tag pair and the start tag of the first < P > tag pair is the first content unit, the content between the start tag and the end tag of the first < P > tag pair is the second content unit, the content between the end tag of the first < P > tag pair and the start tag of the second < P > tag pair is the third content unit, the content between the start tag and the end tag of the second < P > tag pair is the fourth content unit, the content between the end tag of the second < P > tag pair and the end tag of the < TABLE > tag pair is the fifth content unit.
After dividing the text content determined according to the link text density into content units, respectively executing the following operations for each content unit: matching the content unit with a preset content template library; if the matching is successful, adding 1 to the matching frequency of the matched content template in the content template library; then judging whether the matching frequency is greater than a preset second specified threshold value, if so, determining that the frequency of the content unit is higher, thus determining that the content unit is the non-text content of the webpage, and deleting the content unit from the text content; if the judgment result is negative, the occurrence frequency of the content unit is considered to be low, so that the content unit is determined to be the text content of the webpage, and the process of filtering the high-frequency unit of the content unit is ended.
Preferably, the process of matching the content unit with the preset content template library may be, but is not limited to, the following:
firstly, searching a content template consistent with the content of the content unit in each content template in a content template library, wherein the content template in the content template library is obtained by matching the content unit of at least one webpage in advance; and if the content template with the consistent content is found, the matching is considered to be successful, otherwise, the matching is considered to be failed.
Further, before searching for a content template consistent with the content of the content unit, it may be determined whether a content template is stored in the content template library; if the content template is judged not to be stored, the matching is failed; and if the content templates are judged to be stored, the operation of searching the content templates which are consistent with the content of the content unit in each content template in the content template library is executed. When the content unit is matched with the content template library, the content template library possibly stores content templates, and at the moment, the operation of searching the content template consistent with the content of the content unit in each content template in the content template library can be directly executed; the content template library may not store the content template, for example, in the case of initially establishing the content template library, the matching result of the content unit and the content template library is considered as a matching failure.
If the matching result of the content unit and the content template library is failure, the process of performing high-frequency filtering processing on the content unit can be directly finished; preferably, the content unit may be stored in the content template library as a new content template, and the matching frequency corresponding to the content unit may be set to an initial value, where the initial value of the matching frequency may be, but is not limited to, 1.
In order to save the storage capacity of the content template library, the content template can be encoded by using a preset encoding rule before the content template is stored in the content template library, and the encoded content template is stored in the content template library, so that the storage capacity of the content template library is effectively saved. That is, all the content templates stored in the content template library are encoded, so before the content unit is matched with the content template library, the content unit needs to be encoded by using a preset encoding rule, and then the encoded content unit needs to be matched with the content template library.
Because the high-frequency unit filtering processing is mainly used for filtering redundant information such as weather forecast, website source, click rate, copyright information and the like, the text length of the redundant information is generally not too long, in order to reduce the storage scale of the content template library, preferably, before the content unit is matched with the content template library, whether the text length of the content unit is greater than a preset third specified threshold value or not can be judged, if the judgment result is yes, the content unit is considered to be the text content, the high-frequency unit filtering processing is not needed, and then the process of performing the high-frequency unit filtering processing on the content unit is finished; if the judgment result is negative, the content unit is considered to be possibly redundant information, and at the moment, the operation of matching the content unit with a preset content template library is executed.
In practical applications, the searching for the content template consistent with the content of the content unit may be directly searching for the content template consistent with the content of the content unit, that is, directly comparing the content in the content unit with the content in the content template to determine whether the content is consistent. The content unit is stored in the content template library as a content template, the keywords in the content unit are extracted, the extracted keywords are stored in the content template library as the content template, that is, corresponding keywords are stored in each content template stored in the content template library, when a content template consistent with the content of the content unit is searched, the content of the content unit is compared with the keywords in the content template, and then whether the content is consistent is determined. For example, the keywords of the content unit are extracted, and then the content template consistent with the keywords of the content unit is searched in each content template in the content template library.
Because the contents of the redundant information such as website sources, copyright information and the like are generally fixed and the change of characters is less, a method of directly searching a content template consistent with the contents of the content unit can be preferably adopted.
In the HTML code, characters such as "<", ">", "&" have special meanings, they are reserved characters of the HTML language and thus cannot be directly used, and when such characters need to be used, they need to be escape sequences instead of display characters, wherein the correspondence of escape sequences and display characters can be, but not limited to, as shown in table 4.
Table 4:
escape sequence Displaying characters Escape sequence Displaying characters Escape sequence Displaying characters
&nbsp; &quot; \ &ldquo;
&times; × &copy; (C) &rdquo;
&divide; ÷ &reg; (R) &mdash; -
&amp; & &trade; TM &#8240;
Before the high-frequency unit filtering, further, the escape sequence in the HTML code needs to be restored to the corresponding display character, that is, the escape sequence in the HTML code needs to be replaced by the corresponding display character.
The following illustrates a preferred flow of the high frequency unit filtering process.
Initially, no content template is stored in the content template library, and when a high-frequency unit filtering process is performed on a first webpage (referred to as a webpage 1, it should be noted that the webpage 1 may be a webpage from which text content is to be extracted, or may be other webpages), the text content of the webpage 1 obtained by using the link text density is first divided into a plurality of content units, for example, into a content unit 1A, a content unit 1B, and a content unit 1C; firstly, high-frequency unit filtering processing is performed on the content unit 1A: and obtaining the text length of the content unit 1A, judging that the text length of the content unit 1A is greater than a third specified threshold, considering that the content unit 1A is the text content at this moment, and ending the flow of performing the high-frequency unit filtering processing on the content unit 1A. High frequency unit filtering processing is next performed for the content unit 1B: the method comprises the steps of obtaining the text length of a content unit 1B, judging that the text length of the content unit 1B is not larger than a third specified threshold, considering that the content unit 1B is possibly redundant information, continuously judging whether a content template is stored in a content template library, directly storing the content unit 1B as a new content template into the content template library because the content template library is in an initial state and does not store any content template, setting the matching frequency of the content template as an initial value 1, and confirming that the content unit 1B is the text content because the matching frequency is smaller than a preset second specified threshold 5. High frequency unit filtering processing is next performed for the content unit 1C: obtaining the text length of the content unit 1C, judging that the text length of the content unit 1C is not greater than a third specified threshold, considering that the content unit 1C may be redundant information at this time, continuously judging whether a content template is stored in the content template library, searching for a content template consistent with the content of the content unit 1C because the content template corresponding to the content unit 1B is stored in the content template library, determining that the content unit 1C is the text content if the content template is not found, storing the content unit 1C as a new content template in the content template library, and setting the matching frequency of the content template as an initial value 1.
When the high-frequency unit filtering processing is performed on a second webpage (referred to as a webpage 2, it should be noted that the webpage 2 may be a webpage from which text content is to be extracted, or other webpages), the text content of the webpage 2 obtained by link filtering is divided into a plurality of content units, for example, into a content unit 2A and a content unit 2B; the high-frequency unit filtering processing is performed on the content unit 2A: and obtaining the text length of the content unit 2A, judging that the text length of the content unit 2A is greater than a third specified threshold, considering that the content unit 2A is the text content at this moment, and ending the flow of performing the high-frequency unit filtering processing on the content unit 2A. High frequency unit filtering processing is next performed for the content unit 2B: obtaining the text length of the content unit 2B, judging that the text length of the content unit 2B is not greater than a third specified threshold, considering that the content unit 2B may be redundant information, continuously judging whether a content template is stored in the content template library, because the content template library stores content templates corresponding to the content unit 1B and the content unit 1C, not searching for a content template consistent with the content of the content unit 2B, confirming that the content unit 2B is the text content, storing the content unit 2B as a new content template in the content template library, and setting the matching frequency of the content template as an initial value 1.
Respectively performing high-frequency unit filtering processing on webpages 3-1 according to the above flow, and when performing high-frequency unit filtering processing on an nth webpage (referred to as a webpage N, it should be noted that the webpage N may be a webpage from which text content is to be extracted, or other webpages), dividing text content of the webpage N obtained by link filtering into a plurality of content units, for example, into a content unit NA and a content unit NB; firstly, high-frequency unit filtering processing is carried out on a content unit NA: and obtaining the text length of the content unit NA, judging that the text length of the content unit NA is greater than a third specified threshold, considering the content unit NA as the text content at this moment, and ending the flow of performing high-frequency unit filtering processing on the content unit NA. The high frequency unit filtering process is next performed for the content unit NB: obtaining the text length of the content unit NB, judging that the text length of the content unit NB is not greater than a third specified threshold, considering that the content unit NB may be redundant information, continuously judging whether the content template is stored in the content template library, wherein the content of the content unit NB is the same as that of the content unit 1B, searching for the content template consistent with the content of the content unit NB in each content template stored in the content template library, because the content of the content unit NB is the same as that of the content unit 1B, searching for the content template consistent with the content in each content template stored in the content template library, adding 1 to the matching frequency corresponding to the searched content template (the content template corresponding to the content unit 1B), wherein the matching frequency of the searched content template is 10 and is greater than a preset second specified threshold, thus, the content unit NB is confirmed as redundant information, and is deleted from the body content.
Because the numerical values of the click rate and the browsing amount in the redundant information are constantly changed, in order to prevent the constantly changed numerical values of the click rate and the browsing amount from being wrongly judged as the text content when the high-frequency unit filtering is performed, before the determined content unit is matched with the content template library, the numerical values can be further subjected to digital normalization processing, and all digital characters contained in the content unit are converted into unified preset characters.
According to the embodiment of the invention, two filtering models are constructed for extracting the text content of the webpage, the two filtering models are a link filtering model and a high-frequency unit filtering model, the link filtering model is mainly used for filtering the link content which is irrelevant to the text content, such as navigation bar links, advertisement links and the like, and the high-frequency unit filtering model is mainly used for filtering redundant information with high frequency in the webpage. When the text content of the webpage is extracted, after the webpage is preprocessed, the preprocessed webpage can be visually regarded as two filtering models which are sequentially put into the webpage: and the link filtering model and the high-frequency unit filtering model are used for filtering the webpage content, and the result after filtering by the link filtering model and the high-frequency unit filtering model is the text content of the webpage.
In order to test the processing effect of link filtering, 100 webpages can be randomly selected from the crawled news webpages, and 100 non-text paragraphs containing link contents are selected from the selected 100 webpages as test corpora, wherein 98% of the link contents can be filtered by the link filtering model, and the remaining 2% of the link contents (such as the link contents shown in table 5) cannot be filtered by the link filtering model.
Table 5:
although 2% of the link content is not correctly filtered by the link filtering model, the content appears repeatedly in multiple web pages and can therefore be correctly filtered in the high frequency element filtering model. Because the embodiment of the invention adopts the link filtering model and the high-frequency unit filtering model to carry out double filtering, redundant information is deleted to the maximum extent, and the accuracy of extracting the text content of the webpage is further improved.
More detailed embodiments are given below.
As shown in fig. 2, which is a flowchart of a specific implementation of the method for extracting the text content of the web page in the embodiment of the present invention, the process of extracting the text content of the web page is divided into a preprocessing portion, a link filtering portion and a high frequency unit filtering portion, where steps 21 to 23 are the preprocessing portion, steps 24 to 29 are the link filtering portion, and steps 210 to 218 are the high frequency unit filtering portion, and the specific processing process of the method for extracting the text content of the web page is as follows:
step 21, processing a uniform webpage coding format aiming at the webpage from which the text content needs to be extracted;
step 22, carrying out simplified label processing on the webpage from which the text content needs to be extracted;
step 23, deleting a code segment irrelevant to the text aiming at the webpage from which the text content needs to be extracted;
step 24, after the normalized preprocessing is completed, obtaining each container label pair in the preprocessed webpage;
step 25, dividing the preprocessed webpage into content blocks according to the obtained container label pairs;
step 26, determining the length of the link text and the length of the non-link text of each divided content block;
step 27, determining the link text density corresponding to each divided content block according to the determined link text length and non-link text length;
step 28, for each divided content block, respectively determining whether the link text density corresponding to the content block is greater than a preset first prescribed threshold, if yes, going to step 29, if no, going to step 210;
step 29, confirming that the content block is non-text content, and deleting the content block from the HTML code;
step 210, restoring the escape sequence in the HTML code into a corresponding display character;
step 211, determining the content between each tag in the HTML code and the next tag as a content unit;
step 212, converting each digital character contained in each content unit into a unified preset character;
step 213, determining whether the text length of each determined content unit is greater than a third predetermined threshold, if yes, going to step 218, and if no, going to step 214;
step 214, searching a content template consistent with the content of the content unit in each content template in the content template library, if the content template is found, turning to step 216, and if the content template is not found, turning to step 215;
step 215, storing the content unit as a new content template into a content template library, setting the matching frequency corresponding to the content template as an initial value, and then proceeding to step 218;
step 216, adding 1 to the matching frequency corresponding to the content template, and then going to step 217;
step 217, determining whether the current matching frequency of the content unit is greater than a preset second predetermined threshold, if yes, going to step 29, and if no, going to step 218;
step 218, determine the content unit to be the text content of the web page, and add the content unit to the text content result set.
Accordingly, an embodiment of the present invention provides an apparatus for extracting text content of a web page, which has a structure as shown in fig. 3, and includes a content block dividing unit 31, a first text length determining unit 32, a first link text density determining unit 33, a first link text density determining unit 34, and a text content determining unit 35, where:
a content block dividing unit 31, configured to divide a web page from which text content needs to be extracted into content blocks;
a first text length determining unit 32, configured to determine, for each content block, a link text length and a non-link text length of the content block, respectively;
a first link text density determining unit 33, configured to determine, according to the link text length and the non-link text length determined by the first text length determining unit 32, a link text density corresponding to the content block;
a first link text density judging unit 34 configured to judge whether the link text density determined by the first link text density determining unit 33 is greater than a preset first prescribed threshold;
and a body content determining unit 35, configured to determine that the content block is the body content of the web page when the determination result of the first link text density determining unit 34 is negative.
Preferably, the content block dividing unit 31 specifically includes a preprocessing subunit, a tag pair obtaining subunit, and a content block dividing subunit, where:
the preprocessing subunit is used for carrying out standardized preprocessing on the webpage from which the text content needs to be extracted;
the label pair obtaining subunit is used for obtaining each container label pair in the webpage after the preprocessing subunit performs preprocessing;
and the content block dividing subunit is used for dividing the webpage preprocessed by the preprocessing subunit into content blocks according to the label pairs of the containers obtained by the label pair obtaining subunit.
Preferably, the first link text density determining unit 33 specifically includes a ratio operator unit and a link text density determining subunit, where:
the ratio operator unit is used for calculating the ratio of the length of the link text to the length of the non-link text;
and the link text density determining subunit is used for determining the link text density corresponding to the content block according to the ratio calculated by the ratio calculating subunit.
Preferably, the link text density determining subunit is specifically configured to multiply the ratio calculated by the ratio calculating subunit by a penalty factor to obtain the link text density corresponding to the content block.
More preferably, the penalty factor is the number of links included in the content block.
Preferably, the web page body content extracting apparatus further includes a first nested content block determining unit, a second text length determining unit, a second link text density determining unit, and a content block deleting unit, wherein:
a first nested content block determining unit, configured to determine, after the text content determining unit 35 determines that the content block is the text content of the web page, whether at least one nested content block is nested in the content block;
a second text length determining unit, configured to determine, for each nested content block, a link text length and a non-link text length of the nested content block, respectively, when the determination result of the first nested content block determining unit is yes;
the second link text density determining unit is used for determining the link text density corresponding to the nested content block according to the link text length and the non-link text length determined by the second text length determining unit;
the second link text density judging unit is used for judging whether the link text density corresponding to the nested content block is larger than a preset first specified threshold value or not;
and the content block deleting unit is used for deleting the nested content block from the text content when the judgment result of the second link text density judging unit is yes.
Preferably, the web page body content extracting apparatus further includes a second nested content block determining unit, a third text length determining unit, a third link text density determining unit, a non-body content determining unit, and a content determining unit, wherein:
a second nested content block determining unit, configured to determine whether at least one nested content block is nested in the content block, and if the determination result is negative, the first text length determining unit 32 performs, for each content block, determining a linked text length and a non-linked text length of the content block respectively;
a third text length determining unit, configured to determine, for each nested content block, a link text length and a non-link text length of the nested content block, respectively;
the third link text density determining unit is used for determining the link text density corresponding to the nested content block according to the link text length and the non-link text length determined by the third text length determining unit;
the third link text density judging unit is used for judging whether the link text density corresponding to the nested content block is larger than a preset first specified threshold value or not;
the non-text content determining unit is used for determining the nested content block as the non-text content of the webpage when the judgment result of the third link text density judging unit is yes;
and the content determining unit is used for taking the content except the nested content blocks which are not determined as the non-text content by the text content determining unit in the content block as the content block again.
According to the technical scheme of the embodiment of the invention, for each webpage needing to extract the body content, whether the webpage is the body content of the webpage is determined according to the proportion of the length of the link text in the content block (namely the density of the link text), when the proportion of the length of the link text is more, the possibility of the webpage becoming the body content is considered to be smaller, if the proportion of the length of the link text is more than a first specified threshold value, the webpage is determined to be the non-body content, and vice versa. Therefore, when the webpage text content is extracted, the method and the device are carried out for each webpage needing to extract the text content, and cannot be influenced by different webpage form differences, and the proportion of the link text length in the content block can objectively and accurately reflect the possibility that the content block becomes the webpage text, so that the method and the device are more accurate and reasonable compared with other judgment factors.
Preferably, the web page text content extracting apparatus further includes a content unit dividing unit, a content matching unit, a matching frequency processing unit, and a content deleting unit, wherein:
the content unit dividing unit is used for acquiring each label in the text content and respectively determining the content between two adjacent labels as a content unit;
the content matching unit is used for respectively matching the content units with a preset content template library aiming at each content unit;
the matching frequency processing unit is used for adding 1 to the matching frequency of the matched content template in the content template library when the content matching unit is successfully matched;
and the content deleting unit is used for judging whether the matching frequency is greater than a preset second specified threshold value, and if so, deleting the content unit from the text content.
Preferably, the content matching unit specifically includes a content template searching subunit and a matching confirmation subunit, where:
the content template searching subunit is used for searching a content template consistent with the content of the content unit in each content template in a content template library, wherein the content template is obtained by matching the content unit of at least one webpage in advance;
and the matching confirmation subunit is used for confirming that the matching is successful when the content template searching subunit searches the content template, and confirming that the matching is failed when the content template searching subunit does not search the content template.
Preferably, the web page text content extracting apparatus further includes a content template determining unit and a matching result confirming unit, wherein:
a content template judging unit, configured to judge whether a content template is stored in the content template library;
and the matching result confirming unit is used for confirming that the matching fails when the judgment result of the content template judging unit is negative, and the content template searching subunit executes to search the content template consistent with the content of the content unit in each content template in the content template library when the judgment result of the content template judging unit is positive.
Preferably, the device for extracting the text content of the web page further comprises a content template storage unit, which is used for storing the content unit as a new content template into the content template library when the matching of the content matching unit fails, and setting the corresponding matching frequency as an initial value.
Preferably, the web page text content extracting apparatus further includes a text length determining unit and a matching unit, wherein:
the text length judging unit is used for judging whether the text length of the content unit is greater than a preset third specified threshold value before the content matching unit matches the content unit with a preset content template library;
and the matching unit is used for matching the content units with a preset content template library respectively aiming at each content unit when the judgment result of the text length judgment unit is negative.
Preferably, the web page text content extracting apparatus further includes a character converting unit, configured to convert each digital character included in the content unit into a unified preset character before the content matching unit matches the content unit with a preset content template library.
Because the embodiment of the invention adopts the link filtering model and the high-frequency unit filtering model to carry out double filtering, redundant information is deleted to the maximum extent, and the accuracy of extracting the text content of the webpage is further improved.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (23)

1. A webpage text content extraction method is characterized by comprising the following steps:
dividing a webpage from which text content needs to be extracted into content blocks;
respectively executing the following steps for each divided content block:
determining the length of the link text and the length of the non-link text of the content block; and are
Determining the link text density corresponding to the content block according to the determined link text length and the determined non-link text length;
when the link text density is not greater than a preset first specified threshold value, determining the content block as the text content of the webpage;
acquiring each label in the text content, and respectively determining the content between two adjacent labels as a content unit;
for each content unit, performing:
matching the content unit with a preset content template library;
if the matching is successful, adding 1 to the matching frequency of the matched content template in the content template library;
and judging whether the matching frequency is greater than a preset second specified threshold value, if so, deleting the content unit from the text content.
2. The method for extracting the text content of the web page as claimed in claim 1, wherein the dividing the web page from which the text content is to be extracted into content blocks specifically comprises:
carrying out normalized preprocessing on a webpage from which text content needs to be extracted;
obtaining each container label pair in the preprocessed webpage;
and dividing the preprocessed webpage into a plurality of content blocks according to the obtained container label pairs.
3. The method for extracting body content of a web page according to claim 1, wherein the determining the link text density corresponding to the content block according to the determined link text length and the determined non-link text length specifically comprises:
calculating the ratio of the length of the link text to the length of the non-link text;
and determining the link text density corresponding to the content block according to the ratio.
4. The method for extracting the body content of the web page as claimed in claim 3, wherein the determining the link text density corresponding to the content block according to the ratio specifically comprises:
and multiplying the ratio by a penalty factor to obtain the link text density corresponding to the content block.
5. The method of claim 4, wherein the penalty factor is the number of links included in the content block.
6. The method for extracting the text content of the web page according to claim 1, wherein after determining that the content block is the text content of the web page and before acquiring each tag in the text content, the method further comprises:
judging whether at least one nested content block is nested in the content block; if the judgment result is negative, ending;
if the judgment result is yes, respectively executing the following steps for each nested content block:
determining the length of a link text and the length of a non-link text of the nested content block; and are
Determining the link text density corresponding to the nested content block according to the determined link text length and the determined non-link text length;
and when the link text density corresponding to the nested content block is greater than a preset first specified threshold value, deleting the nested content block from the text content.
7. The web page text content extracting method according to claim 1, wherein before the separately executing for each divided content block, further comprising:
judging whether at least one nested content block is nested in the content block; if the judgment result is negative, executing the steps respectively executed aiming at each divided content block;
if the judgment result is yes, respectively executing the following steps for each nested content block:
determining the length of a link text and the length of a non-link text of the nested content block; and are
Determining the link text density corresponding to the nested content block according to the determined link text length and the determined non-link text length;
when the link text density corresponding to the nested content block is larger than a preset first specified threshold value, determining that the nested content block is the non-text content of the webpage;
and taking the content except the nested content blocks determined as the non-text content in the content block as the content block again.
8. The method for extracting the text content of the web page according to claim 1, wherein the matching of the content unit with a preset content template library specifically comprises:
searching a content template consistent with the content of the content unit in each content template in a content template library, wherein the content template is obtained by matching the content unit of at least one webpage in advance;
if the matching is found, the matching is successful, otherwise, the matching is failed.
9. The method for extracting the text content of the web page according to claim 8, wherein before searching the content template consistent with the content of the content unit in each content template in the content template library, the method further comprises:
judging whether the content template library stores content templates or not;
if the content template is not stored, the matching is failed;
and if the content templates are stored, executing the operation of searching the content templates which are consistent with the content of the content unit in each content template in the content template library.
10. The method for extracting text contents of web pages according to any one of claims 1 to 9, wherein if the matching fails, further comprising:
and taking the content unit as a new content template, storing the content unit into the content template library, and setting the corresponding matching frequency as an initial value.
11. The method for extracting the text content of the web page according to claim 1, wherein before the matching of the content unit with the preset content template library, the method further comprises:
judging whether the text length of the content unit is larger than a preset third specified threshold value or not;
if the judgment result is yes, ending;
and if the judgment result is negative, executing the operation of matching the content unit with a preset content template library.
12. The method for extracting the text content of the web page as claimed in claim 1, wherein before the matching of the content unit with the preset content template library, the method further comprises:
and converting each digital character contained in the content unit into a unified preset character.
13. A web page text content extraction apparatus, comprising:
the content block dividing unit is used for dividing the webpage of which the text content needs to be extracted into content blocks;
a first text length determining unit, configured to determine, for each content block, a link text length and a non-link text length of the content block, respectively;
the first link text density determining unit is used for determining the link text density corresponding to the content block according to the link text length and the non-link text length determined by the first text length determining unit;
the first link text density judging unit is used for judging whether the link text density determined by the first link text density determining unit is greater than a preset first specified threshold value or not;
the text content determining unit is used for determining the content block as the text content of the webpage when the judgment result of the first link text density judging unit is negative;
the content unit dividing unit is used for acquiring each label in the text content and respectively determining the content between two adjacent labels as a content unit;
the content matching unit is used for respectively matching the content units with a preset content template library aiming at each content unit;
the matching frequency processing unit is used for adding 1 to the matching frequency of the matched content template in the content template library when the content matching unit is successfully matched;
and the content deleting unit is used for judging whether the matching frequency is greater than a preset second specified threshold value, and if so, deleting the content unit from the text content.
14. The apparatus for extracting text content of a web page according to claim 13, wherein the content block dividing unit specifically includes:
the preprocessing subunit is used for carrying out standardized preprocessing on the webpage from which the text content needs to be extracted;
the label pair obtaining subunit is used for obtaining each container label pair in the webpage after the preprocessing subunit performs preprocessing;
and the content block dividing subunit is used for dividing the webpage preprocessed by the preprocessing subunit into content blocks according to the label pairs of the containers obtained by the label pair obtaining subunit.
15. The web page body content extraction apparatus according to claim 13, wherein the first link text density determination unit specifically includes:
the ratio operator unit is used for calculating the ratio of the length of the link text to the length of the non-link text;
and the link text density determining subunit is used for determining the link text density corresponding to the content block according to the ratio calculated by the ratio calculating subunit.
16. The web page body content extraction device according to claim 15, wherein the link text density determination subunit is specifically configured to multiply the ratio calculated by the ratio calculation subunit by a penalty factor to obtain the link text density corresponding to the content block.
17. The web page text content extracting apparatus according to claim 13, further comprising:
the first nested content block judging unit is used for judging whether at least one nested content block is nested in the content block after the text content determining unit determines that the content block is the text content of the webpage;
a second text length determining unit, configured to determine, for each nested content block, a link text length and a non-link text length of the nested content block, respectively, when the determination result of the first nested content block determining unit is yes;
the second link text density determining unit is used for determining the link text density corresponding to the nested content block according to the link text length and the non-link text length determined by the second text length determining unit;
the second link text density judging unit is used for judging whether the link text density corresponding to the nested content block is larger than a preset first specified threshold value or not;
and the content block deleting unit is used for deleting the nested content block from the text content when the judgment result of the second link text density judging unit is yes.
18. The web page text content extracting apparatus according to claim 13, further comprising:
a second nested content block judgment unit, configured to judge whether at least one nested content block is nested in the content block, and if the judgment result is negative, the first text length determination unit performs, for each content block, determining a linked text length and a non-linked text length of the content block respectively;
a third text length determining unit, configured to determine, for each nested content block, a link text length and a non-link text length of the nested content block, respectively;
the third link text density determining unit is used for determining the link text density corresponding to the nested content block according to the link text length and the non-link text length determined by the third text length determining unit;
the third link text density judging unit is used for judging whether the link text density corresponding to the nested content block is larger than a preset first specified threshold value or not;
the non-text content determining unit is used for determining the nested content block as the non-text content of the webpage when the judgment result of the third link text density judging unit is yes;
and the content determining unit is used for taking the content except the nested content blocks which are not determined as the non-text content by the text content determining unit in the content block as the content block again.
19. The apparatus for extracting text content of a web page according to claim 13, wherein the content matching unit specifically includes:
the content template searching subunit is used for searching a content template consistent with the content of the content unit in each content template in a content template library, wherein the content template is obtained by matching the content unit of at least one webpage in advance;
and the matching confirmation subunit is used for confirming that the matching is successful when the content template searching subunit searches the content template, and confirming that the matching is failed when the content template searching subunit does not search the content template.
20. The web page text content extracting apparatus according to claim 19, further comprising:
a content template judging unit, configured to judge whether a content template is stored in the content template library;
a matching result confirmation unit for confirming that the matching is failed when the judgment result of the content template judgment unit is negative; and when the judgment result of the content template judgment unit is yes, the content template searching subunit executes the content templates in the content template library to search the content template consistent with the content of the content unit.
21. The web page text content extracting apparatus according to any one of claims 13 to 20, further comprising:
and the content template storage unit is used for storing the content unit serving as a new content template into the content template library when the matching fails, and setting the corresponding matching frequency as an initial value.
22. The web page text content extracting apparatus according to claim 13, further comprising:
the text length judging unit is used for judging whether the text length of the content unit is greater than a preset third specified threshold value before the content matching unit matches the content unit with a preset content template library;
and the matching unit is used for matching the content units with a preset content template library respectively aiming at each content unit when the judgment result of the text length judgment unit is negative.
23. The web page text content extracting apparatus according to claim 13, further comprising:
and the character conversion unit is used for converting each digital character contained in the content unit into a unified preset character before the content matching unit matches the content unit with a preset content template library.
CN201110147583.7A 2011-06-02 2011-06-02 Webpage text content extracting method and device Active CN102810097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110147583.7A CN102810097B (en) 2011-06-02 2011-06-02 Webpage text content extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110147583.7A CN102810097B (en) 2011-06-02 2011-06-02 Webpage text content extracting method and device

Publications (2)

Publication Number Publication Date
CN102810097A CN102810097A (en) 2012-12-05
CN102810097B true CN102810097B (en) 2016-03-02

Family

ID=47233804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110147583.7A Active CN102810097B (en) 2011-06-02 2011-06-02 Webpage text content extracting method and device

Country Status (1)

Country Link
CN (1) CN102810097B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103714176A (en) * 2014-01-08 2014-04-09 同济大学 Webpage text extraction method based on maximum text density
CN103870606B (en) * 2014-04-08 2017-05-10 上海语天信息技术有限公司 Webpage information extracting system and extracting method
CN103927397B (en) * 2014-05-05 2017-02-22 湖北文理学院 Recognition method for Web page link blocks based on block tree
CN103955632B (en) * 2014-05-07 2018-03-06 百度在线网络技术(北京)有限公司 The encryption display methods and device of webpage word
CN104598577B (en) * 2015-01-14 2017-09-15 晶赞广告(上海)有限公司 A kind of extracting method of Web page text
CN106407217B (en) * 2015-07-31 2019-12-24 北京国双科技有限公司 Navigation webpage identification method and device
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN106802899B (en) * 2015-11-26 2020-11-24 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN106855859B (en) * 2015-12-08 2020-11-10 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN105808644A (en) * 2016-02-25 2016-07-27 浪潮软件集团有限公司 Method and device for determining text node
CN107203527B (en) * 2016-03-16 2019-06-28 北大方正集团有限公司 The text extracting method and system of news web page
CN106776886B (en) * 2016-11-29 2019-09-24 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN108628817B (en) * 2017-03-15 2022-07-26 腾讯科技(深圳)有限公司 Data processing method and device
CN107391559B (en) * 2017-06-08 2020-06-02 广东工业大学 General forum text extraction algorithm based on block, pattern recognition and line text
CN109033282B (en) * 2018-07-11 2021-07-23 山东邦尼信息科技有限公司 Webpage text extraction method and device based on extraction template
CN110968807A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Webpage text extraction method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1696933A (en) * 2005-05-27 2005-11-16 清华大学 Method for automatic picking up conceptual relationship of text based on dynamic programming
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7035841B2 (en) * 2002-07-18 2006-04-25 Xerox Corporation Method for automatic wrapper repair
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1696933A (en) * 2005-05-27 2005-11-16 清华大学 Method for automatic picking up conceptual relationship of text based on dynamic programming
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于分块的网页正文信息提取算法研究;黄文蓓等;《计算机应用》;20070630;第27卷;第24-26页 *

Also Published As

Publication number Publication date
CN102810097A (en) 2012-12-05

Similar Documents

Publication Publication Date Title
CN102810097B (en) Webpage text content extracting method and device
US7469251B2 (en) Extraction of information from documents
CN101727461B (en) Method for extracting content of web page
CN102253930B (en) A kind of method of text translation and device
CN105022803B (en) A kind of method and system for extracting Web page text content
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN106528583A (en) Method for extracting and comparing web page main body
US8140533B1 (en) Harvesting relational tables from lists on the web
CN108268884B (en) Document comparison method and device
CN102609427A (en) Public opinion vertical search analysis system and method
CN111178088B (en) Configurable neural machine translation method for XML document
CN109033166A (en) A kind of character attribute extraction training dataset construction method
CN111563372B (en) Typesetting document content self-duplication checking method based on teaching book publishing
CN112257462A (en) Hypertext markup language translation method based on neural machine translation technology
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN112417823A (en) Chinese text word order adjusting and quantitative word completion method and system
CN111984845A (en) Website wrongly-written character recognition method and system
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
CN105808561A (en) Method and device for extracting abstract from webpage
CN105608137A (en) Method and device for extracting identity label
CN105740355A (en) Aggregated text density based webpage body text extraction method and apparatus
EP2071477A1 (en) System and method for searching for document based on policy
CN108255866B (en) Method and device for checking links in website
CN103942188B (en) A kind of method and apparatus identifying language material language
CN108132919A (en) A kind of method of webpage content extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200424

Address after: 310012 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 102200, No. 8, No., Changsheng Road, Changping District science and Technology Park, Beijing, China. 1-5

Patentee before: AUTONAVI SOFTWARE Co.,Ltd.

TR01 Transfer of patent right