CN103020129A

CN103020129A - Text content extraction method and text content extraction device

Info

Publication number: CN103020129A
Application number: CN2012104699406A
Authority: CN
Inventors: 叶伟
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2012-11-20
Filing date: 2012-11-20
Publication date: 2013-04-03
Anticipated expiration: 2032-11-20
Also published as: CN103020129B; WO2013178193A2; WO2013178193A3

Abstract

The invention discloses a text content extraction method and a text content extraction device. The method comprises the following steps: breaking up an input HTML (Hypertext Markup Language) web page into a plurality of modules, determining the position scores of the modules according to the positions of the modules in the web page layout, and calculating the text lengths of the modules; extracting the link addresses contained in the modules, counting the most frequently used character contents in the link addresses, marking the link addresses containing the character contents as effective links, and marking the link addresses, which do not contain the character contents, as ineffective links; and determining the comprehensive scores of the modules according to the formula that the comprehensive score is equal to the position score multiplied by (the text length plus the character length of the effective link) and divided by the character length of the ineffective link, and confirming the modules of which the comprehensive scores exceed a set threshold value as content modules. Through the method provided by the invention, redundant information of non-content parts in the web page can be effectively removed, and more accurate effective content extraction of the wed page is realized.

Description

A kind of method for extracting content of text and device

Technical field

The present invention relates to communication technical field, relate in particular to a kind of method for extracting content of text and device.

Background technology

Along with the fast development of Internet technology, browsing page becomes the Main Means of people's obtaining information gradually, and in the page info of all contacts, text message has accounted for again major part wherein.The text message that how effectively to extract in the page is very important, because if the content of text is all extracted, many unnecessary contents of wherein will inevitably mixing, such as advertising message, navigation information etc., these information normally repeat in a large number, and are not the content that the user is interested and need; Moreover a large amount of repetitions and invalid information also can reduce the accuracy of text cluster and text classification, can strengthen the workload of content retrieval.And in different webpages, the composing of the page and layout are diversified, if list is divided from module or position, then are difficult to obtain accurately effective text message.

At present, the extraction means of content of text are that the input webpage is decomposed into a plurality of modules, and determine by the integrate score that calculates each module whether respective modules is content module.Wherein, the account form of integrate score is: integrate score=position score * word length/link characters length, however this account form is still accurate not, can not accurately divide content.So, how a kind of Text Extraction is provided at present, realize the accurate extraction of content of text is become the technical matters that needs to be resolved hurrily at present.

Summary of the invention

The invention provides a kind of method for extracting content of text and device, can not accurately extract the problem of content of text in order to the method for extracting content of text that solves available technology adopting.

In order to address the above problem, the technical solution used in the present invention is as follows:

On the one hand, the invention provides a kind of method for extracting content of text, comprising:

The HTML (Hypertext Markup Language) html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculate the text size of each module;

Extract the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of described frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of described frequency of utilization is labeled as invalid link;

Word length according to invalid link in the position score of integrate score=module * (in the text size+module of module effectively the word length of link)/module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.

Further, in the method for the invention, use Table label or Div label that the html web page of input is decomposed into a plurality of modules.

Further, in the method for the invention, can also continue to decompose and do not occur the situation that label mixes if decompose the module that obtains, then the module after decomposing be continued to decompose.

Further, in the method for the invention, when the effective link of mark and invalid link, the word length that each link of unified calculation is interior; Perhaps, when determining the integrate score of each module, calculate respectively the interior word length of each link that each module comprises.

Further, in the method for the invention, the text size that calculates each module specifically comprises: for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;

Further, in the method for the invention, go out the chained address of each module by the achor tag extraction.

On the other hand, the present invention also provides a kind of content of text extraction element, comprising:

The Web Page Processing unit is used for the HTML (Hypertext Markup Language) html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;

The mark processing unit, be used for extracting the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of described frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of described frequency of utilization is labeled as invalid link;

The contents extraction unit, the word length that is used for invalid link in position score * (in the text size+module of module effectively the word length of link)/module according to integrate score=module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.

Further, in the device of the present invention, described Web Page Processing unit, concrete html web page for using Table label or Div label with input is decomposed into a plurality of modules.

Further, in the device of the present invention, described Web Page Processing unit is used for also judging whether decompose the module that obtains can also continue to decompose and do not occur the situation that label mixes, if then the module after decomposing is continued to decompose.

Further, in the device of the present invention, described mark processing unit also is used for when the effective link of mark and invalid link the word length that each link of unified calculation is interior; Perhaps, described contents extraction unit also is used for when determining the integrate score of each module, calculates respectively the interior word length of each link that each module comprises.

Further, in the device of the present invention, described Web Page Processing unit, concrete being used for for each module extracts the html tag of module, obtains the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;

Further, in the device of the present invention, described mark processing unit, the concrete chained address that is used for going out by the achor tag extraction each module.

Compared with prior art, beneficial effect of the present invention is as follows:

The method of the invention and device, used plain text with the ratio that effectively links word length sum and invalid link word length, can extract the content of html web page more accurately, the redundant information such as advertisement have been removed, so that the workload in the participle stage of back reduces greatly, improved the accuracy of text cluster and text classification, autoabstract.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, the below will do one to the accompanying drawing of required use in embodiment or the description of the Prior Art and introduce simply, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

The process flow diagram of a kind of method for extracting content of text that Fig. 1 provides for the embodiment of the invention;

Fig. 2 is page layout synoptic diagram in the embodiment of the invention;

The method for extracting content of text particular flow sheet that Fig. 3 provides for the embodiment of the invention;

The structured flowchart of a kind of content of text extraction element that Fig. 4 provides for the embodiment of the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.

For the method for extracting content of text that solves available technology adopting can not accurately extract the problem of content of text, the embodiment of the invention provides a kind of method for extracting content of text and device.

As shown in Figure 1, a kind of method for extracting content of text that the embodiment of the invention provides specifically comprises:

Step S101 is decomposed into a plurality of modules with the html web page of inputting, and according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;

In this step, preferably, use Table label or Div label that the html web page of input is decomposed into a plurality of modules.

Further, in this step, can also continue to decompose and do not occur the situation that label mixes if decompose the module that obtains, then the module after decomposing be continued to decompose.Wherein, label mixes: the page layout mode of main flow mainly is divided into two kinds at present, namely by＜Table〉label or＜Div the label layout of carrying out on the page structure divides, but when the edit page content, these two labels also may comprise mutually, i.e. employing＜Table〉may comprise＜Div in the page of layout label, same employing＜Div〉also may comprise＜Talbe in the page of layout label; In addition, label mix also the label of accusing structure processed (such as＜Table 〉,＜h1) and the label that shows of control (such as＜font 〉,＜b) use mixed in together, cause the difficulty of correcting and data division.Owing to be to divide module, refer to＜Table so label used herein mixes mainly among the present invention〉and＜Div label mix use.

Further, in this step, the text size that calculates each module specifically comprises: for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.

Step S102, extract the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of frequency of utilization is labeled as invalid link;

In this step, preferably, go out the chained address of each module by the achor tag extraction.

Step S103, word length according to invalid link in the position score of integrate score=module * (in the text size+module of module effectively the word length of link)/module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.

The word length of the effective link described in this step and the word length of invalid link preferably, are that unified calculation obtains when the effective link of mark and invalid link; Certainly, also can when determining the integrate score of each module, calculate respectively for each module.

For the implementation procedure of clearer elaboration the method for the invention, further specify below in conjunction with 2 to 3 pairs of the method for the invention of accompanying drawing, be specifically related to following content:

The method for extracting content of text that the embodiment of the invention provides is follow-on method for extracting content of text based on the HTML feature, and the method can more accurate and reasonably be divided content of text.In the embodiment of the invention, page layout is divided into content module and non-content module, as shown in Figure 2, content module is the content part in the webpage, but not content module generally is to show the information such as navigation information, banner, copyright notice/or advertisement.The target of the described scheme of the embodiment of the invention is exactly accurate decomposing H TML webpage, and content module is extracted from html web page.For each module of decomposing out, give and different scores according to the position in the page layout at its place, the module score that is in user's sight line focus is higher, otherwise score is lower, if this module ratio is excessive relatively for the invalid link word length, this modules exhibit may be advertisement or navigation information so.

The described method for extracting content of text based on the HTML feature of the embodiment of the invention specifically comprises:

Step 1 uses label that the html web page of input is resolved into a plurality of modules;

In this step, preferably, adopt Table or Div label that the html web page of input is resolved into a plurality of modules.Why the embodiment of the invention adopts Table label and these two labels that are used for layout of Div label to come decomposing module, be on the one hand because they are for page layout, also can reduce on the other hand the complexity of analyzing web page, to such as span, the label of br these other is not processed, accelerate greatly the resolution speed of webpage, reduced the analysis of system resource.

Step 2 if the module of decomposing in the step 1 can also continue to decompose again, and the situation that Table or Div label mix do not occur, and so again this module is delivered to step 1 and continues to decompose.

Step 3 awards the diverse location score to the module of input according to the diverse location in layout.Certainly, in this step, be set in advance good for the concrete score of each position in the layout, it is higher that its cardinal rule is that module more is in its weight of position that user's notice is concentrated in the layout, and the position score is also higher.

Step 4 is calculated the text size in each module.

Step 5, add up the highest character content of frequency of utilization in the chained address of all modules, because it is identical that must there be the partial character content chained address that belongs to this webpage related content, chained address such as advertisement then can not comprise this part identical character content, therefore can distinguish effective chained address (link of webpage related content) and invalid chained address (advertisement is equal to the incoherent link of web page contents) with the highest character content of frequency of utilization that counts.

In this step, when the highest character content of statistics frequency of utilization, with all URL(URL(uniform resource locator)) the total character that all has, such as these total agreement characters of www, http, get rid of outside statistics.

Step 6, the chained address that will comprise the character content that counts in the step 5 is labeled as effective link, and calculates each effectively interior word length of link;

Step 7 is labeled as invalid link with the chained address that does not comprise the character content that counts in the step 5, and calculates the word length in each invalid link.

Step 8, according to the word length in the position score of the integrate score=module of module * (word length in the link of the text size in the module+effectively)/invalid link, draw each module synthesis score, what integrate score was higher than set threshold value namely thinks content module.

Step 9 according to the threshold value of prior setting (namely think the integrate score lower limit that content module should reach), is judged the integrate score in the described step 8, if its score value is higher than the threshold value that sets, thinks that then the content of this module is to need the content of text that extracts.

Based on above-mentioned principle statement, below in conjunction with describing with concrete example, as shown in Figure 3, comprise: after obtaining webpage, webpage as input, as step 1. in, if can use Table label and Div label that the webpage of input is decomposed into a plurality of modules, just the webpage of input be decomposed.2. judge whether the module decompose out can also continue to decompose in step in, 1. continue to decompose if can get back to so step, otherwise enter step 3..Obtain simple text behind all html tags in the module that 3. step proposes 2. to be inputted by step, calculate the length of this text.4. step uses the achor tag extraction to go out all links, and adds up the highest character content of frequency of utilization in the chained address of all modules.5. step calculates the link characters length that comprises and do not comprise the character content that 4. step count, and is labeled as respectively effective link and invalid link.6. step utilizes formula: integrate score=position score * (word length+effectively link word length)/invalid link word length calculates the integrate score of each module.7. integrate score is deleted through step less than the module of threshold value, and the module that integrate score is higher than threshold value enters step and 8. exports.

In sum, the described method of the embodiment of the invention, used plain text with the ratio that effectively links word length sum and invalid link word length, can extract the content of html web page more accurately, the redundant information such as advertisement have been removed, so that the workload in the participle stage of back reduces greatly, improved the accuracy of text cluster and text classification, autoabstract.

As shown in Figure 4, the embodiment of the invention also provides a kind of content of text extraction element, specifically comprises:

Web Page Processing unit 410 is used for the html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;

Mark processing unit 420, be used for extracting the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of frequency of utilization is labeled as invalid link;

Contents extraction unit 430, the word length that is used for invalid link in position score * (in the text size+module of module effectively the word length of link)/module according to integrate score=module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.

Based on above-mentioned principle framework, the below provides the specific implementation of above-mentioned each unit when realizing corresponding function, and is specific as follows:

In the embodiment of the invention, Web Page Processing unit 410 specifically uses Table label or Div label that the html web page of input is decomposed into a plurality of modules; And for each module, extract the html tag of module, and obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.

Further, Web Page Processing unit 410 is used for also judging whether decompose the module that obtains can also continue to decompose and do not occur the situation that label mixes, if then the module after decomposing is continued to decompose.

In the embodiment of the invention, mark processing unit 420 also is used for when the effective link of mark and invalid link the word length that each link of unified calculation is interior; Perhaps, contents extraction unit 430 when determining the integrate score of each module, calculates respectively the interior word length of each link that each module comprises.

Further, in the embodiment of the invention, mark processing unit 420, the concrete chained address that is used for going out by the achor tag extraction each module.

In sum, device of the present invention, used plain text with the ratio that effectively links word length sum and invalid link word length, can extract the content of html web page more accurately, the redundant information such as advertisement have been removed, so that the workload in the participle stage of back reduces greatly, improved the accuracy of text cluster and text classification, autoabstract.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. a method for extracting content of text is characterized in that, comprising:

2. the method for claim 1 is characterized in that, in the described method, uses Table label or Div label that the html web page of input is decomposed into a plurality of modules.

3. method as claimed in claim 2 is characterized in that, in the described method, can also continue to decompose and do not occur the situation that label mixes if decompose the module that obtains, and then the module after decomposing is continued to decompose.

4. the method for claim 1 is characterized in that, in the described method, and when the effective link of mark and invalid link, the word length that each link of unified calculation is interior; Perhaps, when determining the integrate score of each module, calculate respectively the interior word length of each link that each module comprises.

5. such as each described method in the claim 1 to 4, it is characterized in that,

In the described method, the text size that calculates each module specifically comprises: for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;

In the described method, go out the chained address of each module by the achor tag extraction.

6. a content of text extraction element is characterized in that, comprising:

7. device as claimed in claim 6 is characterized in that, described Web Page Processing unit, and concrete html web page for using Table label or Div label with input is decomposed into a plurality of modules.

8. device as claimed in claim 7 is characterized in that, described Web Page Processing unit is used for also judging whether decompose the module that obtains can also continue to decompose and do not occur the situation that label mixes, if then the module after decomposing is continued to decompose.

9. device as claimed in claim 6 is characterized in that,

Described mark processing unit also is used for when the effective link of mark and invalid link the word length that each link of unified calculation is interior;

Perhaps, described contents extraction unit also is used for when determining the integrate score of each module, calculates respectively the interior word length of each link that each module comprises.

10. such as each described device in the claim 6 to 9, it is characterized in that,

Described Web Page Processing unit, concrete being used for for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;

Described mark processing unit, the concrete chained address that is used for going out by the achor tag extraction each module.