CN103020129B

CN103020129B - A kind of method for extracting content of text and device

Info

Publication number: CN103020129B
Application number: CN201210469940.6A
Authority: CN
Inventors: 叶伟
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2012-11-20
Filing date: 2012-11-20
Publication date: 2015-11-18
Anticipated expiration: 2032-11-20
Also published as: CN103020129A; WO2013178193A2; WO2013178193A3

Abstract

The invention discloses a kind of method for extracting content of text and device, described method comprises: the html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module; Extract the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest, is labeled as effective link by each chained address comprising described character content, each chained address not comprising described character content is labeled as invalid link; According to the word length of integrate score=position score × (word length of text size+effectively link)/invalid link, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.The method of the invention effectively can remove the information of the redundancy of non-content part in webpage, achieves and extracts effective content of webpage more accurately.

Description

A kind of method for extracting content of text and device

Technical field

The present invention relates to communication technical field, particularly relate to a kind of method for extracting content of text and device.

Background technology

Along with the fast development of Internet technology, browse the Main Means that webpage becomes people's obtaining information gradually, and in the page info of all contacts, text message account for again major part wherein.How effectively the text message extracted in the page is very important, because if the content of text is all extracted, wherein will inevitably to adulterate many unnecessary contents, as advertising message, navigation information etc., these information normally repeat in a large number, and are not the content that user is interested and need; Moreover a large amount of repetitions and invalid information also can reduce the accuracy of text cluster and text classification, can strengthen the workload of content retrieval.And in different webpages, the typesetting of the page and layout are diversified, divide from module or position if single, be then difficult to obtain effective text message accurately.

At present, the extraction means of content of text are that input webpage is decomposed into multiple module, and determine whether respective modules is content module by the integrate score calculating each module.Wherein, the account form of integrate score is: integrate score=position score × word length/link characters length, but this account form is still accurate not, can not accurately divide content.So, how to provide a kind of Text Extraction at present, realize becoming technical matters urgently to be resolved hurrily at present to the accurate extraction of content of text.

Summary of the invention

The invention provides a kind of method for extracting content of text and device, accurately can not extract the problem of content of text in order to solve the method for extracting content of text adopted in prior art.

In order to solve the problem, the technical solution used in the present invention is as follows:

On the one hand, the invention provides a kind of method for extracting content of text, comprising:

The HTML (Hypertext Markup Language) html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculate the text size of each module;

Extract the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of described frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of described frequency of utilization is labeled as invalid link;

According to the word length of invalid link in position score × (word length of effectively link in the text size+module of module)/module of integrate score=module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.

Further, in the method for the invention, use Table label or Div label that the html web page of input is decomposed into multiple module.

Further, in the method for the invention, can also continue decompose and do not occur the situation that label mixes if decompose the module obtained, then continue to decompose to the module after decomposition.

Further, in the method for the invention, when marking effective link and invalid link, unifiedly calculate the word length in each link; Or, when determining the integrate score of each module, calculate the word length in each link that each module comprises respectively.

Further, in the method for the invention, the text size calculating each module specifically comprises: for each module, extract the html tag of module, the text message comprised in respective modules is obtained according to described html tag, calculate the length of text information, obtain the text size of respective modules;

Further, in the method for the invention, gone out the chained address of each module by anchor tag extraction.

On the other hand, the present invention also provides a kind of content of text extraction element, comprising:

Web Page Processing unit, for the HTML (Hypertext Markup Language) html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;

Mark processing unit, for extracting the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of described frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of described frequency of utilization is labeled as invalid link;

Contents extracting unit, for the word length of invalid link in the position score according to integrate score=module × (word length of effectively link in the text size+module of module)/module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.

Further, in device of the present invention, described Web Page Processing unit, is decomposed into multiple module specifically for using Table label or Div label by the html web page of input.

Further, in device of the present invention, described Web Page Processing unit, also for judging that whether decompose the module obtained can also continue decompose and do not occur the situation that label mixes, and if so, then continues to decompose to the module after decomposition.

Further, in device of the present invention, described mark processing unit, also for when marking effective link and invalid link, unifiedly calculates the word length in each link; Or described contents extracting unit, also for when determining the integrate score of each module, calculates the word length in each link that each module comprises respectively.

Further, in device of the present invention, described Web Page Processing unit, specifically for for each module, extract the html tag of module, obtain the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;

Further, in device of the present invention, described mark processing unit, specifically for going out the chained address of each module by anchor tag extraction.

Compared with prior art, beneficial effect of the present invention is as follows:

The method of the invention and device, employ plain text with the ratio effectively linking word length sum and invalid link word length, can extract the content of html web page more accurately, eliminate the information such as the advertisement of redundancy, the workload in participle stage is below reduced greatly, improves the accuracy of text cluster and text classification, autoabstract.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

The process flow diagram of a kind of method for extracting content of text that Fig. 1 provides for the embodiment of the present invention;

Fig. 2 is page layout's schematic diagram in the embodiment of the present invention;

The method for extracting content of text particular flow sheet that Fig. 3 provides for the embodiment of the present invention;

The structured flowchart of a kind of content of text extraction element that Fig. 4 provides for the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Accurately can not extract the problem of content of text to solve the method for extracting content of text adopted in prior art, the embodiment of the present invention provides a kind of method for extracting content of text and device.

As shown in Figure 1, a kind of method for extracting content of text that the embodiment of the present invention provides, specifically comprises:

Step S101, is decomposed into multiple module by the html web page of input, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;

In this step, preferably, use Table label or Div label that the html web page of input is decomposed into multiple module.

Further, in this step, can also continue decompose and do not occur the situation that label mixes if decompose the module obtained, then continue to decompose to the module after decomposition.Wherein, label mixes: page layout's mode of main flow is mainly divided into two kinds at present, the layout of namely being undertaken on page structure by <Table> label or <Div> label is divided, but when edit page content, these two labels also may comprise mutually, namely adopt in the page of <Table> layout and may comprise <Div> label, also <Talbe> label may be comprised in the page of same employing <Div> layout, in addition, label mixes the label (as <Table>, <h1>) also referring to control structure and label (as <font>, <b>) use mixed in together controlling performance, causes the difficulty of correcting and Data Placement.Owing to being need to divide module in the present invention, thus label used herein mix mainly refer to <Table> and <Div> label mix use.

Further, in this step, the text size calculating each module specifically comprises: for each module, extract the html tag of module, obtain the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.

Step S102, extract the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of frequency of utilization is labeled as invalid link;

In this step, preferably, the chained address of each module is gone out by anchor tag extraction.

Step S103, according to the word length of invalid link in position score × (word length of effectively link in the text size+module of module)/module of integrate score=module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.

The word length of effective link described in this step and the word length of invalid link preferably, are that unified calculation obtains when marking effective link and invalid link; Certainly, also when determining the integrate score of each module, can calculate respectively for each module.

In order to the implementation procedure of clearer elaboration the method for the invention, be further described below in conjunction with accompanying drawing 2 to 3 pairs of the method for the invention, be specifically related to following content:

The method for extracting content of text that the embodiment of the present invention provides is follow-on method for extracting content of text based on HTML feature, and the method more accurately and reasonably can divide content of text.In the embodiment of the present invention, page layout is divided into content module and non-content module, as shown in Figure 2, content module is the content part in webpage, but not content module is generally used to show navigation information, banner, the information such as copyright notice or advertisement.The target of scheme described in the embodiment of the present invention is exactly can accurate decomposing H TML webpage, and content module is extracted from html web page.For each decomposition module out, give and different scores according to the position in the page layout at its place, the module score being in user's sight line focus is higher, otherwise score is lower, if invalid link word length is excessive relative to this module ratio, so this modules exhibit may be advertisement or navigation information.

The method for extracting content of text based on HTML feature described in the embodiment of the present invention, specifically comprises:

Step 1, uses label that the html web page of input is resolved into multiple module;

In this step, preferably, adopt Table or Div label that the html web page of input is resolved into multiple module.Why the embodiment of the present invention adopts Table label and these two labels for layout of Div label to carry out decomposing module, because they are for page layout on the one hand, also the complexity of analyzing web page can be reduced on the other hand, to such as span, the label of br these other does not process, accelerate the resolution speed of webpage greatly, decrease the analysis of system resource.

, if the module of decomposing in step 1 can also continue to decompose again, and there is not the situation that Table or Div label mixes in step 2, so again this module is delivered to step 1 and continue to decompose.

Step 3, awards diverse location score the module of input according to the diverse location in layout.Certainly, in this step, the concrete score for position each in layout is pre-set good, and it is higher that its cardinal rule is that module is more in its weight of position of user's visual cognitive ability in layout, and position score is also higher.

Step 4, calculates the text size in each module.

Step 5, add up the character content that in the chained address of all modules, frequency of utilization is the highest, because the chained address belonging to this webpage related content must have partial character content to be identical, the chained address of such as advertisement and so on then can not comprise this part identical character content, therefore can distinguish effective chained address (link of webpage related content) and invalid chained address (advertisement is equal to the incoherent link of web page contents) by the character content that counted frequency of utilization is the highest.

In this step, when the character content that statistics frequency of utilization is the highest, by the total character that all URL (URL(uniform resource locator)) all have, as these total agreement characters of www, http, get rid of outside statistics.

Step 6, is labeled as effective link by the chained address comprising the character content counted in step 5, and calculates the word length in each effective link;

Step 7, is labeled as invalid link by the chained address not comprising the character content counted in step 5, and calculates the word length in each invalid link.

Step 8, according to the word length in position score × (word length in the text size in module+effective link)/invalid link of the integrate score=module of module, draw each module synthesis score, namely integrate score thinks content module higher than set threshold value.

Step 9, according to the threshold value of in advance setting (namely think the integrate score lower limit that content module should reach), integrate score in step 8 described in judgement, if its score value is higher than set threshold value, then thinks that the content of this module is the content of text needing to extract.

Based on the statement of above-mentioned principle, be described below in conjunction with concrete example, as shown in Figure 3, comprise: acquisition webpage after, using webpage as input, as step 1. in, if can use Table label and Div label that the webpage of input is decomposed into multiple module, just the webpage of input is decomposed.Whether can also continue to decompose in the step module that 2. middle judgement decomposes out, 1. continue to decompose if step so can be got back to, otherwise enter step 3..Step obtains simple text after 3. proposing all html tags in the module 2. inputted by step, calculates the length of this text.4. step uses anchor tag extraction to go out all links, and adds up the character content that in the chained address of all modules, frequency of utilization is the highest.5. step calculates the link characters length comprising He do not comprise 4. the counted character content of step, is labeled as effectively link and invalid link respectively.6. step utilizes formula: integrate score=position score × (word length+effectively link word length)/invalid link word length, calculates the integrate score of each module.7. the module that integrate score is less than threshold value deletes through step, and integrate score enters step higher than the module of threshold value and 8. exports.

In sum, method described in the embodiment of the present invention, employ plain text with the ratio effectively linking word length sum and invalid link word length, can extract the content of html web page more accurately, eliminate the information such as the advertisement of redundancy, the workload in participle stage is below reduced greatly, improves the accuracy of text cluster and text classification, autoabstract.

As shown in Figure 4, the embodiment of the present invention also provides a kind of content of text extraction element, specifically comprises:

Web Page Processing unit 410, for the html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;

Mark processing unit 420, for extracting the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of frequency of utilization is labeled as invalid link;

Contents extracting unit 430, for the word length of invalid link in the position score according to integrate score=module × (word length of effectively link in the text size+module of module)/module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.

Based on above-mentioned principle framework, provide the specific implementation of above-mentioned each unit when realizing corresponding function below, specific as follows:

In the embodiment of the present invention, Web Page Processing unit 410, the html web page of input is decomposed into multiple module by concrete Table label or the Div label of using; And for each module, extract the html tag of module, obtain the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.

Further, Web Page Processing unit 410, also for judging that whether decompose the module obtained can also continue decompose and do not occur the situation that label mixes, and if so, then continues to decompose to the module after decomposition.

In the embodiment of the present invention, mark processing unit 420, also for when marking effective link and invalid link, unifiedly calculates the word length in each link; Or contents extracting unit 430, when determining the integrate score of each module, calculates the word length in each link that each module comprises respectively.

Further, in the embodiment of the present invention, mark processing unit 420, specifically for going out the chained address of each module by anchor tag extraction.

In sum, device of the present invention, employ plain text with the ratio effectively linking word length sum and invalid link word length, can extract the content of html web page more accurately, eliminate the information such as the advertisement of redundancy, the workload in participle stage is below reduced greatly, improves the accuracy of text cluster and text classification, autoabstract.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a method for extracting content of text, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, in described method, uses Table label or Div label that the html web page of input is decomposed into multiple module.

3. method as claimed in claim 2, is characterized in that, in described method, can also continue decompose and do not occur the situation that label mixes, then continue to decompose to the module after decomposition if decompose the module obtained.

4. the method for claim 1, is characterized in that, in described method, when marking effective link and invalid link, unifiedly calculates the word length in each link; Or, when determining the integrate score of each module, calculate the word length in each link that each module comprises respectively.

5. the method according to any one of Claims 1-4, is characterized in that,

In described method, the text size calculating each module specifically comprises: for each module, extracts the html tag of module, obtains the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;

In described method, gone out the chained address of each module by anchor tag extraction.

6. a content of text extraction element, is characterized in that, comprising:

7. device as claimed in claim 6, is characterized in that, described Web Page Processing unit, specifically for using Table label or Div label, the html web page of input is decomposed into multiple module.

8. device as claimed in claim 7, is characterized in that, described Web Page Processing unit, also for judging that whether decompose the module obtained can also continue decompose and do not occur the situation that label mixes, and if so, then continues to decompose to the module after decomposition.

9. device as claimed in claim 6, is characterized in that,

Described mark processing unit, also for when marking effective link and invalid link, unifiedly calculates the word length in each link;

Or described contents extracting unit, also for when determining the integrate score of each module, calculates the word length in each link that each module comprises respectively.

10. the device according to any one of claim 6 to 9, is characterized in that,

Described Web Page Processing unit, specifically for for each module, extracts the html tag of module, obtains the text message comprised in respective modules according to described html tag, calculates the length of text information, obtains the text size of respective modules;

Described mark processing unit, specifically for going out the chained address of each module by anchor tag extraction.