CN103116591B

CN103116591B - Content extraction method and apparatus pastes in forum

Info

Publication number: CN103116591B
Application number: CN201110366367.1A
Authority: CN
Inventors: 张涛; 于晓明; 杨建武
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Beijing Founder Electronics Co Ltd
Priority date: 2011-11-17
Filing date: 2011-11-17
Publication date: 2016-04-20
Anticipated expiration: 2031-11-17
Also published as: CN103116591A

Abstract

The invention provides a kind of forum and paste content extraction method, comprising: the source code pasted by forum generates html tag tree; The label subtree merging that text rate in being set by html tag is greater than first threshold obtains a Maximum alternative subtree; From Maximum alternative subtree, screening obtains all node clusters with analog structure; The node cluster that text rate is greater than Second Threshold is screened from node cluster; Extract the content of text in the node cluster of screening.Present invention also offers a kind of forum and paste content extraction device.Present invention achieves the Automatic Extraction that content is pasted by forum.

Description

Content extraction method and apparatus pastes in forum

Technical field

The present invention relates to technical field of Internet information, paste content extraction method and apparatus in particular to a kind of forum.

Background technology

Universal along with internet, applications, network forum is flourish, and user's quantity of forum increases day by day, and data volume be explosive growth, propagates play important effect to public sentiment, so apply more aobvious important to the retrieval of forum data and excavation etc.And be the basis that various forum applies to the correct extraction of web data.

The method of current extraction webpage data information two kinds: one is human configuration template, goes out data message with matching regular expressions; One is that Automatic Extraction goes out template by the sample page, and then utilizes template to go to match data message.Said method one, at substantial manpower, requires high to user knowledge; Said method two, off-line learning must obtain template, and then testing out template quality could use.Above-mentioned two kinds of methods are because all need to get out template ability extraction work in advance, once structure of web page correcting loses effectiveness, need a large amount of professional to safeguard, cost is higher.

The research of current many content pages abstracting methods mainly concentrates on the webpage of news and class news, but the visual layout of the content pages of forum and news are different.News category page structure is relatively simple, and text layout is comparatively concentrated, and forum's content pages page structure level is comparatively complicated, and body matter is divided into main subsides and follow-up, and its layout is comparatively disperseed, and therefore, class news web page abstracting method is not suitable for forum's class web page extraction.

Summary of the invention

The present invention aims to provide a kind of forum and pastes content extraction method and apparatus, to solve the extraction problem that content is pasted by forum.

In an embodiment of the present invention, provide a kind of forum and paste content extraction method, comprising: the source code pasted by forum generates html tag tree; The label subtree merging that text rate in being set by html tag is greater than first threshold obtains a Maximum alternative subtree; From Maximum alternative subtree, screening obtains all node clusters with analog structure; The node cluster that text rate is greater than Second Threshold is screened from node cluster; Extract the content of text in the node cluster of screening.

In an embodiment of the present invention, provide a kind of forum and paste content extraction device, comprising: tag tree module, the source code for being pasted by forum generates html tag tree; Maximum alternative subtree module, obtains a Maximum alternative subtree for the label subtree merging text rate in html tag tree being greater than first threshold; Node cluster module, obtains all node clusters with analog structure for screening from Maximum alternative subtree; Screening module, is greater than the node cluster of Second Threshold for screening text rate from node cluster; Abstraction module, for extracting the content of text in the node cluster of screening.

The forum of the above embodiment of the present invention pastes content extraction method and apparatus and have employed tree construction to simulate forum's subsides, so achieve the Automatic Extraction that content is pasted by forum.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 shows the process flow diagram pasting content extraction method according to the forum of the embodiment of the present invention;

Fig. 2 shows the process flow diagram of the establishment Maximum alternative subtree according to the embodiment of the present invention;

Fig. 3 shows the process flow diagram of the screening node cluster according to the embodiment of the present invention;

Fig. 4 shows the schematic diagram pasting content extraction device according to the forum of the embodiment of the present invention.

Embodiment

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.

Fig. 1 shows the process flow diagram pasting content extraction method according to the forum of the embodiment of the present invention, comprising:

Step S10, the source code pasted by forum generates html tag tree;

Step S20, the label subtree merging that the text rate in being set by html tag is greater than first threshold obtains a Maximum alternative subtree, and according to the result of many experiments, preferably, arranging first threshold is 0.8;

Step S30, from Maximum alternative subtree, screening obtains all node clusters with analog structure and is each floor model;

Step S40, screens the node cluster that text rate is greater than Second Threshold from node cluster, and according to the result of many experiments, preferably, arranging Second Threshold is 0.9;

Step S50, extracts the content of text in the node cluster of screening.

News category page structure is relatively simple, and text layout is comparatively concentrated, and forum's content pages page structure level is comparatively complicated, and body matter is divided into main subsides and follow-up, and its layout is comparatively disperseed.For one section of content pages webpage, in visual effect, its text, advertisement, copyright statement and navigation information are in set of regions and are that region can be divided, and in text region, the ratio of plain text will more than the ratio of hyperlink; Forum's content pages automatically generates html tag by CGI module according to identical template by server end, and each floor model therefore in forum's content pages has similar structures.

The source code that the step S10 of the present embodiment is pasted by forum generates html tag tree, and the content that forum is pasted becomes structuring, conveniently carries out various algorithm process.Step S20 can extract the html label subtree comprising text, in other words, just mean that in webpage, advertisement has been removed, step S30 screening obtains all node clusters with analog structure, be each floor model, step S40 and then determine the floor model with body matter.

The present embodiment step S20 extracts text node cluster in webpage, and step S30 extracts each money order receipt to be signed and returned to the sender node cluster in selected node cluster.The present embodiment achieves the robotization that forum data extracts, and saves manual maintenance cost.

Preferably, step S10 comprises: obtain the source code that forum pastes; The html text order corresponding according to source code generates html tag tree.This preferred embodiment utilizes the language construction feature of HTML, can pass easily through source code and build tag tree.

Preferably, step S10 also comprises: delete the noises such as the script node such as head node, comment nodes, script, input node, form node, select node, textarea node, style node and the font node in html tag tree.These nodes all for describing the information such as page layout or page program, do not have direct relation with body matter, and this preferred embodiment, by deleting these noises, can simplify the treatment scheme in later stage.

Preferably, the text rate TextRatio=W of label subtree is determined ₁* (TextCount-LinkCount)/TextCount+W ₂* TextCount/PageTextCount, wherein, 0<W ₂<<W ₁<1, W ₂+ W ₁=1, W ₁, W ₂be the weights of setting, TextCount represents the plain text number of label subtree, and LinkCount represents the link number of label subtree, and PageTextCount represents the plain text number that forum pastes.

Can determine TextCount=SUM (TextCounti), wherein, SUM (TextCounti) expression is sued for peace to the plain text number TextCounti of node i all in label subtree;

Can determine LinkCount=SUM (LinkCounti), wherein, SUM (LinkCounti) expression is sued for peace to the link number LinkCounti of node i all in label subtree;

Can determine PageTextCount=SUM (TextCounti), wherein, SUM (TextCounti) expression is sued for peace to the plain text number TextCounti of all node i in forum's subsides;

Wherein, be terminal node for node i, if plain text node, then the plain text number of words of TextCounti=node i be set, LinkCounti=0;

If hinged node, then TextCounti=LinkCounti=1 is set;

If other nodes, then TextCounti=LinkCounti=0 is set;

Be nonterminal node for node i, the TextCount summation of plain text number of words+all child nodes of TextCounti=node i self be set, the LinkCount summation of link number+all child nodes of LinkCounti=node i self is set.

Above-mentioned computation process is fairly simple, realizes easily via computer programming.

Fig. 2 shows the process flow diagram of the establishment Maximum alternative subtree according to the embodiment of the present invention, comprising:

Postorder traversal html tag is set, and calculates the text rate of each label subtree, the subtree being greater than first threshold is included in candidate's subtree, no longer travels through its brotgher of node, directly travels through father's node of candidate's subtree;

When candidate's subtree is more than 1, merge two candidate's subtrees, point following two kinds of situations merge: if a candidate subtree A is the subtree of candidate subtree B, then in candidate's subtree, remove candidate subtree A, retain candidate subtree B, continue traversal step; If two candidate subtree A and the different subtree in candidate subtree B office, then find common ancestor's node, then the subtree being root node with common ancestor's node alternatively subtree, other two subtree removings, continue traversal step, the candidate's subtree obtained after having traveled through is Maximum alternative subtree;

Traversal Maximum alternative subtree carries out beta pruning, and leaf node text rate being not more than first threshold all removes, and generates new Maximum alternative subtree.

Fig. 3 shows and comprises according to the process flow diagram of the screening node cluster of the embodiment of the present invention:

Breadth first traversal Maximum alternative subtree, calculate the label diversity factor between the brotgher of node, the label diversity factor of two brotghers of node is less than the 3rd threshold value and then thinks that two brotghers of node have analog structure, wherein, and three threshold value different to the Node configuration of different levels, level is darker, arranging the 3rd threshold value larger, is a subset by the node clustering with analog structure, according to the result of many experiments, preferably, arranging the 3rd threshold value is 0.1;

Namely respectively traversal step is repeated, until each cluster subset elements number is 1 to the subset of each cluster;

Select to comprise the maximum set of subset quantity, each subset in set is pasted as each floor.

Fig. 4 shows the schematic diagram pasting content extraction device according to the forum of the embodiment of the present invention, comprising:

Tag tree module 10, the source code for being pasted by forum generates html tag tree;

Maximum alternative subtree module 20, obtains a Maximum alternative subtree for the label subtree merging text rate in html tag tree being greater than first threshold;

Node cluster module 30, obtains all node clusters with analog structure for screening from Maximum alternative subtree;

Screening module 40, is greater than the node cluster of Second Threshold for screening text rate from node cluster;

Abstraction module 50, for extracting the content of text in the node cluster of screening.

Preferably, the text rate TextRatio=W of Maximum alternative subtree module determination label subtree ₁* (TextCount-LinkCount)/TextCount+W ₂* TextCount/PageTextCount, wherein, 0<W ₂<<W ₁<1, W ₂+ W ₁=1, W ₁, W ₂be the weights of setting, TextCount represents the plain text number of label subtree, and LinkCount represents the link number of label subtree, and PageTextCount represents the plain text number that forum pastes.

Preferably, Maximum alternative subtree module comprises:

Spider module, for postorder traversal html tag tree, calculates the text rate of each label subtree, the subtree being greater than first threshold is included in candidate's subtree, no longer travels through its brotgher of node, directly travel through father's node of candidate's subtree;

Merge module, during for candidate's subtree more than 1, merge two candidate's subtrees, point following two kinds of situations merge:

If a candidate subtree A is the subtree of candidate subtree B, then in candidate's subtree, remove candidate subtree A, retain candidate subtree B, continue traversal step;

If two candidate subtree A and the different subtree in candidate subtree B office, then find common ancestor's node, then the subtree being root node with common ancestor's node alternatively subtree, candidate subtree A, B are removed, continue traversal step, the candidate's subtree obtained after having traveled through is Maximum alternative subtree;

Beta pruning module, carry out beta pruning for traveling through Maximum alternative subtree, leaf node text rate being not more than first threshold all removes, and generates new Maximum alternative subtree.

Preferably, node cluster module comprises:

Spider module, for breadth first traversal Maximum alternative subtree, calculate the diversity factor between the brotgher of node, the label diversity factor of two brotghers of node is less than the 3rd threshold value and then thinks that two brotghers of node have analog structure, wherein, three threshold value different to the Node configuration of different levels, level is darker, arranging the 3rd threshold value larger, is a subset by the node clustering with analog structure;

Namely loop module, for repeating traversal step, until each cluster subset elements number is 1 to the subset of each cluster respectively;

As can be seen from the above description, present invention achieves the Automatic Extraction that content is pasted by forum.

Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a content extraction method is pasted by forum, it is characterized in that, comprising:

The source code pasted by forum generates html tag tree;

The label subtree merging that text rate in described html tag tree is greater than first threshold is obtained a Maximum alternative subtree;

From described Maximum alternative subtree, screening obtains all node clusters with analog structure;

The node cluster of Second Threshold is greater than from each node cluster screening text rate;

Extract described text rate and be greater than content of text in the node cluster of Second Threshold;

Wherein, the label subtree merging text rate in described html tag tree being greater than first threshold obtains a Maximum alternative subtree and comprises:

Described in postorder traversal, html tag tree, calculates the text rate of each label subtree, the subtree being greater than described first threshold is included in candidate's subtree, no longer travels through its brotgher of node, directly travels through father's node of described candidate's subtree;

When described candidate's subtree is more than 1, merge two described candidate's subtrees, point following two kinds of situations merge: if a candidate subtree A is the subtree of candidate subtree B, then in described candidate's subtree, remove candidate subtree A, retain candidate subtree B, continue described traversal step; If two candidate subtree A and the different subtree in candidate subtree B office, then find common ancestor's node, then the subtree being root node using common ancestor's node as described candidate's subtree, A, B subtree is removed, continue described traversal step, the described candidate's subtree obtained after having traveled through is described Maximum alternative subtree;

Travel through described Maximum alternative subtree and carry out beta pruning, leaf node text rate being not more than described first threshold all removes, and generates new described Maximum alternative subtree.

2. method according to claim 1, is characterized in that, the source code pasted by forum generates html tag tree and comprises:

Obtain the source code that described forum pastes;

The html text order corresponding according to described source code generates described html tag tree.

3. method according to claim 2, is characterized in that, the source code pasted by forum generates html tag tree and also comprises:

Delete head node, comment nodes, script node, input node, form node, select node, textarea node, style node and the font node in described html tag tree.

4. method according to claim 1, is characterized in that, determines the text rate TextRatio=W of label subtree ₁* (TextCount-LinkCount)/TextCount+W ₂* TextCount/PageTextCount, wherein, 0<W ₂<<W ₁<1, W ₂+ W ₁=1, W1, W2 are the weights of setting, and TextCount represents the plain text number of described label subtree, and LinkCount represents the link number of described label subtree, and PageTextCount represents the plain text number that described forum pastes.

5. method according to claim 4, is characterized in that, also comprises:

Determine TextCount=SUM (TextCounti), wherein, SUM (TextCounti) expression is sued for peace to the plain text number TextCounti of all node i in described label subtree;

Determine LinkCount=SUM (LinkCounti), wherein, SUM (LinkCounti) expression is sued for peace to the link number LinkCounti of all node i in described label subtree;

Determine PageTextCount=SUM (TextCounti), wherein, SUM (TextCounti) expression is sued for peace to the plain text number TextCounti of all node i in the subsides of described forum;

Wherein, be terminal node for node i, if plain text node, then the plain text number of words of TextCounti=node i be set, LinkCounti=0; If hinged node, then TextCounti=LinkCounti=1 is set; If other nodes, then TextCounti=LinkCounti=0 is set;

6. method according to claim 1, is characterized in that, from described Maximum alternative subtree, screening obtains all node clusters with analog structure and comprises:

Maximum alternative subtree described in breadth first traversal, calculates the diversity factor between the brotgher of node; Diversity factor is less than the 3rd threshold value and then thinks that two brotghers of node have analog structure, wherein, three threshold value different to the Node configuration of different levels, level is darker, arranges described 3rd threshold value larger, is a subset by the node clustering with analog structure;

Respectively described traversal step is repeated to the subset of each cluster, until namely each cluster subset elements number is 1;

7. a content extraction device pastes in forum, it is characterized in that, comprising:

Tag tree module, the source code for being pasted by forum generates html tag tree;

Maximum alternative subtree module, obtains a Maximum alternative subtree for the label subtree merging text rate in described html tag tree being greater than first threshold;

Node cluster module, obtains all node clusters with analog structure for screening from described Maximum alternative subtree;

Screening module, for screening the node cluster that text rate is greater than Second Threshold from described node cluster;

Abstraction module, for extracting the content of text in the node cluster of described screening;

Wherein, described Maximum alternative subtree module comprises:

Spider module, for html tag tree described in postorder traversal, calculates the text rate of each label subtree, the subtree being greater than described first threshold is included in candidate's subtree, no longer travels through its brotgher of node, directly travel through father's node of described candidate's subtree;

Merge module, during for described candidate's subtree more than 1, two described candidate's subtrees are merged, below point, two kinds of situations merge: if a candidate subtree A is the subtree of candidate subtree B, then in described candidate's subtree, remove candidate subtree A, retain candidate subtree B, continue described traversal step; If two candidate subtree A and the different subtree in candidate subtree B office, then find common ancestor's node, then the subtree being root node using common ancestor's node as described candidate's subtree, candidate subtree A, B are removed, continue described traversal step, the described candidate's subtree obtained after having traveled through is described Maximum alternative subtree;

Beta pruning module, carry out beta pruning for traveling through described Maximum alternative subtree, leaf node text rate being not more than described first threshold all removes, and generates new described Maximum alternative subtree.

8. device according to claim 7, is characterized in that, the text rate TextRatio=W of described Maximum alternative subtree module determination label subtree ₁* (TextCount-LinkCount)/TextCount+W ₂* TextCount/PageTextCount, wherein, 0<W ₂<<W ₁<1, W ₂+ W ₁=1, W ₁, W ₂be the weights of setting, TextCount represents the plain text number of described label subtree, and LinkCount represents the link number of described label subtree, and PageTextCount represents the plain text number that described forum pastes.

9. device according to claim 7, is characterized in that, described node cluster module comprises:

Spider module, for Maximum alternative subtree described in breadth first traversal, calculate the diversity factor between the brotgher of node, the label diversity factor of two brotghers of node is less than the 3rd threshold value and then thinks that two brotghers of node have analog structure, wherein, three threshold value different to the Node configuration of different levels, level is darker, arranging described 3rd threshold value larger, is a subset by the node clustering with analog structure;

Loop module, for repeating described traversal step to the subset of each cluster respectively, until namely each cluster subset elements number is 1;