Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows the process flow diagram pasting content extraction method according to the forum of the embodiment of the present invention, comprising:
Step S10, the source code pasted by forum generates html tag tree;
Step S20, the label subtree merging that the text rate in being set by html tag is greater than first threshold obtains a Maximum alternative subtree, and according to the result of many experiments, preferably, arranging first threshold is 0.8;
Step S30, from Maximum alternative subtree, screening obtains all node clusters with analog structure and is each floor model;
Step S40, screens the node cluster that text rate is greater than Second Threshold from node cluster, and according to the result of many experiments, preferably, arranging Second Threshold is 0.9;
Step S50, extracts the content of text in the node cluster of screening.
News category page structure is relatively simple, and text layout is comparatively concentrated, and forum's content pages page structure level is comparatively complicated, and body matter is divided into main subsides and follow-up, and its layout is comparatively disperseed.For one section of content pages webpage, in visual effect, its text, advertisement, copyright statement and navigation information are in set of regions and are that region can be divided, and in text region, the ratio of plain text will more than the ratio of hyperlink; Forum's content pages automatically generates html tag by CGI module according to identical template by server end, and each floor model therefore in forum's content pages has similar structures.
The source code that the step S10 of the present embodiment is pasted by forum generates html tag tree, and the content that forum is pasted becomes structuring, conveniently carries out various algorithm process.Step S20 can extract the html label subtree comprising text, in other words, just mean that in webpage, advertisement has been removed, step S30 screening obtains all node clusters with analog structure, be each floor model, step S40 and then determine the floor model with body matter.
The present embodiment step S20 extracts text node cluster in webpage, and step S30 extracts each money order receipt to be signed and returned to the sender node cluster in selected node cluster.The present embodiment achieves the robotization that forum data extracts, and saves manual maintenance cost.
Preferably, step S10 comprises: obtain the source code that forum pastes; The html text order corresponding according to source code generates html tag tree.This preferred embodiment utilizes the language construction feature of HTML, can pass easily through source code and build tag tree.
Preferably, step S10 also comprises: delete the noises such as the script node such as head node, comment nodes, script, input node, form node, select node, textarea node, style node and the font node in html tag tree.These nodes all for describing the information such as page layout or page program, do not have direct relation with body matter, and this preferred embodiment, by deleting these noises, can simplify the treatment scheme in later stage.
Preferably, the text rate TextRatio=W of label subtree is determined
1* (TextCount-LinkCount)/TextCount+W
2* TextCount/PageTextCount, wherein, 0<W
2<<W
1<1, W
2+ W
1=1, W
1, W
2be the weights of setting, TextCount represents the plain text number of label subtree, and LinkCount represents the link number of label subtree, and PageTextCount represents the plain text number that forum pastes.
Can determine TextCount=SUM (TextCounti), wherein, SUM (TextCounti) expression is sued for peace to the plain text number TextCounti of node i all in label subtree;
Can determine LinkCount=SUM (LinkCounti), wherein, SUM (LinkCounti) expression is sued for peace to the link number LinkCounti of node i all in label subtree;
Can determine PageTextCount=SUM (TextCounti), wherein, SUM (TextCounti) expression is sued for peace to the plain text number TextCounti of all node i in forum's subsides;
Wherein, be terminal node for node i, if plain text node, then the plain text number of words of TextCounti=node i be set, LinkCounti=0;
If hinged node, then TextCounti=LinkCounti=1 is set;
If other nodes, then TextCounti=LinkCounti=0 is set;
Be nonterminal node for node i, the TextCount summation of plain text number of words+all child nodes of TextCounti=node i self be set, the LinkCount summation of link number+all child nodes of LinkCounti=node i self is set.
Above-mentioned computation process is fairly simple, realizes easily via computer programming.
Fig. 2 shows the process flow diagram of the establishment Maximum alternative subtree according to the embodiment of the present invention, comprising:
Postorder traversal html tag is set, and calculates the text rate of each label subtree, the subtree being greater than first threshold is included in candidate's subtree, no longer travels through its brotgher of node, directly travels through father's node of candidate's subtree;
When candidate's subtree is more than 1, merge two candidate's subtrees, point following two kinds of situations merge: if a candidate subtree A is the subtree of candidate subtree B, then in candidate's subtree, remove candidate subtree A, retain candidate subtree B, continue traversal step; If two candidate subtree A and the different subtree in candidate subtree B office, then find common ancestor's node, then the subtree being root node with common ancestor's node alternatively subtree, other two subtree removings, continue traversal step, the candidate's subtree obtained after having traveled through is Maximum alternative subtree;
Traversal Maximum alternative subtree carries out beta pruning, and leaf node text rate being not more than first threshold all removes, and generates new Maximum alternative subtree.
Fig. 3 shows and comprises according to the process flow diagram of the screening node cluster of the embodiment of the present invention:
Breadth first traversal Maximum alternative subtree, calculate the label diversity factor between the brotgher of node, the label diversity factor of two brotghers of node is less than the 3rd threshold value and then thinks that two brotghers of node have analog structure, wherein, and three threshold value different to the Node configuration of different levels, level is darker, arranging the 3rd threshold value larger, is a subset by the node clustering with analog structure, according to the result of many experiments, preferably, arranging the 3rd threshold value is 0.1;
Namely respectively traversal step is repeated, until each cluster subset elements number is 1 to the subset of each cluster;
Select to comprise the maximum set of subset quantity, each subset in set is pasted as each floor.
Fig. 4 shows the schematic diagram pasting content extraction device according to the forum of the embodiment of the present invention, comprising:
Tag tree module 10, the source code for being pasted by forum generates html tag tree;
Maximum alternative subtree module 20, obtains a Maximum alternative subtree for the label subtree merging text rate in html tag tree being greater than first threshold;
Node cluster module 30, obtains all node clusters with analog structure for screening from Maximum alternative subtree;
Screening module 40, is greater than the node cluster of Second Threshold for screening text rate from node cluster;
Abstraction module 50, for extracting the content of text in the node cluster of screening.
Preferably, the text rate TextRatio=W of Maximum alternative subtree module determination label subtree
1* (TextCount-LinkCount)/TextCount+W
2* TextCount/PageTextCount, wherein, 0<W
2<<W
1<1, W
2+ W
1=1, W
1, W
2be the weights of setting, TextCount represents the plain text number of label subtree, and LinkCount represents the link number of label subtree, and PageTextCount represents the plain text number that forum pastes.
Preferably, Maximum alternative subtree module comprises:
Spider module, for postorder traversal html tag tree, calculates the text rate of each label subtree, the subtree being greater than first threshold is included in candidate's subtree, no longer travels through its brotgher of node, directly travel through father's node of candidate's subtree;
Merge module, during for candidate's subtree more than 1, merge two candidate's subtrees, point following two kinds of situations merge:
If a candidate subtree A is the subtree of candidate subtree B, then in candidate's subtree, remove candidate subtree A, retain candidate subtree B, continue traversal step;
If two candidate subtree A and the different subtree in candidate subtree B office, then find common ancestor's node, then the subtree being root node with common ancestor's node alternatively subtree, candidate subtree A, B are removed, continue traversal step, the candidate's subtree obtained after having traveled through is Maximum alternative subtree;
Beta pruning module, carry out beta pruning for traveling through Maximum alternative subtree, leaf node text rate being not more than first threshold all removes, and generates new Maximum alternative subtree.
Preferably, node cluster module comprises:
Spider module, for breadth first traversal Maximum alternative subtree, calculate the diversity factor between the brotgher of node, the label diversity factor of two brotghers of node is less than the 3rd threshold value and then thinks that two brotghers of node have analog structure, wherein, three threshold value different to the Node configuration of different levels, level is darker, arranging the 3rd threshold value larger, is a subset by the node clustering with analog structure;
Namely loop module, for repeating traversal step, until each cluster subset elements number is 1 to the subset of each cluster respectively;
Select to comprise the maximum set of subset quantity, each subset in set is pasted as each floor.
As can be seen from the above description, present invention achieves the Automatic Extraction that content is pasted by forum.
Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.