The content of the invention
The technical problems to be solved by the invention are to provide a kind of method of the effective content of statistical table, overcome prior art
The defect of the invalid content statistics of presence.
In order to solve the above technical problems, the present invention provides a kind of method of the effective content of statistical table, comprise the following steps:
Step I, document pretreatment, filter out the noise components in document;
Step II, the similarity by calculating data, the value according to similarity are classified to data;
Step III, the live part number for calculating data acquisition system in each classification;
Step IV, draw last live part number by the live part number in all classification is cumulative.
The noise components filtered out in document are to be marked in removing per pen data with the incoherent html of document content
Label, url link addresses, punctuation mark, space.
It is preferred that, the step II comprises the following steps:
I, load all data into first in set G, Bit-reversed then carried out to set G according to size text,
Exactly make number one length is most long, length is most short to roll into last place.
A pen data D in II, taking-up set G, is saved it in classification set L1, and data D is deleted from set G
Remove.
III, the similarities of calculating data D successively with other data GD in set G, when the similarity numerical value is more than or equal to
During the text similarity threshold values pre-set, then GD is also stored in set L1, and preserve D->GD minimum editor's number
S1, and the deletion data GD in set G.
IV, repeat step II, the mode of III, form classification set L2 ..., Ln.
It is preferred that, the similarity for calculating data comprises the following steps:
Minimum editor's number between two pen datas is calculated by editing distance algorithm,
The similarity of two pen datas is calculated according to editor's number of times.
The live part number of data acquisition system, comprises the following steps in each classification of calculating:
3.1st, searching loop L1 gathers, and the second pen data object is taken out as reference object, successively using the first pen data LD1
LD2, and the minimum editor number of times S1 in LD2 objects is taken out, calculate effective the content-data L1A1, L1A1=of this two pen data
S1+ (LD2 object text datas length);
3.2nd, according to 3.1 mode, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n- that finally draw L1A2 ...
1) the effective content number L1A of set L1, are finally counted,
L1A=(L1A1+L1A2+ ... .+L1A1 (n-1))-(LD1 object text datas length) * (set L1 length -1);
3.3rd, the operation of repeat step 3.1 to 3.2, calculates classification set L2 ..., Ln difference corresponding effectively interior successively
Hold number L2A ..., LnA;
3.4th, it is the cumulative of every object text size sum in the set that can not match effective content number WA in set W.
The noise components of the invention automatically filtered out in document, then calculate the similarity of two pen datas, according to similar
The value of degree is classified to data, the live part of data acquisition system in each classification is then counted successively, finally by all points
Data summarization in class, which adds up, draws last overall live part data.The present invention avoids the repetition meter of duplicate contents automatically
Calculate, the statistics accuracy rate to live part is high;Simultaneously without artificial treatment, statistical efficiency is high.It is worthy of popularization.
Embodiment
With reference to shown in Fig. 1, the invention mainly comprises following steps:
Step 1, document pretreatment, remove related content noise components content in document.
In order to improve module efficiency and statistical accuracy, before module execution, first have to enter related content in document
Row filtering.With the incoherent html labels of document content in removing per pen data, url link addresses, punctuation mark, space etc.
Noise content.Said on these content stricti jurises and be not belonging to effective content in document, therefore when statistics and be not required to
By these content statisticses in last result.
Step 2, data are sorted out, and a class is classified as by calculating the similarities of data by homogeneous data.
2.1 load all data into set G first, then carry out Bit-reversed to set G according to size text,
Exactly make number one length is most long, length is most short to roll into last place.
2.2 take out a pen data D in set G, save it in classification set L1, and data D is deleted from set G
Remove.
Data D is drawn minimum editor's number, root by 2.3 with other data GD in set G by editing distance algorithm successively
The similarity of two text datas of D, GD is drawn according to minimum editor's number, when the similarity numerical value is more than or equal to the text pre-set
During this similarity threshold values, then GD is also stored in set L1, and preserve D->GD minimum editor number S1, and deleted in set G
Except data GD.
The operation of 2.4 repeat steps 2.2,2.3, and new data are stored in new classification set L2 ... Ln.
2.5 collating sort set L1 ... Ln, take out the set that set length is 1, and these collective datas are all to match
Data, these data are all taken out, being saved into can not match in set W.
The classification of data is completed this moment, finally draws classification set L1 ..., Ln, and can not match set W.
Above-mentioned editing distance is referred between two character strings, as the minimum editor behaviour needed for one is converted into another
Make number of times.The edit operation of license includes a character being substituted for another character, inserts a character, deletes a word
Symbol.The present invention will replace to reduce algorithm complex, insert, and the weight of deletion is all set as 1.Editing distance algorithm include with
Lower step:
Step (1), setting n are character string s (' newest most hot best ') length.Setting m be character string t (' newest most
Heat ') length.And following two-dimensional array d [n+1, m+1] is constructed, it is as shown in table 1 below.
Table 1
Step (2), two bit array d [n+1, m+1] of initialization;
According in proper order successively the filling up d [0, m+1] and d [n+1,0] numerical value of numerical value, as shown in table 2.
Table 2
Step (3), exemplified by the A in table 2, setting unit d [1,1] is one of following minimum value:
A, close to above the unit+1:d[1,0]+1;
B, close on the left of the unit+1:d[0,1]+1;
C, unit diagonal top and left side+cost:(cost values represent the word of two same positions to d [0,0]+cost
Whether symbol is equal);
From the point of view of numerical value in current form, a numerical value is that 2, b numerical value is 2, c because d [0,1] is equal to d [1,0], therefore cost etc.
In 0, on the contrary is 1, then a, and b, the pen datas of c tri- are (2,2,0), take minimum number 0, then the numerical value at A is 0.
Step (4), according to the rule of step (3) B is located successively, C at, at D and whole array other vacant local count
Numerical value is calculated, then final d [n+1, m+1] is the value of smallest edit distance, then currently ' newest most hot '->' it is newest most it is hot most
Smallest edit distance is 2 well '.Ultimately form as shown in table 3.
Table 3
Step 3, the effective content statisticses of classification set.
3.1 searching loop L1 gather, and the second pen data object is taken out as reference object, successively using the first pen data LD1
LD2, and the minimum editor number of times S1 in LD2 objects is taken out, calculate effective the content-data L1A1, L1A1=of this two pen data
S1+ (LD2 object text datas length).
3.2 according to 3.1 mode, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n-1) that finally draws L1A2 ...,
The effective content number L1A of set L1 are finally counted,
L1A=(L1A1+L1A2+ ... .+L1A1 (n-1))-(LD1 object text datas length) * (set L1 length -1).
3.3 repeat 3.1 to 3.2 operation, the LnA that L2A calculated successively ..., it is impossible to effective content number WA in matching set W
For in the set every object text size sum it is cumulative.
The effective content number LS counted in step 4, final current document is:
LS=L1A+L2A+ ...+LnA+WA.
It should be noted last that, above embodiment is merely illustrative of the technical solution of the present invention and unrestricted,
Although the present invention is described in detail with reference to preferred embodiment, it will be understood by those within the art that, can be right
Technical scheme is modified or equivalent substitution, and without departing from the spirit and scope of technical solution of the present invention, its is equal
It should cover among scope of the presently claimed invention.