CN104750668B

CN104750668B - A kind of method of the effective content of statistical table

Info

Publication number: CN104750668B
Application number: CN201510141995.8A
Authority: CN
Inventors: 江潮; 贺建华; 蒋汉华
Original assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Current assignee: Language network (Wuhan) Information Technology Co., Ltd.
Priority date: 2015-03-27
Filing date: 2015-03-27
Publication date: 2017-10-17
Anticipated expiration: 2035-03-27
Also published as: CN104750668A

Abstract

The present invention relates to the content statisticses of the data mining applied technical field of computer, more particularly to electronic data sheet.The noise components of the invention automatically filtered out in document, then the similarity of two pen datas is calculated, value according to similarity is classified to data, then the live part of data acquisition system in each classification is counted successively, is finally drawing last overall live part data by the data summarization in all classification is cumulative.The present invention avoids computing repeatedly for duplicate contents automatically, and the statistics accuracy rate to live part is high；Simultaneously without artificial treatment, statistical efficiency is high.It is worthy of popularization.

Description

A kind of method of the effective content of statistical table

Technical field

The present invention relates to the content system of the data mining applied technical field of computer, more particularly to electronic data sheet Meter.

Background technology

Statistics electronic data sheet document content, is all the statistical function carried using excel, but so count at present The data come are only the summation of all the elements in excel documents, and these contents and are not all effective content, and the inside is full of portion Point html codes, url link addresses, the portion repeated between different pieces of information under the noise components, and same column such as punctuation mark Point.Therefore, the electronic data sheet content quantity counted according to existing statistical is far longer than in document effective part, Statistical demand of the people to effective content in form can not be adapted to.For example, when this excel be need as translate original, These noise components be need not as translation word counting, and the part repeated under same column between different pieces of information is not yet Need in the word counting as translation, then to complete this work, it is necessary to it is artificial go to judge and rejects noise components with The part of the identical repetition of same column data is removed, when excel data are more and more, manual intervention cost also can be increasingly Height, efficiency also can be more and more lower, error probability meeting more and more higher, and the numerical value accuracy rate finally counted also can be more and more lower.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of method of the effective content of statistical table, overcome prior art The defect of the invalid content statistics of presence.

In order to solve the above technical problems, the present invention provides a kind of method of the effective content of statistical table, comprise the following steps：

Step I, document pretreatment, filter out the noise components in document；

Step II, the similarity by calculating data, the value according to similarity are classified to data；

Step III, the live part number for calculating data acquisition system in each classification；

Step IV, draw last live part number by the live part number in all classification is cumulative.

The noise components filtered out in document are to be marked in removing per pen data with the incoherent html of document content Label, url link addresses, punctuation mark, space.

It is preferred that, the step II comprises the following steps：

I, load all data into first in set G, Bit-reversed then carried out to set G according to size text, Exactly make number one length is most long, length is most short to roll into last place.

A pen data D in II, taking-up set G, is saved it in classification set L1, and data D is deleted from set G Remove.

III, the similarities of calculating data D successively with other data GD in set G, when the similarity numerical value is more than or equal to During the text similarity threshold values pre-set, then GD is also stored in set L1, and preserve D->GD minimum editor's number S1, and the deletion data GD in set G.

IV, repeat step II, the mode of III, form classification set L2 ..., Ln.

It is preferred that, the similarity for calculating data comprises the following steps：

Minimum editor's number between two pen datas is calculated by editing distance algorithm,

The similarity of two pen datas is calculated according to editor's number of times.

The live part number of data acquisition system, comprises the following steps in each classification of calculating：

3.1st, searching loop L1 gathers, and the second pen data object is taken out as reference object, successively using the first pen data LD1 LD2, and the minimum editor number of times S1 in LD2 objects is taken out, calculate effective the content-data L1A1, L1A1=of this two pen data S1+ (LD2 object text datas length)；

3.2nd, according to 3.1 mode, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n- that finally draw L1A2 ... 1) the effective content number L1A of set L1, are finally counted,

L1A=(L1A1+L1A2+ ... .+L1A1 (n-1))-(LD1 object text datas length) * (set L1 length -1)；

3.3rd, the operation of repeat step 3.1 to 3.2, calculates classification set L2 ..., Ln difference corresponding effectively interior successively Hold number L2A ..., LnA；

3.4th, it is the cumulative of every object text size sum in the set that can not match effective content number WA in set W.

The noise components of the invention automatically filtered out in document, then calculate the similarity of two pen datas, according to similar The value of degree is classified to data, the live part of data acquisition system in each classification is then counted successively, finally by all points Data summarization in class, which adds up, draws last overall live part data.The present invention avoids the repetition meter of duplicate contents automatically Calculate, the statistics accuracy rate to live part is high；Simultaneously without artificial treatment, statistical efficiency is high.It is worthy of popularization.

Brief description of the drawings

Technical scheme is further described in detail with reference to the accompanying drawings and detailed description.

Fig. 1 is the flow chart of the specific embodiment of the invention.

Embodiment

With reference to shown in Fig. 1, the invention mainly comprises following steps：

Step 1, document pretreatment, remove related content noise components content in document.

In order to improve module efficiency and statistical accuracy, before module execution, first have to enter related content in document Row filtering.With the incoherent html labels of document content in removing per pen data, url link addresses, punctuation mark, space etc. Noise content.Said on these content stricti jurises and be not belonging to effective content in document, therefore when statistics and be not required to By these content statisticses in last result.

Step 2, data are sorted out, and a class is classified as by calculating the similarities of data by homogeneous data.

2.1 load all data into set G first, then carry out Bit-reversed to set G according to size text, Exactly make number one length is most long, length is most short to roll into last place.

2.2 take out a pen data D in set G, save it in classification set L1, and data D is deleted from set G Remove.

Data D is drawn minimum editor's number, root by 2.3 with other data GD in set G by editing distance algorithm successively The similarity of two text datas of D, GD is drawn according to minimum editor's number, when the similarity numerical value is more than or equal to the text pre-set During this similarity threshold values, then GD is also stored in set L1, and preserve D->GD minimum editor number S1, and deleted in set G Except data GD.

The operation of 2.4 repeat steps 2.2,2.3, and new data are stored in new classification set L2 ... Ln.

2.5 collating sort set L1 ... Ln, take out the set that set length is 1, and these collective datas are all to match Data, these data are all taken out, being saved into can not match in set W.

The classification of data is completed this moment, finally draws classification set L1 ..., Ln, and can not match set W.

Above-mentioned editing distance is referred between two character strings, as the minimum editor behaviour needed for one is converted into another Make number of times.The edit operation of license includes a character being substituted for another character, inserts a character, deletes a word Symbol.The present invention will replace to reduce algorithm complex, insert, and the weight of deletion is all set as 1.Editing distance algorithm include with Lower step：

Step (1), setting n are character string s (' newest most hot best ') length.Setting m be character string t (' newest most Heat ') length.And following two-dimensional array d [n+1, m+1] is constructed, it is as shown in table 1 below.

Table 1

Step (2), two bit array d [n+1, m+1] of initialization；

According in proper order successively the filling up d [0, m+1] and d [n+1,0] numerical value of numerical value, as shown in table 2.

Table 2

Step (3), exemplified by the A in table 2, setting unit d [1,1] is one of following minimum value：

A, close to above the unit+1：d[1,0]+1；

B, close on the left of the unit+1：d[0,1]+1；

C, unit diagonal top and left side+cost：(cost values represent the word of two same positions to d [0,0]+cost Whether symbol is equal)；

From the point of view of numerical value in current form, a numerical value is that 2, b numerical value is 2, c because d [0,1] is equal to d [1,0], therefore cost etc. In 0, on the contrary is 1, then a, and b, the pen datas of c tri- are (2,2,0), take minimum number 0, then the numerical value at A is 0.

Step (4), according to the rule of step (3) B is located successively, C at, at D and whole array other vacant local count Numerical value is calculated, then final d [n+1, m+1] is the value of smallest edit distance, then currently ' newest most hot '->' it is newest most it is hot most Smallest edit distance is 2 well '.Ultimately form as shown in table 3.

Table 3

Step 3, the effective content statisticses of classification set.

3.1 searching loop L1 gather, and the second pen data object is taken out as reference object, successively using the first pen data LD1 LD2, and the minimum editor number of times S1 in LD2 objects is taken out, calculate effective the content-data L1A1, L1A1=of this two pen data S1+ (LD2 object text datas length).

3.2 according to 3.1 mode, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n-1) that finally draws L1A2 ..., The effective content number L1A of set L1 are finally counted,

L1A=(L1A1+L1A2+ ... .+L1A1 (n-1))-(LD1 object text datas length) * (set L1 length -1).

3.3 repeat 3.1 to 3.2 operation, the LnA that L2A calculated successively ..., it is impossible to effective content number WA in matching set W For in the set every object text size sum it is cumulative.

The effective content number LS counted in step 4, final current document is：

LS=L1A+L2A+ ...+LnA+WA.

It should be noted last that, above embodiment is merely illustrative of the technical solution of the present invention and unrestricted, Although the present invention is described in detail with reference to preferred embodiment, it will be understood by those within the art that, can be right Technical scheme is modified or equivalent substitution, and without departing from the spirit and scope of technical solution of the present invention, its is equal It should cover among scope of the presently claimed invention.

Claims

1. a kind of method of the effective content of statistical table, it is characterised in that comprise the following steps：

Step I, document pretreatment, filter out the noise components in document；

Step IV, draw all live part numbers by the live part number in all classification is cumulative；

3.1st, searching loop L1 gathers, and the second pen data object LD2 is taken out as reference object, successively using the first pen data LD1, and The minimum editor number of times S1 in LD2 objects is taken out, effective the content-data L1A1, L1A1=S1+ of this two pen data is calculated (LD2 object text datas length)；

3.2nd, according to the mode of step 3.1, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n- that finally draw L1A2 ... 1) the effective content number L1A of set L1, are finally counted,

3.3rd, the operation of repeat step 3.1 to 3.2, calculates classification set L2 ..., Ln and distinguishes corresponding effective content number successively L2A ..., LnA；

2. the method for the effective content of statistical table according to claim 1, it is characterised in that described to filter out in document Noise components are, with the incoherent html labels of document content, url link addresses, punctuation mark and sky in removing per pen data Lattice.

3. the method for the effective content of statistical table according to claim 1, it is characterised in that the step II includes following Step：

I, load all data into first in set G, Bit-reversed is then carried out to set G according to size text, that is, Make number one length is most long, length is most short to roll into last place；

A pen data D in II, taking-up set G, is saved it in classification set L1, and data D is deleted from set G；

III, the similarities of calculating data D successively with other data GD in set G, when the similarity numerical value is more than or equal in advance During the text similarity threshold value set, then GD is also stored in set L1, and preserve D->GD minimum editor number S1, and Data GD is deleted in set G；

IV, repeat step II, the mode of III, form classification set L2 ..., Ln.

4. the method for the effective content of statistical table according to claim 3, it is characterised in that the calculating data it is similar Degree, comprises the following steps：

Pass through the pen data of editing distance algorithm comparison two minimum editor's number of times；

The similarity of two pen datas is calculated by editor's number of times.