CN104750668B - A kind of method of the effective content of statistical table - Google Patents

A kind of method of the effective content of statistical table Download PDF

Info

Publication number
CN104750668B
CN104750668B CN201510141995.8A CN201510141995A CN104750668B CN 104750668 B CN104750668 B CN 104750668B CN 201510141995 A CN201510141995 A CN 201510141995A CN 104750668 B CN104750668 B CN 104750668B
Authority
CN
China
Prior art keywords
data
classification
pen
similarity
effective content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510141995.8A
Other languages
Chinese (zh)
Other versions
CN104750668A (en
Inventor
江潮
贺建华
蒋汉华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Language network (Wuhan) Information Technology Co., Ltd.
Original Assignee
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd filed Critical WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201510141995.8A priority Critical patent/CN104750668B/en
Publication of CN104750668A publication Critical patent/CN104750668A/en
Application granted granted Critical
Publication of CN104750668B publication Critical patent/CN104750668B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the content statisticses of the data mining applied technical field of computer, more particularly to electronic data sheet.The noise components of the invention automatically filtered out in document, then the similarity of two pen datas is calculated, value according to similarity is classified to data, then the live part of data acquisition system in each classification is counted successively, is finally drawing last overall live part data by the data summarization in all classification is cumulative.The present invention avoids computing repeatedly for duplicate contents automatically, and the statistics accuracy rate to live part is high;Simultaneously without artificial treatment, statistical efficiency is high.It is worthy of popularization.

Description

A kind of method of the effective content of statistical table
Technical field
The present invention relates to the content system of the data mining applied technical field of computer, more particularly to electronic data sheet Meter.
Background technology
Statistics electronic data sheet document content, is all the statistical function carried using excel, but so count at present The data come are only the summation of all the elements in excel documents, and these contents and are not all effective content, and the inside is full of portion Point html codes, url link addresses, the portion repeated between different pieces of information under the noise components, and same column such as punctuation mark Point.Therefore, the electronic data sheet content quantity counted according to existing statistical is far longer than in document effective part, Statistical demand of the people to effective content in form can not be adapted to.For example, when this excel be need as translate original, These noise components be need not as translation word counting, and the part repeated under same column between different pieces of information is not yet Need in the word counting as translation, then to complete this work, it is necessary to it is artificial go to judge and rejects noise components with The part of the identical repetition of same column data is removed, when excel data are more and more, manual intervention cost also can be increasingly Height, efficiency also can be more and more lower, error probability meeting more and more higher, and the numerical value accuracy rate finally counted also can be more and more lower.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of method of the effective content of statistical table, overcome prior art The defect of the invalid content statistics of presence.
In order to solve the above technical problems, the present invention provides a kind of method of the effective content of statistical table, comprise the following steps:
Step I, document pretreatment, filter out the noise components in document;
Step II, the similarity by calculating data, the value according to similarity are classified to data;
Step III, the live part number for calculating data acquisition system in each classification;
Step IV, draw last live part number by the live part number in all classification is cumulative.
The noise components filtered out in document are to be marked in removing per pen data with the incoherent html of document content Label, url link addresses, punctuation mark, space.
It is preferred that, the step II comprises the following steps:
I, load all data into first in set G, Bit-reversed then carried out to set G according to size text, Exactly make number one length is most long, length is most short to roll into last place.
A pen data D in II, taking-up set G, is saved it in classification set L1, and data D is deleted from set G Remove.
III, the similarities of calculating data D successively with other data GD in set G, when the similarity numerical value is more than or equal to During the text similarity threshold values pre-set, then GD is also stored in set L1, and preserve D->GD minimum editor's number S1, and the deletion data GD in set G.
IV, repeat step II, the mode of III, form classification set L2 ..., Ln.
It is preferred that, the similarity for calculating data comprises the following steps:
Minimum editor's number between two pen datas is calculated by editing distance algorithm,
The similarity of two pen datas is calculated according to editor's number of times.
The live part number of data acquisition system, comprises the following steps in each classification of calculating:
3.1st, searching loop L1 gathers, and the second pen data object is taken out as reference object, successively using the first pen data LD1 LD2, and the minimum editor number of times S1 in LD2 objects is taken out, calculate effective the content-data L1A1, L1A1=of this two pen data S1+ (LD2 object text datas length);
3.2nd, according to 3.1 mode, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n- that finally draw L1A2 ... 1) the effective content number L1A of set L1, are finally counted,
L1A=(L1A1+L1A2+ ... .+L1A1 (n-1))-(LD1 object text datas length) * (set L1 length -1);
3.3rd, the operation of repeat step 3.1 to 3.2, calculates classification set L2 ..., Ln difference corresponding effectively interior successively Hold number L2A ..., LnA;
3.4th, it is the cumulative of every object text size sum in the set that can not match effective content number WA in set W.
The noise components of the invention automatically filtered out in document, then calculate the similarity of two pen datas, according to similar The value of degree is classified to data, the live part of data acquisition system in each classification is then counted successively, finally by all points Data summarization in class, which adds up, draws last overall live part data.The present invention avoids the repetition meter of duplicate contents automatically Calculate, the statistics accuracy rate to live part is high;Simultaneously without artificial treatment, statistical efficiency is high.It is worthy of popularization.
Brief description of the drawings
Technical scheme is further described in detail with reference to the accompanying drawings and detailed description.
Fig. 1 is the flow chart of the specific embodiment of the invention.
Embodiment
With reference to shown in Fig. 1, the invention mainly comprises following steps:
Step 1, document pretreatment, remove related content noise components content in document.
In order to improve module efficiency and statistical accuracy, before module execution, first have to enter related content in document Row filtering.With the incoherent html labels of document content in removing per pen data, url link addresses, punctuation mark, space etc. Noise content.Said on these content stricti jurises and be not belonging to effective content in document, therefore when statistics and be not required to By these content statisticses in last result.
Step 2, data are sorted out, and a class is classified as by calculating the similarities of data by homogeneous data.
2.1 load all data into set G first, then carry out Bit-reversed to set G according to size text, Exactly make number one length is most long, length is most short to roll into last place.
2.2 take out a pen data D in set G, save it in classification set L1, and data D is deleted from set G Remove.
Data D is drawn minimum editor's number, root by 2.3 with other data GD in set G by editing distance algorithm successively The similarity of two text datas of D, GD is drawn according to minimum editor's number, when the similarity numerical value is more than or equal to the text pre-set During this similarity threshold values, then GD is also stored in set L1, and preserve D->GD minimum editor number S1, and deleted in set G Except data GD.
The operation of 2.4 repeat steps 2.2,2.3, and new data are stored in new classification set L2 ... Ln.
2.5 collating sort set L1 ... Ln, take out the set that set length is 1, and these collective datas are all to match Data, these data are all taken out, being saved into can not match in set W.
The classification of data is completed this moment, finally draws classification set L1 ..., Ln, and can not match set W.
Above-mentioned editing distance is referred between two character strings, as the minimum editor behaviour needed for one is converted into another Make number of times.The edit operation of license includes a character being substituted for another character, inserts a character, deletes a word Symbol.The present invention will replace to reduce algorithm complex, insert, and the weight of deletion is all set as 1.Editing distance algorithm include with Lower step:
Step (1), setting n are character string s (' newest most hot best ') length.Setting m be character string t (' newest most Heat ') length.And following two-dimensional array d [n+1, m+1] is constructed, it is as shown in table 1 below.
Table 1
Step (2), two bit array d [n+1, m+1] of initialization;
According in proper order successively the filling up d [0, m+1] and d [n+1,0] numerical value of numerical value, as shown in table 2.
Table 2
Step (3), exemplified by the A in table 2, setting unit d [1,1] is one of following minimum value:
A, close to above the unit+1:d[1,0]+1;
B, close on the left of the unit+1:d[0,1]+1;
C, unit diagonal top and left side+cost:(cost values represent the word of two same positions to d [0,0]+cost Whether symbol is equal);
From the point of view of numerical value in current form, a numerical value is that 2, b numerical value is 2, c because d [0,1] is equal to d [1,0], therefore cost etc. In 0, on the contrary is 1, then a, and b, the pen datas of c tri- are (2,2,0), take minimum number 0, then the numerical value at A is 0.
Step (4), according to the rule of step (3) B is located successively, C at, at D and whole array other vacant local count Numerical value is calculated, then final d [n+1, m+1] is the value of smallest edit distance, then currently ' newest most hot '->' it is newest most it is hot most Smallest edit distance is 2 well '.Ultimately form as shown in table 3.
Table 3
Step 3, the effective content statisticses of classification set.
3.1 searching loop L1 gather, and the second pen data object is taken out as reference object, successively using the first pen data LD1 LD2, and the minimum editor number of times S1 in LD2 objects is taken out, calculate effective the content-data L1A1, L1A1=of this two pen data S1+ (LD2 object text datas length).
3.2 according to 3.1 mode, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n-1) that finally draws L1A2 ..., The effective content number L1A of set L1 are finally counted,
L1A=(L1A1+L1A2+ ... .+L1A1 (n-1))-(LD1 object text datas length) * (set L1 length -1).
3.3 repeat 3.1 to 3.2 operation, the LnA that L2A calculated successively ..., it is impossible to effective content number WA in matching set W For in the set every object text size sum it is cumulative.
The effective content number LS counted in step 4, final current document is:
LS=L1A+L2A+ ...+LnA+WA.
It should be noted last that, above embodiment is merely illustrative of the technical solution of the present invention and unrestricted, Although the present invention is described in detail with reference to preferred embodiment, it will be understood by those within the art that, can be right Technical scheme is modified or equivalent substitution, and without departing from the spirit and scope of technical solution of the present invention, its is equal It should cover among scope of the presently claimed invention.

Claims (4)

1. a kind of method of the effective content of statistical table, it is characterised in that comprise the following steps:
Step I, document pretreatment, filter out the noise components in document;
Step II, the similarity by calculating data, the value according to similarity are classified to data;
Step III, the live part number for calculating data acquisition system in each classification;
Step IV, draw all live part numbers by the live part number in all classification is cumulative;
The live part number of data acquisition system, comprises the following steps in each classification of calculating:
3.1st, searching loop L1 gathers, and the second pen data object LD2 is taken out as reference object, successively using the first pen data LD1, and The minimum editor number of times S1 in LD2 objects is taken out, effective the content-data L1A1, L1A1=S1+ of this two pen data is calculated (LD2 object text datas length);
3.2nd, according to the mode of step 3.1, the 3rd pen data LD3 is taken out successively until LDn, the L1A (n- that finally draw L1A2 ... 1) the effective content number L1A of set L1, are finally counted,
L1A=(L1A1+L1A2+ ... .+L1A1 (n-1))-(LD1 object text datas length) * (set L1 length -1);
3.3rd, the operation of repeat step 3.1 to 3.2, calculates classification set L2 ..., Ln and distinguishes corresponding effective content number successively L2A ..., LnA;
3.4th, it is the cumulative of every object text size sum in the set that can not match effective content number WA in set W.
2. the method for the effective content of statistical table according to claim 1, it is characterised in that described to filter out in document Noise components are, with the incoherent html labels of document content, url link addresses, punctuation mark and sky in removing per pen data Lattice.
3. the method for the effective content of statistical table according to claim 1, it is characterised in that the step II includes following Step:
I, load all data into first in set G, Bit-reversed is then carried out to set G according to size text, that is, Make number one length is most long, length is most short to roll into last place;
A pen data D in II, taking-up set G, is saved it in classification set L1, and data D is deleted from set G;
III, the similarities of calculating data D successively with other data GD in set G, when the similarity numerical value is more than or equal in advance During the text similarity threshold value set, then GD is also stored in set L1, and preserve D->GD minimum editor number S1, and Data GD is deleted in set G;
IV, repeat step II, the mode of III, form classification set L2 ..., Ln.
4. the method for the effective content of statistical table according to claim 3, it is characterised in that the calculating data it is similar Degree, comprises the following steps:
Pass through the pen data of editing distance algorithm comparison two minimum editor's number of times;
The similarity of two pen datas is calculated by editor's number of times.
CN201510141995.8A 2015-03-27 2015-03-27 A kind of method of the effective content of statistical table Active CN104750668B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510141995.8A CN104750668B (en) 2015-03-27 2015-03-27 A kind of method of the effective content of statistical table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510141995.8A CN104750668B (en) 2015-03-27 2015-03-27 A kind of method of the effective content of statistical table

Publications (2)

Publication Number Publication Date
CN104750668A CN104750668A (en) 2015-07-01
CN104750668B true CN104750668B (en) 2017-10-17

Family

ID=53590380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510141995.8A Active CN104750668B (en) 2015-03-27 2015-03-27 A kind of method of the effective content of statistical table

Country Status (1)

Country Link
CN (1) CN104750668B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874250B (en) * 2017-02-15 2020-08-25 中车株洲电机有限公司 Automatic operation method and system based on word domain

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN104008166A (en) * 2014-05-30 2014-08-27 华东师范大学 Dialogue short text clustering method based on form and semantic similarity

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7610281B2 (en) * 2006-11-29 2009-10-27 Oracle International Corp. Efficient computation of document similarity
CA2782391A1 (en) * 2012-06-29 2013-12-29 The Governors Of The University Of Alberta Methods for matching xml documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN104008166A (en) * 2014-05-30 2014-08-27 华东师范大学 Dialogue short text clustering method based on form and semantic similarity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于改进编辑距离的相似重复记录清理算法;叶焕倬 等;《现代图书情报技术》;20110831;第83页第2节,第86页第4节 *

Also Published As

Publication number Publication date
CN104750668A (en) 2015-07-01

Similar Documents

Publication Publication Date Title
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
CN101770446B (en) Method and system for identifying form in layout file
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN102567530B (en) Intelligent extraction system and intelligent extraction method for article type web pages
CN103823838B (en) A kind of method of multi-format document typing and comparison
CN107423391B (en) Information extraction method of webpage structured data
CN104199857A (en) Tax document hierarchical classification method based on multi-tag classification
CN106709032A (en) Method and device for extracting structured information from spreadsheet document
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN107229668A (en) A kind of text extracting method based on Keywords matching
CN106446072B (en) The treating method and apparatus of web page contents
CN106407195B (en) Method and system for web page duplication elimination
CN103106245A (en) Method which is used for classifying translation manuscript in automatic fragmentation mode and based on large-scale term corpus
CN106055613A (en) Cleaning method for data classification and training databases based on mixed norm
CN104504151A (en) Public opinion monitoring system of Wechat
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN108228787B (en) Method and device for processing information according to multi-level categories
CN110765402A (en) Visual acquisition system and method based on network resources
CN115270723A (en) PDF document splitting method, device, equipment and storage medium
CN101714147A (en) Method for filtering same or similar files
CN104750668B (en) A kind of method of the effective content of statistical table
CN103064966A (en) Method for extracting regular noise from single record web pages
Chu et al. Automatic data extraction of websites using data path matching and alignment
CN103218420A (en) Method and device for extracting page titles
CN112148735A (en) Construction method for structured form data knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: WUHAN TRANSN INFORMATION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: YULIANWANG (WUHAN) INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20150731

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20150731

Address after: 430074, Optics Valley Software Park, East Lake Development Zone, Wuhan, south of Hubei, South Lake Road, Optics Valley Software Park, 2, six, 5, No. 205

Applicant after: Wuhan Transn Information Technology Co., Ltd.

Address before: 430074, Optics Valley Software Park, East Lake Development Zone, Wuhan, south of Hubei, South Lake Road, Optics Valley Software Park, 2, six, 6, No. 206

Applicant before: Language network (Wuhan) Information Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 430074, Optics Valley Software Park, East Lake Development Zone, Wuhan, south of Hubei, South Lake Road, Optics Valley Software Park, 2, six, 5, No. 205

Patentee after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430074, Optics Valley Software Park, East Lake Development Zone, Wuhan, south of Hubei, South Lake Road, Optics Valley Software Park, 2, six, 5, No. 205

Patentee before: Wuhan Transn Information Technology Co., Ltd.