CN107515908A

CN107515908A - A kind of data processing method and device

Info

Publication number: CN107515908A
Application number: CN201710683566.2A
Authority: CN
Inventors: 汪利鹏; 赵丹; 牟远; 王勇
Original assignee: New Chi Chi (beijing) Technology Services Co Ltd
Current assignee: New Chi Chi (beijing) Technology Services Co Ltd
Priority date: 2017-08-11
Filing date: 2017-08-11
Publication date: 2017-12-26

Abstract

The invention discloses a kind of data processing method and device, this method includes：Obtain pending spread-sheet data；The row for including nonnumerical information are inquired about in pending spread-sheet data, the row inquired are digitized processing；Multigroup clustering combination is obtained according to pending spread-sheet data, wherein combination includes at least one cluster field per group cluster；Group cluster combination is extracted, according to corresponding informance in the clustering combination inquiry processing spread-sheet data, corresponding informance is subjected to cluster analysis, obtains the multiple cluster samples specified, counts after each cluster sample accounts for the ratio of corresponding informance and is preserved；Corresponding informance in Statistical Clustering Analysis combination under each cluster field is preserved after ratio shared in each cluster sample respectively；The operation for performing extraction one group cluster combination is returned to, until default multigroup clustering combination is all disposed.It is readable that the present invention improves Data Mining efficiency, statistical efficiency, batch automatic processing capabilities and information.

Description

A kind of data processing method and device

Technical field

The present embodiments relate to data mining technology, more particularly to a kind of data processing method and device.

Background technology

Recently as the development of big data rapid technological improvement, mining data value be business and government trade management not The part that can or lack.At present, mining data value generally has both of which：Traditional statistical analysis and new engineering Practise.

Statistical analysis is exactly common packet and Macro or mass analysis, the result of statistics generally include " and ", " poor ", " average Value " and the statistics content such as " distribution probability ", " coefficient correlation ", it will usually business decision is supplied in the form of statistical report form Data foundation of the layer as decision-making.Cluster analysis is unsupervised machine learning algorithm, belongs to the data analysing method of exploration, Generally, would look like unordered object using cluster analysis to be grouped, sort out, to reach the mesh for more fully understanding research object 's.Objects similarity is higher in cluster result requirement group, and objects similarity is relatively low between group.Subjective, the hardly possible of statistical analysis To carry out prospective analysis, the usually not quantitative analysis of the result of cluster analysis, further, since lacking specific data point Analysis, cluster result are difficult to directly instruct decision-making.

In view of the above-mentioned problems, not yet propose effective solution at present.

The content of the invention

The present invention provides a kind of data processing method and device, and Data Mining efficiency, statistical efficiency, batch are improved to realize Automatic processing capabilities and information are readable.

In a first aspect, the embodiments of the invention provide a kind of data processing method, including：

Obtain pending spread-sheet data；

The row for including nonnumerical information in the pending spread-sheet data are inquired about, the row inquired are digitized Processing, generation processing spread-sheet data；

Multigroup clustering combination is obtained according to the pending spread-sheet data, wherein combination includes at least one per group cluster Individual cluster field, each field that clusters is a row field in the pending spread-sheet data；

Group cluster combination is extracted, corresponding informance in spread-sheet data is handled according to clustering combination inquiry is described, The corresponding informance is subjected to cluster analysis, obtains the multiple cluster samples specified, each cluster sample is counted and accounts for the correspondence Preserved after the ratio of information；The corresponding informance in the clustering combination under each cluster field is counted respectively described each Preserved in cluster sample after shared ratio；

The operation for performing extraction one group cluster combination is returned to, until default multigroup clustering combination is all disposed.

Second aspect, the embodiment of the present invention additionally provide a kind of data processing equipment, including：

Pending spread-sheet data acquisition module, for obtaining pending spread-sheet data；

Spread-sheet data digital processing module, for inquiring about in the pending spread-sheet data comprising nonnumeric The row of information, the row inquired are digitized processing, generation processing spread-sheet data；

Clustering combination acquisition module, for obtaining multigroup clustering combination according to the pending spread-sheet data, wherein Per group cluster, combination includes at least one cluster field, and each field that clusters is one in the pending spread-sheet data Row field；

Class statistic analysis module, for extracting group cluster combination, according to the clustering combination inquiry processing electricity Corresponding informance in sub-table data, the corresponding informance is subjected to cluster analysis, obtains the multiple cluster samples specified, statistics is every Individual cluster sample is preserved after accounting for the ratio of the corresponding informance；Count pair each clustered in the clustering combination under field Information is answered to be preserved respectively after ratio shared in each cluster sample；

Loop module, the operation of extraction one group cluster combination is performed for returning, until default multigroup clustering combination is complete Portion is disposed.

The present invention analyzes process by the way that nonnumerical information in spread-sheet data is digitized into processing, in class statistic The middle circulation multigroup clustering combination of batch processing, and statistical analysis is carried out for cluster analysis result, solve data results Readable poor low with all clustering combination efficiency are traveled through present in cluster analysis manually and cluster analysis result lacks quantitative The problem of analysis and further statistical analysis, improve Data Mining efficiency, statistical efficiency, batch automatic processing capabilities and information It is readable.

Brief description of the drawings

Fig. 1 a are a kind of flow charts of data processing method in the embodiment of the present invention one；

Fig. 1 b are a kind of schematic diagrames of typical original electron list data in the embodiment of the present invention one；

Fig. 1 c are a kind of schematic diagrames of pending spread-sheet data in the embodiment of the present invention one；

Fig. 1 d are a kind of schematic diagrames of clustering combination in the embodiment of the present invention one；

Fig. 1 e are a kind of schematic diagrames of class statistic analysis result in the embodiment of the present invention one；

Fig. 1 f are a kind of schematic diagrames of class statistic analysis result in the embodiment of the present invention one；

Fig. 1 g are a kind of schematic diagrames of book name form in the embodiment of the present invention one；

Fig. 1 h are that a kind of second worksheet in the embodiment of the present invention one is preserved to the schematic diagram of the first book；

Fig. 1 i are that a kind of 3rd worksheet in the embodiment of the present invention one is preserved to the schematic diagram of the first book；

Fig. 2 is a kind of flow chart of preferable data processing method in the embodiment of the present invention two；

Fig. 3 is a kind of structural representation of data processing equipment in the embodiment of the present invention three.

Embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.

Embodiment one

Fig. 1 a are a kind of flow chart for data processing method that the embodiment of the present invention one provides, and the present embodiment is applicable to count According in excavation, this method can be performed by data processing equipment, and described device is performed by software and/or hardware, this implementation The technical scheme of example specifically comprises the following steps：

S110, obtain pending spread-sheet data.

Wherein, pending spread-sheet data obtains to original electron list data after pretreatment, pending Spread-sheet data is the follow-up data basis for carrying out cluster analysis and statistical analysis, and original electron list data is needed into line number Data preprocess forms pending electronic data, wherein, specific data prediction step is as follows：

Step 1, obtain original electron list data.

Exemplary, it is a typical original electron list data as shown in Figure 1 b, wherein, the first behavior header line, It show in particular the implication of each row field, and often row represents initial data actual in each sample below.

Specifically, case information is have recorded in Fig. 1 b, wherein, each row field represents to make from left to right successively in header line Case state, case locations and regions, lost value, means feature, case state, position and incidence of criminal offenses region, below every capable basis Actual conditions, according to the implication of respective column field, corresponding information is inserted wherein, forms a raw sample data.By upper N number of (N >=1) raw sample data that the mode of stating is formed just constitutes original electron list data.

Step 2, the row field and the row that class statistic analysis will need not be carried out in the original electron list data Information deletion corresponding to field.

Exemplary, case feature is analyzed according to original electron list data, judges whether each row field is entered in Fig. 1 b Row class statistic is analyzed.Because case state and case feature relevance are relatively low, through judging, case state is without above-mentioned point Analysis, so as to which corresponding information under the row field and the field be deleted from original electron list data.

By judging whether each row field carries out class statistic analysis in advance, reduce follow-up data treating capacity, improve Data-handling efficiency.

Step 3, by the original electron list data include multiple subfields row field split, generate and wait to locate Manage spread-sheet data.

It is exemplary, as illustrated in figure 1 c, according to case analysis demand, it is necessary to using the crime time as primary study object, Carry out specific case characteristic analysis.So as to, further, by crime time row field in Fig. 1 b be specifically split as year, month, day, Week, hour, field of grading.

By carrying out further deconsolidation process to the row field comprising abundant information so that the class statistic analysis knot of acquisition Fruit is more accurate.

The row of nonnumerical information are included in S120, the inquiry pending spread-sheet data, the row inquired are carried out Digitized processing, generation processing spread-sheet data.

Wherein, nonnumerical information includes Chinese character information, null information and null character string information, and digitized processing is by non-number Word information is converted into digital information.Specific transfer process comprises the following steps：

Step 1, the Chinese character information is grouped according to field, deletes duplicate message in every group, and will not repeat to believe Breath is ranked up.

Exemplary, as illustrated in figure 1 c, first, by the row of the nonnumerical information in pending spread-sheet data according to work Case state, means feature, position, crime time _ week are grouped, and delete duplicate message in every group, for example, in crime Between _ day of week sub-field in, there are a duplicate message appearance on Tuesday, Thursday, Saturday, delete duplicate message, corresponding only to retain one Item information.Then, according to block form, all nonnumerical informations without duplicate message are ranked up, wherein, sequence Can with but be not limited to according to phonetic transcriptions of Chinese characters sequencing carry out.For example, according to phonetic transcriptions of Chinese characters sequencing to crime time _ star Phase field is ranked up, and is ordered as Tuesday, Saturday, Sunday, Thursday, Friday, Monday.Equally, according to Chinese character Phonetic sequencing is ranked up to means feature field, is ordered as stealing means, is looked for object, preparation means, organizational form.

Step 2, duplicate message does not carry out digital number according to sequence by described in, and the digital number is preserved to the second word Allusion quotation table variable.

Wherein, dictionary be it is a kind of store data mode, dictionary equivalent to two row array, two row be referred to as keys and Items, wherein, keys can not be repeated, and items can be repeated.

Exemplary, as illustrated in figure 1 c, row field crime state, means feature, crime time _ week etc. are keys, its The serial number items of respective column.Optionally, according to the sequence in step 1：Accomplished offence reference numeral is 1 under mode field of committing a crime；Hand Means are stolen under Duan Tedian fields, look for object, preparation means, organizational form to correspond to coding and be followed successively by 1,2,3,4；During crime Between _ day of week sub-field next week two, Saturday, Sunday, Thursday, Friday, Monday, reference numeral 1,2,3,4,5,6. Above-mentioned digital number is preserved to the second dictionary table variable, in the second dictionary table variable, have recorded pending electrical form number Which row includes the corresponding relation of Chinese character information and digital number in Chinese character information and these row in.

Step 3, the information in the second dictionary table variable, corresponding digital number is converted to by Chinese character information.

Step 4, the null information and null character string information be converted into optional network specific digit.

Exemplary, all null information and null character string information in nonnumerical information are converted to stationary digital, for example, There is null value in crime mode field and bit field in Fig. 1 c, be just converted into stationary digital, it is preferred that stationary digital - 1000 are could be arranged to, -2000, -10000, and be not especially limited, ensure that it is differed with digital number used.By Which row in pending spread-sheet data have recorded and include Chinese character information and this for the second dictionary table variable obtained in step 2 The corresponding relation of Chinese character information and digital number in a little row.Thus, according to the information of the second dictionary table variable, by Chinese character information, Corresponding digital number is converted to according to the digital number in step 2.As illustrated in figure 1 c, the institute that will occur in mode field of committing a crime There is accomplished offence information according to the digital number in step 2, be converted into 2, other Chinese character informations of appearance do same place Reason.

By being digitized processing to the non-mathematical information in pending spread-sheet data, avoid and be additionally required people Work carries out the work of data processing, simplifies the difficulty of artificial treatment data.Meanwhile directly nonnumerical information is gone according to packet Corresponding digital number is converted into after weight, and is stored into dictionary corresponding form.According to dictionary information, in the follow-up cluster system of output During the result that meter analysis obtains, the above-mentioned digital information changed into can return to Chinese character or other nonnumerical informations corresponding to output (such as null value and null character string), improve the readability of the class statistic analysis result finally obtained.Further, since realize complete Portion's Data Digital, provide data for follow-up class statistic analysis and support.

Further, the row of nonnumerical information are included in the pending electrical form is inquired about, the row inquired are entered Before digitized processing, in addition to：

Row field and its sequence number of corresponding row in header line in the pending spread-sheet data are preserved to One dictionary table variable.

Exemplary, as illustrated in figure 1 c, row field crime state, means feature, position etc. are keys, the sequence of its respective column Number it is items.By row field crime state, crime locations and regions, lost value, means feature, position, incidence of criminal offenses region, crime Time _ year, the crime time _ moon, the crime time _ day, the crime time _ week, the crime time _ hour, the crime time _ point, share N =12 row fields, and its sequence number 1,2 ... of corresponding row, 11,12 preserve to the first dictionary table variable.

Row field information storage in header line into corresponding dictionary format, there are into two effects, first, being easy in down-stream It is middle to be called using when clustering field composition clustering combination, wherein, cluster field is row field, is specifically, using clustering field The sequence number composition clustering combination of corresponding row；Second, being easy to call when exporting class statistic analysis result, believe according to dictionary Breath, corresponding row field information is translated back to automatically original Chinese character row name information so that row field is all in final analysis result It is original Chinese character row name, improves the readability of analysis result.

S130, multigroup clustering combination obtained according to the pending spread-sheet data, wherein combination includes per group cluster At least one cluster field, each field that clusters is a row field in the pending spread-sheet data.

Wherein, multigroup clustering combination is obtained according to the pending spread-sheet data, specifically comprised the following steps：

Multiple row fields in step 1, acquisition the first dictionary table variable.

Exemplary, the first dictionary table variable includes crime state, crime locations and regions, lost value, means feature, portion Position, incidence of criminal offenses region, the crime time _ year, the crime time _ moon, the crime time _ day, the crime time _ week, the crime time _ hour, The crime time _ point, shared N=12 row field, and its sequence number 1,2 ... of corresponding row, 11,12.

Step 2, the multiple row field is carried out to the multigroup clustering combination of various combination generation, by multigroup clustering combination Preserve to the first worksheet.

Exemplary,, in a program will be described more with reference to actual conditions according to the N=12 row field obtained in step 1 Row sequence number corresponding to individual row field carries out various forms of combinations, it is assumed that retrievable clustering combination number is T, then 1≤T≤ 2^N- 1, wherein,As shown in Figure 1 d, it is poly- by the T of acquisition Class combination is preserved to the first worksheet, is read for the ease of user, the clustering combination presented in the table for each row sequence number from The corresponding row field that turn is translated.Follow-up cluster analysis will read the worksheet, all clustering combinations of circulation batch processing.

Compared in the prior art, adopting manually, clustering combination is configured one by one, entered one by one Row cluster analysis, present embodiments provide a kind of by reading the worksheet generated by multigroup clustering combination, circulation batch processing The mode of all clustering combinations, which greatly accelerate the Data Mining efficiency of cluster analysis.

S140, extraction one group cluster combination, according to corresponding in the clustering combination inquiry processing spread-sheet data Information, the corresponding informance is subjected to cluster analysis, obtains the multiple cluster samples specified, each cluster sample of statistics accounts for described Preserved after the ratio of corresponding informance；The corresponding informance in the clustering combination under each cluster field is counted respectively described Preserved in each cluster sample after shared ratio.

Wherein, cluster can be defined as follows：In data space A, training sample set X is by M given training sample group Into, wherein, X=(x₁,x₂,...,x_i,...,x_M-1,x_M), each training sample x_i=(x_i1, x_i2..., x_ij..., x_iN-1, x_iN), i=1,2 ..., M, j=1,2 ..., N, i represent training sample, j represents row field, that is, clusters field.Training sample set Equivalent to one M × N matrix of X, the final purpose of cluster are that training sample set X is divided into K class, and the foundation of division is training Similarity between sample.The specific index for representing similarity includes similarity factor and range index, wherein, range index includes Euclidean distance, Euclidean distance square, manhatton distance, Chebyshev distance, card side apart from etc.." distance " smaller sample is got over With similitude, " coefficient correlation " bigger sample more has similitude.Clustering method include but is not limited to K-means, K-medoids, CLARA (Clustering LARge Application), FCM.

Exemplary, the technical scheme of the present embodiment uses K-means++ algorithms, and the corresponding informance is carried out into cluster point Analysis, obtains the multiple cluster samples specified, specifically comprises the following steps：

Step 1, randomly select from the corresponding informance K and be used as cluster centre.

Step 2, the other information in the corresponding informance distributed into closest cluster according to minimal distance principle Clustered corresponding to the heart, obtain K cluster sample.

Step 3, the sample average for clustering described K in sample in each cluster sample are as new cluster centre.

Step 4, return to perform and distribute the other information in the corresponding informance to closest according to minimal distance principle Cluster centre corresponding to cluster, obtain K cluster sample operation, up to cluster centre no longer change when, obtain current K Individual cluster sample is as the multiple cluster samples specified.

Exemplary, as shown in fig. le, K=5, training sample M=2745 are set, a group cluster of extraction is combined as by gathering Class field " A=crime times " and " B=crimes locations and regions composition ", after K-mean++ algorithm process, division result is：The 1 class clusters sample, quantity 525；2nd class clusters sample, quantity 496；3rd class clusters sample, quantity 498；4th birdss of the same feather flock together Class sample, quantity 472；5th class clusters sample, quantity 754.Accordingly, each cluster sample size of statistics accounts for training sample This ratio, result of calculation are followed successively by 19.12568%, 18.06922%, 18.14208%, 17.1949%, 27.46812%, Above-mentioned statistical result is preserved into the second worksheet.

As shown in Figure 1 f, the 2nd class of selection cluster sample, quantity 496, wherein, include 11 correspondences under cluster field A Information, quantity are followed successively by 35,50,64,64,61,52,36,51,31,29,23.Accordingly, the correspondence under Statistical Clustering Analysis field A Information accounts for the ratio of each cluster sample respectively, and result of calculation is followed successively by 7.056452%, 10.08065%, 12.90323%, 12.90323%, 10.48387%, 7.258065%, 10.28226%, 6.25%, 5.846774%, 4.637097%.2 corresponding informances are included under cluster field B, quantity is followed successively by 262,234.Accordingly, Statistical Clustering Analysis field B Under corresponding informance account for the ratio of each cluster sample respectively, result of calculation is followed successively by 52.82258%, 47.17742%, Above-mentioned statistical result is preserved into the 3rd worksheet.

Read for the ease of user, what is presented in above-mentioned second worksheet, the 3rd worksheet is the numeral being converted into Information returns to Chinese character or other nonnumerical informations (such as null value and null character string) corresponding to output.

Also need to further carry out follow-up statistical compared to analysis is carried out for cluster analysis result in the prior art Analysis, present embodiments provides method that is a kind of while obtaining cluster analysis and statistic analysis result, by cluster analysis and statistical Analysis perfectly combines, so as to improve data analysis efficiency.

S150, the operation for performing extraction one group cluster combination is returned to, until default multigroup clustering combination has all been handled Finish.

The technical scheme of the present embodiment, by the way that nonnumerical information in spread-sheet data is digitized into processing, poly- The multigroup clustering combination of batch processing is circulated during class statistical analysis, and statistical analysis is carried out for cluster analysis result, is solved Data results are readable poor low with all clustering combination efficiency are traveled through present in cluster analysis manually and cluster is divided The problem of result lacks quantitative analysis and further statistical analysis is analysed, it is automatic to improve Data Mining efficiency, statistical efficiency, batch Disposal ability and information are readable.

On the basis of above-mentioned technical proposal, the corresponding informance point under each cluster field in counting the clustering combination After not preserved after ratio shared in each cluster sample to the 3rd worksheet, also comprise the following steps：

Second worksheet and the 3rd worksheet are preserved to the first book by each cluster field name In.

Wherein, the name form of the first book is：Cluster field 1_ cluster field 2_..._ cluster field N_ clusters Number.

Exemplary, as shown in Figure 1 g, according to work of the name form of book to preservation class statistic analysis result Book is named, for example, crime time _ crime locations and regions _ 5.

It is named using above-mentioned name form so that during all clustering combinations of circulation batch processing, own Book will not repeat, meanwhile, the particular content that book includes just can be understood that according to name.

Exemplary, as shown in figure 1h, a group cluster of extraction is combined as by cluster field " A=crime times " and " B= Crime locations and regions are formed ".Cluster Sheet1 and represent the second worksheet, save each cluster sample size and account for training sample Ratio.As shown in figure 1i, the 2nd class represents the 3rd worksheet, and the corresponding informance saved under cluster field A and B accounts for the 2nd class respectively Cluster the ratio of sample.Above-mentioned worksheet Sheet1, the class of worksheet the 2nd are preserved to being named as " the crime time _ crime place area In the book of domain _ 5 ".

The class statistic generated by multiple clustering combinations analysis as shown in Figure 1 g can be produced under specified saving contents Book, above-mentioned whole process are that automatic batch is handled, so that user disposably quick and convenient can check cluster Content in statistic analysis book, greatly enhance data analysis efficiency.

Further, since input and output are electronic form files so that compatibility and readability are all very good.

Embodiment two

Fig. 2 show a kind of flow chart of preferable data processing method of the offer of the embodiment of the present invention two, the present embodiment It is applicable in data mining, this method can be performed by data processing equipment, and described device is held by software and/or hardware OK, the technical scheme of the present embodiment specifically comprises the following steps：

S210, original electron list data progress data prediction is obtained, generate pending spread-sheet data.

By being pre-processed to initial data, judge whether each row field carries out class statistic analysis in advance, reduce Follow-up data treating capacity, improves data-handling efficiency；Further deconsolidation process is carried out to the row field comprising abundant information, So that the class statistic analysis result obtained is more accurate.

S220, row field in header line in the pending spread-sheet data and its sequence number of corresponding row preserved To the first dictionary table variable.

The row of nonnumerical information are included in S230, the inquiry pending spread-sheet data, by null information and empty word Symbol string information replaces with -1000 and Chinese character information is converted into digital number according to the second dictionary table variable.

Multiple row fields in S240, acquisition the first dictionary table variable, different groups are carried out by the multiple row field Symphysis is preserved to the first worksheet into multigroup clustering combination.

Read for the ease of user, the corresponding row that the clustering combination presented in the table is translated automatically for the sequence number of each row Field.Follow-up cluster analysis will read the worksheet, all clustering combinations of circulation batch processing.

S250, extraction one group cluster combination, according to corresponding in the clustering combination inquiry processing spread-sheet data Information, the corresponding informance is subjected to cluster analysis, obtains the multiple cluster samples specified.

The ratio that the corresponding informance is accounted in each cluster sample of S260, statistics is preserved to the second worksheet.

Corresponding informance in S270, the statistics clustering combination under each cluster field is respectively in each cluster sample In preserved to the 3rd worksheet after shared ratio.

S280, by second worksheet and the 3rd worksheet preserve to by it is each cluster field name the first work Make book.

It is named using above-mentioned name form so that during all clustering combinations of circulation batch processing, own Book will not repeat, meanwhile, the particular content that book includes just can be understood that according to name.Above-mentioned whole process Be that automatic batch is handled, so that user can disposably it is quick and convenient check class statistic analysis book in it is interior Hold, greatly enhance data analysis efficiency.Further, since input and output are electronic form files so that compatible Property and it is readable all very good.

S290, the operation for performing extraction one group cluster combination is returned to, until default multigroup clustering combination has all been handled Finish.

Embodiment three

Fig. 3 show a kind of structural representation of data processing equipment of the offer of the embodiment of the present invention three, the tool of the device Body structure is as follows：

Pending spread-sheet data acquisition module 310, for obtaining pending spread-sheet data.

Exemplary, the pending spread-sheet data acquisition module 310, it is specifically used for：

Obtain original electron list data.

The row field and the row field pair of class statistic analysis will need not be carried out in the original electron list data The information deletion answered.

The row field for including multiple subfields in the original electron list data is split, generates pending electronics List data.

By judging whether each row field carries out class statistic analysis in advance, reduce follow-up data treating capacity, improve Data-handling efficiency.By carrying out further deconsolidation process to the row field comprising abundant information so that the cluster system of acquisition It is more accurate to count analysis result.

Spread-sheet data digital processing module 320, for inquiring about in the pending spread-sheet data comprising non- The row of digital information, the row inquired are digitized processing, generation processing spread-sheet data.

Wherein, the nonnumerical information includes Chinese character information, null information and null character string information.

Exemplary, the spread-sheet data digital processing module 320, it is specifically used for：

The Chinese character information is grouped according to field, deletes duplicate message in every group, and duplicate message is carried out Sequence.

Duplicate message does not carry out digital number according to sequence by described in, and the digital number is preserved to the second dictionary table and become Amount.

According to the information in the second dictionary table variable, Chinese character information is converted into corresponding digital number.

The null information and null character string information are converted into optional network specific digit.

By being digitized processing to the non-mathematical information in pending spread-sheet data, avoid and be additionally required people Work carries out the work of data processing, simplifies the difficulty of artificial treatment data.Meanwhile directly nonnumerical information is gone according to packet Corresponding digital number is converted into after weight, and is stored into dictionary corresponding form.According to dictionary information, in the follow-up cluster system of output During the result that meter analysis obtains, the above-mentioned digital information changed into can return to Chinese character or other nonnumerical informations corresponding to output (such as null value and null character string), improve the readability of the class statistic analysis result finally obtained.Further, since realize complete Portion's Data Digital, provide data for follow-up class statistic analysis and support.In the spread-sheet data digitized processing The row of nonnumerical information are included in pending spread-sheet data described in module polls, the row inquired are digitized processing Before, in addition to the first dictionary table generation module 300, it is specifically used for：

Clustering combination acquisition module 330, for obtaining multigroup clustering combination according to the pending spread-sheet data, its In per group cluster, combination include at least one cluster field, each cluster field is one in the pending spread-sheet data Individual row field.

Exemplary, the clustering combination acquisition module 330, it is specifically used for：

Obtain multiple row fields in the first dictionary table variable.

The multiple row field is subjected to various combination and generates multigroup clustering combination, by multigroup clustering combination preserve to First worksheet.

Class statistic analysis module 340, for extracting group cluster combination, the processing is inquired about according to the clustering combination Corresponding informance in spread-sheet data, the corresponding informance is subjected to cluster analysis, obtains the multiple cluster samples specified, is counted Each cluster sample is preserved after accounting for the ratio of the corresponding informance；Count in the clustering combination and each cluster under field Corresponding informance is preserved after ratio shared in each cluster sample respectively.

Exemplary, the class statistic analysis module 340, it is specifically used for：

K are randomly selected from the corresponding informance and is used as cluster centre.

Other information in the corresponding informance is distributed to closest cluster centre according to minimal distance principle and corresponded to Cluster, obtain K cluster sample.

The sample average that described K is clustered in sample in each cluster sample is as new cluster centre.

Return to perform and distribute the other information in the corresponding informance to closest cluster according to minimal distance principle Clustered corresponding to center, obtain the operation of K cluster sample, until when cluster centre no longer changes, obtain K current cluster Sample is as the multiple cluster samples specified.

Exemplary, the class statistic analysis module 340, it is additionally operable to：

The ratio that each cluster sample of statistics accounts for the corresponding informance is preserved to the second worksheet；By described in statistics Each ratio shared in each cluster sample preserves to the corresponding informance under cluster field respectively in clustering combination Three worksheets.

Loop module 350, the operation of extraction one group cluster combination is performed for returning, until default multigroup clustering combination All it is disposed.

In the technology of above-mentioned technical proposal, the class statistic analysis module 350, it is additionally operable to：

Corresponding informance under each cluster field in counting the clustering combination is respectively in each cluster sample Preserved after shared ratio to the 3rd worksheet, second worksheet and the 3rd worksheet are preserved to by each In the first book for clustering field name.

It is named using above-mentioned name form so that during all clustering combinations of circulation batch processing, own Book will not repeat, meanwhile, the particular content that book includes just can be understood that according to name.In specified preservation The class statistic generated by multiple clustering combinations can be produced under catalogue and analyzes book, above-mentioned whole process is at automatic batch Reason, so that user disposably quick and convenient can check that class statistic analyzes the content in book, largely Improve data analysis efficiency.

Pay attention to, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also Other more equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims

A kind of 1. data processing method, it is characterised in that including：

Obtain pending spread-sheet data；

The row for including nonnumerical information in the pending spread-sheet data are inquired about, the row inquired are digitized place Reason, generation processing spread-sheet data；

Multigroup clustering combination is obtained according to the pending spread-sheet data, wherein combination is comprising at least one poly- per group cluster Class field, each field that clusters is a row field in the pending spread-sheet data；

Group cluster combination is extracted, according to corresponding informance in the clustering combination inquiry processing spread-sheet data, by institute State corresponding informance and carry out cluster analysis, obtain the multiple cluster samples specified, count each cluster sample and account for the corresponding informance Ratio after preserved；The corresponding informance each clustered in the clustering combination under field is counted each to cluster described respectively Preserved in sample after shared ratio；

The operation for performing extraction one group cluster combination is returned to, until default multigroup clustering combination is all disposed.
2. according to the method for claim 1, it is characterised in that pending spread-sheet data is obtained, including：

Obtain original electron list data；

It need not will be carried out in the original electron list data corresponding to the row field and the row field of class statistic analysis Information deletion；

The row field for including multiple subfields in the original electron list data is split, generates pending electrical form Data.
3. according to the method for claim 1, it is characterised in that include non-number in the inquiry pending spread-sheet data The row of word information, before the row inquired are digitized into processing, in addition to：

Row field and its sequence number of corresponding row in header line in the pending spread-sheet data are preserved to first Dictionary table variable；

Multigroup clustering combination is obtained according to the pending spread-sheet data, including：

Obtain multiple row fields in the first dictionary table variable；

The multiple row field is subjected to various combination and generates multigroup clustering combination, multigroup clustering combination is preserved to first Worksheet.
4. according to the method for claim 1, it is characterised in that the nonnumerical information includes Chinese character information, null information With null character string information；

The row for including nonnumerical information in the pending spread-sheet data are inquired about, the row inquired are digitized place Reason, generation processing spread-sheet data, including：

The Chinese character information is grouped according to field, deletes duplicate message in every group, and duplicate message is ranked up；

Duplicate message does not carry out digital number according to sequence by described in, and the digital number is preserved to the second dictionary table variable；

According to the information in the second dictionary table variable, Chinese character information is converted into corresponding digital number；

The null information and null character string information are converted into optional network specific digit.
5. according to the method for claim 1, it is characterised in that the corresponding informance is subjected to cluster analysis, obtains and specifies Multiple cluster samples, including：

K are randomly selected from the corresponding informance and is used as cluster centre；

Other information in the corresponding informance is distributed to corresponding to closest cluster centre according to minimal distance principle and gathered Class, obtain K cluster sample；

The sample average that described K is clustered in sample in each cluster sample is as new cluster centre；

Return to perform and distribute the other information in the corresponding informance to closest cluster centre according to minimal distance principle Corresponding cluster, the operation of K cluster sample is obtained, until when cluster centre no longer changes, obtain K current cluster sample As specified multiple cluster samples.
6. according to the method for claim 1, it is characterised in that each cluster sample of statistics is accounted for into the corresponding informance Ratio is preserved to the second worksheet；By the corresponding informance each clustered in the clustering combination of statistics under field respectively described Shared ratio is preserved to the 3rd worksheet in each cluster sample；

Corresponding informance under each cluster field in counting the clustering combination is shared in each cluster sample respectively Ratio after preserve to after the 3rd worksheet, in addition to：

Second worksheet and the 3rd worksheet are preserved into the first book by each cluster field name.
A kind of 7. data processing equipment, it is characterised in that including：

Pending spread-sheet data acquisition module, for obtaining pending spread-sheet data；

Spread-sheet data digital processing module, nonnumerical information is included in the pending spread-sheet data for inquiring about Row, the row inquired are digitized processing, generation processing spread-sheet data；

Clustering combination acquisition module, for obtaining multigroup clustering combination according to the pending spread-sheet data, wherein every group Clustering combination includes at least one cluster field, and each field that clusters is a row word in the pending spread-sheet data Section；

Class statistic analysis module, for extracting group cluster combination, the processing electronic watch is inquired about according to the clustering combination Corresponding informance in lattice data, the corresponding informance is subjected to cluster analysis, obtains the multiple cluster samples specified, count each poly- Class sample is preserved after accounting for the ratio of the corresponding informance；Count the corresponding letter each clustered in the clustering combination under field Preserved after ceasing ratio shared in each cluster sample respectively；

Loop module, the operation of extraction one group cluster combination is performed for return, until at default multigroup clustering combination is whole Reason finishes.
8. device according to claim 7, it is characterised in that the pending spread-sheet data acquisition module, be used for：

Obtain original electron list data；

It need not will be carried out in the original electron list data corresponding to the row field and the row field of class statistic analysis Information deletion；

The row field for including multiple subfields in the original electron list data is split, generates pending electrical form Data.
9. device according to claim 7, it is characterised in that also including the first dictionary table variable generating module, for The spread-sheet data digital processing module inquires about the row for including nonnumerical information in the pending spread-sheet data, Before the row inquired are digitized into processing,

Row field and its sequence number of corresponding row in header line in the pending spread-sheet data are preserved to the first word Allusion quotation table variable；

The clustering combination acquisition module, is used for：

Obtain multiple row fields in the first dictionary table variable；

The multiple row field is subjected to various combination and generates multigroup clustering combination, multigroup clustering combination is preserved to first Worksheet.
10. device according to claim 7, it is characterised in that the nonnumerical information includes Chinese character information, null information With null character string information；

The spread-sheet data digital processing module, is used for：

The Chinese character information is grouped according to field, deletes duplicate message in every group, and duplicate message is ranked up；

Duplicate message does not carry out digital number according to sequence by described in, and the digital number is preserved to the second dictionary table variable；

According to the information in the second dictionary table variable, Chinese character information is converted into corresponding digital number；

The null information and null character string information are converted into optional network specific digit.
11. device according to claim 7, it is characterised in that the class statistic analysis module, be used for：

K are randomly selected from the corresponding informance and is used as cluster centre；

Other information in the corresponding informance is distributed to corresponding to closest cluster centre according to minimal distance principle and gathered Class, obtain K cluster sample；

The sample average that described K is clustered in sample in each cluster sample is as new cluster centre；

Return to perform and distribute the other information in the corresponding informance to closest cluster centre according to minimal distance principle Corresponding cluster, the operation of K cluster sample is obtained, until when cluster centre no longer changes, obtain K current cluster sample As specified multiple cluster samples.
12. device according to claim 7, it is characterised in that the class statistic analysis module, be used for：

The ratio that each cluster sample of statistics accounts for the corresponding informance is preserved to the second worksheet；By the cluster of statistics Each ratio shared in each cluster sample preserves to the 3rd work the corresponding informance under cluster field respectively in combination Make table；

The class statistic analysis module is additionally operable to：Corresponding informance point under each cluster field in counting the clustering combination After not preserved after ratio shared in each cluster sample to the 3rd worksheet, by second worksheet and described 3rd worksheet is preserved into the first book by each cluster field name.