CN107273493A

CN107273493A - A kind of data-optimized and quick methods of sampling under big data environment

Info

Publication number: CN107273493A
Application number: CN201710452151.4A
Authority: CN
Inventors: 张浩澜; 陈剑平; 李兴森
Original assignee: Ningbo Institute of Technology of ZJU
Current assignee: Ningbo Institute of Technology of ZJU
Priority date: 2017-06-15
Filing date: 2017-06-15
Publication date: 2017-10-20
Anticipated expiration: 2037-06-15
Also published as: CN107273493B

Abstract

The present invention relates to the data-optimized and quick methods of sampling under a kind of big data environment, including：(1), large data sets are deployed in cloud environment；(2) large data sets, are divided into some Sub Data Sets according to numerical attribute, the Sub Data Set of numeric form is filtered out；(3), choose the Sub Data Set for needing to be analyzed, the data distribution for judging the Sub Data Set is close to normal distribution or Poisson distribution, reuse normal state sampling algorithm proposed by the present invention or Poisson sampling algorithm, data block is drawn to the Sub Data Set rapid extraction, several data blocks of therefrom sampling are analyzed, the sample data block drawn by normal state partitioning algorithm or Poisson partitioning algorithm progress rapid extraction inherits the average of Sub Data Set, the attributes such as variance, data from the sample survey block is so only needed to be analyzed it is ensured that the data block and the high consistency of Sub Data Set drawn and representativeness, this sample loading mode greatly shortens the data-analysis time, improve data analysis efficiency.

Description

A kind of data-optimized and quick methods of sampling under big data environment

Technical field

The present invention relates to big data analysis field, take out more particularly, to data-optimized under a kind of big data environment and quickly Quadrat method.

Background technology

The surge of medical treatment and E-business applications generates huge data volume, and people have been brought into " big data " epoch. Different from traditional large data sets, " big data " one word does not mean only that the big of data volume, and is also represented by the generation speed of data Degree is fast.However, current data mining and analytical technology is faced with a challenge, i.e., Large Copacity is handled in a short period of time Data, the arrival in big data epoch promotes researcher to find the solution of the optimization for solving mass data, particularly to The solution of line business and medical data.In the prior art, the method that generally uses of volume of processing large data sets is：Using The method of matrix decomposition, by reducing data row (dimensionality reduction), then splits large data sets using enhanced Ultraviolet method, The data block obtained after separation is analyzed, but for a unusual big data set, matrix disassembling method can not subtract Few large data sets volume, could not also improve the treatment effeciency of data.By a large amount of domestic and foreign literatures, practical application and Patent data Access, find to there is no similarly the same technology with principle proposed by the present invention and application and development.

The content of the invention

The technical problems to be solved by the invention, which are to provide under a kind of big data environment, can reduce data volume and energy Enough improve the data-optimized and quick methods of sampling of data-handling efficiency.

The technical solution adopted in the present invention is that the data-optimized and quick methods of sampling under a kind of big data environment is wrapped Include following steps：

(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute Several columns Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the Sub Data Set Form includes numeric form Sub Data Set and textual form Sub Data Set, and the Sub Data Set of numeric form is sieved from large data sets Elect；

(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, at this A storing path is set up under ground system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set；

(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form similar normal state point of the curve Cloth curve, performs step (4)；If the approximate Poisson distribution curve of the distribution form of the curve, then perform step (5)；

(4), the data number of blocks for needing to be split to the Sub Data Set is set, the dividing method pair of normal distribution is used The Sub Data Set carries out segmentation and draws some data blocks, some data block of being sampled from some data blocks, performs step (9)；

(5), the direction for primitive curve towards the ordinate that will be fitted approximate Poisson distribution moves up K unit and obtains curve A, then The direction of the primitive curve of approximate Poisson distribution towards ordinate is moved down into K unit and obtains curve B, between curve A and curve B Region forms a standard area；

(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the subnumber Averagely split according to amount according to the data count of concentration, i.e., each split data volume=subdata that obtained data block is included The total amount of data of collection/data number of blocks；

(7) some data block E, is extracted in some data blocks drawn from segmentation, and curve is carried out to the data block Fitting, sees whether the curve that data block fitting is obtained falls in the standard area formed between curve A and curve B, if number Fall according to the curve of block in standard area, then be carried out step (9)；If the curve of data block does not fall within standard area It is interior, it is carried out step (8)；

(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, such as Really the sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then just should Data point is included in data block E；It is located at if the sample data is added to the curve that in data block E data block E can not be fitted In standard area, continue to selection sample data and be put into data block E, until data block E matched curve is located at standard area Untill interior, and perform step (9)；

(9) data analysis, is carried out to the obtained data block.

The beneficial effects of the invention are as follows：The subnumber for possessing some attribute for needing to be analyzed is filtered out from mass data According to collection, then the Sub Data Set is optimized, finally chooses some small-sized data block to carry out data analysis, this mode Data volume is greatly reduced, the data-analysis time is shortened.If the matched curve approximate normal distribution of Sub Data Set, that It is also to meet normal distribution to carry out the data block that draws of segmentation by normal distribution algorithm subdata sets, then only need to from The sampling obtained data block of segmentation is analyzed in the information contained with regard to Sub Data Set and mass data belonging to drawing Hold, improve data analysis efficiency, and also improve the accuracy of data analysis；If the matched curve of Sub Data Set is approximate Poisson distribution, then subdata sets are carried out after averagely splitting, extracts wherein certain data block and has met Poisson point after optimizing Cloth, then analyzed from sampling obtained data block of segmentation and Sub Data Set belonging to drawing and mass data are accumulate The information content contained, improves data analysis efficiency.

As preferential, large data sets are expressed as the subnumber of wherein some numeric form in A [1, k], setting large data sets According to integrating as S [a, b], it is assumed that some data x ∈ [a, b] in Sub Data Set, and meet：Wherein μ is average, and σ is standard variance, then can show that Sub Data Set is met：I.e. It is Normal Distribution that the Sub Data Set, which can be drawn,.

As preferential, the position for setting Sub Data Set represents position as S (pl, pr), pl and pr, and's Absolute value is less than Δ, and wherein p represents the location point of greatest measure, when Δ represents to carry out piecemeal, and pointer is to certain data block set two While the maximum magnitude value being adjusted, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c are represented with A (p) =(pl+pr)/2=pl+B/2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA (p) ∈ S, it is assumed that S [a, b] is met：A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then may be used To draw Sub Data Sets of the S [a, b] for a series of data block comprising Normal Distributions, i.e.,：Wherein m_rFor the m parameter value being dynamically adapted, after initialization m values, with Δ adjustment piecemeal, m values change, and remaining region is by m in cutting procedure_rTo adjust,Wherein x_iTable Show the positional value where the interval frequency maxima of selection in the data acquisition system that a left side adjacent interval is dynamically adjusted Subtract the positional value of upper interval adjacent on the left of this interval.

As preferential, the dividing method of the normal distribution described in the step (4) is：

Firstth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], set Δ represent into During row piecemeal, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data of segmentation Number of blocks, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is The left-most position of the current data block set split, if pr is the current rightmost side position for carrying out partition data set of blocks Put, i.e. pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+ Δ, i.e. cl represent that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S, Cl, cr, LP), p represents to represent in the data block closest to the point position of cut-point, LP in the data block currently split Closest to the maximum of the point of cut-point；

Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if | p-c | ＞ Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, i.e., expression is determined most connects in the data block The point position of nearly cut-point, is then adjusted with cl and cr near this position and finds out the position of cut-point, in data After block center is mobile to the left or to the right cut-point is determined, then carry out splitting next data block again, it is necessary to be partitioned into The parameters value of data block is upper one parameter value split after the data block adjustment drawn, until being partitioned into than the m-th data Untill block.

Method using normal distribution algorithm is split, and the data of the data block drawn, which are still, meets normal distribution, And the data block that quick search can be supported to be split, improves the pardon of analysis large data sets.

As preferential, the quantity of the Sub Data Set of affiliated numeric form is at least a row, because only that the more numbers of increase Digital data can just improve big data precision of analysis.

Brief description of the drawings

Fig. 1 is the parameter comparison in embodiment of the present invention during normal distribution in data set 1；

Fig. 2 is the parameter comparison in embodiment of the present invention during normal distribution in data set 2；

Embodiment

Referring to the drawings and combine embodiment and further describe invention, to make those skilled in the art's reference Specification word can be implemented according to this, and the scope of the present invention is not limited to the embodiment.

The present invention relates to the data optimization methods under a kind of big data environment, comprise the following steps：

(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form of the curve is closer to normal state Distribution curve, performs step (4)；If the distribution form of the curve is closer to Poisson distribution curve, then perform step (5)；

(4), the data number of blocks for needing to be split to the Sub Data Set is set, the dividing method pair of normal distribution is used The Sub Data Set carries out segmentation and draws some data blocks, some data block of therefrom sampling, and performs step (9)；

(5) fitting, is moved up into K unit close to the direction of primitive curve towards the ordinate of Poisson distribution and obtains curve A, then Direction close to the primitive curve of Poisson distribution towards ordinate is moved down into K unit and obtains curve B, shape between curve A and curve B Into a standard area；

(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the subnumber Averagely split according to the data of concentration, i.e., each split the sum for data volume=Sub Data Set that obtained data block is included According to amount/data number of blocks；

(7) some data block E, is extracted in some data blocks drawn from segmentation, and curve is carried out to the data block Fitting, sees whether the obtained curve of data block fitting falls the standard area between curve A and curve B, if data block Curve falls in standard area, then be carried out step (9)；If the curve of data block is not fallen within standard area, just hold Row step (8)；

(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, such as Really the sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then the data Point is exactly the point needed；If the sample data, which is added to the curve that in data block E data block E can not be fitted, is located at standard regions In domain, continue to selection sample data and be put into data block E, untill data block E matched curve is located in standard area, And perform step (9)；

(9) data analysis, is carried out to the obtained data block.

Large data sets are deployed in cloud environment, distributed cloud computing environment allows to analyze and locate under local system Data are managed, Sub Data Set can be distributed in the diverse location of any cloud framework, then local to handle its by single cloud node Sub Data Set under system, in addition, Sub Data Set splits the parallel processing suitable for cloud environment, it is flexible that increase internal memory is used Property.

The numerical attribute of each Sub Data Set can be age, duration, birth weight, the body quality of population distribution Index, the size of population distribution of breast cancer patients, home address, date, sex etc., wherein such as the age, the date, sex, The data of this classification of home address belong to the Sub Data Set of textual form, and duration, birth weight, the body of population distribution The classification of these digital forms such as the size of population of body mass index and breast cancer patients belongs to numeric form Sub Data Set, this Class Sub Data Set needs to screen again to analyze and process from large data sets, using data optimization methods of the present invention Prerequisite be large data sets need comprising at least one numeric form Sub Data Set arrange, to the Sub Data Set screened Carry out curve fitting, the distribution form for seeing the curve be close to normal distribution or Poisson distribution, if close to normal distribution, that Using the automatic Segmentation Sub Data Set of normal distribution so as to create small and representative data block, drawn Data block is also Normal Distribution, and its average and variance are approximate with the average and variance of affiliated Sub Data Set, so Just directly it can just can be derived that the information that the Sub Data Set belonging to this is contained by analyzing certain partial data block therein；Such as Really close to Poisson distribution, then by average segmentation Sub Data Set, some data block is extracted in the data block drawn from segmentation It is allowed to also comply with the average value and variance of Poisson distribution, the i.e. data block and the average value and variance of Sub Data Set after optimizing It is very approximate, thus directly it can just can be derived that the Sub Data Set belonging to this is contained by analyzing the data block therein Information.

As preferential, large data sets are expressed as the subnumber of wherein some numeric form in A [1, k], setting large data sets According to integrating as S [a, b], it is assumed that some data x ∈ [a, b] in Sub Data Set, and meet：Its Middle μ is average, and σ is standard variance, then can show that Sub Data Set is met： It can show that the Sub Data Set is Normal Distribution.

As preferential, the position for setting Sub Data Set represents position as S (pl, pr), pl and pr, andIt is exhausted Δ is less than to value, wherein p represents the location point of greatest measure, when Δ represents to carry out piecemeal, pointer is to certain data block set both sides The maximum magnitude value being adjusted, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c=are represented with A (p) (pl+pr)/2=pl+B/2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA (p) ∈ S, it is assumed that S [a, b] is met：A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then it can obtain Go out the Sub Data Set that S [a, b] is a series of data block comprising Normal Distributions, i.e.,： Wherein m_rFor the m parameter value being dynamically adapted, after initialization m values, as Δ adjusts piecemeal, m values change, cutting procedure In remaining region by m_rTo adjust,Wherein x_iRepresent the number that a left side adjacent interval is dynamically adjusted The position of upper interval adjacent on the left of this interval is subtracted according to the positional value where the interval frequency maxima of selection in set Put value.

As preferential, the dividing method of the normal distribution described in the step (3) is：

Secondth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], set Δ represent into During row piecemeal, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data of segmentation Number of blocks, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is The left-most position of the current data block set split, if pr is the current rightmost side position for carrying out partition data set of blocks Put, i.e. pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+ Δ, i.e. cl represent that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S, Cl, cr, LP), p represents to represent in the data block closest to the point position of cut-point, LP in the data block currently split Closest to the maximum of the point of cut-point；

Under be classified as two medical data files for meeting normal distribution, i.e.,：

1) diabetes data source：Data set comes from 10 years (1999- in hospital of 130, the U.S. and comprehensive delivery network 2008) clinical care data set, including 101767 records, 50 numerical attributes, including 4521 records.

2) health data sources:The hospital record that this data source is provided comprising patient information and their online resource, should Record marital status, employment state, hospital stays of patient etc..

We are based on medical data source and have carried out a series of experiment, and these data files are diabetes data collection and strong Health data set.

Based on two data sources, experimental result is as follows:

Data set 1:μ represents average, and σ represents standard deviation, and this data set is " diabetes attribute ".

Data set 2:μ represents average, and σ represents standard deviation, and this data set is " healthy attribute:Balance ".

After the automatic Segmentation that above-mentioned two medical data collection passes through normal distribution, as a result it is described as follows：

In order to verify the pardon of the data block by normal distribution dividing method, we are obtained by analyzing segmentation sampling Data block and raw data set two data sources, then both μ values and σ values are compared

In data set 1, data set 2 is divided into 15 single data blocks, and we compare μ the and σ values of raw data set μ and σ values corresponding with some partition data block, compare form as follows：

Parameter	Data set 1	Partition data block
			μ	486.5324	486.552
σ	212.3062	212.1967
			Set sizes	101766	6784.4

μ the and σ values of the μ and σ values of partition data block closely data set 1, the μ between data set 1 and partition data block Value difference is 0.0196, the discrepancy of partition data block relative data set 1 about 0.0044%, between data set 1 and data set 1 Different σ value differences are 0.1095, and this is the change that partition data block relative data set 1 has about 0.516%.

In data set 2, data set 2 is divided into 5 single data blocks, complete data set and some partition data block Between comparison listed in following form：

μ the and σ values of the μ and σ values of partition data block closely data set 2, the μ between data set 2 and partition data block Value difference is 4.48, the discrepancy of partition data block relative data set 2 about 0.0031%, between data set 2 and partition data block σ value differences it is different be 54.04, this is the change that partition data block relative data set 2 has about 0.0719%.

The probability of the Poisson distribution can be expressed as：Wherein k is represented The frequency that certain event occurs in interval of time, λ is the number of times that average event occurs in a time interval, and e is Euler Number, k！It is k factorial, p represents the probability of time generation.

The data for meeting Poisson distribution are sampled, the average and standard deviation and original number of the data block finally given It is as follows according to the mean μ of collection and the contrast of standard deviation：

Data source	μ	σ	λ
				Raw data set (1000 data)	100.02889	99.095595367901	99.441096530586
Data block 1 (2000 data)	99.476941747573	99.587138219901	99.57710545218
				Data block 2 (1250 data)	99.460146053449	99.18890265114	99.238174168741
Data block 3 (1000 data)	99.508640776699	99.276333103959	99.314157031216
				Data block 4 (625 data)	99.415456674473	98.511236501051	98.841110280882

As a result show, the μ values and σ values of data block and raw data set are closely, it can be seen that data block volume is got over Greatly, as a result better (i.e. compared with raw data set, closer to σ and value), when data block 1 includes 2000 data, original number According to the similarity ratio 99.45% between collection and data block 1, when data block 4 includes 625 data, raw data set and data Similarity between block 4 compares 99.39%, so compared with raw data set, obtained data of sampling are bigger closer to original number According to collection information.

Experimental results demonstrate two kinds of algorithms of normal distribution and Poisson distribution can be with flat on average (expectation), variance equivalence Equal more than 99% similarity is close to raw data set.

Claims

1. the data-optimized and quick methods of sampling under a kind of big data environment, it is characterised in that：Comprise the following steps：

(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute some Row Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the form of the Sub Data Set Including numeric form Sub Data Set and textual form Sub Data Set, the Sub Data Set of numeric form is filtered out from large data sets Come；

(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, is being locally A storing path is set up under system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set；

(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form approximate normal distribution of the curve is bent Line, performs step (4)；If the approximate Poisson distribution curve of the distribution form of the curve, then perform step (5)；

(4), the data number of blocks for needing to be split to the Sub Data Set is set, using the dividing method of normal distribution to the son Data set carries out segmentation and draws some data blocks, some data block of being sampled from some data blocks, performs step (9)；

(5), the direction for primitive curve towards the ordinate that will be fitted approximate Poisson distribution moves up K unit and obtains curve A, then nearly Move down K unit like the direction of primitive curve towards the ordinate of Poisson distribution and obtain curve B, the region between curve A and curve B Form a standard area；

(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the Sub Data Set In data count according to amount averagely split, i.e., each split data volume=Sub Data Set that obtained data block is included Total amount of data/data number of blocks；

(7) some data block E, is extracted in some data blocks drawn from segmentation, and the data block is carried out curve fitting, See whether the curve that data block fitting is obtained falls in the standard area formed between curve A and curve B, if data block Curve falls in standard area, then be carried out step (9)；If the curve of data block is not fallen within standard area, just hold Row step (8)；

(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, if should Sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then just by the data Point is included in data block E；If the sample data, which is added to the curve that in data block E data block E can not be fitted, is located at standard In region, continue to selection sample data and be put into data block E, be until data block E matched curve is located in standard area Only, and step (9) is performed；

(9) data analysis, is carried out to the obtained data block.

2. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that： Large data sets are expressed as A [1, k], set the Sub Data Set of wherein some numeric form in large data sets as S [a, b], false If some data x ∈ [a, b] in Sub Data Set, and meet：Wherein μ is equal Value, σ is standard variance, then can show that Sub Data Set is met： It can show that the Sub Data Set is Normal Distribution.

3. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 2, it is characterised in that： The position for setting Sub Data Set represents position as S (pl, pr), pl and pr, andAbsolute value be less than Δ, wherein p The location point of greatest measure is represented, when Δ represents to carry out piecemeal, the maximum model that pointer is adjusted to certain data block set both sides Value is enclosed, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c=(pl+pr)/2=pl+B/ are represented with A (p) 2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA(p)∈ S, it is assumed that S [a, b] is met：A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then S [a, b] can be drawn to be comprising one The Sub Data Set of the data block of row Normal Distribution, i.e.,：Wherein m_rFor The m parameter value being dynamically adapted, after initialization m values, as Δ adjusts piecemeal, m values change, remaining in cutting procedure Region is by m_rTo adjust,Wherein x_iRepresent in the data acquisition system that a left side adjacent interval is dynamically adjusted Positional value where the interval frequency maxima of selection subtracts the positional value of upper interval adjacent on the left of this interval.

4. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that： The dividing method of normal distribution described in the step (4) is：

Firstth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], Δ is set and represents point During block, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data block number of segmentation Amount, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is current The left-most position for the data block set split, if pr is the current right-most position for carrying out partition data set of blocks, i.e., Pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+ Δs, i.e., Cl represents that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S, cl, Cr, LP), p represents to represent most to connect in the data block closest to the point position of cut-point, LP in the data block currently split The maximum of the point of nearly cut-point；

Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if | p-c | ＞ Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, that is, represents to determine in the data block closest point The point position of cutpoint, is then adjusted with cl and cr near this position and finds out the position of cut-point, within the data block Cut-point is determined after the leftward or rightward movement of the heart, then carries out splitting next data block again, it is necessary to the data being partitioned into The parameters value of block is upper one parameter value split after the data block adjustment drawn, is until being partitioned into than the m-th data block Only.

5. under a kind of big data environment according to claim 1 or claim 2 or claim 3 or claim 4 The data-optimized and quick methods of sampling, it is characterised in that：The quantity of the Sub Data Set of affiliated numeric form is at least a row.

6. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that： The file format that the large data sets file can be supported has CSV, XLS, TXT.