CN107273493A - A kind of data-optimized and quick methods of sampling under big data environment - Google Patents
A kind of data-optimized and quick methods of sampling under big data environment Download PDFInfo
- Publication number
- CN107273493A CN107273493A CN201710452151.4A CN201710452151A CN107273493A CN 107273493 A CN107273493 A CN 107273493A CN 201710452151 A CN201710452151 A CN 201710452151A CN 107273493 A CN107273493 A CN 107273493A
- Authority
- CN
- China
- Prior art keywords
- data
- sub
- data block
- curve
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000005070 sampling Methods 0.000 title claims abstract description 19
- 238000009826 distribution Methods 0.000 claims abstract description 58
- 238000007405 data analysis Methods 0.000 claims abstract description 11
- 230000011218 segmentation Effects 0.000 claims description 23
- 238000005192 partition Methods 0.000 claims description 15
- 238000000605 extraction Methods 0.000 abstract 2
- 238000000638 solvent extraction Methods 0.000 abstract 2
- 238000004458 analytical method Methods 0.000 description 3
- 206010012601 diabetes mellitus Diseases 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 2
- 208000026310 Breast neoplasm Diseases 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Operations Research (AREA)
- Fuzzy Systems (AREA)
- Algebra (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
Abstract
The present invention relates to the data-optimized and quick methods of sampling under a kind of big data environment, including:(1), large data sets are deployed in cloud environment;(2) large data sets, are divided into some Sub Data Sets according to numerical attribute, the Sub Data Set of numeric form is filtered out;(3), choose the Sub Data Set for needing to be analyzed, the data distribution for judging the Sub Data Set is close to normal distribution or Poisson distribution, reuse normal state sampling algorithm proposed by the present invention or Poisson sampling algorithm, data block is drawn to the Sub Data Set rapid extraction, several data blocks of therefrom sampling are analyzed, the sample data block drawn by normal state partitioning algorithm or Poisson partitioning algorithm progress rapid extraction inherits the average of Sub Data Set, the attributes such as variance, data from the sample survey block is so only needed to be analyzed it is ensured that the data block and the high consistency of Sub Data Set drawn and representativeness, this sample loading mode greatly shortens the data-analysis time, improve data analysis efficiency.
Description
Technical field
The present invention relates to big data analysis field, take out more particularly, to data-optimized under a kind of big data environment and quickly
Quadrat method.
Background technology
The surge of medical treatment and E-business applications generates huge data volume, and people have been brought into " big data " epoch.
Different from traditional large data sets, " big data " one word does not mean only that the big of data volume, and is also represented by the generation speed of data
Degree is fast.However, current data mining and analytical technology is faced with a challenge, i.e., Large Copacity is handled in a short period of time
Data, the arrival in big data epoch promotes researcher to find the solution of the optimization for solving mass data, particularly to
The solution of line business and medical data.In the prior art, the method that generally uses of volume of processing large data sets is:Using
The method of matrix decomposition, by reducing data row (dimensionality reduction), then splits large data sets using enhanced Ultraviolet method,
The data block obtained after separation is analyzed, but for a unusual big data set, matrix disassembling method can not subtract
Few large data sets volume, could not also improve the treatment effeciency of data.By a large amount of domestic and foreign literatures, practical application and Patent data
Access, find to there is no similarly the same technology with principle proposed by the present invention and application and development.
The content of the invention
The technical problems to be solved by the invention, which are to provide under a kind of big data environment, can reduce data volume and energy
Enough improve the data-optimized and quick methods of sampling of data-handling efficiency.
The technical solution adopted in the present invention is that the data-optimized and quick methods of sampling under a kind of big data environment is wrapped
Include following steps:
(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute
Several columns Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the Sub Data Set
Form includes numeric form Sub Data Set and textual form Sub Data Set, and the Sub Data Set of numeric form is sieved from large data sets
Elect;
(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, at this
A storing path is set up under ground system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set;
(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form similar normal state point of the curve
Cloth curve, performs step (4);If the approximate Poisson distribution curve of the distribution form of the curve, then perform step (5);
(4), the data number of blocks for needing to be split to the Sub Data Set is set, the dividing method pair of normal distribution is used
The Sub Data Set carries out segmentation and draws some data blocks, some data block of being sampled from some data blocks, performs step (9);
(5), the direction for primitive curve towards the ordinate that will be fitted approximate Poisson distribution moves up K unit and obtains curve A, then
The direction of the primitive curve of approximate Poisson distribution towards ordinate is moved down into K unit and obtains curve B, between curve A and curve B
Region forms a standard area;
(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the subnumber
Averagely split according to amount according to the data count of concentration, i.e., each split data volume=subdata that obtained data block is included
The total amount of data of collection/data number of blocks;
(7) some data block E, is extracted in some data blocks drawn from segmentation, and curve is carried out to the data block
Fitting, sees whether the curve that data block fitting is obtained falls in the standard area formed between curve A and curve B, if number
Fall according to the curve of block in standard area, then be carried out step (9);If the curve of data block does not fall within standard area
It is interior, it is carried out step (8);
(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, such as
Really the sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then just should
Data point is included in data block E;It is located at if the sample data is added to the curve that in data block E data block E can not be fitted
In standard area, continue to selection sample data and be put into data block E, until data block E matched curve is located at standard area
Untill interior, and perform step (9);
(9) data analysis, is carried out to the obtained data block.
The beneficial effects of the invention are as follows:The subnumber for possessing some attribute for needing to be analyzed is filtered out from mass data
According to collection, then the Sub Data Set is optimized, finally chooses some small-sized data block to carry out data analysis, this mode
Data volume is greatly reduced, the data-analysis time is shortened.If the matched curve approximate normal distribution of Sub Data Set, that
It is also to meet normal distribution to carry out the data block that draws of segmentation by normal distribution algorithm subdata sets, then only need to from
The sampling obtained data block of segmentation is analyzed in the information contained with regard to Sub Data Set and mass data belonging to drawing
Hold, improve data analysis efficiency, and also improve the accuracy of data analysis;If the matched curve of Sub Data Set is approximate
Poisson distribution, then subdata sets are carried out after averagely splitting, extracts wherein certain data block and has met Poisson point after optimizing
Cloth, then analyzed from sampling obtained data block of segmentation and Sub Data Set belonging to drawing and mass data are accumulate
The information content contained, improves data analysis efficiency.
As preferential, large data sets are expressed as the subnumber of wherein some numeric form in A [1, k], setting large data sets
According to integrating as S [a, b], it is assumed that some data x ∈ [a, b] in Sub Data Set, and meet:Wherein
μ is average, and σ is standard variance, then can show that Sub Data Set is met:I.e.
It is Normal Distribution that the Sub Data Set, which can be drawn,.
As preferential, the position for setting Sub Data Set represents position as S (pl, pr), pl and pr, and's
Absolute value is less than Δ, and wherein p represents the location point of greatest measure, when Δ represents to carry out piecemeal, and pointer is to certain data block set two
While the maximum magnitude value being adjusted, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c are represented with A (p)
=(pl+pr)/2=pl+B/2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA (p) ∈ S, it is assumed that S [a, b] is met:A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then may be used
To draw Sub Data Sets of the S [a, b] for a series of data block comprising Normal Distributions, i.e.,:Wherein mrFor the m parameter value being dynamically adapted, after initialization m values, with
Δ adjustment piecemeal, m values change, and remaining region is by m in cutting procedurerTo adjust,Wherein xiTable
Show the positional value where the interval frequency maxima of selection in the data acquisition system that a left side adjacent interval is dynamically adjusted
Subtract the positional value of upper interval adjacent on the left of this interval.
As preferential, the dividing method of the normal distribution described in the step (4) is:
Firstth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], set Δ represent into
During row piecemeal, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data of segmentation
Number of blocks, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is
The left-most position of the current data block set split, if pr is the current rightmost side position for carrying out partition data set of blocks
Put, i.e. pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+
Δ, i.e. cl represent that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S,
Cl, cr, LP), p represents to represent in the data block closest to the point position of cut-point, LP in the data block currently split
Closest to the maximum of the point of cut-point;
Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if
| p-c | > Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, i.e., expression is determined most connects in the data block
The point position of nearly cut-point, is then adjusted with cl and cr near this position and finds out the position of cut-point, in data
After block center is mobile to the left or to the right cut-point is determined, then carry out splitting next data block again, it is necessary to be partitioned into
The parameters value of data block is upper one parameter value split after the data block adjustment drawn, until being partitioned into than the m-th data
Untill block.
Method using normal distribution algorithm is split, and the data of the data block drawn, which are still, meets normal distribution,
And the data block that quick search can be supported to be split, improves the pardon of analysis large data sets.
As preferential, the quantity of the Sub Data Set of affiliated numeric form is at least a row, because only that the more numbers of increase
Digital data can just improve big data precision of analysis.
Brief description of the drawings
Fig. 1 is the parameter comparison in embodiment of the present invention during normal distribution in data set 1;
Fig. 2 is the parameter comparison in embodiment of the present invention during normal distribution in data set 2;
Embodiment
Referring to the drawings and combine embodiment and further describe invention, to make those skilled in the art's reference
Specification word can be implemented according to this, and the scope of the present invention is not limited to the embodiment.
The present invention relates to the data optimization methods under a kind of big data environment, comprise the following steps:
(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute
Several columns Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the Sub Data Set
Form includes numeric form Sub Data Set and textual form Sub Data Set, and the Sub Data Set of numeric form is sieved from large data sets
Elect;
(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, at this
A storing path is set up under ground system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set;
(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form of the curve is closer to normal state
Distribution curve, performs step (4);If the distribution form of the curve is closer to Poisson distribution curve, then perform step (5);
(4), the data number of blocks for needing to be split to the Sub Data Set is set, the dividing method pair of normal distribution is used
The Sub Data Set carries out segmentation and draws some data blocks, some data block of therefrom sampling, and performs step (9);
(5) fitting, is moved up into K unit close to the direction of primitive curve towards the ordinate of Poisson distribution and obtains curve A, then
Direction close to the primitive curve of Poisson distribution towards ordinate is moved down into K unit and obtains curve B, shape between curve A and curve B
Into a standard area;
(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the subnumber
Averagely split according to the data of concentration, i.e., each split the sum for data volume=Sub Data Set that obtained data block is included
According to amount/data number of blocks;
(7) some data block E, is extracted in some data blocks drawn from segmentation, and curve is carried out to the data block
Fitting, sees whether the obtained curve of data block fitting falls the standard area between curve A and curve B, if data block
Curve falls in standard area, then be carried out step (9);If the curve of data block is not fallen within standard area, just hold
Row step (8);
(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, such as
Really the sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then the data
Point is exactly the point needed;If the sample data, which is added to the curve that in data block E data block E can not be fitted, is located at standard regions
In domain, continue to selection sample data and be put into data block E, untill data block E matched curve is located in standard area,
And perform step (9);
(9) data analysis, is carried out to the obtained data block.
Large data sets are deployed in cloud environment, distributed cloud computing environment allows to analyze and locate under local system
Data are managed, Sub Data Set can be distributed in the diverse location of any cloud framework, then local to handle its by single cloud node
Sub Data Set under system, in addition, Sub Data Set splits the parallel processing suitable for cloud environment, it is flexible that increase internal memory is used
Property.
The numerical attribute of each Sub Data Set can be age, duration, birth weight, the body quality of population distribution
Index, the size of population distribution of breast cancer patients, home address, date, sex etc., wherein such as the age, the date, sex,
The data of this classification of home address belong to the Sub Data Set of textual form, and duration, birth weight, the body of population distribution
The classification of these digital forms such as the size of population of body mass index and breast cancer patients belongs to numeric form Sub Data Set, this
Class Sub Data Set needs to screen again to analyze and process from large data sets, using data optimization methods of the present invention
Prerequisite be large data sets need comprising at least one numeric form Sub Data Set arrange, to the Sub Data Set screened
Carry out curve fitting, the distribution form for seeing the curve be close to normal distribution or Poisson distribution, if close to normal distribution, that
Using the automatic Segmentation Sub Data Set of normal distribution so as to create small and representative data block, drawn
Data block is also Normal Distribution, and its average and variance are approximate with the average and variance of affiliated Sub Data Set, so
Just directly it can just can be derived that the information that the Sub Data Set belonging to this is contained by analyzing certain partial data block therein;Such as
Really close to Poisson distribution, then by average segmentation Sub Data Set, some data block is extracted in the data block drawn from segmentation
It is allowed to also comply with the average value and variance of Poisson distribution, the i.e. data block and the average value and variance of Sub Data Set after optimizing
It is very approximate, thus directly it can just can be derived that the Sub Data Set belonging to this is contained by analyzing the data block therein
Information.
As preferential, large data sets are expressed as the subnumber of wherein some numeric form in A [1, k], setting large data sets
According to integrating as S [a, b], it is assumed that some data x ∈ [a, b] in Sub Data Set, and meet:Its
Middle μ is average, and σ is standard variance, then can show that Sub Data Set is met:
It can show that the Sub Data Set is Normal Distribution.
As preferential, the position for setting Sub Data Set represents position as S (pl, pr), pl and pr, andIt is exhausted
Δ is less than to value, wherein p represents the location point of greatest measure, when Δ represents to carry out piecemeal, pointer is to certain data block set both sides
The maximum magnitude value being adjusted, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c=are represented with A (p)
(pl+pr)/2=pl+B/2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA (p) ∈ S, it is assumed that S [a, b] is met:A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then it can obtain
Go out the Sub Data Set that S [a, b] is a series of data block comprising Normal Distributions, i.e.,:
Wherein mrFor the m parameter value being dynamically adapted, after initialization m values, as Δ adjusts piecemeal, m values change, cutting procedure
In remaining region by mrTo adjust,Wherein xiRepresent the number that a left side adjacent interval is dynamically adjusted
The position of upper interval adjacent on the left of this interval is subtracted according to the positional value where the interval frequency maxima of selection in set
Put value.
As preferential, the dividing method of the normal distribution described in the step (3) is:
Secondth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], set Δ represent into
During row piecemeal, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data of segmentation
Number of blocks, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is
The left-most position of the current data block set split, if pr is the current rightmost side position for carrying out partition data set of blocks
Put, i.e. pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+
Δ, i.e. cl represent that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S,
Cl, cr, LP), p represents to represent in the data block closest to the point position of cut-point, LP in the data block currently split
Closest to the maximum of the point of cut-point;
Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if
| p-c | > Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, i.e., expression is determined most connects in the data block
The point position of nearly cut-point, is then adjusted with cl and cr near this position and finds out the position of cut-point, in data
After block center is mobile to the left or to the right cut-point is determined, then carry out splitting next data block again, it is necessary to be partitioned into
The parameters value of data block is upper one parameter value split after the data block adjustment drawn, until being partitioned into than the m-th data
Untill block.
As preferential, the quantity of the Sub Data Set of affiliated numeric form is at least a row, because only that the more numbers of increase
Digital data can just improve big data precision of analysis.
Under be classified as two medical data files for meeting normal distribution, i.e.,:
1) diabetes data source:Data set comes from 10 years (1999- in hospital of 130, the U.S. and comprehensive delivery network
2008) clinical care data set, including 101767 records, 50 numerical attributes, including 4521 records.
2) health data sources:The hospital record that this data source is provided comprising patient information and their online resource, should
Record marital status, employment state, hospital stays of patient etc..
We are based on medical data source and have carried out a series of experiment, and these data files are diabetes data collection and strong
Health data set.
Based on two data sources, experimental result is as follows:
Data set 1:μ represents average, and σ represents standard deviation, and this data set is " diabetes attribute ".
Data set 2:μ represents average, and σ represents standard deviation, and this data set is " healthy attribute:Balance ".
After the automatic Segmentation that above-mentioned two medical data collection passes through normal distribution, as a result it is described as follows:
In order to verify the pardon of the data block by normal distribution dividing method, we are obtained by analyzing segmentation sampling
Data block and raw data set two data sources, then both μ values and σ values are compared
In data set 1, data set 2 is divided into 15 single data blocks, and we compare μ the and σ values of raw data set
μ and σ values corresponding with some partition data block, compare form as follows:
Parameter | Data set 1 | Partition data block |
μ | 486.5324 | 486.552 |
σ | 212.3062 | 212.1967 |
Set sizes | 101766 | 6784.4 |
μ the and σ values of the μ and σ values of partition data block closely data set 1, the μ between data set 1 and partition data block
Value difference is 0.0196, the discrepancy of partition data block relative data set 1 about 0.0044%, between data set 1 and data set 1
Different σ value differences are 0.1095, and this is the change that partition data block relative data set 1 has about 0.516%.
In data set 2, data set 2 is divided into 5 single data blocks, complete data set and some partition data block
Between comparison listed in following form:
μ the and σ values of the μ and σ values of partition data block closely data set 2, the μ between data set 2 and partition data block
Value difference is 4.48, the discrepancy of partition data block relative data set 2 about 0.0031%, between data set 2 and partition data block
σ value differences it is different be 54.04, this is the change that partition data block relative data set 2 has about 0.0719%.
The probability of the Poisson distribution can be expressed as:Wherein k is represented
The frequency that certain event occurs in interval of time, λ is the number of times that average event occurs in a time interval, and e is Euler
Number, k!It is k factorial, p represents the probability of time generation.
The data for meeting Poisson distribution are sampled, the average and standard deviation and original number of the data block finally given
It is as follows according to the mean μ of collection and the contrast of standard deviation:
Data source | μ | σ | λ |
Raw data set (1000 data) | 100.02889 | 99.095595367901 | 99.441096530586 |
Data block 1 (2000 data) | 99.476941747573 | 99.587138219901 | 99.57710545218 |
Data block 2 (1250 data) | 99.460146053449 | 99.18890265114 | 99.238174168741 |
Data block 3 (1000 data) | 99.508640776699 | 99.276333103959 | 99.314157031216 |
Data block 4 (625 data) | 99.415456674473 | 98.511236501051 | 98.841110280882 |
As a result show, the μ values and σ values of data block and raw data set are closely, it can be seen that data block volume is got over
Greatly, as a result better (i.e. compared with raw data set, closer to σ and value), when data block 1 includes 2000 data, original number
According to the similarity ratio 99.45% between collection and data block 1, when data block 4 includes 625 data, raw data set and data
Similarity between block 4 compares 99.39%, so compared with raw data set, obtained data of sampling are bigger closer to original number
According to collection information.
Experimental results demonstrate two kinds of algorithms of normal distribution and Poisson distribution can be with flat on average (expectation), variance equivalence
Equal more than 99% similarity is close to raw data set.
Claims (6)
1. the data-optimized and quick methods of sampling under a kind of big data environment, it is characterised in that:Comprise the following steps:
(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute some
Row Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the form of the Sub Data Set
Including numeric form Sub Data Set and textual form Sub Data Set, the Sub Data Set of numeric form is filtered out from large data sets
Come;
(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, is being locally
A storing path is set up under system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set;
(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form approximate normal distribution of the curve is bent
Line, performs step (4);If the approximate Poisson distribution curve of the distribution form of the curve, then perform step (5);
(4), the data number of blocks for needing to be split to the Sub Data Set is set, using the dividing method of normal distribution to the son
Data set carries out segmentation and draws some data blocks, some data block of being sampled from some data blocks, performs step (9);
(5), the direction for primitive curve towards the ordinate that will be fitted approximate Poisson distribution moves up K unit and obtains curve A, then nearly
Move down K unit like the direction of primitive curve towards the ordinate of Poisson distribution and obtain curve B, the region between curve A and curve B
Form a standard area;
(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the Sub Data Set
In data count according to amount averagely split, i.e., each split data volume=Sub Data Set that obtained data block is included
Total amount of data/data number of blocks;
(7) some data block E, is extracted in some data blocks drawn from segmentation, and the data block is carried out curve fitting,
See whether the curve that data block fitting is obtained falls in the standard area formed between curve A and curve B, if data block
Curve falls in standard area, then be carried out step (9);If the curve of data block is not fallen within standard area, just hold
Row step (8);
(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, if should
Sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then just by the data
Point is included in data block E;If the sample data, which is added to the curve that in data block E data block E can not be fitted, is located at standard
In region, continue to selection sample data and be put into data block E, be until data block E matched curve is located in standard area
Only, and step (9) is performed;
(9) data analysis, is carried out to the obtained data block.
2. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that:
Large data sets are expressed as A [1, k], set the Sub Data Set of wherein some numeric form in large data sets as S [a, b], false
If some data x ∈ [a, b] in Sub Data Set, and meet:Wherein μ is equal
Value, σ is standard variance, then can show that Sub Data Set is met:
It can show that the Sub Data Set is Normal Distribution.
3. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 2, it is characterised in that:
The position for setting Sub Data Set represents position as S (pl, pr), pl and pr, andAbsolute value be less than Δ, wherein p
The location point of greatest measure is represented, when Δ represents to carry out piecemeal, the maximum model that pointer is adjusted to certain data block set both sides
Value is enclosed, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c=(pl+pr)/2=pl+B/ are represented with A (p)
2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA(p)∈
S, it is assumed that S [a, b] is met:A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then S [a, b] can be drawn to be comprising one
The Sub Data Set of the data block of row Normal Distribution, i.e.,:Wherein mrFor
The m parameter value being dynamically adapted, after initialization m values, as Δ adjusts piecemeal, m values change, remaining in cutting procedure
Region is by mrTo adjust,Wherein xiRepresent in the data acquisition system that a left side adjacent interval is dynamically adjusted
Positional value where the interval frequency maxima of selection subtracts the positional value of upper interval adjacent on the left of this interval.
4. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that:
The dividing method of normal distribution described in the step (4) is:
Firstth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], Δ is set and represents point
During block, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data block number of segmentation
Amount, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is current
The left-most position for the data block set split, if pr is the current right-most position for carrying out partition data set of blocks, i.e.,
Pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+ Δs, i.e.,
Cl represents that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S, cl,
Cr, LP), p represents to represent most to connect in the data block closest to the point position of cut-point, LP in the data block currently split
The maximum of the point of nearly cut-point;
Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if | p-c
| > Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, that is, represents to determine in the data block closest point
The point position of cutpoint, is then adjusted with cl and cr near this position and finds out the position of cut-point, within the data block
Cut-point is determined after the leftward or rightward movement of the heart, then carries out splitting next data block again, it is necessary to the data being partitioned into
The parameters value of block is upper one parameter value split after the data block adjustment drawn, is until being partitioned into than the m-th data block
Only.
5. under a kind of big data environment according to claim 1 or claim 2 or claim 3 or claim 4
The data-optimized and quick methods of sampling, it is characterised in that:The quantity of the Sub Data Set of affiliated numeric form is at least a row.
6. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that:
The file format that the large data sets file can be supported has CSV, XLS, TXT.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710452151.4A CN107273493B (en) | 2017-06-15 | 2017-06-15 | Data optimization and rapid sampling method under big data environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710452151.4A CN107273493B (en) | 2017-06-15 | 2017-06-15 | Data optimization and rapid sampling method under big data environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107273493A true CN107273493A (en) | 2017-10-20 |
CN107273493B CN107273493B (en) | 2020-08-25 |
Family
ID=60066298
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710452151.4A Expired - Fee Related CN107273493B (en) | 2017-06-15 | 2017-06-15 | Data optimization and rapid sampling method under big data environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107273493B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399413A (en) * | 2019-07-04 | 2019-11-01 | 博彦科技股份有限公司 | Sampling of data method, apparatus, storage medium and processor |
US11204931B1 (en) | 2020-11-19 | 2021-12-21 | International Business Machines Corporation | Query continuous data based on batch fitting |
CN117421354A (en) * | 2023-12-19 | 2024-01-19 | 国家卫星海洋应用中心 | Satellite remote sensing big data set statistical method, device and equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110317935A1 (en) * | 2010-06-25 | 2011-12-29 | Fujitsu Limited | Image processing device, method thereof, and a computer readable non transitory storage medium storing an image processing program |
CN104636496A (en) * | 2015-03-04 | 2015-05-20 | 重庆理工大学 | Hybrid clustering recommendation method based on Gaussian distribution and distance similarity |
EP2954308A4 (en) * | 2013-02-08 | 2016-02-10 | Services Petroliers Schlumberger | Apparatus and methodology for measuring properties of microporous material at multiple scales |
CN106599798A (en) * | 2016-11-25 | 2017-04-26 | 南京蓝泰交通设施有限责任公司 | Face recognition method facing face recognition training method of big data processing |
-
2017
- 2017-06-15 CN CN201710452151.4A patent/CN107273493B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110317935A1 (en) * | 2010-06-25 | 2011-12-29 | Fujitsu Limited | Image processing device, method thereof, and a computer readable non transitory storage medium storing an image processing program |
EP2954308A4 (en) * | 2013-02-08 | 2016-02-10 | Services Petroliers Schlumberger | Apparatus and methodology for measuring properties of microporous material at multiple scales |
CN104636496A (en) * | 2015-03-04 | 2015-05-20 | 重庆理工大学 | Hybrid clustering recommendation method based on Gaussian distribution and distance similarity |
CN106599798A (en) * | 2016-11-25 | 2017-04-26 | 南京蓝泰交通设施有限责任公司 | Face recognition method facing face recognition training method of big data processing |
Non-Patent Citations (2)
Title |
---|
CLÉCIO S.FERREIRA: "Nonlinear regression models under skew scale mixtures of normal distributions", 《STATISTICAL METHODOLOGY》 * |
夏奇思: "基于属性约简的粗糙集海量数据分割算法研究", 《计算机技术与发展》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399413A (en) * | 2019-07-04 | 2019-11-01 | 博彦科技股份有限公司 | Sampling of data method, apparatus, storage medium and processor |
US11204931B1 (en) | 2020-11-19 | 2021-12-21 | International Business Machines Corporation | Query continuous data based on batch fitting |
CN117421354A (en) * | 2023-12-19 | 2024-01-19 | 国家卫星海洋应用中心 | Satellite remote sensing big data set statistical method, device and equipment |
CN117421354B (en) * | 2023-12-19 | 2024-03-19 | 国家卫星海洋应用中心 | Satellite remote sensing big data set statistical method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107273493B (en) | 2020-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107731269B (en) | Disease coding method and system based on original diagnosis data and medical record file data | |
CN109243618B (en) | Medical model construction method, disease label construction method and intelligent device | |
CN112101190A (en) | Remote sensing image classification method, storage medium and computing device | |
CN111414393A (en) | Semantic similar case retrieval method and equipment based on medical knowledge graph | |
US20180165413A1 (en) | Gene expression data classification method and classification system | |
CN107273493A (en) | A kind of data-optimized and quick methods of sampling under big data environment | |
Jai-Andaloussi et al. | Medical content based image retrieval by using the Hadoop framework | |
CN111695593A (en) | XGboost-based data classification method and device, computer equipment and storage medium | |
CN112328909B (en) | Information recommendation method and device, computer equipment and medium | |
CN111913999B (en) | Statistical analysis method, system and storage medium based on multiple groups of study and clinical data | |
CN111695336A (en) | Disease name code matching method and device, computer equipment and storage medium | |
CN114187979A (en) | Data processing, model training, molecular prediction and screening method and device thereof | |
CN106445918A (en) | Chinese address processing method and system | |
CN116580849A (en) | Medical data acquisition and analysis system and method thereof | |
CN114496099A (en) | Cell function annotation method, device, equipment and medium | |
US20230056839A1 (en) | Cancer prognosis | |
CN115688760A (en) | Intelligent diagnosis guiding method, device, equipment and storage medium | |
Saravagi et al. | [Retracted] Diagnosis of Lumbar Spondylolisthesis Using a Pruned CNN Model | |
Malini et al. | Opinion mining on movie reviews | |
CN113241193A (en) | Drug recommendation model training method, recommendation method, device, equipment and medium | |
CN114911778A (en) | Data processing method and device, computer equipment and storage medium | |
CN112529319A (en) | Grading method and device based on multi-dimensional features, computer equipment and storage medium | |
CN110852078A (en) | Method and device for generating title | |
CN115116549A (en) | Cell data annotation method, device, equipment and medium | |
CN107168944A (en) | A kind of LDA parallel optimizations method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200825 |