CN107273493A - A kind of data-optimized and quick methods of sampling under big data environment - Google Patents

A kind of data-optimized and quick methods of sampling under big data environment Download PDF

Info

Publication number
CN107273493A
CN107273493A CN201710452151.4A CN201710452151A CN107273493A CN 107273493 A CN107273493 A CN 107273493A CN 201710452151 A CN201710452151 A CN 201710452151A CN 107273493 A CN107273493 A CN 107273493A
Authority
CN
China
Prior art keywords
data
sub
data block
curve
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710452151.4A
Other languages
Chinese (zh)
Other versions
CN107273493B (en
Inventor
张浩澜
陈剑平
李兴森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Institute of Technology of ZJU
Original Assignee
Ningbo Institute of Technology of ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Institute of Technology of ZJU filed Critical Ningbo Institute of Technology of ZJU
Priority to CN201710452151.4A priority Critical patent/CN107273493B/en
Publication of CN107273493A publication Critical patent/CN107273493A/en
Application granted granted Critical
Publication of CN107273493B publication Critical patent/CN107273493B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Fuzzy Systems (AREA)
  • Algebra (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention relates to the data-optimized and quick methods of sampling under a kind of big data environment, including:(1), large data sets are deployed in cloud environment;(2) large data sets, are divided into some Sub Data Sets according to numerical attribute, the Sub Data Set of numeric form is filtered out;(3), choose the Sub Data Set for needing to be analyzed, the data distribution for judging the Sub Data Set is close to normal distribution or Poisson distribution, reuse normal state sampling algorithm proposed by the present invention or Poisson sampling algorithm, data block is drawn to the Sub Data Set rapid extraction, several data blocks of therefrom sampling are analyzed, the sample data block drawn by normal state partitioning algorithm or Poisson partitioning algorithm progress rapid extraction inherits the average of Sub Data Set, the attributes such as variance, data from the sample survey block is so only needed to be analyzed it is ensured that the data block and the high consistency of Sub Data Set drawn and representativeness, this sample loading mode greatly shortens the data-analysis time, improve data analysis efficiency.

Description

A kind of data-optimized and quick methods of sampling under big data environment
Technical field
The present invention relates to big data analysis field, take out more particularly, to data-optimized under a kind of big data environment and quickly Quadrat method.
Background technology
The surge of medical treatment and E-business applications generates huge data volume, and people have been brought into " big data " epoch. Different from traditional large data sets, " big data " one word does not mean only that the big of data volume, and is also represented by the generation speed of data Degree is fast.However, current data mining and analytical technology is faced with a challenge, i.e., Large Copacity is handled in a short period of time Data, the arrival in big data epoch promotes researcher to find the solution of the optimization for solving mass data, particularly to The solution of line business and medical data.In the prior art, the method that generally uses of volume of processing large data sets is:Using The method of matrix decomposition, by reducing data row (dimensionality reduction), then splits large data sets using enhanced Ultraviolet method, The data block obtained after separation is analyzed, but for a unusual big data set, matrix disassembling method can not subtract Few large data sets volume, could not also improve the treatment effeciency of data.By a large amount of domestic and foreign literatures, practical application and Patent data Access, find to there is no similarly the same technology with principle proposed by the present invention and application and development.
The content of the invention
The technical problems to be solved by the invention, which are to provide under a kind of big data environment, can reduce data volume and energy Enough improve the data-optimized and quick methods of sampling of data-handling efficiency.
The technical solution adopted in the present invention is that the data-optimized and quick methods of sampling under a kind of big data environment is wrapped Include following steps:
(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute Several columns Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the Sub Data Set Form includes numeric form Sub Data Set and textual form Sub Data Set, and the Sub Data Set of numeric form is sieved from large data sets Elect;
(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, at this A storing path is set up under ground system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set;
(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form similar normal state point of the curve Cloth curve, performs step (4);If the approximate Poisson distribution curve of the distribution form of the curve, then perform step (5);
(4), the data number of blocks for needing to be split to the Sub Data Set is set, the dividing method pair of normal distribution is used The Sub Data Set carries out segmentation and draws some data blocks, some data block of being sampled from some data blocks, performs step (9);
(5), the direction for primitive curve towards the ordinate that will be fitted approximate Poisson distribution moves up K unit and obtains curve A, then The direction of the primitive curve of approximate Poisson distribution towards ordinate is moved down into K unit and obtains curve B, between curve A and curve B Region forms a standard area;
(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the subnumber Averagely split according to amount according to the data count of concentration, i.e., each split data volume=subdata that obtained data block is included The total amount of data of collection/data number of blocks;
(7) some data block E, is extracted in some data blocks drawn from segmentation, and curve is carried out to the data block Fitting, sees whether the curve that data block fitting is obtained falls in the standard area formed between curve A and curve B, if number Fall according to the curve of block in standard area, then be carried out step (9);If the curve of data block does not fall within standard area It is interior, it is carried out step (8);
(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, such as Really the sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then just should Data point is included in data block E;It is located at if the sample data is added to the curve that in data block E data block E can not be fitted In standard area, continue to selection sample data and be put into data block E, until data block E matched curve is located at standard area Untill interior, and perform step (9);
(9) data analysis, is carried out to the obtained data block.
The beneficial effects of the invention are as follows:The subnumber for possessing some attribute for needing to be analyzed is filtered out from mass data According to collection, then the Sub Data Set is optimized, finally chooses some small-sized data block to carry out data analysis, this mode Data volume is greatly reduced, the data-analysis time is shortened.If the matched curve approximate normal distribution of Sub Data Set, that It is also to meet normal distribution to carry out the data block that draws of segmentation by normal distribution algorithm subdata sets, then only need to from The sampling obtained data block of segmentation is analyzed in the information contained with regard to Sub Data Set and mass data belonging to drawing Hold, improve data analysis efficiency, and also improve the accuracy of data analysis;If the matched curve of Sub Data Set is approximate Poisson distribution, then subdata sets are carried out after averagely splitting, extracts wherein certain data block and has met Poisson point after optimizing Cloth, then analyzed from sampling obtained data block of segmentation and Sub Data Set belonging to drawing and mass data are accumulate The information content contained, improves data analysis efficiency.
As preferential, large data sets are expressed as the subnumber of wherein some numeric form in A [1, k], setting large data sets According to integrating as S [a, b], it is assumed that some data x ∈ [a, b] in Sub Data Set, and meet:Wherein μ is average, and σ is standard variance, then can show that Sub Data Set is met:I.e. It is Normal Distribution that the Sub Data Set, which can be drawn,.
As preferential, the position for setting Sub Data Set represents position as S (pl, pr), pl and pr, and's Absolute value is less than Δ, and wherein p represents the location point of greatest measure, when Δ represents to carry out piecemeal, and pointer is to certain data block set two While the maximum magnitude value being adjusted, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c are represented with A (p) =(pl+pr)/2=pl+B/2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA (p) ∈ S, it is assumed that S [a, b] is met:A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then may be used To draw Sub Data Sets of the S [a, b] for a series of data block comprising Normal Distributions, i.e.,:Wherein mrFor the m parameter value being dynamically adapted, after initialization m values, with Δ adjustment piecemeal, m values change, and remaining region is by m in cutting procedurerTo adjust,Wherein xiTable Show the positional value where the interval frequency maxima of selection in the data acquisition system that a left side adjacent interval is dynamically adjusted Subtract the positional value of upper interval adjacent on the left of this interval.
As preferential, the dividing method of the normal distribution described in the step (4) is:
Firstth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], set Δ represent into During row piecemeal, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data of segmentation Number of blocks, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is The left-most position of the current data block set split, if pr is the current rightmost side position for carrying out partition data set of blocks Put, i.e. pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+ Δ, i.e. cl represent that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S, Cl, cr, LP), p represents to represent in the data block closest to the point position of cut-point, LP in the data block currently split Closest to the maximum of the point of cut-point;
Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if | p-c | > Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, i.e., expression is determined most connects in the data block The point position of nearly cut-point, is then adjusted with cl and cr near this position and finds out the position of cut-point, in data After block center is mobile to the left or to the right cut-point is determined, then carry out splitting next data block again, it is necessary to be partitioned into The parameters value of data block is upper one parameter value split after the data block adjustment drawn, until being partitioned into than the m-th data Untill block.
Method using normal distribution algorithm is split, and the data of the data block drawn, which are still, meets normal distribution, And the data block that quick search can be supported to be split, improves the pardon of analysis large data sets.
As preferential, the quantity of the Sub Data Set of affiliated numeric form is at least a row, because only that the more numbers of increase Digital data can just improve big data precision of analysis.
Brief description of the drawings
Fig. 1 is the parameter comparison in embodiment of the present invention during normal distribution in data set 1;
Fig. 2 is the parameter comparison in embodiment of the present invention during normal distribution in data set 2;
Embodiment
Referring to the drawings and combine embodiment and further describe invention, to make those skilled in the art's reference Specification word can be implemented according to this, and the scope of the present invention is not limited to the embodiment.
The present invention relates to the data optimization methods under a kind of big data environment, comprise the following steps:
(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute Several columns Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the Sub Data Set Form includes numeric form Sub Data Set and textual form Sub Data Set, and the Sub Data Set of numeric form is sieved from large data sets Elect;
(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, at this A storing path is set up under ground system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set;
(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form of the curve is closer to normal state Distribution curve, performs step (4);If the distribution form of the curve is closer to Poisson distribution curve, then perform step (5);
(4), the data number of blocks for needing to be split to the Sub Data Set is set, the dividing method pair of normal distribution is used The Sub Data Set carries out segmentation and draws some data blocks, some data block of therefrom sampling, and performs step (9);
(5) fitting, is moved up into K unit close to the direction of primitive curve towards the ordinate of Poisson distribution and obtains curve A, then Direction close to the primitive curve of Poisson distribution towards ordinate is moved down into K unit and obtains curve B, shape between curve A and curve B Into a standard area;
(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the subnumber Averagely split according to the data of concentration, i.e., each split the sum for data volume=Sub Data Set that obtained data block is included According to amount/data number of blocks;
(7) some data block E, is extracted in some data blocks drawn from segmentation, and curve is carried out to the data block Fitting, sees whether the obtained curve of data block fitting falls the standard area between curve A and curve B, if data block Curve falls in standard area, then be carried out step (9);If the curve of data block is not fallen within standard area, just hold Row step (8);
(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, such as Really the sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then the data Point is exactly the point needed;If the sample data, which is added to the curve that in data block E data block E can not be fitted, is located at standard regions In domain, continue to selection sample data and be put into data block E, untill data block E matched curve is located in standard area, And perform step (9);
(9) data analysis, is carried out to the obtained data block.
Large data sets are deployed in cloud environment, distributed cloud computing environment allows to analyze and locate under local system Data are managed, Sub Data Set can be distributed in the diverse location of any cloud framework, then local to handle its by single cloud node Sub Data Set under system, in addition, Sub Data Set splits the parallel processing suitable for cloud environment, it is flexible that increase internal memory is used Property.
The numerical attribute of each Sub Data Set can be age, duration, birth weight, the body quality of population distribution Index, the size of population distribution of breast cancer patients, home address, date, sex etc., wherein such as the age, the date, sex, The data of this classification of home address belong to the Sub Data Set of textual form, and duration, birth weight, the body of population distribution The classification of these digital forms such as the size of population of body mass index and breast cancer patients belongs to numeric form Sub Data Set, this Class Sub Data Set needs to screen again to analyze and process from large data sets, using data optimization methods of the present invention Prerequisite be large data sets need comprising at least one numeric form Sub Data Set arrange, to the Sub Data Set screened Carry out curve fitting, the distribution form for seeing the curve be close to normal distribution or Poisson distribution, if close to normal distribution, that Using the automatic Segmentation Sub Data Set of normal distribution so as to create small and representative data block, drawn Data block is also Normal Distribution, and its average and variance are approximate with the average and variance of affiliated Sub Data Set, so Just directly it can just can be derived that the information that the Sub Data Set belonging to this is contained by analyzing certain partial data block therein;Such as Really close to Poisson distribution, then by average segmentation Sub Data Set, some data block is extracted in the data block drawn from segmentation It is allowed to also comply with the average value and variance of Poisson distribution, the i.e. data block and the average value and variance of Sub Data Set after optimizing It is very approximate, thus directly it can just can be derived that the Sub Data Set belonging to this is contained by analyzing the data block therein Information.
As preferential, large data sets are expressed as the subnumber of wherein some numeric form in A [1, k], setting large data sets According to integrating as S [a, b], it is assumed that some data x ∈ [a, b] in Sub Data Set, and meet:Its Middle μ is average, and σ is standard variance, then can show that Sub Data Set is met: It can show that the Sub Data Set is Normal Distribution.
As preferential, the position for setting Sub Data Set represents position as S (pl, pr), pl and pr, andIt is exhausted Δ is less than to value, wherein p represents the location point of greatest measure, when Δ represents to carry out piecemeal, pointer is to certain data block set both sides The maximum magnitude value being adjusted, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c=are represented with A (p) (pl+pr)/2=pl+B/2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA (p) ∈ S, it is assumed that S [a, b] is met:A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then it can obtain Go out the Sub Data Set that S [a, b] is a series of data block comprising Normal Distributions, i.e.,: Wherein mrFor the m parameter value being dynamically adapted, after initialization m values, as Δ adjusts piecemeal, m values change, cutting procedure In remaining region by mrTo adjust,Wherein xiRepresent the number that a left side adjacent interval is dynamically adjusted The position of upper interval adjacent on the left of this interval is subtracted according to the positional value where the interval frequency maxima of selection in set Put value.
As preferential, the dividing method of the normal distribution described in the step (3) is:
Secondth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], set Δ represent into During row piecemeal, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data of segmentation Number of blocks, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is The left-most position of the current data block set split, if pr is the current rightmost side position for carrying out partition data set of blocks Put, i.e. pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+ Δ, i.e. cl represent that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S, Cl, cr, LP), p represents to represent in the data block closest to the point position of cut-point, LP in the data block currently split Closest to the maximum of the point of cut-point;
Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if | p-c | > Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, i.e., expression is determined most connects in the data block The point position of nearly cut-point, is then adjusted with cl and cr near this position and finds out the position of cut-point, in data After block center is mobile to the left or to the right cut-point is determined, then carry out splitting next data block again, it is necessary to be partitioned into The parameters value of data block is upper one parameter value split after the data block adjustment drawn, until being partitioned into than the m-th data Untill block.
As preferential, the quantity of the Sub Data Set of affiliated numeric form is at least a row, because only that the more numbers of increase Digital data can just improve big data precision of analysis.
Under be classified as two medical data files for meeting normal distribution, i.e.,:
1) diabetes data source:Data set comes from 10 years (1999- in hospital of 130, the U.S. and comprehensive delivery network 2008) clinical care data set, including 101767 records, 50 numerical attributes, including 4521 records.
2) health data sources:The hospital record that this data source is provided comprising patient information and their online resource, should Record marital status, employment state, hospital stays of patient etc..
We are based on medical data source and have carried out a series of experiment, and these data files are diabetes data collection and strong Health data set.
Based on two data sources, experimental result is as follows:
Data set 1:μ represents average, and σ represents standard deviation, and this data set is " diabetes attribute ".
Data set 2:μ represents average, and σ represents standard deviation, and this data set is " healthy attribute:Balance ".
After the automatic Segmentation that above-mentioned two medical data collection passes through normal distribution, as a result it is described as follows:
In order to verify the pardon of the data block by normal distribution dividing method, we are obtained by analyzing segmentation sampling Data block and raw data set two data sources, then both μ values and σ values are compared
In data set 1, data set 2 is divided into 15 single data blocks, and we compare μ the and σ values of raw data set μ and σ values corresponding with some partition data block, compare form as follows:
Parameter Data set 1 Partition data block
μ 486.5324 486.552
σ 212.3062 212.1967
Set sizes 101766 6784.4
μ the and σ values of the μ and σ values of partition data block closely data set 1, the μ between data set 1 and partition data block Value difference is 0.0196, the discrepancy of partition data block relative data set 1 about 0.0044%, between data set 1 and data set 1 Different σ value differences are 0.1095, and this is the change that partition data block relative data set 1 has about 0.516%.
In data set 2, data set 2 is divided into 5 single data blocks, complete data set and some partition data block Between comparison listed in following form:
μ the and σ values of the μ and σ values of partition data block closely data set 2, the μ between data set 2 and partition data block Value difference is 4.48, the discrepancy of partition data block relative data set 2 about 0.0031%, between data set 2 and partition data block σ value differences it is different be 54.04, this is the change that partition data block relative data set 2 has about 0.0719%.
The probability of the Poisson distribution can be expressed as:Wherein k is represented The frequency that certain event occurs in interval of time, λ is the number of times that average event occurs in a time interval, and e is Euler Number, k!It is k factorial, p represents the probability of time generation.
The data for meeting Poisson distribution are sampled, the average and standard deviation and original number of the data block finally given It is as follows according to the mean μ of collection and the contrast of standard deviation:
Data source μ σ λ
Raw data set (1000 data) 100.02889 99.095595367901 99.441096530586
Data block 1 (2000 data) 99.476941747573 99.587138219901 99.57710545218
Data block 2 (1250 data) 99.460146053449 99.18890265114 99.238174168741
Data block 3 (1000 data) 99.508640776699 99.276333103959 99.314157031216
Data block 4 (625 data) 99.415456674473 98.511236501051 98.841110280882
As a result show, the μ values and σ values of data block and raw data set are closely, it can be seen that data block volume is got over Greatly, as a result better (i.e. compared with raw data set, closer to σ and value), when data block 1 includes 2000 data, original number According to the similarity ratio 99.45% between collection and data block 1, when data block 4 includes 625 data, raw data set and data Similarity between block 4 compares 99.39%, so compared with raw data set, obtained data of sampling are bigger closer to original number According to collection information.
Experimental results demonstrate two kinds of algorithms of normal distribution and Poisson distribution can be with flat on average (expectation), variance equivalence Equal more than 99% similarity is close to raw data set.

Claims (6)

1. the data-optimized and quick methods of sampling under a kind of big data environment, it is characterised in that:Comprise the following steps:
(1), data prediction, large data sets are deployed in cloud environment, are divided into large data sets according to numerical attribute some Row Sub Data Set, that is, the data for possessing identical numerical attribute are included into same row Sub Data Set, the form of the Sub Data Set Including numeric form Sub Data Set and textual form Sub Data Set, the Sub Data Set of numeric form is filtered out from large data sets Come;
(2) some Sub Data Set for needing to be analyzed, is selected from some Sub Data Sets screened, is being locally A storing path is set up under system, the Sub Data Set is preserved, and curve matching is done to the Sub Data Set;
(3), the Sub Data Set is fitted to obtained curve to be judged, if the distribution form approximate normal distribution of the curve is bent Line, performs step (4);If the approximate Poisson distribution curve of the distribution form of the curve, then perform step (5);
(4), the data number of blocks for needing to be split to the Sub Data Set is set, using the dividing method of normal distribution to the son Data set carries out segmentation and draws some data blocks, some data block of being sampled from some data blocks, performs step (9);
(5), the direction for primitive curve towards the ordinate that will be fitted approximate Poisson distribution moves up K unit and obtains curve A, then nearly Move down K unit like the direction of primitive curve towards the ordinate of Poisson distribution and obtain curve B, the region between curve A and curve B Form a standard area;
(6), the data number of blocks for needing to be split to the Sub Data Set is set, according to data number of blocks come to the Sub Data Set In data count according to amount averagely split, i.e., each split data volume=Sub Data Set that obtained data block is included Total amount of data/data number of blocks;
(7) some data block E, is extracted in some data blocks drawn from segmentation, and the data block is carried out curve fitting, See whether the curve that data block fitting is obtained falls in the standard area formed between curve A and curve B, if data block Curve falls in standard area, then be carried out step (9);If the curve of data block is not fallen within standard area, just hold Row step (8);
(8) sample data, is arbitrarily selected, the sample data is located in other data blocks in addition to data block E, if should Sample data is added in data block E, the curve that data block E is fitted can be made to be located in standard area, then just by the data Point is included in data block E;If the sample data, which is added to the curve that in data block E data block E can not be fitted, is located at standard In region, continue to selection sample data and be put into data block E, be until data block E matched curve is located in standard area Only, and step (9) is performed;
(9) data analysis, is carried out to the obtained data block.
2. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that: Large data sets are expressed as A [1, k], set the Sub Data Set of wherein some numeric form in large data sets as S [a, b], false If some data x ∈ [a, b] in Sub Data Set, and meet:Wherein μ is equal Value, σ is standard variance, then can show that Sub Data Set is met: It can show that the Sub Data Set is Normal Distribution.
3. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 2, it is characterised in that: The position for setting Sub Data Set represents position as S (pl, pr), pl and pr, andAbsolute value be less than Δ, wherein p The location point of greatest measure is represented, when Δ represents to carry out piecemeal, the maximum model that pointer is adjusted to certain data block set both sides Value is enclosed, the maximum possibility numerical value in Sub Data Set S [c- Δs, c+ Δs], c=(pl+pr)/2=pl+B/ are represented with A (p) 2, B=k/m, c represent the centre position of Sub Data Set, and m represents the data number of blocks of segmentation, byA(p)∈ S, it is assumed that S [a, b] is met:A (p)=Max (S), c ≈ (a+b)/2, | c-p |≤Δ, then S [a, b] can be drawn to be comprising one The Sub Data Set of the data block of row Normal Distribution, i.e.,:Wherein mrFor The m parameter value being dynamically adapted, after initialization m values, as Δ adjusts piecemeal, m values change, remaining in cutting procedure Region is by mrTo adjust,Wherein xiRepresent in the data acquisition system that a left side adjacent interval is dynamically adjusted Positional value where the interval frequency maxima of selection subtracts the positional value of upper interval adjacent on the left of this interval.
4. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that: The dividing method of normal distribution described in the step (4) is:
Firstth, some specific initialization values are first set, that is, set overall Sub Data Set as S [1, n], Δ is set and represents point During block, the maximum magnitude value that pointer is adjusted to the data block set both sides split, if m represents the data block number of segmentation Amount, then B=n/m represents the length of each data block, if LP represents the maximum of ineligible cut-point, if pl is current The left-most position for the data block set split, if pr is the current right-most position for carrying out partition data set of blocks, i.e., Pr=B, if c is the center of the current data block set split, i.e. c=B/2, if cl=c- Δs, cr=c+ Δs, i.e., Cl represents that data block center is moved to the left Δ, and cr represents that data block center moves right Δ, setting p=MaxPoint (S, cl, Cr, LP), p represents to represent most to connect in the data block closest to the point position of cut-point, LP in the data block currently split The maximum of the point of nearly cut-point;
Secondth, segmentation is proceeded by, is split from first data BOB(beginning of block) of acquiescence, when pr meets pr≤n, if | p-c | > Δs, then LP=[LP, p], p=MaxPoint (S, cl, cr, LP) are drawn, that is, represents to determine in the data block closest point The point position of cutpoint, is then adjusted with cl and cr near this position and finds out the position of cut-point, within the data block Cut-point is determined after the leftward or rightward movement of the heart, then carries out splitting next data block again, it is necessary to the data being partitioned into The parameters value of block is upper one parameter value split after the data block adjustment drawn, is until being partitioned into than the m-th data block Only.
5. under a kind of big data environment according to claim 1 or claim 2 or claim 3 or claim 4 The data-optimized and quick methods of sampling, it is characterised in that:The quantity of the Sub Data Set of affiliated numeric form is at least a row.
6. the data-optimized and quick methods of sampling under a kind of big data environment according to claim 1, it is characterised in that: The file format that the large data sets file can be supported has CSV, XLS, TXT.
CN201710452151.4A 2017-06-15 2017-06-15 Data optimization and rapid sampling method under big data environment Expired - Fee Related CN107273493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710452151.4A CN107273493B (en) 2017-06-15 2017-06-15 Data optimization and rapid sampling method under big data environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710452151.4A CN107273493B (en) 2017-06-15 2017-06-15 Data optimization and rapid sampling method under big data environment

Publications (2)

Publication Number Publication Date
CN107273493A true CN107273493A (en) 2017-10-20
CN107273493B CN107273493B (en) 2020-08-25

Family

ID=60066298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710452151.4A Expired - Fee Related CN107273493B (en) 2017-06-15 2017-06-15 Data optimization and rapid sampling method under big data environment

Country Status (1)

Country Link
CN (1) CN107273493B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399413A (en) * 2019-07-04 2019-11-01 博彦科技股份有限公司 Sampling of data method, apparatus, storage medium and processor
US11204931B1 (en) 2020-11-19 2021-12-21 International Business Machines Corporation Query continuous data based on batch fitting
CN117421354A (en) * 2023-12-19 2024-01-19 国家卫星海洋应用中心 Satellite remote sensing big data set statistical method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110317935A1 (en) * 2010-06-25 2011-12-29 Fujitsu Limited Image processing device, method thereof, and a computer readable non transitory storage medium storing an image processing program
CN104636496A (en) * 2015-03-04 2015-05-20 重庆理工大学 Hybrid clustering recommendation method based on Gaussian distribution and distance similarity
EP2954308A4 (en) * 2013-02-08 2016-02-10 Services Petroliers Schlumberger Apparatus and methodology for measuring properties of microporous material at multiple scales
CN106599798A (en) * 2016-11-25 2017-04-26 南京蓝泰交通设施有限责任公司 Face recognition method facing face recognition training method of big data processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110317935A1 (en) * 2010-06-25 2011-12-29 Fujitsu Limited Image processing device, method thereof, and a computer readable non transitory storage medium storing an image processing program
EP2954308A4 (en) * 2013-02-08 2016-02-10 Services Petroliers Schlumberger Apparatus and methodology for measuring properties of microporous material at multiple scales
CN104636496A (en) * 2015-03-04 2015-05-20 重庆理工大学 Hybrid clustering recommendation method based on Gaussian distribution and distance similarity
CN106599798A (en) * 2016-11-25 2017-04-26 南京蓝泰交通设施有限责任公司 Face recognition method facing face recognition training method of big data processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CLÉCIO S.FERREIRA: "Nonlinear regression models under skew scale mixtures of normal distributions", 《STATISTICAL METHODOLOGY》 *
夏奇思: "基于属性约简的粗糙集海量数据分割算法研究", 《计算机技术与发展》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399413A (en) * 2019-07-04 2019-11-01 博彦科技股份有限公司 Sampling of data method, apparatus, storage medium and processor
US11204931B1 (en) 2020-11-19 2021-12-21 International Business Machines Corporation Query continuous data based on batch fitting
CN117421354A (en) * 2023-12-19 2024-01-19 国家卫星海洋应用中心 Satellite remote sensing big data set statistical method, device and equipment
CN117421354B (en) * 2023-12-19 2024-03-19 国家卫星海洋应用中心 Satellite remote sensing big data set statistical method, device and equipment

Also Published As

Publication number Publication date
CN107273493B (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN107731269B (en) Disease coding method and system based on original diagnosis data and medical record file data
CN109243618B (en) Medical model construction method, disease label construction method and intelligent device
CN112101190A (en) Remote sensing image classification method, storage medium and computing device
CN111414393A (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
US20180165413A1 (en) Gene expression data classification method and classification system
CN107273493A (en) A kind of data-optimized and quick methods of sampling under big data environment
Jai-Andaloussi et al. Medical content based image retrieval by using the Hadoop framework
CN111695593A (en) XGboost-based data classification method and device, computer equipment and storage medium
CN112328909B (en) Information recommendation method and device, computer equipment and medium
CN111913999B (en) Statistical analysis method, system and storage medium based on multiple groups of study and clinical data
CN111695336A (en) Disease name code matching method and device, computer equipment and storage medium
CN114187979A (en) Data processing, model training, molecular prediction and screening method and device thereof
CN106445918A (en) Chinese address processing method and system
CN116580849A (en) Medical data acquisition and analysis system and method thereof
CN114496099A (en) Cell function annotation method, device, equipment and medium
US20230056839A1 (en) Cancer prognosis
CN115688760A (en) Intelligent diagnosis guiding method, device, equipment and storage medium
Saravagi et al. [Retracted] Diagnosis of Lumbar Spondylolisthesis Using a Pruned CNN Model
Malini et al. Opinion mining on movie reviews
CN113241193A (en) Drug recommendation model training method, recommendation method, device, equipment and medium
CN114911778A (en) Data processing method and device, computer equipment and storage medium
CN112529319A (en) Grading method and device based on multi-dimensional features, computer equipment and storage medium
CN110852078A (en) Method and device for generating title
CN115116549A (en) Cell data annotation method, device, equipment and medium
CN107168944A (en) A kind of LDA parallel optimizations method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200825