CN101567694B

CN101567694B - Multilevel data sampling method based on connected subgraph

Info

Publication number: CN101567694B
Application number: CN 200910031265
Authority: CN
Inventors: 钱宇; 张康
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-04-30
Filing date: 2009-04-30
Publication date: 2012-04-18
Anticipated expiration: 2029-04-30
Also published as: CN101567694A

Abstract

The invention discloses a multilevel data sampling method based on a connected subgraph, which is characterized in that the method comprises the following steps: (1) establishing K nearest adjacent graphs or K commonly-adjacent graphs for input data, wherein K is an integer; (2) acquiring connected subgraphs of the established K nearest adjacent graphs orthe K commonly-adjacent graphs; (3) calculating the average value or the intermediate value of the connected subgraphs as sampling points, wherein a collection of the sampling points is a result of the sampling; and (4) taking the sampling result obtained in the step (3) as new input data, repeating the step (1) to the step (3) until sampling end conditions are met, and realizing sampling required multilevel data. A concept of graphs is introduced in the sampling process; therefore, the method has high sampling efficiency and short running time; and the sampling end conditions can be set, so that a system can automatically stop continuous sampling when the sampling points of the data are too few and cannot represent source data.

Description

A kind of multilevel data sampling method based on connected subgraph

Technical field

The application relates to a kind of new method of data sampling, belongs to information processing and statistics field, through data are sampled, saves the expense of data space, transmission time and data analysis.

Background technology

Data sampling is widely used in information processing and statistics field.Especially after the popularizing of extensive use digital equipment and network, need the data volume rapid growth handled, and growth rate is being accelerated always.Such growth has received memory capacity, communication bandwidth, the restriction of each side such as system understanding ability.Often need sample or compress in this case, thereby in preserving data, accelerate storage, transmission and the analysis of data under the prerequisite of essential information data.

The simplest also the most widely used method of sampling is exactly a stochastical sampling.It is short that it has operation time, be easy to advantages such as realization, yet its shortcoming also clearly.The uncontrollable sampling process that causes in position that at first is sampled point is not reproducible, and sampling error also just is difficult to control; The selected probability in the data area that its two data point is few is low, in sampling, is left in the basket usually, causes whole zone not have representative point; The prior number of input sample point of user in addition, and domestic consumer is difficult to know that what sampled points could both guarantee that sample minimized not serious distortion.An important improvement to stochastical sampling is based on the position that concentration (being data-intensive degree) is adjusted sampled point, and the area sampling frequency that concentration is low more is high more, so just can guarantee that also there is sampled point in the few zone of data point.But doing has like this increased the expense of calculating, nor can remedy other defective of stochastical sampling method.

Vector quantization coding (Vector Quantization) is another kind of typical sampling compression method.LBG (Linde-Buzo-Gray) algorithm is a typical vectorization coding method.It uses the K-means clustering algorithm to produce representative point, then same group of representative point is applied to new data calculating sampling point.The shortcoming of LBG algorithm is that running time is longer, and the user need specify sample size equally.The LBG algorithm is a kind of learning method that supervision is arranged, and is not suitable for the occasion that does not have training data.

Two immediate data points of the each merging of PNN algorithm arrive the sample size of user's appointment up to the decreased number of data point.Its algorithm complex is O (N ³), the user need specify sample size, exists domestic consumer to be difficult to confirm the problem of sampled point quantity equally.

Summary of the invention

The object of the invention provides a kind of multilevel data sampling method based on connected subgraph, is reducing algorithm complex, when reducing the sampling time, realizes the automatic termination of multilevel data sampling.

For achieving the above object, the technical scheme that the present invention adopts is: a kind of multilevel data sampling method based on connected subgraph comprises the following steps:

(1) the input data is set up map interlinking of K arest neighbors or the common adjacent map of K; For N data vector, K is the integer of

;

(2) obtain the K arest neighbors map interlinking set up or the connected subgraph of the common adjacent map of K;

(3) to each connected subgraph, calculate its mean value or median as a sampled point, the set of all sampled points is the result of this time sampling;

(4) sampled result of obtaining with step (3) is as new input data, and repeating step (1) until satisfying the sampling end condition, is realized required multilevel data sampling to (3).

In the technique scheme, the K in the step (1) is an integer variable, and the value of K is big more, and the number of the sampled point of generation is few more, is not having to get K=1 usually under the situation of specified otherwise; A sampled result can be by further sampling, and promptly repeating step (1) it should be noted that to (3) further sampling also should continue to use mean value if in step (3), use mean value; If what use is median, further sampling also should continue to use median, claims that the former is an average sample, and the latter is the intermediate value sampling.Sampling each time produces the sampled point of lesser number than preceding once sampling, reduced data fidelity, but obtained the more data sample of refining.Data sample set of these compositions of sample, the number of the sampled point in these samples progressively successively decreases, and the distortion factor progressively increases, but the level of abstraction progressively increases.Can simple proof, if do not stop further to sample, last sampling will only comprise a sampled point.This moment data distortion maximum but the level of abstraction is the highest.

In the technique scheme, the foundation of map interlinking of K arest neighbors or the common adjacent map of K is prior art, explains as follows:

A) the input data can be as the set of one group of data vector, and each data vector has the attribute of similar number, and property value can be sky.

B) each data vector can represent that the property value of data vector is exactly the coordinate figure of this point with a point in the cartesian coordinate system.

C) similarity between the every pair of data vector is the Euclidean distance between these two points.

D) the arest neighbors map interlinking with each data point with link to each other with its nearest other K data points.

E) the common adjacent map of K requires to judge whether that for each closest approach of each data point X an X also is one of K closest approach of this closest approach.If not, in common adjacent map, will there be on line between some X and this closest approach.

F) create an arest neighbors map interlinking or common adjacent map based on one group of data point.

Whether G) whether two data points are close is in same connected subgraph definition by it.

H) one group of close data point can characterize with its center.

I) center of one group of data point can be defined as the mean value or the median of this group data point.

J) sustainable the carrying out of process of sampling.The output that sampling each time produces is the input of next time sampling.Thereby reach the purpose of constantly dwindling sample.

In the technique scheme, the sampling end condition in the said step (4) is explained as follows,

A) tentation data collection G comprises N vector point.A samples for the first time ₁Produce sampling D ₁, the sampled point number is N ₁, A samples for the second time ₂Produce sampling D ₂, the sampled point number is N ₂... to the last only surplus 1 sampled point, according to the said algorithm of claim 1,1＜...＜N ₂＜N ₁＜N and for all i, D _iThe distortion factor less than D _I+1

B) with sampled point from N _iBecome N _I+1Reduction and sample D _iTo D _I+1The distortion factor be changed to benchmark, if at next sampling process A _I+2In, N _I+1To N _I+2The reduction changes in amplitude little, yet the sampling D _I+2Compare D _I+1The distortion factor increase considerably, then the declarative data pattern is destroyed.Sampling must stop when once finishing preceding.

C) the normal ratio of sampling distortion degree and the sampled point reduction reduction that begins most to be compared with initial data by sampling for the first time and the average distance between raw data points self are estimated.The expectation distortion factor of sampling is estimated by the degree of distortion in service of last time and the sampled point minimizing number of this sampling each time then.

The degree of distortion in service of definition sampled result for each data point arrive its nearest sampled point apart from sum, represent by following formula (1)

ad = {(Σ_{i = 1}^{N} dist {(X_{i}, C (X_{i}))}^{2} / N)}^{1 / 2} - - - (1)

Wherein ad is the distortion factor, and N is input data number, and the input data are X ₁, X ₂..., X _N, corresponding sampling points is C (X ₁), C (X ₂) ..., C (X _N), C (X wherein _i) be to return X _iThe function of sampled point, dist (X _i, C (X _i)) then be an X _iAnd C (X _i) between Euclidean distance;

A _iThe sampled result D that obtains when the stage samples _iThe expectation distortion factor define by the ratio that previous degree of distortion in service and sample dwindle:

pd _i＝(ad _i-1+ad ₁)(N _i-1/N _i) ^1/d-ad ₁

{&ForAll;}_{i} > i

(2)

Wherein, d is the dimension of data, ad ₁Be the degree of distortion in service of phase I sampling, ad _I-1Be sample phase A _I-1Degree of distortion in service, N _I-1Expression sample phase A _I-1The sampled point number, N _iExpression sample phase A _iThe sampled point number;

As sample phase A _tSatisfy condition:

{&ForAll;}_{i} < t,

Ad _i≤pd _iAnd ad _t＞pd _t, promptly the degree of distortion in service when the t time sampling is higher than when estimating the distortion factor, and sampling stops automatically.

Perhaps, the sampling end condition in the said step (4) is that system continues sampling till the sampled point number is 1; In this process, preserve the result of each sampling, require from the result who preserves, to choose required sample according to the distortion factor of size or sample by the user.

Further technical scheme, in the said step (1), K=1.

Because the technique scheme utilization, the present invention compared with prior art has advantage:

1. the present invention has introduced the notion of figure in sampling process, each whole connected subgraph of forming by the phase near point that merges, and sampling efficiency is higher, and running time is shorter than PNN, and to N data vector, the algorithm complex of PNN is O (N ³), and method proposed by the invention is O (N ²); Low dimension data more is reduced to O (NlogN), and sampling process continues to carry out simultaneously, and the output of each sampling can be used as the input of sampling last time; The user need not specify sample size, can finish the back in sampling and from the sample set that produces, select such as the size and the distortion factor according to the attribute of sample.

2. the present invention can be provided with the sampling stop condition, and the very few sampling that stops automatically can't represent source data the time continuing can be put at data sampling by system.

Description of drawings

Fig. 1 is the sampling algorithm sketch map among the embodiment one;

Fig. 2 is the sampling algorithm sketch map among the embodiment two;

Fig. 3 is the data sampling sketch map as a result of embodiment two.

Embodiment

Below in conjunction with accompanying drawing and embodiment the present invention is further described:

Embodiment one: shown in accompanying drawing 1, a kind of multilevel data sampling method based on connected subgraph comprises the following steps:

(1) the input data is set up map interlinking of K arest neighbors or the common adjacent map of K, get K=1;

(2) obtain the connected subgraph of the arest neighbors map interlinking set up or common adjacent map;

(3) to each connected subgraph, calculate its mean value as a sampled point, the set of all sampled points is the result of this time sampling;

(4) sampled result of obtaining with step (3) is as new input data, and repeating step (1) till the sampled point number is 1, is realized required multilevel data sampling to (3).

In this process, preserve the result of each sampling, require from the result who preserves, to choose required sample based on the distortion factor of size or sample by the user.

Source data with 40 bivectors is an example, and sampled result is as shown in the table.

Form 1: source data (40 bivectors) is 1 through three samplings until the sampled point number

Primary data is sampled for the first time to sample for the second time and is sampled for the third time

169.7998 74.50672 173.4262 ?72.55558 ?176.273 72.16654 132.0092 133.3186

166.7934 63.05775 166.173 59.49513 ?112.3492 ?168.3199

202.7356 the 76.41366 195.7937 79.14297 107.1235 148.6582 sample distortion factors 59.752489

193.5498 70.38806 160.2242 73.59232 132.291 144.1296 sample sizes 1

164.6338 53.93999 172.0314 ?92.31077

195.8784 the 80.32839 189.9896 55.90245 sample distortion factors 14.352662

197.9796 81.092 106.9004 165.3225 sample sizes 4

171.0991 69.93897 96.29275 ?147.0986

158.0456 73.13734 110.0779 ?147.4216

167.9566 89.69758 114.9997 ?151.4545

162.4027 74.0473 ?134.6302 ?138.35

180.0624 64.92613 129.9518 ?149.9093

193.5652 54.91849 117.798 171.3173

167.0918 61.48766

188.8248 the 87.49272 sample distortion factors 5.037373

173.3717 93.5637 sample sizes 13

189.9956 57.91901

172.7436 80.8505

186.4079 54.86986

174.7658 93.67102

106.7336 165.4157

93.78157 158.2717

107.0671 165.2292

97.56 145.3172

107.346 151.3961

113.5398 152.9269

136.9691 137.3639

131.4998 153.2854

129.429 137.3368

94.33297 144.2756

114.8089 168.0962

123.047 174.4868

115.5382 171.3688

114.358 151.3219

110.1604 149.505

128.4038 146.5332

112.7274 141.3637

117.1014 150.1147

99.49649 140.5297

137.4924 140.3492

Embodiment two: shown in accompanying drawing 2, a kind of multilevel data sampling method based on connected subgraph comprises the following steps:

Sampling end condition in the said step (4) does,

ad = {(Σ_{i = 1}^{N} dist {(X_{i}, C (X_{i}))}^{2} / N)}^{1 / 2} - - - (1)

pd _i＝(ad _i-1+ad ₁)(N _i-1/N _i) ^1/d-ad ₁

{&ForAll;}_{i} > 1

(2)

As sample phase A _tSatisfy condition:

{&ForAll;}_{i} < t,

Referring to accompanying drawing 3, be that a two-dimentional data set is used result's diagram that the present embodiment method is carried out twice continuous sampling.Raw data set has 60000 data points; First round sampling obtains 18612 sampled points; Second takes turns sampling then further narrows down to 5272 points with sample.Can see that from diagram sample has kept the data distribution pattern of initial data.Such sample is for Application of pattern recognition, has expressed initial data less than 9% sample data, thereby has greatly saved the memory space of data, the time of transmission time and data analysis.

Another one data example sees the following form.

Form 2: source data (200 bivectors) stops through 3 samplings automatically

169.7997776 74.50672302 173.426227 72.55558 176.272997 72.166536 114.129588 81.678394

166.7933783 63.05774696 166.172992 59.49513 112.349174 168.31988 98.778549 ?170.6385

202.7356144 76.41366387 195.793659 79.14297 102.559628 142.22789 239.078243 181.25253

193.5498314 70.38806409 160.224158 73.59232 132.291001 144.12965 142.265443 240.25743

164.6338182 53.93998559 172.031384 92.31077 239.945324 165.63614 299.776329 241.73492

195.8784235 80.3283904 ?189.989563 55.90245 238.211163 196.86892 96.537169 ?326.67356

197.9796412 81.09200337 106.900354 165.3225 137.050508 247.99492 290.446734 71.594537

171.0991433 69.93896589 96.292754 ?147.0986 147.480377 232.51995

158.0455791 73.13733714 110.077946 147.4216 278.149385 310.43626 degree of distortion in service 44.71084

167.9566124 89.69758466 114.99971 151.4545 106.978795 322.56198 distortion estimator degree 35.793693

162.402737 74.04729564 ?134.63016 138.35 86.095542 330.78514

180.0623559 64.92612878 129.951841 149.9093 260.092535 70.635186 sample sizes: 7

193.5651531 54.91848737 117.797993 171.3173 305.544689 46.601105 distortion estimator degree＜degree of distortion in service

167.0917808 61.48766111 239.22731 161.3595 305.702979 97.547319 samplings stop

188.8247858 87.49271938 226.118832 188.5736 292.850153 230.32031

173.3717278 93.5637015 ?222.426235 168.0075 328.329449 184.44819

189.9956312 57.91901393 235.327043 205.4823 52.936274 ?214.40892

172.7436315 80.85050011 245.850746 165.365 ?93.756668 ?184.10616

186.4079048 54.86986105 252.277006 167.8126 105.704 ?94.144446

174.7658105 93.67102089 253.187613 196.5509 60.411768 ?78.7242

106.7335659 165.4157125 156.460007 254.659

93.78156518 158.2716753 136.229203 247.991 degree of distortion in service 17.587867

107.0671423 165.2292331 128.699508 239.2955 distortion estimator degree 22.400175

97.55999842 145.3172039 154.11463 ?235.5789

107.3460249 151.3960818 140.846123 229.461 sample sizes: 20

113.5397882 152.9269165 142.974101 260.8046 distortion estimator degree＞degree of distortion in service

136.9691312 137.3639306 120.88972 237.2245 samplings can continue

131.4998494 153.2854434 265.981118 303.8293

129.4289837 137.3368179 292.491996 309.1535

94.33296513 144.2756367 282.205436 296.0294

114.808868 168.0962061 ?280.493316 326.4101

123.0469569 174.4868095 278.662942 311.712

115.5381537 171.3688195 269.061504 315.4833

114.357963 151.3218645 ?96.617666 312.5471

110.1603909 149.5049531 90.827076 ?340.7038

128.4038316 146.5332052 114.241393 315.3127

112.7274216 141.3636768 84.754761 ?329.8122

117.101379 150.1147107 ?117.92681 335.7055

99.49648807 140.5296939 82.70479 321.8394

137.4923659 140.3491654 99.129313 ?326.6825

235.2346947 162.2671429 269.295705 61.01423

221.0634138 196.3193248 317.537391 56.33134

239.1552526 159.9851635 292.489174 104.9481

218.9283092 169.496724 ?293.551986 36.87087

234.4233497 203.305418 ?250.889365 80.25614

238.394729 161.5411915 ?318.916785 90.14651

225.9241606 166.5181978 270.771999 218.9408

248.1528726 164.2897652 289.953656 250.0757

220.2353575 187.7092192 320.75349 ?193.944

240.8418064 182.6705168 335.905407 174.9524

242.8508888 161.364773 ?318.524202 234.5358

240.5009864 161.638997 ?292.150756 217.7289

253.8990317 166.5638696 62.420124 ?231.0347

253.6184175 194.0966203 43.452424 ?197.7831

252.7568084 199.0051041 92.529373 ?191.1006

222.3540715 191.3795932 94.983963 ?177.1117

226.0995132 184.7891818 114.265343 106.6252

243.5486188 166.4402978 61.483045 ?107.5512

236.2307353 207.6592271 97.455955 ?98.69963

250.6549806 169.0613718 105.390702 77.1085

153.6266513 264.3702777 88.868103 ?122.937

140.2554482 244.3291167 61.503216 ?44.26813

127.4981005 237.3244286 58.249043 ?84.35325

150.0450694 238.2197243

139.9488918 229.9625585 degree of distortion in service 8.785595

152.4128685 253.7015052 distortion estimator degree N/A

129.9009162 241.2665011

165.0723569 242.5129241 sample sizes: 63

137.5494144 259.9084716

157.388017 248.9612482

153.8284342 258.0855702

158.1841914 232.9381509

123.5829529 242.5294869

121.7241763 239.8786043

156.4317144 260.3225515

137.6629203 249.5985649

146.6320587 262.0880899

144.7408289 260.4171827

141.7433537 228.9593758

130.7692399 250.0454107

270.2152371 302.4620969

272.177858 306.1354661

295.2478816 314.6564228

282.2274576 303.1310009

281.5348935 326.4685219

279.3628062 312.8095062

291.6605084 306.0408

277.9630781 310.6144287

255.5502594 302.8904372

284.4850095 290.7885228

284.7994825 295.7164115

279.451738 326.3516213

270.3049469 311.9283375

277.3097955 294.4818477

291.3577418 308.7947095

271.100245 314.1428586

265.4750133 315.1515665

272.0883585 314.3626217

266.3389558 321.8309697

291.7018514 307.1219722

93.99414939 312.6508404

88.68348413 340.924915

92.97066779 340.4827158

108.6450879 322.5673166

79.92435981 329.0684224

122.537263 328.2926627

110.6075591 313.373927

80.64741437 320.3693805

116.5127737 339.7257055

104.5108955 331.2226058

87.07737898 328.9883561

99.39594504 324.5342045

93.48109903 324.2907953

83.57849292 329.7692623

84.76216555 323.3094776

114.7303927 339.0982433

88.4388114 331.4226647

117.9328551 310.056421

119.7800703 315.2532746

99.24118189 312.4433646

264.5958928 55.52543732

315.1461845 48.29076454

299.4000389 110.5455044

319.9285974 64.371913

296.6773291 58.69920433

282.4113216 31.95051365

273.9955181 66.50301916

274.787336 95.34290012

260.4228332 77.96874658

321.2433564 85.01493692

239.8905738 80.35811653

301.9841923 109.7890365

253.0161871 76.02233523

299.4793505 24.86013611

293.7851308 104.115055

288.6640596 29.63217691

300.5278703 39.21232171

250.2278647 86.67537928

312.069147 95.20103603

323.4378523 90.22356736

274.1651315 199.4700348

290.7485889 245.0252879

320.9462415 195.4197857

333.9955542 170.8090094

329.4123264 195.6631252

342.1049732 208.4304607

290.6502159 248.8807703

288.4621634 256.3210496

313.718554 227.5113916

326.8414486 229.1041347

289.3432217 220.4176825

271.4961509 213.0431657

316.3975971 196.53059

315.0126038 246.9920203

337.8152603 179.095773

293.2171377 167.177807

294.9582904 215.0402094

272.4049588 225.8314788

265.0217553 237.418322

322.442665 200.4421204

68.83360455 215.1384059

36.98937184 194.9917144

95.12690113 186.1412053

45.78798258 194.24183

53.76184019 218.3447956

95.3531616 175.9992938

55.69049976 246.8171295

93.07948623 190.493174

66.13368371 240.4401933

58.95348222 188.9294226

45.05658057 177.5552126

76.60044508 192.3126568

41.73376966 244.1037793

94.61476488 178.2241594

65.21252494 227.6262047

39.50000902 208.2659874

34.42711922 222.7144593

85.57494506 224.7726419

105.3106608 195.4553072

117.3620314 229.2653928

112.1635718 106.3020536

66.45241694 115.9552987

55.12324246 98.01123497

97.12060662 97.7510742

104.4510167 75.69535689

62.87347605 108.6871273

96.45828527 121.7012639

114.2851887 92.88195385

56.97830472 45.52186784

84.70505196 75.57541713

101.0071849 94.05021798

80.71707671 34.05409407

108.8038105 58.64582851

116.3671141 106.9483625

114.7084426 82.74396325

81.27792014 124.172652

55.03400835 83.2137695

46.81426629 53.22843064

61.46407719 85.49272839

94.24007377 104.2975849

Embodiment three: a kind of multilevel data sampling method based on connected subgraph comprises the following steps:

(3) to each connected subgraph, calculate its median as a sampled point, the set of all sampled points is the result of this time sampling;

Sampling end condition in the said step (4) does,

ad = {(Σ_{i = 1}^{N} dist {(X_{i}, C (X_{i}))}^{2} / N)}^{1 / 2} - - - (1)

pd _i＝(ad _i-1+ad ₁)(N _i-1/N _i) ^1/d-ad ₁

{&ForAll;}_{i} > 1

(2)

As sample phase A _tSatisfy condition:

{&ForAll;}_{i} < t,

Claims

1. the multilevel data sampling method based on connected subgraph is characterized in that, comprises the following steps:

;

(4) sampled result of obtaining with step (3) is as new input data, and repeating step (1) until satisfying the sampling end condition, is realized required multilevel data sampling to (3);

Sampling end condition in the said step (4) does,

ad = {(Σ_{i = 1}^{N} dist {(X_{i}, C (X_{i}))}^{2} / N)}^{1 / 2} - - - (1)

pd _i＝(ad _i-1+ad ₁)(N _i-1/N _i) ^1/d-ad ₁

{&ForAll;}_{i} > 1 - - - (2)

Wherein, d is the dimension of data, ad ₁Be the degree of distortion in service of phase I sampling, ad _I-1Be sample phase A _I-1Degree of distortion in service, N _I-1Expression sample phase A _I-1The sampled point number, N _iExpression sample phase A _iThe sampled point number; I is the integer greater than 1, stage A ₁Represent sampling for the first time, A ₂Represent sampling for the second time ... A _iRepresent the i time sampling;

As sample phase A _tSatisfy condition:

Ad _i≤pd _iAnd ad _t＞pd _t, promptly the degree of distortion in service when the t time sampling is higher than when estimating the distortion factor, and sampling stops automatically;

2. the multilevel data sampling method based on connected subgraph according to claim 1 is characterized in that: in the said step (1), and K=1.

3. the multilevel data sampling method based on connected subgraph is characterized in that, comprises the following steps:

;

Sampling end condition in the said step (4) does,

ad = {(Σ_{i = 1}^{N} dist {(X_{i}, C (X_{i}))}^{2} / N)}^{1 / 2} - - - (1)

A _iThe sampled result D that obtains when the stage samples _iThe expectation distortion factor by the preceding A that once samples _I-1Degree of distortion in service and the sample ratio of dwindling define:

pd _i＝(ad _i-1+ad ₁)(N _i-1/N _i) ^1/d-ad ₁

{&ForAll;}_{i} > 1 - - - (2)

As sample phase A _tSatisfy condition:

4. the multilevel data sampling method based on connected subgraph according to claim 3 is characterized in that: in the said step (1), and K=1.