CN106649723B

CN106649723B - Large data set multi-pass random sampling method based on improved pond sampling

Info

Publication number: CN106649723B
Application number: CN201611203570.6A
Authority: CN
Inventors: 许卓明
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2020-09-18
Anticipated expiration: 2036-12-23
Also published as: CN106649723A

Abstract

The method for randomly sampling the large data set in multiple times based on the improved pond sampling is provided, and comprises the following steps: opening a large data set file containing n data records so as to read the data records from the large data set file, inputting the size k of a pool for random sampling, allocating a memory space which can just accommodate the k data records for the pool, and inputting the number m of times of random sampling, wherein the requirement that k x m < n is met; creating and opening a small data set file to enable writing of sampled data records therein; initially marking all data records in the large data set file as "non-sampled"; repeating m times of random sampling, in each time of random sampling, randomly extracting k 'non-sampled' data records from the large data set file by using a pond, adding the data records to the small data set file, and marking the data records as 'sampled'; and closing the large and small data set files to complete the multi-pass random sampling process. The method is simple and easy to implement and has wide application prospect.

Description

Large data set multi-pass random sampling method based on improved pond sampling

Technical Field

The invention belongs to the technical field of random sampling, relates to a random sampling method and technology of a large data set, and particularly relates to a multi-pass random sampling method of the large data set based on improved pond sampling.

Background

Random sampling (random sampling) is the fundamental technique for many application problems in computer science, statistics and engineering, and is particularly useful for statistically significant analysis processing of large data sets (big datasets). There are many Sampling algorithms (Sampling algorithms) in the field of random Sampling techniques (see authoritative book: Yves Till. Sampling algorithms in spring Series in Statistics, spring New York,2006.), and pond Sampling (Reservoir Sampling) (see authoritative book 4.4.5, page 48; Algorithm 4.4, page 49.) among the classic algorithms. By "pond" is meant a particular storage area allocated in the computer memory for storing data records during and after the random sampling process. The concepts related to pond sampling and pond sampling are common knowledge in the technical field.

The traditional pond sampling algorithm was proposed in 1980 s, and its application is a typical "small data" background. Currently, big data applications have become an urgent need. The traditional pond sampling algorithm can not meet the requirement of big data application only by carrying out random sampling once. Assuming that a large data set file contains n data records, the size of the pond is k (i.e., exactly k data records can be accommodated). In practical application, it is not difficult to imagine that the value of n is very large, but the value of k is relatively very small (when a computer runs, too much memory space cannot be allocated to a pond due to limited memory resources), so that the traditional pond sampling algorithm can only randomly extract k data records with very limited number from a large data set file. In such a sampling result, since sample data is actually too small, it is difficult to reasonably embody information content characteristics (such as statistical information characteristics) included in the original large data set file, which will inevitably greatly affect the rationality of the data analysis processing result.

Therefore, a plurality of substantial improvements (for details, see the various distinguishing technical features of the present invention in the following text) must be made on the conventional pond sampling algorithm, so that the algorithm can perform a plurality of random sampling on a large data set by one-time operation, and more random sample data can be obtained under the condition of maintaining the "random sampling" characteristic and not increasing the time complexity of the algorithm. The invention also aims to overcome the defects of the lack of a large data set (multi-pass) random sampling method and a complete and practicable technical scheme.

Disclosure of Invention

The invention aims to solve the technical problems that a pond sampling algorithm is substantially improved and a large data set multi-pass random sampling method based on improved pond sampling is provided, so that the defect that the traditional pond sampling algorithm cannot be suitable for large data set (multi-pass) random sampling is overcome.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention provides a large data set multi-pass random sampling method based on improved pond sampling, which comprises the following steps:

step S1: opening a large data set file containing n data records so as to read the data records from the large data set file, inputting a memory area for random sampling, namely the size k of a pond, allocating a memory space which can just accommodate the k data records for the pond, and inputting the pass number m of the random sampling, wherein the requirement that the product of k and m is less than n is met, namely k x m is less than n;

step S2: creating and opening a small data set file to enable writing of sampled data records therein;

step S3: initially marking all data records in the large data set file as "non-sampled";

step S4: repeating m times of random sampling, in each time of random sampling, randomly extracting k 'non-sampled' data records from the large data set file by using a pond, adding the data records to the small data set file, and marking the data records as 'sampled';

step S5: and closing the large and small data set files to complete the multi-pass random sampling process.

In the method, the step S4 further includes:

taking the random sampling pass as a loop variable of the counting control loop, wherein the initial value is 1, the termination value is m, the increment step length is 1, and the following steps are repeatedly executed for m times:

step S4-1: copying k "non-sampled" data records in the front of the large dataset file into the pond, while marking those data records as "sampled";

step S4-2: randomly replacing certain data records in the pond with certain "non-sampled" data records in the back of the large data set file with a decreasing probability, while restoring the marking of these replaced data records as "non-sampled" and marking the data records for replacement as "sampled";

step S4-3: k data records in the pond which are not changed any more are added to the tail of the small data set file.

In the method, the step S4-1 further includes:

controlling a cyclic variable of a cycle by taking the current data recording position of a pond as a condition, wherein the initial value of the cyclic variable is 1, the cyclic condition is that the cyclic variable value is less than or equal to k, and the following steps are repeatedly executed for a plurality of times from the current data recording position value of a large data set file being 1:

step S4-1-1: if the current data record of the large data set file is "non-sampled," then the following process occurs:

copying the current data record in the big data set file to the current data record position in the pond;

recording the position value of the copied data record in the pond corresponding to the data record in the big data set file;

marking the copied data record in the large data set file as 'sampled';

increasing the current data recording position value of the pond by 1;

step S4-1-2: the current data record location value of the large data set file is incremented by 1.

In the method, the step S4-2 further includes:

taking the current data recording position of the large data set file as a cycle variable of the counting control cycle, taking the initial value as the current data recording position value r of the large data set file, taking the termination value as n, taking the increment step length as 1, and repeatedly executing the following steps for (n-r +1) times:

step S4-2-1: if the current data record of the large data set file is "non-sampled," then the following steps are performed:

step S4-2-1-1: generating a random integer j from 1 to the current data record position value of the large data set file;

step S4-2-1-2: if the random integer j is less than or equal to the size k of the pond, performing the following treatment:

according to the position value of the data record corresponding to the recorded data record in the pond in the big data set file, the data record in the big data set file corresponding to the data record with the position value of j in the pond is marked as 'not sampled';

replacing the data record with the position value of j in the pond by the current data record in the big data set file;

recording the position value of the replaced data record in the pond corresponding to the data record in the big data set file;

the data record in the large data set file for the above replacement is marked as "sampled".

The beneficial technical effects of the invention mainly comprise four aspects: (1) the defect that the traditional pond sampling algorithm cannot be suitable for large data set (multi-pass) random sampling is overcome; (2) an effective large data set random sampling method is provided; (3) the provided method is simple and easy to implement; (4) the method has wide application prospect in the fields of big data analysis and the like.

The following further describes embodiments of the present invention with reference to the accompanying drawings. Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a flow chart of steps of a large data set multi-pass random sampling method based on improved pond sampling according to the technical scheme of the invention;

fig. 2 is a schematic diagram of a large data set multi-pass random sampling processing procedure in a large data set multi-pass random sampling method based on improved pond sampling according to the technical scheme of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar concepts, objects, elements, etc., or concepts, objects, elements, etc., having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and in the relevant art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

according to the present invention, there is provided a method for multi-pass random sampling of large data sets based on improved pond sampling, as shown in fig. 1, comprising the following steps S1 to S5:

step S1: opening a large data set file containing n data records so as to read the data records from the large data set file, inputting a memory area for random sampling, namely the size k of a pond, and allocating a memory space which can just accommodate the k data records for the pond, and inputting the pass number m of the random sampling, wherein the requirement that the product of k and m is less than n is met, namely k x m is less than n.

Step S2: a small data set file is created and opened to enable the sampled data records to be written therein.

Step S3: all data records in the large data set file are initially marked as "non-sampled".

Step S4: repeating m times of random sampling, in each time of random sampling, randomly extracting k 'non-sampled' data records from the large data set file by using a pond, adding the data records to the small data set file, and marking the data records as 'sampled'; the step S4 further includes:

taking the random sampling pass as a loop variable of the counting control loop, wherein the initial value is 1, the end value is m, the increment step size is 1, and the following steps S4-1, S4-2 and S4-3 are repeatedly executed for m times:

step S4-1: copying k "non-sampled" data records in the front of the large dataset file into the pond, while marking those data records as "sampled"; the step S4-1 further includes:

controlling a loop variable of the loop with the current data recording position of the pond as a condition, wherein an initial value is 1, the loop condition is that the loop variable value is less than or equal to k, and the following steps S4-1-1 and S4-1-2 are repeatedly executed for a plurality of times starting with the current data recording position value of the large data set file being 1:

marking the copied data record in the large data set file as 'sampled';

increasing the current data recording position value of the pond by 1;

Step S4-2: randomly replacing certain data records in the pond with certain "non-sampled" data records in the back of the large data set file with a decreasing probability, while restoring the marking of these replaced data records as "non-sampled" and marking the data records for replacement as "sampled"; the step S4-2 further includes:

taking the current data recording position of the large data set file as a loop variable of the counting control loop, taking the initial value as the current data recording position value r of the large data set file, taking the ending value as n, taking the increment step as 1, and repeatedly executing the following steps S4-2-1 for (n-r +1) times:

step S4-2-1: if the current data record of the large data set file is "non-sampled", the following steps S4-2-1-1 and S4-2-1-2 are performed:

The following further describes the specific implementation of the technical solution of the present invention by a preferred embodiment, and shows the beneficial technical effects of the present invention. The following pseudo code description provides a program code implementation of the technical solution of the present invention:

in this pseudo-code description, bold-faced statements have conventional meanings well known in the art of computer science and related arts. For example, create/open/close file represents create/open/close file, for.. do... end for represents count-controlled loop, while, do... end, while, for example, condition-controlled loop, if... then, end if represents condition-judging statement. The variables or symbols in the pseudo-code description and their meanings are shown in table 1.

TABLE 1 variables or symbols in pseudo code and their meanings

Further, the following visually explains the processing procedure of multi-pass random sampling (i.e. step S4) of the large data set, which is a core technical feature in the technical solution of the present invention, with reference to fig. 2. As shown in fig. 2, step S4 requires repeating m times of random sampling under the premise that k × m < n is satisfied (in practical applications, k × m is usually much smaller than n). In each random sampling pass, k "non-sampled" data records are randomly extracted from the large data set file (see left part of fig. 2) using a pond (see middle part of fig. 2) and added to the small data set file (see right part of fig. 2) while they are marked as "sampled". As shown by the arrows, explanatory words and legend in fig. 2, the specific processing procedure of step S4 is as follows:

step S4-1: the k "non-sampled" data records in the front of the large data set file (the so-called front, see the description at point 2 in fig. 2) are copied into the pond, while the data records are marked as "sampled" (as indicated by the solid arrows and text labels below the left-middle portion of fig. 2).

Step S4-2: certain data records in the pond are randomly replaced with certain "unsampled" data records in the back of the large data set file (so-called back, see description at point 2 in fig. 2) with a decreasing probability, while those replaced data records are marked back as "unsampled" and data records for replacement as "sampled" (as indicated by the solid arrows and text labels above the left-middle portion of fig. 2).

Step S4-3: k no-more-changed data records in the pond are added to the tail of the small data set file (as indicated by the solid arrows and text labels below the middle-right portion of fig. 2).

It is further noted that, as indicated by the literal notation in FIG. 2, j in step S4-2 is a random integer from 1 to the current data record location value of the large data set file; only under the condition of j < ═ k, the data record in the pond with j is replaced by some "non-sampled" data record at the back of the large data set file (see Step S4-2-1-2 or the condition judgment in the pseudo code Step 16). Because the current data record position value of the large data set file is continuously increased in the processing process, the probability of satisfying the condition and causing the random replacement of the data record is gradually reduced. These features are the core of the improved pool sampling algorithm of the present invention that still maintains the "random sampling" feature.

After the above-mentioned processing procedure is circularly executed m times, m times of random sampling of the large data set file are implemented, so that k × m data records randomly coming from the large data set file are stored in the small data set file as sampling result.

The conventional reservoir sampling algorithm has a linear time complexity O (n), and the time complexity of the improved reservoir sampling algorithm of the present invention is O (m × n) ═ O (n) (since m is a constant much smaller than n), and still has a linear time complexity.

In summary, it can be understood from the foregoing technical solutions and the specific embodiments (including the preferred embodiments) of the present invention that the beneficial technical effects of the present invention mainly include four aspects: (1) the defect that the traditional pond sampling algorithm cannot be suitable for large data set (multi-pass) random sampling is overcome; (2) an effective large data set random sampling method is provided; (3) the provided method is simple and easy to implement; (4) the method has wide application prospect in the fields of big data analysis and the like.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A big data set multi-pass random sampling method based on improved pond sampling comprises the following steps:

step S5: closing the large and small data set files to complete the random sampling process for multiple times;

the step S4 further includes:

step S4-3: adding k data records which are not changed any more in the pond to the tail part of the small data set file;

the step S4-1 further includes:

marking the copied data record in the large data set file as 'sampled';

increasing the current data recording position value of the pond by 1; step S4-1-2: increasing the current data record position value of the large data set file by 1; the step S4-2 further includes:

according to the recorded position value of the data record in the pond in the large data set file corresponding to the data record,

marking the data record recovery in the large data set file corresponding to the data record with the position value of j in the pond as 'non-sampled';