CN110618986A

CN110618986A - Big data statistical sampling method and device, server and storage medium

Info

Publication number: CN110618986A
Application number: CN201910836126.5A
Authority: CN
Inventors: 柯瑞强; 李宏强
Original assignee: Crystal Ball Education Information Technology Co Ltd
Current assignee: Crystal Ball Education Information Technology Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2019-12-27

Abstract

The invention discloses a big data statistical sampling method, a big data statistical sampling device, a server and a storage medium, wherein the method comprises the following steps: according to a preset verification process, performing verification processing on original data to obtain primary overall data; the preset checking process comprises the following steps: removing or marking data with integrity defects during data integrity check processing; when missing data is processed, removing or converting and reserving partial defective data according to business requirements and algorithm requirements; sampling the preliminary overall data according to a preset sampling flow to obtain sampling result data; the sampling process comprises the following steps: index dimension analysis, index dimension sampling and sampling data synthesis; sampling inspection is carried out on the sampling result data; the sampling tests include mean variance test, variance difference test, and distribution difference test. The invention can improve the sampling precision and has wider application range.

Description

Big data statistical sampling method and device, server and storage medium

Technical Field

The invention relates to the technical field of big data statistics, in particular to a big data statistical sampling method, a big data statistical sampling device, a server and a storage medium.

Background

In the data processing process, a scene of sampling data is often encountered, and particularly in the field related to mathematical statistics, the data sampling processing is an important processing step which can not be avoided. Conventionally, special intensive research is rarely carried out on the sampling processing, the simple processing is basically carried out by adopting a random number correlation algorithm, the correlation requirement can be met without deviating too much from the expectation result after the sampling is finished, and for the sampling processing with insufficient precision, the conventional sampling is often carried out for a plurality of times, and the sample with relatively good parameter precision is selected.

In recent years, with the development of big data technology and various related complex theories and calculation algorithms, various defects and defects are gradually and more shown in the traditional data sampling method, and the special research on the sampling method with higher precision becomes a practical requirement.

Based on the above, it is necessary to perform special research on the sampling method, design a general big data statistical sampling method, be able to adapt to various scenes of data statistical sampling as far as possible, provide higher-precision mathematical statistical parameter support, and solve the data processing requirements of various practical scenes.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a big data statistical sampling method, a device, a server and a storage medium, wherein the method uses and combines the traditional sampling methods as public knowledge: simple random sampling, systematic sampling and hierarchical sampling, especially in processing steps, some steps are almost consistent with the systematic sampling method, and are equivalent to systematic sampling when one-dimensional scene needs and original data are sorted according to dimension data. But the invention provides more connotations and solves the more universal scene requirement. Namely, the invention can improve the sampling precision and has wider application range.

An embodiment of the present invention provides a big data statistical sampling method, including:

according to a preset verification process, performing verification processing on original data to obtain primary overall data; the preset checking process comprises the following steps: removing or marking data with integrity defects during data integrity check processing; when missing data is processed, removing or converting and reserving partial defective data according to business requirements and algorithm requirements;

sampling the preliminary overall data according to a preset sampling flow to obtain sampling result data; the sampling process comprises the following steps: index dimension analysis, index dimension sampling and sampling data synthesis;

sampling inspection is carried out on the sampling result data; the sampling tests include mean variance test, variance difference test, and distribution difference test.

Wherein, the index dimension analysis comprises:

acquiring index dimensionality contained in the primary overall data according to business requirements and later-stage algorithm characteristics; assuming that the initial total data volume has X pieces of data, the sampling target is to extract Y pieces of data, X is far larger than Y, the result of index dimension analysis is M index dimensions, and the M index dimensions are respectively defined as W₁、W₂、W₃…W_m。

Wherein the index dimension sampling comprises:

respectively sequencing the preliminary overall data according to M index dimensions, wherein the overall data are equally divided into Y/M paragraphs after each sequencing;

acquiring a piece of data at each paragraph at fixed intervals from beginning to end, specifically:

firstly aligning Y pieces of data for the first time according to the index dimension W₁Sorting is carried out, and Y/M data are extracted from the sorted X data; after the data are extracted for the first time, the residual data are X-Y/M data, then the second extraction is carried out, firstly, the residual X-Y/M data are extracted according to the index dimension W₂Sorting, equally dividing the sorting result into Y/M paragraphs, extracting Y/M data at fixed intervals, completely extracting for the second time, retreating by the type, and generating M groups of sampling data with the quantity of Y/M after processing the index dimensions of the M types;

the sample data synthesis comprises:

and synthesizing the M groups of sampling data into Y pieces of sampling data, namely the processed sampling result data.

Wherein the mean variance test is a mean variance test performed on the sampling result data and the preliminary overall data; wherein, if M numerical calculation type index dimension field types exist, M times of corresponding mean variance tests are required;

the variance difference test adopts chi-square test; wherein, if M numerical calculation type index dimension fields exist, M corresponding chi-square tests are needed;

and the distribution difference test adopts chi-square test, divides the data into a plurality of sections according to the dimensionality of the numerical calculation type index, and carries out overall-sample difference test on the data under different numerical sections.

An embodiment of the present invention further provides a big data statistical sampling device, including:

the verification processing unit is used for verifying the original data according to a preset verification process to obtain primary overall data; the preset checking process comprises the following steps: removing or marking data with integrity defects during data integrity check processing; when missing data is processed, removing or converting and reserving partial defective data according to business requirements and algorithm requirements;

the sampling processing unit is used for sampling the preliminary overall data according to a preset sampling flow to acquire sampling result data; the sampling process comprises the following steps: index dimension analysis, index dimension sampling and sampling data synthesis;

the sampling inspection unit is used for performing sampling inspection on the sampling result data; the sampling tests include mean variance test, variance difference test, and distribution difference test.

An embodiment of the present invention further provides a big data statistics sampling server, including:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a big data statistical sampling method as described above.

An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored computer program, where the apparatus on which the storage medium is located is controlled to execute the big data statistical sampling method as described above when the computer program runs.

The embodiment of the invention has the following beneficial effects:

according to the teaching of the above embodiments, the big data statistical sampling method, device, server and storage medium provided by the present invention are mainly suitable for the scenario of statistical analysis and sampling processing of big data. When the total large data needs to be sampled due to overlarge data volume, the traditional sampling method basically combines the service requirements and the characteristics of a data algorithm, and performs sampling processing by using a random number algorithm. The method has the greatest value and significance in that the realization and the implementation are simple and easy, the sampling result and the related statistical index precision of the sample population are very high, and the precision is generally 1-2 orders of magnitude higher than that of the traditional random sampling method.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a statistical big data sampling method according to an embodiment of the present invention;

FIG. 2 is another flow chart of a statistical big data sampling method according to an embodiment of the present invention;

FIG. 3 is a statistical schematic diagram of data checksum sampling results provided in accordance with an embodiment of the present invention;

FIG. 4 is a diagram illustrating the result of the mean variance test provided by an embodiment of the present invention;

FIG. 5 is a diagram illustrating the results of variance difference tests provided by one embodiment of the present invention;

FIG. 6 is a table of data for distribution variance testing provided by an embodiment of the present invention;

FIG. 7 is a graph illustrating the results of a distribution variance test provided by one embodiment of the present invention;

fig. 8 is a schematic structural diagram of a big data statistical sampling apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.

Please refer to fig. 1-7. Fig. 2 mainly includes three parts from top to bottom:

the upper part of the drawing is a first part and mainly describes four types of data involved in the whole method process: the method comprises the steps of firstly, obtaining original data, original data after verification processing, intermediate sampling data and a final sampling result;

the second part of the drawing, which mainly describes the whole process, comprises three stages: data checking processing, sampling processing and checking processing; the core is sampling processing, but data checksum verification processing is an indispensable step in the whole process;

the lower part of the figure is a third part which mainly describes the concrete processing steps and contents corresponding to three processing stages, wherein the data verification stage comprises integrity check and missing data processing, the former is equivalent to form check, and the latter is equivalent to substantive check and processing; the sampling processing stage comprises three processes of index dimension analysis, index dimension sampling and sampling data synthesis, wherein the index dimension analysis is equivalent to data preprocessing, the index dimension analysis and the sampling data synthesis are respectively processing which is gradually circulated or iterated on the basis of the preprocessing, and the processing can be finished only by one round of processing on the general single-dimensional sample data; in the sampling inspection stage, three types of conventional inspection are mainly performed on sampling results: mean variance test, variance difference test, and distribution difference test to verify the validity of the sampled data.

As shown in fig. 1-2, an embodiment of the present invention provides a big data statistical sampling method, including:

s100, according to a preset verification process, verifying the original data to obtain primary overall data; the preset checking process comprises the following steps: removing or marking data with integrity defects during data integrity check processing; and when the missing data is processed, removing or converting and reserving part of the defective data according to business requirements and algorithm requirements.

In a specific embodiment, two types of basic processing are generally performed during the data verification processing: data integrity check processing and missing data processing; in the data integrity checking process, the integrity of the data is generally checked, and the data with integrity defects is removed or marked; in the case of missing data processing, some defective data is removed or converted to be retained according to business requirements and algorithm requirements. And (4) finishing the data verification of the original data to obtain a result, namely primary overall data.

S200, sampling the preliminary overall data according to a preset sampling flow to obtain sampling result data; the sampling process comprises the following steps: index dimension analysis, index dimension sampling and sampling data synthesis.

Wherein, the index dimension analysis comprises:

In a specific embodiment, the preliminary overall data is analyzed, and index dimensions contained in the original overall data are sorted out according to business requirements and later-stage algorithm features, wherein most of the general index dimensions are numerical fields, and other types of fields exist for some special analysis requirements, such as time-sharing and hierarchical classification algorithms.

Wherein the index dimension sampling comprises:

firstly aligning Y pieces of data for the first time according to the index dimension W₁Sorting is carried out, and Y/M data are extracted from the sorted X data; after the data is extracted for the first time, the residual data are X-Y/M pieces of data, then the second extraction is carried out,firstly, the rest X-Y/M pieces of data are processed according to the index dimension W₂Sorting, equally dividing the sorting result into Y/M paragraphs, extracting Y/M data at fixed intervals, completely extracting for the second time, retreating by the type, and generating M groups of sampling data with the quantity of Y/M after processing the index dimensions of the M types;

the sample data synthesis comprises:

In a specific embodiment, the overall sample data is respectively subjected to sorting processing (the simplest and most classical computer data processing algorithm) according to M index dimensions, the overall data is divided into Y/M sections after each sorting, and one piece of data is obtained at each section at a fixed interval from beginning to end (a fixed interval, namely Step, is certain, otherwise, the quality of the sampled data is not high). Sorting Y pieces of data according to the index dimension W1 for the first time, and extracting Y/M (rounding) pieces of data from the sorted X pieces of data; after the data are extracted for the first time, the residual data are X-Y/M data, then the second extraction is carried out, firstly, the residual X-Y/M data are sorted according to the index dimension W2, the sorting result is divided into Y/M paragraphs, the Y/M (integer) data are extracted again according to a fixed interval, the second extraction is completed, the class retreats are carried out one by one, and after the class-M index dimensions are processed, M groups of sampling data (without repetition) with the quantity of Y/M are generated;

sampling data synthesis: the M sets of sample data are combined to obtain Y pieces of sample data, i.e., processed sample result data.

S300, sampling inspection is carried out on the sampling result data; the sampling tests include mean variance test, variance difference test, and distribution difference test.

In particular embodiments, after sampling is completed, three types of sample inspection analysis are typically performed: mean variance test, variance difference test, distribution difference test.

And (3) mean variance test: the method for performing mean variance test on the sample data and the preliminary overall data comprises the following steps: single sample Z test [(Z statistic: sample mean-population mean/(population standard deviation/sample size opening)) ] where the index dimension of the numerical type can be tested according to a single-sample Z test, assuming there are M types of field types of index dimension of the numerical type, M tests are required, the test must be "insignificant" and the Z value is very low (high precision).

Variance difference test: the chi-square test is adopted here, and the specific calculation formula is as follows: checking with block [ [(chi fang statistic: sample size × sample variance/total variance), similarly, if there are M kinds of numeric calculation type index dimension fields, M kinds of corresponding chi fang tests are required, and the conclusion of the tests must be "insignificant".

And (3) testing distribution difference: the chi-square test is adopted, the data are divided into a plurality of sections (generally, average segmentation) according to the dimensionality of the numerical calculation type index, the data under different numerical sections are subjected to overall-sample difference test, and the test conclusion is required to be 'insignificant'.

All sampling processing processes and all inspection processes are completed, and sampling results can be applied to subsequent big data statistical analysis and calculation and other processing.

The technical solutions in the embodiments of the present invention will be further clearly and completely described below with reference to practical examples.

Assuming that there is a batch of test data for a certain type of application, the total data size is 322, 690 pieces of data, the data used here is a single-dimensional sample, for comparison with the conventional sampling method, the data is divided into two types a and B, and the data sizes are 161, 526, 161 and 164 pieces respectively; the target of the sampling is 107, 552 (or 3 taps 1);

firstly, data verification is carried out, the quantity of invalid data is determined to be 33, the type A invalid data is 7, and the type B invalid data is 26;

for the data of A type and B type, the traditional random sampling method is adopted for several times, and the better statistical index is selected as the result;

for the totality, the method is adopted for sampling, and the index dimensionality is relatively simple to analyze due to the fact that the sample data is single-dimensional sample data, only one dimensionality exists, the data type is a calculation numerical type, and the numerical value is between 0 and 100; sorting all overall data according to a single-dimensional index value, dividing the data into 107552 average segments (each segment comprises 3 pieces of data, and the tail segment can be less than 3), and extracting 107552 pieces of data from the 1 st (or 2 or 3 rd) according to the step length of 3; the detailed data checksum sampling statistics are shown in fig. 3.

Performing sampling inspection, namely performing mean variance inspection, variance difference inspection and distribution difference inspection respectively;

1. and (3) mean variance test: processing according to single sample Z test;

(Z statistics: sample mean-Overall mean/(Overall Standard deviation/sample size squared))

The test results are shown in fig. 4, and the test calculated Z value is less than 1.96, and the conclusion is that the difference is not significant.

2. Variance difference test: checking by adopting a chi-square method;

(chi-square statistic: sample size x sample variance/Total variance)

The results of the test are shown in fig. 5.

3. And (3) testing distribution difference: and (5) checking by adopting a chi-square method.

Dividing the test full score into n equal score sections, taking the numerical value upper limit 100 as one section according to the numerical value 5, and carrying out overall-sample difference test on different numerical value sections.

Here, the processing is performed in a left-closed and right-open manner, and the data table corresponds to that shown in fig. 6.

The checking card method checks the critical value table, wherein the degree of freedom is 19, the checking value is 30.14 according to P being 0.05, when calculating x2, two calculation results are provided according to different processing principles and methods of data, and the specific calculation result is shown in FIG. 7.

The conclusion is that the difference is not significant.

It is obvious from the above test process and results that the precision of the sample statistical test index by the sampling method is improved by 1-2 quantity levels, the precision of the statistical index Z value of the mean variance test in the interim is improved by 1 quantity level or more, the statistical index of the variance difference test is basically in the same quantity level, and the precision of the index of the distribution difference test is improved by 2 quantity levels.

The implementation steps and the test are exemplified by using real actual data, and the validity of the method is verified from the aspect of actual measurement data by combining calculation and test results.

In summary, the statistical sampling method for big data provided by the embodiment of the present invention is mainly used in the related field of statistical analysis processing for big data. The method is derived from various data calculation processing processes related to classical measurement theory (CTT) and project response theory (IRT) for a large amount of data in the field of education in reality. Generally, when sampling data of carding statistics is carried out, the sampling method mainly considers factors determined by business rules and related algorithms, in an actual application scene, after the business rules and the algorithm factors are eliminated, random sampling methods are mostly adopted, and when a completely random sampling mode is found in practical application and sampling inspection is carried out after the sampling is finished, the result accords with the rules and requirements of general mathematical statistics, but the precision of the sampling inspection result is not ideal and perfect. The method provides an accurate and controllable sampling method, the sampling result of the method is improved by 1-2 orders of magnitude on the basis of meeting the rules and requirements of general mathematical statistics, the accuracy of the sampling result during inspection is improved theoretically, and the method provides an ideal and perfect method for large data sampling. The method is simple and easy to implement, is convenient to implement, and is easier to understand, master and use by related personnel for statistical analysis and processing of various kinds of data on the same line. The method is easier to utilize the modern I T technology, particularly the big data and cloud computing related technology to realize sampling processing of the big data with low cost and high efficiency, and provides powerful theoretical method support for various application scenes of big data statistical analysis and processing.

Referring to fig. 8, an embodiment of the present invention further provides a big data statistical sampling apparatus, including:

the verification processing unit 10 is configured to perform verification processing on the original data according to a preset verification process to obtain preliminary overall data; the preset checking process comprises the following steps: removing or marking data with integrity defects during data integrity check processing; and when the missing data is processed, removing or converting and reserving part of the defective data according to business requirements and algorithm requirements.

The sampling processing unit 20 is configured to perform sampling processing on the preliminary overall data according to a preset sampling flow to obtain sampling result data; the sampling process comprises the following steps: index dimension analysis, index dimension sampling and sampling data synthesis.

Wherein, the index dimension analysis comprises:

Wherein the index dimension sampling comprises:

the sample data synthesis comprises:

A sampling inspection unit 30 for performing sampling inspection on the sampling result data; the sampling tests include mean variance test, variance difference test, and distribution difference test.

1. and (3) mean variance test: processing according to single sample Z test;

2. Variance difference test: checking by adopting a chi-square method;

(chi-square statistic: sample size x sample variance/Total variance)

The results of the test are shown in fig. 5.

The conclusion is that the difference is not significant.

one or more processors;

storage means for storing one or more programs;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for statistically sampling big data, comprising:

2. The big data statistical sampling method of claim 1, wherein the index dimension analysis comprises:

3. The big data statistical sampling method of claim 2, wherein the index dimension sampling comprises:

firstly aligning Y pieces of data for the first time according to the index dimension W₁Sorting is carried out, and Y/M data are extracted from the sorted X data; after the data is extracted for the first time, the residual data are X-Y/M pieces of data, and then the second extraction is carried outFirstly, the rest X-Y/M pieces of data are processed according to the index dimension W₂Sorting, equally dividing the sorting result into Y/M paragraphs, extracting Y/M data at fixed intervals, completely extracting for the second time, retreating by the type, and generating M groups of sampling data with the quantity of Y/M after processing the index dimensions of the M types;

the sample data synthesis comprises:

4. The big data statistical sampling method according to claim 3, wherein the mean-variance test is a mean-variance test performed on the sampling result data and the preliminary population data; wherein, if M numerical calculation type index dimension field types exist, M times of corresponding mean variance tests are required;

5. A big data statistical sampling device, comprising:

6. The big data statistical sampling device of claim 5, wherein the index dimension analysis comprises:

7. The big data statistical sampling device of claim 6, wherein the metric dimension sampling comprises:

the sample data synthesis comprises:

8. The big data statistical sampling device of claim 7, wherein the mean variance test is a mean variance test performed on the sampling result data and the preliminary population data; wherein, if M numerical calculation type index dimension field types exist, M times of corresponding mean variance tests are required;

9. A big data statistics sampling server, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the big data statistical sampling method of any one of claims 1 to 4.

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus on which the storage medium resides to perform the big data statistical sampling method of any one of claims 1 to 4.