CN108319611B - Sampling method and sampling device - Google Patents

Sampling method and sampling device Download PDF

Info

Publication number
CN108319611B
CN108319611B CN201710035012.1A CN201710035012A CN108319611B CN 108319611 B CN108319611 B CN 108319611B CN 201710035012 A CN201710035012 A CN 201710035012A CN 108319611 B CN108319611 B CN 108319611B
Authority
CN
China
Prior art keywords
amount
sampling
sub
sample
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710035012.1A
Other languages
Chinese (zh)
Other versions
CN108319611A (en
Inventor
尹红军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710035012.1A priority Critical patent/CN108319611B/en
Publication of CN108319611A publication Critical patent/CN108319611A/en
Application granted granted Critical
Publication of CN108319611B publication Critical patent/CN108319611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Sampling And Sample Adjustment (AREA)

Abstract

The invention provides a sampling method and a sampling device, wherein the sampling method comprises the following steps: according to the appointed sub-layer attributes, the experimental sample set and the control sample set are subjected to layering processing respectively, so that a first sampling amount of each sub-layer of the experimental sample set and a second sampling amount of each sub-layer of the control sample set are obtained; when the second sampling amount of each sub-layer is not more than the first sampling amount, or the first sampling amount is not more than the contrast expected layering amount, selecting a first target proportional relation meeting a first specified condition from the proportional relation between the first sampling amount and the second sampling amount of each sub-layer; obtaining a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer; and sampling the experimental sample set and the control sample set according to the third sampling amount. Through the technical scheme of the invention, more representative samples can be extracted, so that the effectiveness of the sampling investigation result is improved.

Description

Sampling method and sampling device
[ technical field ] A method for producing a semiconductor device
The present invention relates to the field of data processing technologies, and in particular, to a sampling method and a sampling apparatus.
[ background of the invention ]
Currently, when two user groups are examined in a sampling mode, there are two common modes:
firstly, given the amount of samples to be sampled for each of two user groups, the two user groups are symmetrically sampled in a random sampling manner.
Second, a group of users with a digital account system may be randomly sampled according to some digits of the digital account, for example, if the users with the first three digits 839, the middle three digits random number, and the last two digits 07 of the digital account are sampled, 1000 random users with the middle three digits 000 to 999 of the digital account may be generated.
However, in the actual sampling investigation process, the smaller the difference of the other dimensions except the dimension to be investigated is, the more accurate the sampling comparison result of the dimension to be investigated is, the closer to the reality is, the overall difference between the two user groups is often very large, and the users have differences in various dimensions such as age, gender, location, and the like. Therefore, the random sampling in the first method cannot extract a sample that is representative enough for the user group itself, and accordingly, the sampling comparison result is unreliable. The second method only ensures the uniformity of the distribution of the users on the account numbers, and is still random sampling in nature, so that a sample which is sufficiently representative relative to the user group per se cannot be extracted.
Therefore, how to extract a more representative sample to further increase the reliability of the sampling comparison result is a technical problem to be solved.
[ summary of the invention ]
The embodiment of the invention provides a sampling method and a sampling device, aiming at solving the technical problem that the lottery comparison result is unreliable because the extracted samples are not representative in the related technology, and being capable of extracting more representative samples and further increasing the reliability of the sampling comparison result.
In a first aspect, an embodiment of the present invention provides a sampling method, including: according to the appointed sub-layer attributes, the experimental sample set and the control sample set are subjected to layering processing respectively, so that a first sampling amount of each sub-layer of the experimental sample set and a second sampling amount of each sub-layer of the control sample set are obtained; when the second sampling amount of each sub-layer is not more than the first sampling amount, or the first sampling amount is not more than the contrast expected layering amount, selecting a first target proportional relation meeting a first specified condition from the proportional relation between the first sampling amount and the second sampling amount of each sub-layer; obtaining a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer; and sampling the experimental sample set and the control sample set according to a third sampling amount.
In the above embodiment of the present invention, optionally, the method further includes: and when the second sampling amount of each sub-layer is larger than the first sampling amount and the first sampling amount is larger than the control expected layering amount, sampling the experimental sample set and the control sample set based on the control expected layering amount.
In the foregoing embodiment of the present invention, optionally, after selecting a first target proportional relationship satisfying a first specified condition from the proportional relationship between the first sample volume and the second sample volume of each sub-layer, the method further includes: judging whether the first target proportional relation meets a second specified condition; if the first target proportional relation meets a second specified condition, executing the step to obtain a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer; and if the first target proportional relation does not meet a second specified condition, selecting a second target proportional relation meeting the first specified condition from other proportional relations except the first target proportional relation, and obtaining a third sample amount of each sub-layer according to the second target proportional relation and the first sample amount of each sub-layer.
In the above embodiment of the present invention, optionally, the first specified condition is that a ratio of the second sample amount to the first sample amount is minimum; or the first specified condition is that the ratio of the first sample volume to the second sample volume is maximum.
In the above embodiment of the present invention, optionally, the second specified condition is that a difference between the first target proportional relation and 0 is greater than a specified threshold; or the second specified condition is that the difference between the first target proportional relation and 0 is equal to a specified threshold.
In the above embodiment of the present invention, optionally, the performing layered processing on the experimental sample set and the control sample set respectively according to the specified sub-layer attributes includes: extracting samples with the sub-layer attributes from the experiment sample set, wherein the total amount of the samples is used as the first sample amount; and extracting samples having the sub-layer attribute from the control sample set, the total amount of the samples being the second sample amount.
In a second aspect, an embodiment of the present invention provides a sampling apparatus, including: the hierarchical sampling amount acquisition unit is used for respectively carrying out hierarchical processing on the experimental sample set and the control sample set according to the appointed sub-layer attributes so as to obtain a first sampling amount of each sub-layer of the experimental sample set and a second sampling amount of each sub-layer of the control sample set; a first selection unit that selects a first target proportional relationship that satisfies a first prescribed condition from the proportional relationship between the first sample amount and the second sample amount for each sublayer, when the second sample amount for each sublayer is not greater than the first sample amount, or the first sample amount is not greater than the comparison expected sublayer amount; a coordinated sample amount obtaining unit, which obtains a third sample amount of each sub-layer according to the first target proportional relation and the first sample amount of each sub-layer; and the sampling processing unit is used for sampling the experimental sample set and the reference sample set according to a third sampling amount.
In the above embodiment of the present invention, optionally, the sampling processing unit is further configured to: and when the second sampling amount of each sub-layer is larger than the first sampling amount and the first sampling amount is larger than the control expected layering amount, sampling the experimental sample set and the control sample set based on the control expected layering amount.
In the above embodiment of the present invention, optionally, the method further includes: a judging unit that judges whether or not a first target proportional relationship satisfies a second specified condition after the first selecting unit selects the first target proportional relationship satisfying the first specified condition; the coordinated sample amount obtaining unit is specifically configured to: if the first target proportional relation meets a second specified condition, executing the step to obtain a third sample volume of each sub-layer according to the first target proportional relation and the first sample volume of each sub-layer, if the first target proportional relation does not meet the second specified condition, selecting a second target proportional relation meeting the first specified condition from other proportional relations except the first target proportional relation, and obtaining the third sample volume of each sub-layer according to the second target proportional relation and the first sample volume of each sub-layer.
In the above embodiment of the present invention, optionally, the first specified condition is that a ratio of the second sample amount to the first sample amount is minimum; or the first specified condition is that the ratio of the first sample volume to the second sample volume is maximum.
In the above embodiment of the present invention, optionally, the second specified condition is that a difference between the first target proportional relation and 0 is greater than a specified threshold; or the second specified condition is that the difference between the first target proportional relation and 0 is equal to a specified threshold.
In the foregoing embodiment of the present invention, optionally, the layered sample amount obtaining unit is specifically configured to: samples having the sub-layer attribute are extracted from the experimental sample set, the total amount of the samples being the first sample amount, and samples having the sub-layer attribute are extracted from the control sample set, the total amount of the samples being the second sample amount.
By the technical scheme, aiming at the technical problem that the lottery comparison result is unreliable due to unrepresentative extracted samples in the related technology, the experimental sample set and the control sample set can be subjected to layered sampling according to the attributes of the multiple seed layers to obtain the multiple sub-layers respectively, so that each sub-layer of the experimental sample set is consistent with other attributes except the attribute to be examined of the sub-layer in the corresponding control sample set, in other words, each sub-layer of the experimental sample set is consistent with the sample distribution in the sub-layer in the corresponding control sample set, and the sample comparability of the experimental sample set and the control sample set is improved, and the reliability of the comparison result of the two samples is improved.
Next, for each sub-layer, if the corresponding control expected layering amount is smaller than the first sampling amount in the experimental sample set, and the first sampling amount is smaller than the second sampling amount in the control sample set, it indicates that the number of samples reaching the corresponding control expected layering amount can be extracted in each sub-layer.
Otherwise, that is, when the second sampling amount of each sub-layer is not greater than the first sampling amount, or the first sampling amount is not greater than the control expected layering amount, it indicates that the first sampling amount or the second sampling amount of the sub-layer does not reach the control expected layering amount. In this case, a first target proportional relationship may be selected from the proportional relationship between the first sample volume and the second sample volume of each sub-layer, the comparison expected layered volumes corresponding to all sub-layers are reduced in an equal proportion through the first target proportional relationship, the first target proportional relationship should satisfy a first specified condition, and then a third sample volume that is smaller than both the first sample volume and the second sample volume may be obtained, and finally, the experiment sample set and the comparison sample set are symmetrically sampled through the third sample volume, so that the consistency of the symmetrical sample volumes is ensured.
In conclusion, the technical scheme of the invention can provide a new sampling mode, and the symmetrical samples with the sample distribution condition consistent with the total amount of the samples are extracted, so that the symmetrical samples are more comparable, and the reliability of the sampling comparison result can be further increased by using the symmetrical samples for comparison and investigation, and the objectivity and scientificity of experimental comparison are improved.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 illustrates a flow diagram of a sampling method provided by one embodiment of the present invention;
FIG. 2 illustrates a flow chart of a sampling method provided by another embodiment of the present invention;
FIG. 3 illustrates an overall schematic diagram of a data sampling process provided by one embodiment of the present invention;
FIG. 4 shows a block diagram of a sampling device provided by one embodiment of the present invention;
fig. 5 shows a block diagram of a terminal provided by an embodiment of the present invention;
fig. 6 shows a block diagram of a terminal provided by another embodiment of the present invention.
[ detailed description ] embodiments
For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
Fig. 1 shows a flow chart of a sampling method provided by an embodiment of the present invention.
As shown in fig. 1, one embodiment of the present invention provides a sampling method, including:
and 102, performing layering processing on the experimental sample set and the control sample set respectively according to the designated sub-layer attributes to obtain a first sampling quantity of each sub-layer of the experimental sample set and a second sampling quantity of each sub-layer of the control sample set.
The layering processing refers to dividing the experimental sample set and the control sample set into a plurality of sublayers according to the attributes of the plurality of sublayers and the number of segments of the attributes of each sublayer, and the principle can be understood as follows:
first, the number of segments of the multi-sublayer attribute and per-sublayer attribute of the experimental sample set and the control sample set is determined, for example, when it is determined that a sampling survey is performed on a population using a specific APP, two types of sub-layer attributes of age and gender can be determined, next, the number of segments of age can be determined to be 3, the segments are divided into three segments of 18 years or less, 19 to 35 years and 35 years or more, and the number of segments of gender is determined to be 2, and the segments are divided into two segments of male and female.
And then layering the experimental sample set and the control sample set according to the segmentation of the first seed layer attribute to obtain a plurality of sub-sets, layering each sub-set according to the second seed layer attribute, and so on until all sub-layer attributes are finished to obtain a plurality of final sub-layers. In this way, each sub-layer of the experimental sample set is consistent with other attributes except the attribute to be examined of the sub-layer in the corresponding control sample set, in other words, each sub-layer of the experimental sample set is consistent with the sample distribution in the sub-layer in the corresponding control sample set, so that the sample comparability of the two is improved, and the reliability of the comparison result of the two is increased.
For example, 4 sub-layer attributes may be taken, the number of first sub-layer attribute segments is a, the number of second sub-layer attribute segments is G, the number of third sub-layer attribute segments is L, the number of fourth sub-layer attribute segments is H, the total number of extracted samples is M, and the total size is N, so that the final total number of layers is a × G × L × H.
In addition, the above principle is only for facilitating understanding, and in actual processing, the result of the hierarchical processing, such as the raman distribution, can be directly obtained through a predetermined hierarchical function.
And 104, when the second sampling amount of each sub-layer is not more than the first sampling amount, or the first sampling amount is not more than the contrast expected layering amount, selecting a first target proportional relation meeting a first specified condition from the proportional relation between the first sampling amount and the second sampling amount of each sub-layer.
The total amount of the expected control sample is set by a user according to actual requirements, or estimated by the system according to the total amount of the experimental sample set, for example, the system automatically presets 1% of the experimental sample set as the total amount of the expected control sample. And multiplying the preset total expected control sample amount by the ratio of the first sample amount of any sublayer in the experimental sample set to obtain the expected control layering amount corresponding to the sublayer.
When the second sampling amount of each sub-layer is not greater than the first sampling amount, or the first sampling amount is not greater than the control expected layering amount, it is described that the first sampling amount or the second sampling amount of the sub-layer does not reach the control expected layering amount, and at this time, the following processing needs to be performed according to a first target proportional relation that satisfies a first specified condition.
And step 106, obtaining a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer.
That is, a first target proportional relationship may be selected from the proportional relationship between the first sample volume and the second sample volume of each sub-layer, the comparison expected layered volumes corresponding to all sub-layers are reduced in an equal proportion through the first target proportional relationship, the first target proportional relationship satisfies a first specified condition, and then a third sample volume that is smaller than both the first sample volume and the second sample volume may be obtained, and finally, the experiment sample set and the comparison sample set are symmetrically sampled through the third sample volume, so that the consistency of the symmetrical sample volumes is ensured.
And 108, sampling the experimental sample set and the control sample set according to the third sampling amount.
It should be added that the first specified condition is that the ratio of the second sample amount to the first sample amount is minimum, or the ratio of the first sample amount to the second sample amount is maximum.
In an actual scene, the first sampling amount of the experimental sample set is often greater than the second sampling amount of the control sample set, and the smaller the ratio of the second sampling amount to the first sampling amount is, that is, the larger the ratio of the first sampling amount to the second sampling amount is, the larger the difference between the first sampling amount and the second sampling amount is. Therefore, the ratio of the second sample volume to the first sample volume may be selected to be the minimum or the first target proportional relationship of the maximum ratio of the first sample volume to the second sample volume, and the third sample volume is obtained by reducing the expected layering volume, where the third sample volume is smaller than both the first sample volume and the second sample volume, and thus, the third sample volume is used to sample the experimental sample set and the control sample set, and the sample samples with the same number may be obtained.
In summary, the embodiment provides a new sampling manner, which can extract a symmetric sample with the same sample distribution condition and sample total amount, such symmetric sample is more comparable, and the reliability of the sampling comparison result can be further increased by using such symmetric sample for comparison and investigation, thereby improving the objectivity and scientificity of the experimental comparison.
Fig. 2 shows a flow chart of a sampling method provided by another embodiment of the present invention.
As shown in fig. 2, another embodiment of the present invention provides a sampling method, including:
step 202, according to the designated sub-layer attributes, performing layered processing on the experimental sample set and the control sample set respectively to obtain a first sample volume of each sub-layer of the experimental sample set and a second sample volume of each sub-layer of the control sample set.
In this way, each sub-layer of the experimental sample set is consistent with the sample distribution in the sub-layer of the corresponding control sample set, so that the sample comparability of the two is improved, and the reliability of the comparison result of the two is increased.
Step 204, judging whether the second sampling amount of each sub-layer is larger than the first sampling amount, and the first sampling amount is larger than the comparison expected layering amount, if so, entering step 206, otherwise, entering step 208.
And step 206, sampling the experimental sample set and the control sample set based on the control expected layering quantity, and ending the process.
For each sub-layer, if the corresponding control expected layering amount is less than the first sampling amount in the experimental sample set and the first sampling amount is less than the second sampling amount in the control sample set, it indicates that the number of samples reaching the corresponding control expected layering amount can be extracted in each sub-layer, so that the samples of the control expected layering amount can be extracted for both the experimental sample set and the control sample set.
Step 208, selecting a first target proportional relation meeting a first specified condition from the proportional relation between the first sample volume and the second sample volume of each sub-layer. The first specified condition is that the ratio of the second sample volume to the first sample volume is minimum; or the first specified condition is that the ratio of the first sample volume to the second sample volume is maximum.
And when the second sampling amount of each sub-layer is not more than the first sampling amount, or the first sampling amount is not more than the control expected layering amount, indicating that the first sampling amount or the second sampling amount of the sub-layer does not reach the control expected layering amount. In an actual scene, the first sampling amount of the experimental sample set is often greater than the second sampling amount of the control sample set, and the smaller the ratio of the second sampling amount to the first sampling amount, that is, the larger the ratio of the first sampling amount to the second sampling amount, the larger the difference between the first sampling amount and the second sampling amount.
Therefore, the ratio of the second sample volume to the first sample volume may be selected to be the minimum or the first target proportional relationship of the maximum ratio of the first sample volume to the second sample volume, and the third sample volume is obtained by reducing the expected layering volume, where the third sample volume is smaller than both the first sample volume and the second sample volume, and thus, the third sample volume is used to sample the experimental sample set and the control sample set, and the sample samples with the same number may be obtained.
Step 210, determining whether the first target proportional relation satisfies a second specified condition, if yes, entering step 212, otherwise, entering step 214. Wherein the second specified condition is that the difference between the first target proportional relation and 0 is greater than a specified threshold; or the second specified condition is that the difference between the first target proportional relation and 0 is equal to a specified threshold.
In an actual scenario, there is a possibility that the number of samples of the user group under a certain sub-layer is 0, that is, the first target proportion relation is 0, and at this time, no sample can be taken out, so to coordinate with this special case, the first target proportion relation may be detected in one step so as to perform special processing for the case that the first target proportion relation is 0.
And 212, obtaining a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer. In this case, the first target proportional relationship is not 0, and the third sampling amount can be obtained by scaling down the control expected layered amounts corresponding to all the sublayers by the first target proportional relationship.
And 214, selecting a second target proportional relation meeting the first specified condition from other proportional relations except the first target proportional relation, and obtaining a third sample amount of each sub-layer according to the second target proportional relation and the first sample amount of each sub-layer.
In this case, the first target proportional relationship is 0, and a second target proportional relationship satisfying the first predetermined condition may be further selected, for example, the smallest one of the plurality of proportional relationships except 0 may be selected as the second target proportional relationship, and the comparison desired layered amounts corresponding to all the sublayers may be scaled down by the second target proportional relationship to obtain a third sample amount.
And step 216, sampling the experimental sample set and the reference sample set according to the third sampling amount, so that symmetrical samples with the sample distribution condition consistent with the total sample amount can be extracted, the reliability of the sampling comparison result is further improved, and the objectivity and scientificity of experimental comparison are improved.
Fig. 3 is a general schematic diagram of a data sampling process provided by an embodiment of the present invention.
Generally, the experimental sample set is much smaller than the control sample set, for example, the size of the exposed group in the advertisement placement is much smaller than the size of the unexposed group in the large disk active user. The total amount of the experimental sample set is N, the maximum variance of the sample amount according to the allowed estimator of the experimental object is V, and the total variance is S2As shown in fig. 3, the calculation formula of the sample amount n of the experimental sample set is as follows:
Figure BDA0001211398970000101
however, in general, the total amount of the desired control samples is often greater than the total amount of samples n in the experimental sample set, and therefore, after the sample amount n is determined, if the sample amount n is detected to be less than the total amount of the desired control samples, the experimental sample set and the control sample set need to be hierarchically sampled and coordinately sampled.
Firstly, hierarchical sampling:
the number of layers of the hierarchical sampling needs to be referenced to the estimator accuracy, sampling cost, etc., and may be selected from 3 to 10 layers. Considering sampling cost, actual sampling difficulty and an experimental investigation target, selecting candidate dimensions as sub-layer attributes, or selecting sub-layer attributes with higher correlation according to experience and information gain analysis, wherein the specific sub-layer attributes can be determined according to actual conditions.
For example, 4 sub-layer attributes may be taken, the number of first sub-layer attribute segments is a, the number of second sub-layer attribute segments is G, the number of third sub-layer attribute segments is L, the number of fourth sub-layer attribute segments is H, the total number of extracted samples is M, and the total size is N, so that the final total number of layers is a × G × L × H.
Since there are at least two representative samples per level, M > max { A, G, L, H }, and M > A × G × L × H, with sub-layer rate sampling, the number of samples taken by any sub-layer is:
Figure BDA0001211398970000102
wherein i is 1,2, 3 or 4, j is 1,2, 3 or 4, k is 1,2, 3 or 4, t is 1,2, 3 or 4, AiGjLkHtRefers to one of a × G × L × H sub-layers,
Figure BDA0001211398970000111
is the total number of sub-layers.
Alternatively, the hierarchical sampling may employ other algorithms than the above-described algorithm, such as a kernel allocation.
In a specific scenario, an experimental sample set of 1 ten thousand samples and a control sample set of 100 ten thousand samples have attributes of three sub-layers of user age, user gender and the number of times of reading and accessing a certain game application every week, the user age is divided into three stages of 0-18 years, 19-35 years and 36-50 years, and the user gender is divided into male and female, so that when sampling and comparing the number of times of reading and accessing a certain game application every week of the users in the experimental sample set and the control sample set, the user age and the user gender can be layered, for any sub-layer, the user ages and the user genders of the experimental sample and the control sample are consistent, and on the basis, the number of times of reading and accessing a certain game application every week is compared, and the comparison is more comparable.
The obtained mixture was stratified into six sublayers, i.e., 0-18 year old male, 19-35 year old male, 36-50 year old male, 0-18 year old female, 19-35 year old female, and 36-50 year old female, and the distribution thereof is shown in Table 1 below. For convenience of understanding, the hierarchical quantities in this scenario are all exemplified by integers.
TABLE 1
Sub-layer Experimental sample layered quantity (volume) Comparative sample layered quantity (volume)
Male of 0-18 years old 1400 15 ten thousand
Male of 19-35 years old 3500 30 ten thousand
Male of 36 to 50 years old 1600 15 ten thousand
0-18 year old female 800 8 ten thousand
19-35 year old female 2000 25 ten thousand
36 to 50 years old female 700 7 ten thousand
II, coordinated sampling:
and then extracting symmetrical sample sets with consistent distribution and quantity on the premise of accepting certain errors.
The total number of samples N ═ Σ N in the set of experimental samplesoO 1,2, …, axg × L × H, with the expected total number of control samples being EM, the sampling information for all sub-layers is based on the layer sampling information shown in table 2 below, if EM is EMo<No<MoThen the layers may be sampled by the amount EMoAnd (6) extracting. If EM is not satisfiedo<No<MoThen a sampling bottleneck can be selected for coordination.
TABLE 2
Figure BDA0001211398970000121
The layering quantity of a plurality of layers in the comparison sample set is particularly small and is smaller than the expected layering quantity of the comparison sample set, so that the layer with the smallest layering quantity can be selected as a sampling bottleneck, and the expected layering quantity of each layer is reduced in the same proportion according to the sampling bottleneck. Wherein, the ratio of the layering amount to the contrast expected layering amount is small, and the layering amount is minimum.
Inferential reduction of sampling bottlenecks can be made:
Figure BDA0001211398970000122
the user expects a sampling rate of EM/N and a final sampling rate of EM/N
Figure BDA0001211398970000123
Determine that the general situation should be satisfied
Figure BDA0001211398970000124
The amount of the stratified samples after coordinated sampling is
Figure BDA0001211398970000125
In addition, in rare cases, it appears that some of the layered quantities against the population are 0, and if such data is taken as a bottleneck, the overall scaling ratio is 0, and no sample can be taken. So that the smallest larger than 0 can be found
Figure BDA0001211398970000126
The ratio of (d) serves as a sampling bottleneck.
Alternatively, some of the hierarchical quantities of the comparison population may be close to 0, and if such data is taken as a bottleneck, the overall scaling ratio is close to 0, and few samples are taken, which may affect the actual comparison. At this time, it may take the minimum value larger than a predetermined threshold value
Figure BDA0001211398970000127
The predetermined threshold is a preset minimum sampling rate required for normal completion of symmetric sampling.
After symmetric sampling, simple experimental statistics and parametric analysis can be performed on the symmetric samples, such as analysis of the effects of exposure and non-exposure on product conversion, but of course, other statistical parameter consistency tests can also be used.
With the above specific scenario, if the number of samples N in the experimental sample set is 1 ten thousand and the number of samples M in the reference sample set is 100 ten thousand, it is determined whether to perform coordinated sampling on the samples, and it is first determined whether to compare the expected layering quantity EM of the o-th layeroAre all less than the delamination number N of the experimental sampleoAnd the amount of delamination N of the test sampleoLess than the delamination amount M of the control sampleo
When the total amount of EM of the desired control sample is given as 4000, the control expected stratification amounts EM for the first sub-layer of 0-18 year old men1The control expected layering amounts of five sublayers, calculated as 4000 × 15000/100000 ═ 600, similarly, for males aged 19-35, 36-50, females aged 0-18, females aged 19-35, and females aged 36-50, were 1200, 600, 320, 1000, and 280, respectively, all of which satisfied EMo<No<MoThen can be directly according to EMoSampling is performed.
When the total amount of EM given the expected control sample is 10000, for this first sub-layer of 0-18 year old men, EM is110000 × 15000/100000 ═ 1500, the same way, the expected amount of delamination for the control of other sublayers was calculated, due to min (M)o/EMo) At 0.8, the control expected delamination amount for each sublayer can be reduced by 0.8 to obtain the distribution results shown in table 3 below.
TABLE 3
Sub-layer Mo EMo No Mo/EMo Actual extraction Ro
Male of 0-18 years old 1400 1500 15 ten thousand 0.93 1200
Male of 19-35 years old 3500 3000 30 ten thousand 1.17 2800
Male of 36 to 50 years old 1600 1500 15 ten thousand 0.93 1200
0-18 year old female 800 800 8 ten thousand 1 640
19-35 year old female 2000 2500 25 ten thousand 0.8 2000
36 to 50 years old female 700 700 7 ten thousand 1 560
The invention introduces hierarchical sampling, treats the inconsistent distribution of the experimental population and the control entity, and extracts the symmetrical samples with consistent total amount and distribution under the condition of allowing a certain error so as to analyze and compare the experimental effect. Provides good theoretical basis and method reference for experimental comparison in practice, and ensures the objectivity and scientificity of experimental comparison.
Fig. 4 shows a block diagram of a sampling apparatus provided by an embodiment of the present invention.
As shown in fig. 4, one embodiment of the present invention provides a sampling apparatus 400, comprising: a hierarchical sample amount acquisition unit 402, a first selection unit 404, a coordinated sample amount acquisition unit 406, and a sample processing unit 408.
The hierarchical sample amount obtaining unit 402 is configured to perform hierarchical processing on the experiment sample set and the control sample set according to the specified sub-layer attributes, so as to obtain a first sample amount of each sub-layer of the experiment sample set and a second sample amount of each sub-layer of the control sample set.
The first selection unit 404 is configured to select a first target proportional relationship satisfying a first specified condition from proportional relationships between the first sample amount and the second sample amount of each sub-layer when the second sample amount of each sub-layer is not greater than the first sample amount, or the first sample amount is not greater than the control expected layer amount.
The coordinated sample amount obtaining unit 406 is configured to obtain a third sample amount of each sub-layer according to the first target proportional relationship and the first sample amount of each sub-layer.
The sampling unit 408 is configured to perform sampling processing on the experiment sample set and the control sample set according to the third sampling amount.
In the above embodiment of the present invention, optionally, the sampling processing unit 408 is further configured to: and when the second sampling amount of each sub-layer is larger than the first sampling amount and the first sampling amount is larger than the control expected layering amount, sampling the experimental sample set and the control sample set based on the control expected layering amount.
In the above embodiment of the present invention, optionally, the sampling apparatus 400 further includes: a judging unit 410, after the first selecting unit 404 selects the first target proportional relationship satisfying the first specified condition, judging whether the first target proportional relationship satisfies the second specified condition; the coordinated sample amount obtaining unit 406 is specifically configured to: if the first target proportional relation meets the second specified condition, executing the step to obtain a third sample volume of each sub-layer according to the first target proportional relation and the first sample volume of each sub-layer, if the first target proportional relation does not meet the second specified condition, selecting a second target proportional relation meeting the first specified condition from other proportional relations except the first target proportional relation, and obtaining the third sample volume of each sub-layer according to the second target proportional relation and the first sample volume of each sub-layer.
In the above embodiment of the present invention, optionally, the first specified condition is that a ratio of the second sample amount to the first sample amount is minimum; or the first specified condition is that the ratio of the first sample volume to the second sample volume is maximum.
In the above embodiment of the present invention, optionally, the second specified condition is that a difference between the first target proportional relation and 0 is greater than a specified threshold; or the second specified condition is that the difference between the first target proportional relation and 0 is equal to a specified threshold.
In the foregoing embodiment of the present invention, optionally, the hierarchical sample amount obtaining unit 402 is specifically configured to: samples having sub-layer properties are extracted from the set of experimental samples, the total amount of the samples being taken as a first sample amount, and samples having sub-layer properties are extracted from the set of control samples, the total amount of the samples being taken as a second sample amount.
Fig. 5 shows a block diagram of a terminal provided by an embodiment of the present invention.
As shown in fig. 5, a terminal 500 according to an embodiment of the present invention includes the sampling apparatus 400 shown in fig. 4, and therefore, the terminal 500 has the same technical effects as the sampling apparatus 400 shown in fig. 4, and will not be described again.
Fig. 6 shows a block diagram of a terminal provided by another embodiment of the present invention.
As shown in fig. 6, terminal 600 may include a processor 602 coupled to one or more data storage facilities, which may include a storage medium 604 and a memory unit 606. Terminal 600 may also include an input interface 608 and an output interface 610 for communicating with another device or system. Program code executed by the CPU6022 of the processor 602 may be stored in the storage medium 604 or the memory unit 606.
The processor 602 in the terminal 600 calls the program code stored in the storage medium 604 or the memory unit 606, and can execute the following steps:
according to the appointed sub-layer attributes, the experimental sample set and the control sample set are subjected to layering processing respectively, so that a first sampling amount of each sub-layer of the experimental sample set and a second sampling amount of each sub-layer of the control sample set are obtained;
when the second sampling amount of each sub-layer is not more than the first sampling amount, or the first sampling amount is not more than the contrast expected layering amount, selecting a first target proportional relation meeting a first specified condition from the proportional relation between the first sampling amount and the second sampling amount of each sub-layer;
obtaining a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer;
and sampling the experimental sample set and the control sample set according to a third sampling amount.
In a particular implementation, the processor 602 may further perform:
and when the second sampling amount of each sub-layer is larger than the first sampling amount and the first sampling amount is larger than the control expected layering amount, sampling the experimental sample set and the control sample set based on the control expected layering amount.
In a particular implementation, the processor 602 may further perform:
after a first target proportional relation meeting a first specified condition is selected from the proportional relation between the first sample volume and the second sample volume of each sublayer, whether the first target proportional relation meets a second specified condition is judged;
if the first target proportional relation meets a second specified condition, executing the step to obtain a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer;
and if the first target proportional relation does not meet a second specified condition, selecting a second target proportional relation meeting the first specified condition from other proportional relations except the first target proportional relation, and obtaining a third sample amount of each sub-layer according to the second target proportional relation and the first sample amount of each sub-layer.
The first specified condition is that the ratio of the second sample volume to the first sample volume is minimum; or the first specified condition is that the ratio of the first sample volume to the second sample volume is maximum. The second specified condition is that the difference between the first target proportional relation and 0 is greater than a specified threshold; or the second specified condition is that the difference between the first target proportional relation and 0 is equal to a specified threshold.
In one particular implementation, the processor 602 is operable to perform:
extracting samples with the sub-layer attributes from the experiment sample set, wherein the total amount of the samples is used as the first sample amount; and the number of the first and second groups,
and extracting samples with the sub-layer attributes from the control sample set, wherein the total amount of the samples is used as the second sample amount.
The technical scheme of the invention is explained in detail by combining the drawings, and a novel sampling mode can be provided by the technical scheme of the invention, so that the symmetrical samples with the sample distribution condition consistent with the sample total amount are extracted, the symmetrical samples are more comparable, the reliability of the sampling comparison result can be further increased by using the symmetrical samples for comparison and investigation, and the objectivity and scientificity of experimental comparison are improved.
It should be understood that although the terms first, second, third, etc. may be used to describe sample amounts in embodiments of the present invention, these sample amounts should not be limited to these terms. These terms are only used to distinguish sample volumes from each other. For example, the first sample amount may also be referred to as the second sample amount, and similarly, the second sample amount may also be referred to as the first sample amount without departing from the scope of embodiments of the present invention.
The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It should be noted that the terminal according to the embodiment of the present invention may include, but is not limited to, a Personal Computer (PC), a Personal Digital Assistant (PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a Processor (Processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A sampling method, comprising:
according to the appointed sub-layer attributes, the experimental sample set and the control sample set are subjected to layering processing respectively, so that a first sampling amount of each sub-layer of the experimental sample set and a second sampling amount of each sub-layer of the control sample set are obtained;
when the second sampling amount of each sub-layer is not more than the first sampling amount, or the first sampling amount is not more than the contrast expected layering amount, selecting a first target proportional relation meeting a first specified condition from the proportional relation between the first sampling amount and the second sampling amount of each sub-layer; the first specified condition is that the ratio of the second sample volume to the first sample volume is minimum; or, the first specified condition is that the ratio of the first sample volume to the second sample volume is maximum; the contrast expected layering amount is the ratio of the preset total amount of the expected contrast sample multiplied by the first sampling amount of any sublayer in the experimental sample set;
obtaining a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer;
and sampling the experimental sample set and the control sample set according to a third sampling amount.
2. The method of claim 1, further comprising:
and when the second sampling amount of each sub-layer is larger than the first sampling amount and the first sampling amount is larger than the control expected layering amount, sampling the experimental sample set and the control sample set based on the control expected layering amount.
3. The method of claim 1, wherein after selecting a first target proportional relationship satisfying a first specified condition from proportional relationships of the first sample amount and the second sample amount of each sub-layer, the method further comprises:
judging whether the first target proportional relation meets a second specified condition; the second specified condition is that the difference between the first target proportional relation and 0 is greater than a specified threshold; or, the second specified condition is that the difference between the first target proportional relation and 0 is equal to a specified threshold;
if the first target proportional relation meets a second specified condition, executing the step to obtain a third sampling amount of each sub-layer according to the first target proportional relation and the first sampling amount of each sub-layer;
and if the first target proportional relation does not meet a second specified condition, selecting a second target proportional relation meeting the first specified condition from other proportional relations except the first target proportional relation, and obtaining a third sample amount of each sub-layer according to the second target proportional relation and the first sample amount of each sub-layer.
4. The method of claim 1, wherein the layering the experimental sample set and the control sample set according to the sub-layer attributes comprises:
extracting samples with the sub-layer attributes from the experiment sample set, wherein the total amount of the samples is used as the first sample amount; and the number of the first and second groups,
and extracting samples with the sub-layer attributes from the control sample set, wherein the total amount of the samples is used as the second sample amount.
5. A sampling device, comprising:
the hierarchical sampling amount acquisition unit is used for respectively carrying out hierarchical processing on the experimental sample set and the control sample set according to the appointed sub-layer attributes so as to obtain a first sampling amount of each sub-layer of the experimental sample set and a second sampling amount of each sub-layer of the control sample set;
a first selection unit that selects a first target proportional relationship that satisfies a first prescribed condition from the proportional relationship between the first sample amount and the second sample amount for each sublayer, when the second sample amount for each sublayer is not greater than the first sample amount, or the first sample amount is not greater than the comparison expected sublayer amount; the first specified condition is that the ratio of the second sample volume to the first sample volume is minimum; or, the first specified condition is that the ratio of the first sample volume to the second sample volume is maximum; the contrast expected layering amount is the ratio of the preset total amount of the expected contrast sample multiplied by the first sampling amount of any sublayer in the experimental sample set;
a coordinated sample amount obtaining unit, which obtains a third sample amount of each sub-layer according to the first target proportional relation and the first sample amount of each sub-layer;
and the sampling processing unit is used for sampling the experimental sample set and the reference sample set according to a third sampling amount.
6. The apparatus of claim 5, wherein the sampling processing unit is further configured to:
and when the second sampling amount of each sub-layer is larger than the first sampling amount and the first sampling amount is larger than the control expected layering amount, sampling the experimental sample set and the control sample set based on the control expected layering amount.
7. The apparatus of claim 5, further comprising:
a judging unit that judges whether or not a first target proportional relationship satisfies a second specified condition after the first selecting unit selects the first target proportional relationship satisfying the first specified condition; the second specified condition is that the difference between the first target proportional relation and 0 is greater than a specified threshold; or, the second specified condition is that the difference between the first target proportional relation and 0 is equal to a specified threshold;
the coordinated sample amount obtaining unit is specifically configured to:
if the first target proportional relation meets a second specified condition, executing the step to obtain a third sample volume of each sub-layer according to the first target proportional relation and the first sample volume of each sub-layer, if the first target proportional relation does not meet the second specified condition, selecting a second target proportional relation meeting the first specified condition from other proportional relations except the first target proportional relation, and obtaining the third sample volume of each sub-layer according to the second target proportional relation and the first sample volume of each sub-layer.
8. The apparatus according to claim 5, wherein the hierarchical sample amount obtaining unit is specifically configured to:
samples having the sub-layer attribute are extracted from the experimental sample set, the total amount of the samples being the first sample amount, and samples having the sub-layer attribute are extracted from the control sample set, the total amount of the samples being the second sample amount.
CN201710035012.1A 2017-01-17 2017-01-17 Sampling method and sampling device Active CN108319611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710035012.1A CN108319611B (en) 2017-01-17 2017-01-17 Sampling method and sampling device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710035012.1A CN108319611B (en) 2017-01-17 2017-01-17 Sampling method and sampling device

Publications (2)

Publication Number Publication Date
CN108319611A CN108319611A (en) 2018-07-24
CN108319611B true CN108319611B (en) 2022-03-11

Family

ID=62892226

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710035012.1A Active CN108319611B (en) 2017-01-17 2017-01-17 Sampling method and sampling device

Country Status (1)

Country Link
CN (1) CN108319611B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865753B (en) * 2019-04-26 2022-05-20 腾讯科技(深圳)有限公司 Method and device for determining parameters of media information, storage medium and electronic device
CN112711643B (en) * 2019-10-25 2023-10-10 北京达佳互联信息技术有限公司 Training sample set acquisition method and device, electronic equipment and storage medium
CN110825783B (en) * 2019-10-31 2024-07-02 深圳前海微众银行股份有限公司 Data sampling method, device, equipment and storage medium
CN112579983B (en) * 2021-03-01 2021-07-30 深圳市城市交通规划设计研究中心股份有限公司 Travel survey sampling method and device
CN113312554B (en) * 2021-06-15 2023-11-03 北京百度网讯科技有限公司 Method and device for evaluating recommendation system, electronic equipment and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965976A (en) * 2015-06-12 2015-10-07 北京京东尚科信息技术有限公司 Sampling method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7047230B2 (en) * 2002-09-09 2006-05-16 Lucent Technologies Inc. Distinct sampling system and a method of distinct sampling for optimizing distinct value query estimates

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965976A (en) * 2015-06-12 2015-10-07 北京京东尚科信息技术有限公司 Sampling method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A probabilistic formulation for empirical population synthesis: sampling;Roberto Cid Fernandes;《Monthly Notices of the Royal Astronomical Society》;20010223;60-67页 *
分层抽样误差分析及其在渔业统计中的应用;袁兴伟等;《海洋渔业》;20110215(第01期);全文 *
基于分层抽样的高速网络吞吐率测量;张峰等;《吉林大学学报(信息科学版)》;20041230(第06期);全文 *

Also Published As

Publication number Publication date
CN108319611A (en) 2018-07-24

Similar Documents

Publication Publication Date Title
CN108319611B (en) Sampling method and sampling device
US11301729B2 (en) Systems and methods for inferential sharing of photos
WO2021184727A1 (en) Data abnormality detection method and apparatus, electronic device and storage medium
CN108090567B (en) Fault diagnosis method and device for power communication system
US20190139623A1 (en) Display of estimated parental contribution to ancestry
CN111814910B (en) Abnormality detection method, abnormality detection device, electronic device, and storage medium
US20230096921A1 (en) Image recognition method and apparatus, electronic device and readable storage medium
CN113538070B (en) User life value cycle detection method and device and computer equipment
CN109783805B (en) Network community user identification method and device and readable storage medium
US20230004979A1 (en) Abnormal behavior detection method and apparatus, electronic device, and computer-readable storage medium
US10444062B2 (en) Measuring and diagnosing noise in an urban environment
CN108197795A (en) The account recognition methods of malice group, device, terminal and storage medium
CN117421491A (en) Method and device for quantifying social media account running data and electronic equipment
CN112966756A (en) Visual access rule generation method and device, machine readable medium and equipment
CN114817518B (en) License handling method, system and medium based on big data archive identification
CN115761360A (en) Tumor gene mutation classification method and device, electronic equipment and storage medium
CN113891323B (en) WiFi-based user tag acquisition system
CN108334519B (en) User label obtaining method and device in user portrait
CN111538652B (en) Application control testing method and related equipment
CN112001760B (en) Potential user mining method and device, electronic equipment and storage medium
CN114548620A (en) Logistics punctual insurance service recommendation method and device, computer equipment and storage medium
CN110737693A (en) Data mining processing method, device, equipment and computer readable storage medium
CN110852392A (en) User grouping method, device, equipment and medium
CN114626340B (en) Behavior feature extraction method based on mobile phone signaling and related device
CN110337015B (en) Method for correcting audience rating error of cable television user under large sample

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant