CN115795324A

CN115795324A - Cluster sampling method, device, medium and program product

Info

Publication number: CN115795324A
Application number: CN202211423048.4A
Authority: CN
Inventors: 刘明
Original assignee: Shanghai Lianshang Network Technology Co Ltd
Current assignee: Shanghai Lianshang Network Technology Co Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-03-14

Abstract

An object of the present application is to provide a cluster sampling method, device, medium, and program product, including: acquiring a first data group and a second data group which are obtained based on original data characteristic classification, wherein the first data group comprises data meeting the characteristic condition of the original data characteristic, and the second data group comprises data not meeting the characteristic condition of the original data characteristic; acquiring first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group; and determining a corresponding third data group according to one of the first data group and the second data group, the first target characteristic distribution information, the second target characteristic information and the third data group. The method and the device are suitable for deducing the causal relationship under the non-AB test, are simple and convenient to calculate, save calculation resources and improve calculation efficiency.

Description

Cluster sampling method, device, medium and program product

Technical Field

The application relates to the field of communication, in particular to a cluster sampling technology.

Background

The AB test is the gold criterion for assessing causal effects, but in some scenarios it is not possible to carry out AB tests or it is too costly. For example, in a service promotion effect evaluation scenario, statistics shows that the subsequent opening frequency and income of users who participate in the service are higher than those of users who do not participate in the service, but the quality of the users who participate in the service is higher, and indexes before promotion are better in performance, so that the problem of how to carry out causal inference under non-AB tests exists. Also for example, new product user long life cycle value estimation: because the online period of the innovative product is short and the long-term value of the user cannot be directly counted, sometimes a historical data fitting function is used, and then the life cycle value of the user of the new product is predicted, but the characteristic attribute of the existing product is possibly greatly different from that of the new product, so that a larger deviation occurs.

Disclosure of Invention

It is an object of the present application to provide a cluster sampling method, apparatus, medium, and program product.

According to an aspect of the present application, a cluster sampling method is provided, wherein the method includes:

acquiring a first data group and a second data group which are obtained based on original data characteristic classification, wherein the first data group comprises data meeting the characteristic condition of the original data characteristic, and the second data group comprises data not meeting the characteristic condition of the original data characteristic;

acquiring first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group, wherein the at least one target data feature does not include the original data feature;

and determining a corresponding third data group according to one of the first data group and the second data group, the first target feature distribution information, the second target feature information and the third data group, wherein the third data group comprises data corresponding to the counterfactual hypothesis condition about the original data feature.

According to another aspect of the present application, there is provided a cluster sampling apparatus, wherein the apparatus comprises:

the system comprises a module, a module and a module, wherein the module is used for acquiring a first data group and a second data group which are obtained based on original data characteristic classification, the first data group comprises data meeting characteristic conditions of original data characteristics, and the second data group comprises data not meeting the characteristic conditions of the original data characteristics;

a second module, configured to obtain first target feature distribution information of at least one target data feature in the first data set and second target feature distribution information of the at least one target data feature in the second data set, where the at least one target data feature does not include the original data feature;

and a third module, configured to determine a corresponding third data set according to one of the first data set and the second data set, the first target feature distribution information, the second target feature information, and the third data set, where the third data set includes data corresponding to a counterfactual assumption condition about the original data feature.

According to an aspect of the present application, there is provided a computer apparatus, wherein the apparatus comprises:

a processor; and

a memory arranged to store computer executable instructions which, when executed, cause the processor to perform the steps of the method as described in any one of the above.

According to an aspect of the application, there is provided a computer readable storage medium having stored thereon a computer program/instructions, characterized in that the computer program/instructions, when executed, cause a system to perform the steps of performing the method as described in any of the above.

According to an aspect of the application, there is provided a computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps of the method as described in any of the above.

Compared with the prior art, the data sample corresponding to the counterfactual hypothesis condition about the original data characteristic is obtained by determining the characteristic distribution of the first data group and the second data group corresponding to the target data characteristic based on the difference of various data characteristics in the first data group and the second data group, and determining the corresponding third data group based on the characteristic distribution resampling. The method and the device are suitable for deducing the causal relationship under the non-AB test, are simple and convenient to calculate, save calculation resources and improve calculation efficiency.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method of clustering samples according to one embodiment of the present application;

FIG. 2 illustrates a device structure diagram of a computer device according to another embodiment of the present application;

FIG. 3 illustrates an exemplary system that can be used to implement the various embodiments described in this application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.

The Memory may include forms of volatile Memory, random Access Memory (RAM), and/or non-volatile Memory in a computer-readable medium, such as Read Only Memory (ROM) or Flash Memory. Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase-Change Memory (PCM), programmable Random Access Memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash Memory or other Memory technologies, compact Disc Read-Only Memory (CD-ROM), digital Versatile Disc (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The device referred to in this application includes, but is not limited to, a user device, a network device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product, such as a smart phone, a tablet computer, etc., capable of performing human-computer interaction with a user (e.g., human-computer interaction through a touch panel), and the mobile electronic product may employ any operating system, such as an Android operating system, an iOS operating system, etc. The network Device includes an electronic Device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded Device, and the like. The network device includes but is not limited to a computer, a network host, a single network server, multiple network server sets, or a cloud of multiple servers; here, the Cloud is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual supercomputer consisting of a collection of loosely coupled computers. Including, but not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), etc. Preferably, the device may also be a program running on the user device, the network device, or a device formed by integrating the user device and the network device, the touch terminal, or the network device and the touch terminal through a network.

Of course, those skilled in the art will appreciate that the foregoing is by way of example only, and that other existing or future devices, which may be suitable for use in the present application, are also encompassed within the scope of the present application and are hereby incorporated by reference.

In the description of the present application, "a plurality" means two or more unless specifically defined otherwise.

Fig. 1 shows a cluster sampling method applied to a computer device according to an aspect of the present application, the method includes step S101, step S102, and step S103. In step S101, a first data group and a second data group obtained based on the classification of the original data features are obtained, where the first data group includes data that satisfies the feature condition of the original data features, and the second data group includes data that does not satisfy the feature condition of the original data features; in step S102, first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group are obtained, where the at least one target data feature does not include the original data feature; in step S103, a corresponding third data set is determined according to one of the first data set and the second data set, the first target feature distribution information, the second target feature information, and the third data set, wherein the third data set includes data corresponding to a counterfactual assumption condition about the original data feature.

Taking the influence of the participated-in classroom teaching on the examination score as an example, statistics shows that the score of the students who participated in the classroom teaching is 5 points higher than that of the users who did not participated in the classroom teaching on average, but whether the students participated in the classroom teaching is not a random experiment. For example, the parents of the students who participate in the class teaching have higher income, the environments of waiting for teaching and developing are better, and even if the students do not participate in the class teaching, the test results can be as good. Causal inference is performed in the case of non-random experiments (AB tests), and commonly used solutions include the following:

1) Important missing features are added to the regression equation: for example, in the case of a small class teaching example, features of the income of parents are added, and the method has the defects that not only are the features omitted but also the features are difficult to collect;

2) Lag variable added to target variable: in the regression equation, the results observed before the target variable is added. This type of approach is generally not common because of severe autocorrelation problems;

3) Using an advanced cross-section metrology model: for example, tool variable regression, trend value matching, hockmann two-stage models, breakpoint regression, etc., which are generally used by academia;

4) Using panel data: typical models include fixed effect models, random effect models, and double difference models.

However, the above model is difficult to implement in the internet-related industry for reasons including: 1) Large-scale collection of features beyond user behavior needs to be developed; 2) Panel data is difficult to collect with the problem that the user logs out (no duplicate observations can be made). Therefore, the scheme provides 1 simple clustering sampling algorithm for how to perform causality judgment under non-AB test.

Specifically, in step S101, a first data group and a second data group obtained by classification based on the features of the original data are obtained, where the first data group includes data satisfying the feature condition of the features of the original data, and the second data group includes data not satisfying the feature condition of the features of the original data. For example, the computer device obtains corresponding original data, divides the original data into two groups of data with larger index distribution difference according to original data characteristics, such as a first data group and a second data group corresponding to mutual exclusion conditions according to the original data characteristics, where the first data group is used to indicate a data set that meets characteristic conditions of the original data characteristics, and the second data group is used to indicate a data set that does not meet characteristic conditions of the original data characteristics, for example, the first data group is a user group participating in a promotion service, and the second data group is a user group that does not participate in a promotion service. In some cases, the corresponding user data is too large, and we only need to call part of the user data from the user data, etc., the computing device may further cluster the raw data into several typical user sets according to one or more data features (e.g., preset data features or at least one target data, etc.), and determine corresponding first data groups and second data groups, etc., respectively, for the several typical user sets by means of upsampling and downsampling, for example, clustering the user data of the application statistics into a plurality of categories according to revenue contribution, opening frequency, etc., and sampling according to the feature conditions of the raw data from the plurality of categories, and determine corresponding first data groups and second data groups, respectively, where the corresponding sampling includes, but is not limited to, non-downsampling, or with-downsampling, etc.

In step S102, first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group are obtained, where the at least one target data feature does not include the original data feature. For example, the target data feature includes a group feature for describing a data distribution in the target data set, and the feature distribution corresponding to the target data feature may be a data distribution about a group based on a certain threshold, such as a user group proportion with a revenue contribution higher than or equal to X, a user group throw proportion with a revenue contribution smaller than X, thereby composing a user data distribution about a revenue contribution, and the like; in some cases, the feature distribution information corresponding to the target data feature may also be group data distribution of a plurality of intervals about the target data feature, for example, a proportion of user groups with revenue contributions below a, a proportion of user groups with revenue contributions in the interval a-b, a proportion of user groups with revenue contributions above b, and the like. The target feature distribution information is used to indicate corresponding feature distribution information of at least one target data feature corresponding to the respective data set, and the target feature distribution information may be a description of a feature distribution of a single target feature corresponding to each target data feature with respect to the data set, a combined feature distribution of combined data features determined by weighting the feature distributions of the single target features of each target data feature, and the like. Based on the foregoing method, the computer device may obtain first target feature distribution information of at least one target data feature in the first data set, and second target feature distribution information of the at least one target data feature in the second data set. Here, the target data feature is a user feature that is of greater interest to researchers, and in order to distinguish the target data feature from the original data feature having a large difference, the corresponding target data feature does not include the original data feature, for example, the corresponding original data feature is a feature related to an intervention feature for distinguishing a user group, and the corresponding target data feature is a feature related to a desired prediction related to a user behavior such as a revenue contribution and an application opening frequency. The foregoing first and second terms are used merely to distinguish one noun from another and do not denote any particular order.

In some embodiments, the at least one target data characteristic comprises one target data characteristic; wherein the obtaining first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group includes: determining a first target data quantity meeting the target data characteristics in the first data group according to the target data characteristics, and determining first target characteristic distribution information in the first data group according to the first target data quantity and the total data quantity of the first data group; and determining a second target data quantity meeting the target data characteristics in the second data group according to the target data characteristics, and determining second target characteristic distribution information in the first data group according to the second target data quantity and the total data quantity of the first data group. For example, the at least one target data feature includes one target data feature, that is, the number of data features corresponding to the user behavior is one, only the influence of a single target data feature on data under non-AB experiments needs to be considered, and the like. Specifically, the computer device, according to the target data feature, satisfies a first target data quantity corresponding to the target data feature in the target data feature, where the first target data quantity is used to indicate a quantity of data satisfying the target data feature, and the quantity may be used to indicate a data quantity in an interval of one or more target data features, and if the corresponding target data feature includes an interval with an income contribution greater than or equal to X, determines a first data quantity with an income contribution greater than X in the first data group; and if the corresponding target data characteristics comprise a plurality of intervals with income contribution smaller than a, a-b, larger than b and the like, the corresponding first target data quantity is the data quantity corresponding to each interval in the three intervals with income contribution smaller than a, a-b and larger than b in the first data group. Then, based on the corresponding first target data quantity and the total data quantity of the first data group, we can calculate the data proportion of each interval of the target data feature, and combine the proportions of the multiple intervals to determine the corresponding first target feature distribution information of the target data feature in the first data group. Similar to the processing of the first data group, the computer device may determine a second target data amount in the second data group that satisfies the target data characteristic according to the target data characteristic, and determine second target characteristic distribution information in the first data group according to the second target data amount and the total data amount of the first data group.

In some embodiments, the at least one target data feature comprises a plurality of target data features; wherein the obtaining first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group includes: determining first unit target feature distribution information of each target data feature in the first data group according to the plurality of target data features to acquire a plurality of first unit target feature distribution information of the plurality of target data feature information, and determining first target feature distribution information in the first data group according to the plurality of first unit target feature distribution information; determining second unit target feature distribution information of each target data feature in the second data group according to the plurality of target data features to acquire a plurality of second unit target feature distribution information of the plurality of target data feature information, and determining second target feature distribution information in the second data group according to the plurality of second unit target feature distribution information. For example, the number of the target data features may also be multiple, for example, including revenue contribution, opening frequency or opening duration, and the like. The computer device may determine, according to a plurality of target data features, first unit target feature distribution information corresponding to each target data feature, where the first unit target feature distribution information is used to indicate feature distribution information corresponding to a single feature of each target data feature in a first data set, and if one of the target data features includes a plurality of intervals, such as an interval with an income contribution smaller than a, an interval with an a-b, an interval with a b, and the like, the corresponding first unit target data distribution is that the income contribution of the first data set is smaller than a, an interval with an a-b, an interval with a b, and the like, a proportion of data corresponding to each interval in the three intervals with an income contribution smaller than a, an interval with an a-b, an interval with a b, and the like, relative to total data, and the like. The computer device determines first target feature distribution information of a plurality of target data features corresponding to each of the plurality of target data features in the first data group according to the first unit target feature distribution information of the target data features, for example, assigns a weight to the plurality of first unit target feature distribution information and determines the first target feature distribution information based on the corresponding weight and the first unit target feature distribution information. For example, a plurality of target data features are arranged and combined to determine corresponding combined data features, so that the data proportion of each combined data feature is determined, and corresponding first target feature distribution information is obtained. Wherein, the determining process of the first unit target feature distribution information is the same as or similar to the determining process of the first target feature distribution information when the number of the target feature information is one, as in some embodiments, the determining the first unit target feature distribution information of each target data feature in the first data group according to the plurality of target data features includes: sequentially selecting one target data feature to be determined from the plurality of target data features, determining a first target data quantity which meets the target data feature to be determined in the first data group according to the target data feature to be determined, and determining first unit target feature distribution information in the first data group according to the first target data quantity and the total data quantity of the first data group; wherein the determining second unit target feature distribution information of each target data feature in the second data group according to the plurality of target data features comprises: sequentially selecting one target data feature to be determined from the plurality of target data features, determining a second target data quantity which meets the target data feature to be determined in the second data group according to the target data feature to be determined, and determining second unit target feature distribution information in the second data group according to the second target data quantity and the total data quantity of the second data group. In some embodiments, the determining first target feature distribution information in the first data set from the plurality of first unit target feature distribution information includes: arranging and combining a plurality of first unit target feature distribution information of the plurality of target data features, and determining first combined target feature distribution information of a plurality of combined target data features to acquire first target feature distribution information of the plurality of target data feature information; wherein the determining second target feature distribution information in the second data group according to the plurality of second unit target feature distribution information comprises: and arranging and combining the second unit target feature distribution information of the target data features, and determining second combined target feature distribution information of the combined target data features to acquire second target feature distribution information of the target data features. For example, after the computer device determines that the target feature distribution information corresponds to a plurality of first units, the computer device may perform permutation and combination on a plurality of target data features, and determine corresponding combined target feature distribution information, for example, if the target data feature 1 is a data feature with a revenue contribution smaller than a, in a-b, and larger than b, and the target data feature 2 is a data feature with an opening frequency below a and in an interval above a, three intervals of the target data feature 1 and two intervals of the target data feature 2 may be permutated and combined, and determine corresponding six combined target feature information: the revenue contribution is less than a and the opening frequency is below A, the revenue contribution is less than a and the opening frequency is above A, the revenue contribution is below a-b and the opening frequency is below A, the revenue contribution is above a-b and the opening frequency is above A, the revenue contribution is greater than b and the opening frequency is below A, the revenue contribution is greater than b and the opening frequency is above A. Accordingly, based on the feature distribution ratios (e.g., 30%, 20%, 50%) of the three intervals of the target data feature 1 and the feature distribution ratios (e.g., 40%, 60%) of the two intervals of the target data feature 2, we can then determine that the feature distribution ratios corresponding to the six combined target feature information are 12%, 18%, 8%, 12%, 20%, 30%, and so on, respectively. Based on the foregoing process, corresponding first unit target feature distribution information and first target feature distribution information may be determined, and accordingly, second unit target feature distribution information and second target feature distribution information are the same as or similar to the foregoing process.

Of course, those skilled in the art will appreciate that the above-described target data features are merely exemplary, and that other existing or future target data features, as may be applicable to the present application, are also encompassed within the scope of the present application and are hereby incorporated by reference.

In step S103, a corresponding third data set is determined according to one of the first data set and the second data set, the first target feature distribution information, the second target feature information, and the third data set, wherein the third data set includes data corresponding to a counterfactual assumption condition about the original data feature. For example, after the computer device determines the corresponding first target feature distribution information and the second target feature distribution, the computer device may determine corresponding sampling distribution information according to the first target feature distribution information and the second target feature distribution information, and sample from the first data set/the second data set based on the corresponding counterfactual assumption condition to determine a corresponding third data set. The nature of the causal effect can be understood as the process of counterfactual inference, which, for intervening users (participating in a team teaching), looks at what the performance of each person would be if they did not participate in the team teaching. In fact, however, it is impossible for a person to participate in both team teaching and team teaching at the same time, which is a dilemma faced by causal effect assessment under the counterfactual inference framework. For example, for a company to evaluate a specific business for a promotion service, the counterfactual assumption condition is that "those users who participated in the promotion service will be how their subsequent behavior will be if the company does not make a promotion at that time", but this is also impossible to obtain data because the company is unlikely to have some user data to be promoted and not promoted at that time. In some embodiments, the method further includes step S104 (not shown), in step S104, obtaining a counterfactual assumption condition about the original data feature, where the counterfactual assumption condition includes a factual condition and an assumption condition corresponding to the original data feature, the factual condition matches one of the first data set and the second data set, and the assumption condition matches the other data set; wherein the step S103 includes a substep S1031 (not shown) and a substep S1032 (not shown), in the step S1031, matching sample data sets are determined from the first data set and the second data set according to the hypothesis of the counterfactual hypothesis; in step S1032, a corresponding third data set is determined according to the sampled data set, the first target feature distribution information, the second target feature information, and the third data set, where the third data set includes data corresponding to the counterfactual assumption condition about the original data feature. For example, the counterfactual assumption conditions include assumption conditions for describing a desired feature, which are determined on the basis of the fact that the condition of one of the first data set/the second data set with respect to the original data feature is taken as a basis of an assumption that the condition of the other data set with respect to the original data feature is taken as a basis of an assumption. For example, the counterfactual assumption condition is based on the fact that the original data feature in the first data set satisfies the feature condition, and the corresponding counterfactual assumption condition is determined by taking the original data feature in the second data set that does not satisfy the feature condition as the assumption condition, for example, the counterfactual assumption condition corresponding to the specific business of the company promotion service evaluation is "those users who participated in the promotion service, if the company does not do promotion at that time, what their subsequent behavior will be; of course, if the user does not participate in the promotion service as an intervention feature, the corresponding counterfactual assumption condition can also be that the user does not participate in the promotion service, and if the company does promotion to the user at that time, what their subsequent behavior will be. In this case, the fact condition and the hypothesis condition in the counterfactual hypothesis condition correspond to one of the first data set and the second data set, respectively, and we determine the sample data set to be sampled from the first data set and the second data set according to the counterfactual hypothesis condition, for example, determine the data set corresponding to the hypothesis condition as the sample data set. And after the computer equipment determines the corresponding sampling data group, sampling from the sampling data group according to the first target characteristic distribution information and the second target characteristic distribution information, and determining the data obtained by sampling as a third data group. Specifically, the computer device determines corresponding sampling feature distribution information according to the first target feature distribution information and the second target feature information, so as to sample from the sampling data group according to the sampling feature distribution information, and form the sampled data into a corresponding third data group.

In some embodiments, in step S1032, corresponding sampling distribution information is determined according to the first target feature distribution information and the second target feature information; sampling is performed from the sampled data sets according to the sampling distribution information, so that sampled data are grouped into a corresponding third data set, wherein the third data set comprises data corresponding to counterfactual hypothesis conditions about original data characteristics. For example, the sub-feature types in the first target feature distribution information and the second feature distribution information are the same, the data amount corresponding to each sub-feature differs based on the corresponding data group, and the corresponding sub-feature may be target feature information or combined target feature information determined after the target feature information is arranged and combined. According to the first target feature distribution information and the second target feature distribution information, the sampling distribution information corresponding to the third data set may be determined, where the sampling distribution information is used to indicate the sampling proportion and/or the sampling number of each sub-feature of the third data set, for example, different weights may be assigned to the first target feature distribution information and the second target feature distribution information to obtain corresponding sampling distribution information.

In some embodiments, the determining the corresponding sampling distribution information according to the first target feature distribution information and the second target feature information includes: determining the matched sampling target characteristic distribution information from the first target characteristic distribution information and the second target characteristic distribution information according to the sampling data group, and determining the other group of target characteristic distribution information as weight target characteristic distribution information; and determining sampling weight distribution information corresponding to each sub-feature according to the feature distribution ratio of each sub-feature in the weight target feature distribution information and the feature distribution ratio corresponding to the sampling target feature distribution information in sequence so as to obtain sampling distribution information corresponding to a third data group. For example, the distribution information corresponding to the sample data group among the first target feature distribution information, the second target feature distribution information is determined as the sample target feature distribution information, and the other target feature distribution information is determined as the corresponding weight target feature distribution information. And then, dividing the feature proportion of the sub-feature in each weight target feature distribution information by the feature proportion of the corresponding sub-feature in the corresponding sampling target feature distribution information to obtain the weight information of the sub-feature, thereby obtaining the sampling weight distribution information corresponding to a plurality of pieces of sub-feature information. The computer device may further adjust a feature ratio in the sampling target feature distribution information based on the weight information of each sub-feature, and if the corresponding weight is greater than 1, it indicates that the proportion corresponding to the sub-feature is relatively small, and the sampling proportion/sampling number of the sub-feature needs to be increased, and if the corresponding weight is less than 1, it indicates that the proportion corresponding to the sub-feature is relatively large, and the sampling proportion/sampling number of the sub-feature needs to be decreased, and the like. In some cases, the corresponding sampling distribution information may be determined according to the corresponding sampling weight distribution information and the sampling target data feature distribution information, such as according to the sampling weight of each sub-feature multiplied by the feature proportion or the number of the sub-features in the sampling target data feature distribution information.

In some embodiments, said sampling from said sample data set according to said sample distribution information to group sampled data into corresponding third data groups comprises: and sampling from the characteristic data corresponding to the sampling data group according to the sampling weight distribution information corresponding to each sub-characteristic in the sampling distribution information, and determining the sampling data corresponding to each sub-characteristic to form a corresponding third data group. For example, the sampling data group may be sampled according to the corresponding sampling distribution information, the number of samples corresponding to each sub-feature may be determined according to a preset number of samples, and the sub-feature data of each sub-feature of the sampling data group may be sampled accordingly, so as to determine the sample data of each sub-feature, and further form a corresponding third data group. In some embodiments, the sample comprises a put back sample. For example, in order to improve the accuracy and reliability of the sampling data, when the computer device samples from the sampling data group according to the sampling distribution information, the sampling can be performed in a manner of sampling with a playback, so that the sampling effect is improved, and more real and effective effect analysis is obtained. In some embodiments, the number of data of the third data set is greater than or equal to the number of data of the first data set or the second data set. For example, also in order to improve the validity of the data distribution in the third data group after sampling, the computer device may be configured to sample the third data group with a number of corresponding data greater than or equal to the number of data in the first data group or the second data group. In some cases, the data amount may be greater than or equal to a preset ratio of the data amount of the first data group or the second data group, the preset ratio being set by a corresponding administrator or determined by the calculation efficiency of the current calculation resource, and the like.

The foregoing mainly describes embodiments of a cluster sampling method according to an aspect of the present application, and further provides a device capable of implementing the above embodiments, which is specifically described below with reference to fig. 2.

Fig. 2 illustrates a computer apparatus for clustering samples according to an aspect of the present application, the apparatus including a one-module 101, a two-module 102, and a three-module 103. A one-to-one module 101, configured to obtain a first data group and a second data group obtained based on feature classification of original data, where the first data group includes data that satisfies a feature condition of an original data feature, and the second data group includes data that does not satisfy the feature condition of the original data feature; a second module 102, configured to obtain first target feature distribution information of at least one target data feature in the first data set and second target feature distribution information of the at least one target data feature in the second data set, where the at least one target data feature does not include the original data feature; a third module 103, configured to determine a corresponding third data set according to one of the first data set and the second data set, the first target feature distribution information, the second target feature information, and the third data set, where the third data set includes data corresponding to a counterfactual assumption condition about the original data feature.

In some embodiments, the at least one target data feature comprises one target data feature; wherein the obtaining first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group includes: determining a first target data quantity meeting the target data characteristics in the first data group according to the target data characteristics, and determining first target characteristic distribution information in the first data group according to the first target data quantity and the total data quantity of the first data group; and determining a second target data quantity meeting the target data characteristics in the second data group according to the target data characteristics, and determining second target characteristic distribution information in the first data group according to the second target data quantity and the total data quantity of the first data group.

In some embodiments, the at least one target data feature comprises a plurality of target data features; wherein the obtaining first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group includes: determining first unit target feature distribution information of each target data feature in the first data group according to the plurality of target data features to acquire a plurality of first unit target feature distribution information of the plurality of target data feature information, and determining first target feature distribution information in the first data group according to the plurality of first unit target feature distribution information; determining second unit target feature distribution information of each target data feature in the second data group according to the plurality of target data features to acquire a plurality of second unit target feature distribution information of the plurality of target data feature information, and determining second target feature distribution information in the second data group according to the plurality of second unit target feature distribution information. In some embodiments, the determining first unit target feature distribution information of each target data feature in the first data group according to the plurality of target data features comprises: sequentially selecting one target data feature to be determined from the plurality of target data features, determining a first target data quantity which meets the target data feature to be determined in the first data group according to the target data feature to be determined, and determining first unit target feature distribution information in the first data group according to the first target data quantity and the total data quantity of the first data group; wherein the determining second unit target feature distribution information of each target data feature in the second data group according to the plurality of target data features comprises: sequentially selecting one target data feature to be determined from the plurality of target data features, determining a second target data quantity which meets the target data feature to be determined in the second data group according to the target data feature to be determined, and determining second unit target feature distribution information in the second data group according to the second target data quantity and the total data quantity of the second data group. In some embodiments, the determining first target feature distribution information in the first data set from the plurality of first unit target feature distribution information includes: arranging and combining a plurality of first unit target feature distribution information of the plurality of target data features, and determining first combined target feature distribution information of a plurality of combined target data features to acquire first target feature distribution information of the plurality of target data feature information; wherein the determining second target feature distribution information in the second data group according to the plurality of second unit target feature distribution information comprises: and arranging and combining the second unit target feature distribution information of the target data features, and determining second combined target feature distribution information of the combined target data features to acquire second target feature distribution information of the target data features.

In some embodiments, the apparatus further includes a fourth module (not shown) for obtaining a counterfactual hypothesis condition about the original data feature, wherein the counterfactual hypothesis condition includes a factual condition and a hypothetical condition corresponding to the original data feature, the factual condition matches one of the first data set and the second data set, and the hypothetical condition matches the other data set; wherein the one-three module 103 comprises a three-in-one unit (not shown) and a three-in-two unit (not shown), a three-in-one unit, for determining matching sample data sets from the first data set and the second data set according to the hypothesis of the counterfactual hypothesis; and the third and second unit is used for determining a corresponding third data group according to the sampling data group, the first target feature distribution information, the second target feature information and the third data group, wherein the third data group comprises data corresponding to the counterfactual hypothesis condition about the original data feature.

In some embodiments, a third and second unit is configured to determine corresponding sampling distribution information according to the first target feature distribution information and the second target feature information; sampling is performed from the sampling data groups according to the sampling distribution information, so that the sampled data are grouped into a corresponding third data group, wherein the third data group comprises data corresponding to the counterfactual hypothesis condition about the original data characteristics.

In some embodiments, the determining the corresponding sampling distribution information according to the first target feature distribution information and the second target feature information includes: determining the matched sampling target characteristic distribution information from the first target characteristic distribution information and the second target characteristic distribution information according to the sampling data group, and determining the other group of target characteristic distribution information as weight target characteristic distribution information; and determining sampling weight distribution information corresponding to each sub-feature according to the feature distribution ratio of each sub-feature in the weight target feature distribution information and the feature distribution ratio corresponding to the sampling target feature distribution information in sequence so as to obtain sampling distribution information corresponding to a third data group.

In some embodiments, said sampling from said sample data set according to said sample distribution information to group sampled data into corresponding third data sets comprises: sampling is carried out on the characteristic data corresponding to the sampling data group according to the sampling weight distribution information corresponding to each sub-feature in the sampling distribution information, and the sampling data corresponding to each sub-feature is determined to form a corresponding third data group. In some embodiments, the sample comprises a put back sample. In some embodiments, the number of data of the third data set is greater than or equal to the number of data of the first data set or the second data set.

Here, the specific implementation corresponding to the one-to-one module 101, the two-to-one module 102, the three-to-one module 103, and the four-to-one module is the same as or similar to the embodiments of the step S101, the step S102, the step S103, and the step S104, and therefore, the description is omitted and the description is incorporated herein by reference.

In addition to the methods and apparatus described in the embodiments above, the present application also provides a computer readable storage medium storing computer code that, when executed, performs the method as described in any of the preceding claims.

The present application also provides a computer program product, which when executed by a computer device performs the method of any of the preceding claims.

The present application further provides a computer device, comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method as recited in any preceding claim.

FIG. 3 illustrates an exemplary system that can be used to implement the various embodiments described herein;

in some embodiments, as shown in FIG. 3, the system 300 can be implemented as any of the above-described devices in the various embodiments. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.

For one embodiment, system control module 310 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 305 and/or to any suitable device or component in communication with system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

System memory 315 may be used to load and store data and/or instructions for system 300, for example. For one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 315 may include a double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 320 may be accessible over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. System 300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controllers of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310 to form a system on chip (SoC).

In various embodiments, system 300 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, as an Application Specific Integrated Circuit (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

Additionally, some portions of the present application may be applied as a computer program product, such as computer program instructions, which, when executed by a computer, may invoke or provide the method and/or solution according to the present application through the operation of the computer. Those skilled in the art will appreciate that the forms of computer program instructions that reside on a computer-readable medium include, but are not limited to, source files, executable files, installation package files, and the like, and that the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. In this regard, computer readable media can be any available computer readable storage media or communication media that can be accessed by a computer.

Communication media includes media by which communication signals, including, for example, computer readable instructions, data structures, program modules, or other data, are transmitted from one system to another. Communication media may include conductive transmission media such as cables and wires (e.g., fiber optics, coaxial, etc.) and wireless (non-conductive transmission) media capable of propagating energy waves such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied in a modulated data signal, for example, in a wireless medium such as a carrier wave or similar mechanism such as is embodied as part of spread spectrum techniques. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory such as random access memory (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, feRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed that can store computer-readable information/data for use by a computer system.

An embodiment according to the present application herein comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or solution according to embodiments of the present application as described above.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not to denote any particular order.

Claims

1. A method of cluster sampling, wherein the method comprises:

2. The method of claim 1, wherein the at least one target data characteristic comprises a target data characteristic; wherein the obtaining first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group includes:

determining a first target data quantity meeting the target data characteristics in the first data group according to the target data characteristics, and determining first target characteristic distribution information in the first data group according to the first target data quantity and the total data quantity of the first data group;

and determining a second target data quantity meeting the target data characteristics in the second data group according to the target data characteristics, and determining second target characteristic distribution information in the first data group according to the second target data quantity and the total data quantity of the first data group.

3. The method of claim 2, wherein the at least one target data characteristic comprises a plurality of target data characteristics; wherein the obtaining first target feature distribution information of at least one target data feature in the first data group and second target feature distribution information of the at least one target data feature in the second data group includes:

determining first unit target feature distribution information of each target data feature in the first data group according to the plurality of target data features to acquire a plurality of first unit target feature distribution information of the plurality of target data feature information, and determining first target feature distribution information in the first data group according to the plurality of first unit target feature distribution information;

determining second unit target feature distribution information of each target data feature in the second data group according to the plurality of target data features to acquire a plurality of second unit target feature distribution information of the plurality of target data feature information, and determining second target feature distribution information in the second data group according to the plurality of second unit target feature distribution information.

4. The method of claim 3, wherein said determining first unit target feature distribution information for each target data feature in the first data set from the plurality of target data features comprises:

sequentially selecting one target data feature to be determined from the plurality of target data features, determining a first target data quantity which meets the target data feature to be determined in the first data group according to the target data feature to be determined, and determining first unit target feature distribution information in the first data group according to the first target data quantity and the total data quantity of the first data group;

wherein the determining second unit target feature distribution information of each target data feature in the second data group according to the plurality of target data features comprises:

sequentially selecting one target data feature to be determined from the plurality of target data features, determining a second target data quantity which meets the target data feature to be determined in the second data group according to the target data feature to be determined, and determining second unit target feature distribution information in the second data group according to the second target data quantity and the total data quantity of the second data group.

5. The method of claim 3 or 4, wherein said determining first target feature distribution information in said first data set from said plurality of first unit target feature distribution information comprises:

arranging and combining a plurality of first unit target feature distribution information of the plurality of target data features, and determining first combined target feature distribution information of a plurality of combined target data features to acquire first target feature distribution information of the plurality of target data feature information;

wherein the determining second target feature distribution information in the second data group according to the plurality of second unit target feature distribution information comprises:

and arranging and combining the second unit target feature distribution information of the target data features, and determining second combined target feature distribution information of the combined target data features to acquire second target feature distribution information of the target data features.

6. The method of claim 1, wherein the method further comprises:

acquiring a counterfactual hypothesis condition about the original data feature, wherein the counterfactual hypothesis condition comprises a factual condition and a hypothesis condition corresponding to the original data feature, the factual condition is matched with one of the first data group and the second data group, and the hypothesis condition is matched with the other data group;

wherein the determining a corresponding third data set according to one of the first data set and the second data set, and the first target feature distribution information, the second target feature information, and the corresponding third data set includes:

determining a matched sampling data group from the first data group and the second data group according to the hypothesis condition of the counterfactual hypothesis condition;

and determining a corresponding third data group according to the sampling data group, the first target characteristic distribution information and the second target characteristic information.

7. The method of claim 6, wherein said determining a corresponding third data set from the sampled data set, the first target feature distribution information, the second target feature information, and the determined corresponding third data set comprises:

determining corresponding sampling distribution information according to the first target characteristic distribution information and the second target characteristic information;

sampling is performed from the sampled data sets according to the sampling distribution information, so that sampled data are grouped into a corresponding third data set, wherein the third data set comprises data corresponding to counterfactual hypothesis conditions about original data characteristics.

8. The method of claim 7, wherein the determining corresponding sampling distribution information from the first target feature distribution information and the second target feature information comprises:

determining the first target characteristic distribution information and the second target characteristic distribution information to be matched with each other according to the sampling data group, and determining the other group of target characteristic distribution information to be weight target characteristic distribution information;

and determining sampling weight distribution information corresponding to each sub-feature according to the feature distribution ratio of each sub-feature in the weight target feature distribution information and the feature distribution ratio corresponding to the sampling target feature distribution information in sequence so as to obtain the sampling distribution information corresponding to the third data group.

9. The method of claim 8, wherein said sampling from said sample data set according to said sample distribution information to group sampled data into a corresponding third data set comprises:

and sampling from the characteristic data corresponding to the sampling data group according to the sampling weight distribution information corresponding to each sub-characteristic in the sampling distribution information, and determining the sampling data corresponding to each sub-characteristic to form a corresponding third data group.

10. The method of any of claims 7 to 9, wherein the sample comprises a put back sample.

11. The method of any of claims 1-10, wherein the amount of data of the third data set is greater than or equal to the amount of data of the first data set or the second data set.

12. A computer device, wherein the device comprises:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the steps of the method of any one of claims 1 to 11.

13. A computer-readable storage medium having stored thereon a computer program/instructions, characterized in that the computer program/instructions, when executed, cause a system to perform the steps of performing the method according to any of claims 1 to 11.

14. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method of any of claims 1 to 11.