WO2022221743A1

WO2022221743A1 - System and method for estimation of treatment effects from observational and corrupted a/b testing data

Info

Publication number: WO2022221743A1
Application number: PCT/US2022/025140
Authority: WO
Inventors: Vivek Francis FARIAS; Andrew Li; Tianyi PENG
Original assignee: Massachusetts Institute Of Technology
Priority date: 2021-04-15
Filing date: 2022-04-15
Publication date: 2022-10-20

Abstract

A system is provided for building experiments in the real world that suffer from imperfect controls and to infer correctly from the experiments. The system contains a storage device for storing transaction data and intervention data, and an estimation engine that performs the steps of: receiving transaction data and intervention data, also referred to herein as observational data; organizing the observational data as a matrix or tensor; transforming the transaction data and intervention data into a panel format; using a de-biased matric completion algorithm to learn treatment effects of promotions at each store at each time period; and validating learned treatment effects.

Description

SYSTEM AND METHOD FOR ESTIMATION OF TREATMENT EFFECTS FROM OBSERVATIONAL AND CORRUPTED A/B TESTING DATA

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of priority to U.S. Provisional Patent Application No.

63/175,500, entitled Causal Inference In Tensor Data with Multiple Treatments Applied in

General Treatment Patterns, which was filed on April 15, 2021. The disclosure of the prior application is incorporated by reference herein in its entirety.

FIELD OF THE INVENTION

The present invention relates to systems for estimating impact of decisions prior to making such decision, and particularly doing so in a manner that minimizes the effect of interference.

BACKGROUD OF THE INVENTION

When considering a digital service that interacts with customers in a variety of different modes on its platform, and seeks to understand the differential impact of any of these modes of interaction on a customer outcome of interest, such impacts are often effectively measured via A/B testing. Modem A/B testing seeks to estimate “conditional average treatment effects” at increasingly fine levels of granularity. In commerce applications, the demands of finer consumer/ product-level granularity and a large palette of potential treatments lead to an untenable explosion in the number of A/B tests required (from the perspective of samples, and the cost/ complexity of maintaining such tests). Given this challenge, it is desirable to estimate answers to questions typically answered by A/B testing using data that is already available - so-called “observational data”.

Different methods have been attempted to address this problem, with lacking results. The

‘Synthetic Control’ [Abadie et. al. 2003, 2010] framework presents a conceptual approach to addressing the problem above, by compensating for the lack of an actual control (that would exist in A/B testing) with a ‘synthetic’ one constructed as a composite of other treatments units. In the present example this would correspond to constructing a control for a given customer (or group of customers) with a combination of other customers in the dataset. The synthetic control framework is restricted in its scope: (1) it allows for a limited set of data structures with respect to the observational data available, allowing for only ‘panels’ (or matrices). (2) more importantly, it allows for a very restricted set of treatment patterns (i.e. the pattern of data elements that are subject to a treatment must essentially be block shaped). These restrictions preclude the sort of observational data one would have access to on a typical digital platform/ service.

Recently, there have been efforts to relax synthetic control restrictions, by employing vanilla matrix completion techniques, but they continue to be limited in their use. For example, the approach of [Athey et. al. 2018] presents no error recovery guarantees so that any inference on the correctness of the estimates is not available (i.e. the approach provides no guarantees on whether the treatment effects estimated by the approach are in fact correct). As another example, the approach of [XiongPelger 2019], which places restrictions on the data requiring it to be

‘stationary’, is untenable on a fast-evolving digital platform where emergent trends are routine.

As a further example, the approaches of [Amjad et. al. 2018] and [Agarwal et. al. 2020], like synthetic control, place severe restrictions on the treatment pattern. Therefore, there is a need to address the shortcomings of the current state of the art. SUMMARY OF THE INVENTION

The present system and method provides a new approach to estimating treatment effects from data panels that might consist entirely of observational data, or else data obtained from A/B tests that have potentially been corrupted for a variety of reasons. The present system and method provides a new approach to overcoming the lack of an obvious control in these settings that would otherwise have served to establish counterfactual outcomes. It first describes how such data may be manipulated so as to be viewed as the sum of a low rank counterfactual matrix, a noise matrix and a treatment matrix. The present system and method then prescribes an optimization algorithm to recover the average treatment effect corresponding to the treatment matrix along with the counterfactual matrix. It is shown that the optimization algorithm of the estimation engine has many desirable properties, including, but not limited to, it being optimal from among all possible algorithms for this problem in a sense we make precise.

Other systems, methods, features, and advantages of the present invention will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present invention, and be protected by the accompanying claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram illustrating a network in which the present system and method may be implemented.

FIG. 2 is a schematic diagram further illustrating the estimation system of FIG. 1.

FIG. 3 is a flowchart that illustrates steps performed by the present estimation system.

FIG. 4 is a diagram where the left side provides a promotion pattern of real data, while the right side illustrates an estimation of ԏ and test errors.

FIG. 5 is a diagram providing an example of transaction data transformed into a panel format.

FIG. 6 is a diagram providing an example of intervention data transformed into a panel format.

DETAILED DESCRIPTION

The problem of causal inference with panel data is a central problem in econometrics.

The following is a formulation of a fundamental version of this problem: Let M* be a low rank matrix and E be a random matrix with independent, zero-mean, sub-Gaussian entries. For a

‘treatment’ matrix Z with entries in {0, 1 } and some ‘treatment effect’ ԏ* ∈R, we observe the matrix O with entries

where are unknown, heterogenous treatment effects. The problem requires we estimate the average treatment effect ԏ* .This setup finds broad applications in causal inference questions for multi-variate time-series data, which arise in areas ranging from macroeconomics and policy evaluation to e-commerce.

The present system and method provide an estimator for ԏ* that is rate-optimal and asymptotically normal under general conditions on the structure of Z. In particular, present recovery guarantees are valid under a set of conditions on the treatment matrix Z which relate to its projection on the tangent space of M*. Should the conditions on Z be violated by an amount that grows negligibly small with problem size, no algorithm can recover ԏ*. Therefore, the present system and method generalizes the synthetic control paradigm to allow for general treatment patterns. The recovery guarantees of the present system and method are the first of their type. Utilization of an estimator of the present invention on synthetic and real-world data show a substantial advantage over competing matrix completion-based estimators.

Consider the following common econometric problem: one is provided a sequence of T observations on each of n distinct entities, or units. As a concrete example, in the policy evaluation literature, an entity might correspond to a geographic region with the associated sequence of observations corresponding to some economic time series of interest. In e- commerce, an entity may correspond to a product with the associated sequence of observations corresponding to the sales of that product over time, or with a customer, with the associated sequence of observations corresponding to site-visits for that customer over time.

For each entity, some subset of its observations is potentially impacted by the application of a ‘treatment’, or intervention. For example, this may correspond to the implementation of a new policy (in the policy evaluation context), or the application of a new type of promotion (in the e-commerce context). The econometric question at hand is to estimate the average effect of this treatment. This problem is of immense applied importance in modem econometrics. This problem can be formalized: Let M* ∈ R^nxT be a fixed, unknown matrix and £ be a zero-mean random matrix; we refer to M* + E as the "counterfactual" matrix with each row corresponding to a distinct "unit" A known ‘treatment' matrix Z with entries in {0, 1} encodes observations impacted bv an intervention. Specifically. we observe a matrix O with entries where the are unknown. heterogenous treatment effects . The goal is

to recover the average treatment effect

Herein, we refer to this problem as the panel data problem. Note that while we focus on a constant treatment effect, which one may view as capturing the average treatment effect of an intervention, an extension to multiple treatments is also presented.

To allow for meaningful solutions to this problem, assumptions are made on M* and Z; consider the following:

(1) Imputation of Counterfactual Observations: One must make assumptions on M* that, loosely speaking, allow for the imputation of counterfactual entries of M* (i.e., entries in the support of Z). One assumption made to this end is to at least assume M* is a low (say, r) rank matrix.

(2) Identifiability: We must rule out the existence of a rank r matrix M', distinct from M* for which M' = M* + δZ for some 8 0, or else identifying ԏ* is impossible even if E is identically zero.

The synthetic control paradigm, as described in Abadie et al., 2010, Abadie and

Gardeazabal, 2003, which is incorporated herein by reference in its entirety, for instance, requires that Z has support on part of a single row and that the treated row of M* be in the row-space of the untreated rows. An estimator based on linear regression can then allow for estimation of counterfactual values of M*. Other approaches based on the use of propensity scores require knowledge of a generative model for Z and M* to estimate counterfactual values.

More generally, central to any successful approach, one needs a set of assumptions on M* and Z and a complementary approach to imputing counterfactual values of M*.

Against this backdrop, it is natural to imagine that the recent literature on matrix completion might be leveraged to fruitfully solve the panel data problem in greater generality.

Full generality here would mean allowing for arbitrary rank-r matrices M*, and all Z such that M* remains identifiable. That is all Z for which there exists no rank r matrix M' = M* + δZ for some non-zero δ.

Attempts to leverage the matrix completion literature to solve the panel data problem essentially view treated entries of M* (i.e., entries in the support of Z) as missing, and then seek to leverage matrix completion techniques to impute these missing entries. Whereas this has the benefit of not assuming any structure a-priori to the impact of the treatments, this approach runs into certain fundamental challenges:

1. Structured Z: Matrix completion techniques typically require that entries be missing at random. In the context of the panel data problem, however, Z is highly structured so that treating the entries in the support of Z as missing raises immediate challenges. Conversely, if Z were a random matrix with i.i.d entries (corresponding to the missingness pattern most commonly assumed in matrix completion), the panel data problem is already trivial.

2. Error Guarantees: Ignoring the challenge above, imagine one were able to construct an estimate of M* (say,

) and we then proceeded to estimate

by measuring the average difference between observations and estimated counterfactuals on the treated entries. It is unclear that even an (optimal) O(l/√n) guarantee on would yield useful

guarantees on

As it turns out, even optimal entry-wise bounds of

would yield substantially sub-optimal guarantees on

These issues lay bare the challenges with any approach that views the treated observations as ‘missing’: not only does one need to deal with a potentially non-random missingness pattern, but in addition, one would need some sort of entry-wise control on the estimation error of counterfactual observations. And even that would yield substantially sub- optimal guarantees for These challenges are reflected in the available results. For

example, Athey et al., 2018, which is incorporated by reference in its entirety herein, shows that the recovery of subspaces is possible (with error measured in the Frobenius norm) for a certain stylized choice of Z. This does not lead to guarantees for estimating treatment effects. Agarwal et al., 2020 and Amjad et al., 2018 require, like standard synthetic control, a ‘block’ treatment pattern, along with an assumption on a certain sub-matrix of M* that trivializes the completion problem. Xiong and Pelger, 2019 make distributional assumptions on M* and Z that effectively require stationarity in the panel data. Additional work, not necessarily related to panel data, but on matrix completion with non-standard observation patterns, exists but by and large is not obviously useful or applicable to the present problem.

The present system and method provide for a unique and novel estimation of multiple treatment effects in observational data that allows for general treatment patterns, provides confidence intervals for any estimated treatment effect, and is extensible to tensors. The present approach is powered by a novel de-biasing technique provided by the present system, which can be implemented for general treatment patterns and multiple treatment effects. The system yields sharp, statistically optimal recovery guarantees and provides a user with a precise joint distribution on the error of estimating all potential treatment effects. In empirical experiments, it is noted that the present system and method provides significantly improved error rates relative to prior approaches even in settings where those approaches might be used. In particular, the functionality of the estimation approach of the present system and method addresses the needs of answering questions germane to multiple A/B testing using observational data.

FIG. 1 is a schematic diagram illustrating a network 2 in which the present system and method may be implemented. As shown by FIG. 1, the network 2 may contain a data server 10 that is in communication with the estimation system of the present invention 100 via the internet

30. In addition, the network 2 may contain a database 20 that is in communication with the estimation system 100 of the present invention via the internet 30. It should be noted that the use of a network is not a requirement of the present system and method, but instead, the present system and method may be implemented on a stand-alone computer. Observational data may be stored within the server 10 and/or the database 20, which is used by the estimation system 100, as is described in further detail herein. It should also be noted that there need not be both a server 10 and a database 20. Instead, only one of the two may be within the network.

Alternatively, the observational data may already be stored within the estimation system 100, thereby alleviating the need for a server 10 or an external database 20 that would be storing and transmitting the observational data. In addition, there may be more than one server 10 and/’ or more than one external database 20.

The term observational data is intended to cover any data collected in the course of monitoring some performance indicator (or indicators) of interest across a number of distinct observational units (customers, stores, etc.) over time.

Functionality as performed by the present system and method is defined by software modules within the estimation system 100. The estimation system 100, which is illustrated in further detail in FIG. 2, may contain a processor 102, an internal storage device 104, a memory

106 having software 110 stored therein that defines functionality of the estimation system, input and output (I/O) devices (or peripherals) 174, and a local bus 172, or local interface allowing for communication within the estimation system 100. The local interface can be, for example but not limited to, one or more buses or other wired or wireless connections, as is known in the art.

The local bus may have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, to enable communications.

Further, the local bus 172 may include address, control, and/or data connections to enable appropriate communications among the aforementioned components.

The processor 102 is a hardware device for executing software, particularly that stored in the memory 106. The processor 102 can be any custom made or commercially available single core or multi-core processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the present server, a semiconductor based microprocessor (in the form of a microchip or chip set), a microprocessor, or generally any device for executing software instructions.

The memory 106 can include any one or combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.). Moreover, the memory may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory can have a distributed architecture, where various components are situated remotely from one another, but can be accessed by the processor.

The software 110 defines functionality performed by the estimation system 100, in accordance with the present invention. The software 110 in the memory 106 may include one or more separate programs, each of which contains an ordered listing of executable instructions for implementing logical functions of the estimation system 100, as described below. The memory may contain an operating system (O/S) 170. The operating system 170 essentially controls the execution of programs within the estimation system 100 and provides scheduling, input-output control, file and data management, memory management, and communication control and related services. It is noted that the combination of the processor 102, and the memory 106, having the software 110 and operating system 170 therein, may be referred to herein as the estimation engine 108.

The I/O devices 174 may include input devices, for example but not limited to, a keyboard, mouse, scanner, microphone, etc. Furthermore, the I/O devices 174 may also include output devices, for example but not limited to, a printer, display, etc. Finally, the I/O devices

174 may further include devices that communicate via both inputs and outputs, for instance but not limited to, a modulator/demodulator (modem; for accessing another device, system, or network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, or other device.

When the estimation system 100 is in operation, the processor 102 is configured to execute the software 110 stored within the memory 106, to communicate data to and from the memory 106, and to generally control operations of the estimation system 100 pursuant to the software 110, as explained herein.

When the functionality of the estimation system 100 is in operation, the processor 102 is configured to execute the software 110 stored within the memory 106, to communicate data to and from the memory 106, and to generally control operations of the estimation system 100 pursuant to the software 110. The operating system 170 is read by the processor 102, perhaps buffered within the processor 102, and then executed.

When functionality of the estimation system 100 is implemented in software, it should be noted that instructions for implementing the estimation system, can be stored on any computer- readable medium for use by or in connection with any computer-related device, system, or method. Such a computer-readable medium may, in some embodiments, correspond to either or both the memory or the storage device. In the context of this document, a computer-readable medium is an electronic, magnetic, optical, or other physical device or means that can contain or store a computer program for use by or in connection with a computer-related device, system, or method. Instructions for implementing the system can be embodied in any computer-readable medium for use by or in connection with the processor or other such instruction execution system, apparatus, or device. Although the processor has been mentioned by way of example, such instruction execution system, apparatus, or device may, in some embodiments, be any computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. In the context of this document, a “computer-readable medium” can be any means that can store, communicate, propagate, or transport the program for use by or in connection with the processor or other such instruction execution system, apparatus, or device.

Such a computer-readable medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a nonexhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic) having one or more wires, a portable computer diskette (magnetic), a random access memory (RAM) (electronic), a read-only memory (ROM) (electronic), an erasable programmable read-only memory (EPROM,

EEPROM, or Flash memory) (electronic), an optical fiber (optical), and a portable compact disc read-only memory (CDROM) (optical). Note that the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

In an alternative embodiment, where functionality of the estimation system 100 is implemented in hardware, the functionality can be implemented with any or a combination of the following technologies, which are each well known in the art: a discrete logic circuits) having logic gates for implementing logic functions upon data signals, an application specific integrated circuit (ASIC) having appropriate combinational logic gates, a programmable gate array(s)

(PGA), a field programmable gate array (FPGA), etc.

The following provides a more detailed description of the present system and method in accordance with exemplary embodiments of the invention. FIG. 3 is a flowchart that illustrates steps performed by the present system. The following description, with regard to FIG. 3, describes these steps in detail. It should be noted that any process descriptions should be understood as representing modules, segments, portions of code, or steps that include one or more instructions for implementing specific logical functions in the process, and alternative implementations are included within the scope of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention. As shown by block 112, the estimation engine 108 first receives transaction and intervention data, also referred to as observational data. This data is used for estimating promotion effects. Two types of observational data are received. A first type of data received is transaction data, for example, at a store level, over a specified period of time. A second type of data is promotions data, for example, at the store level, indicating the starting time and the ending time of a promotion at the store. Such data may be received from the server 10 (FIG. 1), the database 20 (FIG. 1), stored within the storage device 104 of the estimation system 100 (FIG.

2), or received in real-time.

The estimation engine 108 organizes the observational data as a matrix or tensor, as shown by block 114. Entries in the matrix or tensor are either marked ‘untreated’ or else marked with any treatments applicable to the entry. Herein, the term “treatment” refers to interventions, such as, but not limited to, product promotions or advertising strategies. In practice, data organized in typical NoSql databases such as Big Table are routinely organized in this fashion.

Concretely, on a digital platform the modes of such a tensor would correspond respectively to customers, products and time at a minimum; the keys in the database would thus be (customer, product, time) tuples. Matrix completion methods present a means to allow for inference with general treatment patterns.

As shown by block 122, the received raw data, namely, the transaction and intervention data, is transformed into a panel format. An example of transaction data transformed into a panel format is illustrated by FIG. 5 and an example of intervention data transformed into a panel format is illustrated by FIG. 6. Specifically, there are at least two different sequences of paneling data. A first paneling sequence involves paneling data with sales, in which a matrix is created, where rows are individual stores, columns are time units (e.g., weeks), and entries are the sales at each store at each time. A second paneling sequence involves paneling data with interventions, in which a matrix is created, where rows are individual stores, columns are time units, and entries indicate whether a promotion occurred at each store at each time.

As shown by block 132, the estimation engine 108 then learns the effects of treatment.

Specifically, the estimation engine 108 uses a novel de-biased matric completion algorithm to learn the treatment effects of promotions at each store at each time period. Equations 7 and 8, as provided herein, are used as part of this de-biased matric completion algorithm to learn the treatment effects of promotions at each store at each time period. The estimation engine 108, through this functionality, also produces confidence intervals on these same treatment effects.

Theorem 2 described herein combined with the following, assist in producing confidence intervals. The present system and method begins by considering the following natural convex estimator for M* or ԏ* to get to a ‘rough’ first estimate of ԏ*:

Here X is a regularizer. Denote by

an optimal solution to this program. While no error guarantees have been available for

heretofore, this is arguably a natural choice of estimator.

Crucially, this estimator utilizes all observations to simultaneously learn both M* (and thus, counterfactual observations) as well as ԏ*.

It is unlikely that is an optimal estimator. Specifically, this is because the regularizer in the convex estimation problem above introduces bias in the estimation of M* and thereby introduces a bias in our estimate of Denote by the singular value decomposition

of , and by the projection onto the orthogonal complement to the tangent space of . For

a rank-r matrix X, we colloquially refer to the tangent space of the manifold of up-to rank r matrices at as simply the tangent space of X. The estimation engine 108 derives a correction term that serves to de-bias

and purpose, as the estimator:

It is worth noting that a natural extension to the panel data problem involves multiple treatment matrices each with an associated treatment effect

so that for some fixed k. While the following focuses primarily on the single

treatment case (given its fundamental role and the state of available estimators for that case), multiple treatments can also be addressed by the current estimation engine.

Functionality of the estimation engine 108 was determined based on structured testing.

We begin by formally defining our problem; we in fact present a generalization to the problem described in the previous section, allowing for multiple treatments. Let

M ∈ be a fixed rank-r matrix with singular value decomposition (SVD) denoted by M

where

have orthonormal columns, and

is diagonal with diagonal entries be the condition number of

M*. There are k treatments that can be applied to each entry, and for each treatment

we are given a treatment matrix which encodes (he entries which have received the

Mth treatment (0 meaning no treatment, and 1 meaning being treated). Note that multiple treatments are allowed to be applied to an entry. We then observe a single matrix of outcomes:

is the Hadamard or ‘emtrywise’ product), where each

is an unknown matrix of treatment effects, and E € is a (possibly heterogeneous) random noise matrix. Finally, let ԏ* € be the vector of average ireahneni effects, whose

value is defined as

( ) / and let

_e be associated ‘residual’ matrices. Our problem is to estimate ԏ* after having observed O and Z; It is worth noting that the representation above is powerful: for instance, it subsumes the setting where the intervention on any entry is associated with a (0, 1)-valued covariate vector, and the treatment effect observed on that entry is some linear function of this covariate vector plus idiosyncratic noise. Recovery of ԏ* is then equivalent to recovering covariate dependent heterogeneous treatment effects.

Our problem also subsumes the synthetic control setting where K = 1 and Z₁ must place support on a single row; the focus of our later analysis will be the case where 2; is allowed to be general.

The assumptions imposed in order to state meaningful results can be divided into two groups. The first are assumptions on M* and E that are, by this point, canonical in the matrix completion literature:

Assumption 1 (Random NoiseX The entries of E are independent, mean-zero, sub-Gaussian random variables with sub-Gaussian norm bounded

Assumption 2 (Incoherence). M* is incoherent:

where denotes the maximum l₂-norm of the raws of a matrix.

In addition to these standard conditions on M * and E which we will assume throughout this paper, we will also need to impose conditions on the relationship between M* and the Z_m’s. Loosely speaking, these conditions preclude treatment matrices that can be "disguised" within M*, in the sense that their projections onto the tangent spaces of M* are large. Specifically, the formal statements relate to a particular decomposition of the linear space of n x n matrices,

where T* is the tangent space of M* in the manifold consisting of matrices with tank no larger than rank ( M*):

Equivalently , the orthogonal space of T*, denoted

is the subspace of denote the columns and rows are orthogonal, respectively, to the spaces U* and V*. Let (<) denote the projection operator onto

The estimation engine functionality is constructed in two steps, stated as the following two equations:

In the Eq. 8, define by

the Gram matrix with entries

and by

the ‘error’ vector with components where we have let

denote the tangent space of

The first step in the Eq. 7, is a natural convex optimization formulation that is used to compute a ‘rough’ estimate of the average treatment effects. The objective function’s first term penalizes choices of M and ԏ which differ from the observed O, and the second term seeks to penalize the rank of M using the nuclear norm as a (convex) proxy. The tuning parameter λ > 0, which will be specified in our theoretical guarantees, encodes the relative weight of these two objectives.

After the first step, having

as a minimizer of Eq. 7, we could simply use

as our estimator for . However, a brief analysis of the first-order optimality conditions for the Eq. 7 yields a simple, but powerful decomposition of

that suggests a first-order improvement to via de-biasing:

Lemma 1. Suppose is a minimizer of Eq. 7. Let and

let denote the tangent space of Denote

Consider this error decomposition, i.e.

by Eq. 8, and note that D

is entirely a function of observed quantities. Thus, it is known and removable. The second step of Eq. 8, does exactly this. The resulting de-biased estimator, denoted ԏ^d , is important. The main results characterize the error The crux of this can be gleaned from

the second and third terms of the equation

if

is sufficiently

‘close’ to T*, then Δ³ becomes negligible (because

Showing closeness of

and

T* is the main technical challenge of this work. The remaining error, contributed by A², can then be characterized as a particular ‘weighted average’ of the (independent) entries of E and the residual matrices which we show to be min-max optimal.

Lemma / is then proved for a single treatment (k = 1). Since k = 1, we suppress redundant subscripts. Consider the first-order optimality conditions of Eq. 7:

(W is called the ‘dual certificate’. Combining Eq. 11 and Eq. 10, we have:

Next, applying

to both sides of Eq. 11 and using Eq. 13:

This is equivalent to Eq. 9, completing the proof.

So, to summarize, the estimation engine 108 functionality, as defined by the software 110 in the memory 106 if the estimation engine 108, and executed by the processor 102 of the estimation engine 108, is constructed in two steps: 1) solve the convex program in Eq. 7 to obtain an initial estimate then de-bias according to Eq. 8. While the estimator has been

presented in a setting that allows for multiple treatments (i.e., k > 1), for the sake simplicity our results here are restricted to the single treatment (i.e., k = 1) setting. Recall that the work on synthetic control is for a single treatment and a particular form of Z₁ (support on a single row); our results demonstrated herein are for a single treatment but general Z₁. To ease notation, the present disclosure suppresses treatment-specific subscripts (Z₁, ԏ₁, etc.).

As mentioned previously, the results require a set of conditions that relate the treatment matrix Z to the tangent space T* of M*:

It is assumed that there exist positive constants Cr1, Cr2, such that:

This assumption is necessary for identifying ԏ* (in a manner made formal by proposition 2 illustrated herein). be the matrix of treatment effect ‘residuals.’ Our first

result establishes a bound on the error rate of ԏ^d. Note that is a zero-mean matrix and is zero outside the support of Z. Thus, the requirement hereinbelow that

is mild. It is trivially met in synthetic control settings. It is also easily seen as met when has independent, sub-Gaussian entries. Finally, as it turns out, the condition can also admit random sub-gaussian matrices with complex correlation patterns.

Theorem 1 : Optimal Error Rate Then for any C₂ > 0, for

sufficiently large n, with probability we have

Here, C_e is a constant depending (polynomially) on (where Cr1 and Cr2 are the constants appearing.

To begin parsing this result, consider a ‘typical’ scenario in which

This is minimax optimal (up to log n factors), as shown herein:

Proposition 1 (Minimax Lower Bound)

For any estimator

, there exists an instance with

and

Finally, it is worth considering some special cases under which Eq. 20 reduces further to

which is the optimal rate (up to log n) achievable even when M* and are known. Any of the following are, alone, sufficient to imply Eq. 22:

Independent Independent, sub-gaussian

with o(1) sub-gaussian norm.

Synthetic control and block Z:

consists of an l X C block that is sufficiently sparse:

Panel data regression: Z is sufficiently dense:

This recovers their error guarantee (up to log factors).

A second main result establishes asymptotic normality for the estimation engine 108.

This naturally requires some additional control over the variability of . We consider the setting

in which the

are independent variables.

Theorem 2 (Asymptotic Normality)

Suppose each is a mean-zero, independent random variable with sub-Gaussian norm

Then with probability

Consequently,

Provided that

Asymptotic normality is of econometric interest, as it enables inference. Specifically, inference can be performed using a ‘plug-in’ estimator gotten by substituting

where M^d is a de-biased estimator for M*.

Referring to block 142 of FIG. 3, the estimation engine 108 then validates learned treatment effects. The learned treatment effects are checked for robustness. Specific steps in validating learned treatment effect include validating the estimation using simulation, cross- validating the estimation for counterfactuals, and validating the estimation for data collected from randomized controlled experiments, if available. The following provides an example of validating learned treatment effects. It is noted that the present invention is not intended to be limited to the specific experiments exemplified herein.

A set of experiments are conducted on semi-synthetic datasets (the treatment is introduced artificially and thus ground-truth treatment-effect values are known) and real datasets

(the treatment is real and ground-truth treatment-effect values are unknown). The results show that the present estimation engine 108 ԏ^d is more accurate than existing methods and its performance is robust to various treatment patterns.

The following four benchmarks were implemented: (i) Synthetic Difference-in-

Difference (SDID), as exemplified in Synthetic difference in differences, 2019, by Dmitry

Arkhangelsky, Susan Athey, David A Hirshberg, Guido W Imbens, and Stefan Wager, which is incorporated by reference herein in its entirety; (ii) Matrix-Completion with Nuclear Norm

Minimization (MC-NNM), as exemplified in Matrix completion methods for causal panel data models. Journal of the American Statistical Association, pages 1-41, 2021, by Susan Athey,

Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi, which is incorporated by reference herein in its entirety; (iii) Robust Synthetic Control (RSC), as exemplified in Robust synthetic control. The Journal of Machine Learning Research, 19(1): 802-

852, 2018, by Muhammad Amjad, Devavrat Shah, and Dennis Shen, which is incorporated by reference herein in its entirety; and (iv) Ordinary Least Square (OLS): Selects to minimize where

is the

vector of ones. It is worth noting that SDID and RSC one apply to traditional synthetic control patterns (block and stagger herein).

Semi-Synthetic Data (Tobacco)

The first dataset consists of the annual tobacco consumption per capita for 38 states during 1970-

2001, collected from the prominent synthetic control study of Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. Journal of the American statistical Association, 105(490):493-505, 2010, which is incorporated by reference herein in its entirety (the treated unit California is removed). Similar to Matrix completion methods for causal panel data models. Journal of the American Statistical

Association, pages 1-41, 2021, we view the collected data as M* and introduce artificial treatments. We considered two families of patterns that are common in the economics literature: block and stagger. Block patterns model simultaneous adoption of the treatment, while stagger patterns model adoption at different times. In both cases, treatment continues forever once adopted. Specifically, given the parameters (m1, m2), a set of m1 rows of Z are selected uniformly at random. On these rows, Z_ij = 1 if and only if j ≥ t_i, where for block patterns, t_i = m2, and for stagger patterns, t_i is selected uniformly from values greater than m2.

To model heterogenous treatment effects, let where is i.i.d and

characterizes the unit-specific effect. Then the observation is

We fix through all experiments, where is the

mean value of M*. The hyperparameters for all algorithms were tuned using rank r ~5 (estimated via the spectrum of M*). Next, we compare the performances of the various algorithms on an ensemble of 1,000 instances with for stagger patte s

rn for block patterns (matching the year 1988, where California

passed its law for tobacco control). The results are reported in the first two rows of Table 1 below in terms of the average normalized error

Note that the treatment patterns here are ‘home court’ for the SDID and RSC synthetic control methods but our approach nonetheless outperforms these benchmarks. One potential reason is that these methods do not leverage all of the available data for learning counterfactuals:

MC-NNM and SDID ignore treated observations. RSC ignores even more: it in addition does not leverage some of the untreated observations in M* on treated units (i.e.. observations O_ij for j

< t_i on treated units).

Table 1: Comparison of the present algorithm of the present estimation engine 108 (De- biased Convex) to benchmarks on semi-synthetic datasets (Block and Stagger correspond to

Tobacco dataset; Adaptive pattern corresponds to Sales dataset). Average normalized error is reported.

Table 1

As shown by the table, the errors using the approach introduced by this invention are substantially lower that stat-of-the-art alternatives across all settings.

Semi-Svnthetic Data (Sales) The second dataset consists of weekly sales of 167 products over 147 weeks, collected from a Kaggle competition. In this application, treatment corresponds to various ‘promotions’ of a product (e.g., price reductions, advertisements, etc.). We introduced an artificial promotion Z, used the collected data as M* (M* ≈ 12170), and the goal was to estimate the average treatment effect given

follows the same generation process as above with

Now the challenge in these settings is that these promotions are often decided based on previous sales. Put another way, the treatment matrix Z is constructed adaptively. We considered a simple model for generating adaptive patterns for Z: Fix parameters (a, b). If the sale of a product reaches its lowest point among the past a weeks, then we added promotions for the following b weeks (this models a common preference for promoting low-sale products).

Across our instances, (a, b) was generated according to

This represents a treatment pattern where it is unclear how typical synthetic control approaches

(SDID, RSC) might even be applied.

The rank of M* is estimated via the spectrum with r 35. See Table 1 for the results averaged over 1,000 instances. The average of for our algorithm,

versus 27.6% for MC-NNM. We conjecture that the reason for this is that highly structured missing-ness patterns are challenging for matrix-completion algorithms; we overcome this limitation by leveraging the treated data as well. Of course, there is a natural trade-off here: if the heterogeneity in > were on the order of the variation in

then it is unclear that the treated data would help (and it might, in fact, hurt). But for most practical applications, the treatment effects we seek to estimate are typically small relative to the nominal observed values. Real Data

This dataset consists of daily sales and promotion information of 571 drug stores over

942 days, collected from Rossmann Store Sales dataset. The promotion dataset Z is binary (1 indicates a promo is running on that specific day and store). The real pattern is highly complex

(see FIG. 4) and hence synthetic-control type methods (SDID, RSC) again do not apply. Our goal here is to estimate the average increase of sales ԏ* brought by the promotion. The left side of FIG. 4 show the promotion pattern of the real data, while the right side illustrates an estimation of ԏ and test errors.

The hyperparameters for all algorithms were tuned using rank r ~ 70 (estimated via cross validation). A test set Ω consisting of 20% of the treated entires is randomly sampled and hidden. The test error is then calculated by where 0 is the mean-value of O. FIG. 4 shows the results averaged over 100 instances. The algorithm of the estimation engine 108 provides superior test error. This is potentially a conservative measure since it captures error in approximating both M* and ԏ* the variation contributed by M* to observations is substantially larger that that contributed by ԏ*. NOW whereas the ground-truth for ԏ* is not known here, the negative treatment effects estimated by MC-NNM and OLS seem less likely since store-wise promotions are typically associated with positive effects on sales.

With the treatment effects learned, the learned treatment effects can be used for promotions decisions, to the benefit of a user of the estimation system. Business decisions can be made based on the estimated promotion effects. For example, in the decision of whether to begin distribution of a new product line, the estimation system can be used to make a decision to rollout the new product line where tested promotions have significant positive effects. In addition, it may be decided to perform further investigation where there are less positive effects noted. In addition, when promotions that have significant negative effects, a decision may be made by the user to discard of the new product line, or to further investigate the product line with different promotions.

Extension to Multiple Treatments

Whereas the focus has been on the case of a single treatment, we consider in this section an extension of our estimator to the setting of multiple treatments. Specifically, for some (fixed) k, we consider that a unit may be simultaneously subject to multiple treatments so that

where Z_m is a 0-1 matrix whose support indicates the observations impacted by treatment m. Thus, our goal here is to estimate a treatment effect vector r* €

The convex estimator utilized for scalar r * has a natural generalization; specifically, we consider as our first step, computing a Tough' estimate of the treatment effect,

by solving a natural optimization problem:

The de-biasing step in this case is slightly more involved, but follows from a decomposition of the quantity

derived from the optimality conditions for the convex program. In particular, we have the following;

Lemma 2. Suppose

is a minimizer for the program in (13).

and denote the tangent space of Then, for each treatment I = 1, 2. . . , k, we have

This decomposition immediately suggests the appropriate de-biasing required. Specifically, define by D € the Gram matrix with entires

the "error" vectors with components Then.

Lemma 2 establishes

Noting that Δ¹ is entirely a function of the solution to Eq. 25, and thus is known, we propose the de-biased estimator:

Our definition implicitly assumes that D is invertible. We view this as a natural assumption on

(the absence of) collinearity in treatments.

Now, the RMSE for this estimator is simply

For fixed k, ԏ_d as defined herein, is also a near-optimal estimator under suitable conditions on M* and the treatment matrices Z/ that are in analogy with the single treatment case. Specifically, we see that if

(which will likely require that k be fixed), then the third term is negligible. This is since

= 0 so that the error is dominated by the second term. Now if, in addition, error revealed by the second term is optimal (by way of comparison with the

error of the least squares estimator for the case when M* is known).

Claims

CLAIMS We claim:

1. A system for building experiments in the real world that suffer from imperfect controls and to infer correctly from the experiments, wherein the system comprises: a storage device for storing transaction data and intervention data; an estimation engine that performs the steps of: receives transaction data and intervention data, also referred to herein as observational data; organize the observational data as a matrix or tensor; transforming the transaction data and intervention data into a panel format; using a de-biased matric completion algorithm to learn treatment effects of promotions at each store at each time period; and validating learned treatment effects.

2. The system of claim 1, wherein the observation data comprising at least two types of data, wherein a first type of observation data received is transaction data, and a second type of observation data is promotions data.

3. The system of claim 2, wherein the transaction data is at a store level, over a specified period of time.

4. The system of claim 2, wherein the promotions data is at the store level, indicating the starting time and the ending time of a promotion at the store.

5. The system of claim 1, wherein the learned treatment effects are checked for robustness.

6. The system of claim 1, wherein the specific steps in validating learned treatment effect include validating the estimation using simulation, cross-validating the estimation for counterfactuals, and validating the estimation for data collected from randomized controlled experiments.