CA2403249A1

CA2403249A1 - Gradient criterion method for neural networks and application to targeted marketing

Info

Publication number: CA2403249A1
Application number: CA002403249A
Authority: CA
Inventors: Yuri Galperin; Vladimir Fishman
Original assignee: Marketswitch Corp
Current assignee: Marketswitch Corp
Priority date: 1999-03-15
Filing date: 2000-03-15
Publication date: 2000-09-21
Also published as: AU3884000A; WO2000055790A3; WO2000055790A2; WO2000055790B1

Abstract

The present invention is drawn to a unique application of the Maximum Likelihood statistical method to commercial neural network technologies. The present invention utilizes the specific nature of the output in target marketing problems and makes it possible to produce more accurate and predictive results by minimizing a gradient criterion to produce model weights to get the maximum likelihood result. It is best used on "noisy" data and when one is interested in determining a distribution's overall accuracy, or best general description of reality.

Description

1 TITLE OF THE INVENTION: Gradient Criterion Method for Neural Networks and 2 Application to Targeted Marketing 3 FIELD OF THE INVENTION:
This invention relates generally to the development of neural network models to optimize the effects of targeted marketing programs. More specifically, this 6 invention is an improvement on the Maximum Likelihood method of training neural 7 networks using a gradient criterion, and is specially designed for binary output having 8 strongly uneven proportion, which is typical for direct marketing problems.

1o BACKGROUND OF THE INVENTION:
11 The goal of most modeling procedures is to minimize the discrepancy between 12 real results and model outputs. If the discrepancy, or error, can be accumulated on a 13 record by record basis, it is suitable for gradient algorithms like Maximum 14 Likelihood.
The goal of target marketing modeling is typically to find a method to 16 calculate the probability of any prospect in the list to respond to an offer. The neural 17 network model is built based on the experimental data (test mailing), and the 18 traditional approach to this problem is to choose a model and compute model 19 parameters with a model fitting procedure.
2o The topology of model-for example, number of nodes, input and transfer 21 functions-defines the formula that expresses the probability of response as a 22 function of attributes.
23 In a special model fitting procedure, the output of the model is tested against 24 actual output (from the results of a test mailing) and discrepancy is accumulated in a special error function. Different types of error functions can be used (e.g., mean 1 square, absolute error); model parameters are determined to minimize the error .-2 function. The best fitting of model parameters is an implicit indication that the model 3 is good (not necessarily the best) in terms of its original objective.
Thus the model building process is defined by two entities: the type of model and the error (or utility) function. The type of model defines the ability of the model 6 to discern various patterns in the data. For example, increasing the number of nodes 7 results in more complicated formulae, so a model can more accurately discern 8 complicated patterns.
9 The "goodness" of the model is ultimately defined by the choice of an error to function, since it is the error function that is minimized during the model training 1i process.
12 To reach the goal of modeling, one wants to use a utility function that assigns 13 probabilities that are most in compliance with the results of the experiment (the test 14 mailing). The Maximum Likelihood criterion is the explicit measure of this compliance. However, the modeling process as it exists today has a significant 16 drawback: it uses conventional utility functions (least mean square, cross entropy) 17 only because there is a mathematical apparatus developed for these utility functions.
1 g What would really be useful is a process that builds a response model that 19 directly maximizes Maximum Likelihood.
For example, a random variable X exists with the distribution p(X, A), where 21 A is an unknown vector of parameters to be estimated based on the independent 22 observations of X: (x1, x2, ..., xN). The goal is to find such a vector A
that makes a 23 probability of the output p(xl,A)*p( x2,A)* ... *p( xN,A) maximally possible. Note that 24 the function p(X, A) should be a known function of two variables. The Maximum 1 Likelihood technique provides the mathematical apparatus to solve this optimization 2 problem.
3 In general, the Maximum Likelihood method can be applied to neural 4 networks as follows. Let the neural network calculate a value of the output variable y based on the input vector X. The observed values (y1, y2, ..., yN) represent the actual 6 output with some error e. Assuming that this error has, for example, a normal 7 distribution, the method can fmd weights W of the neural network that makes a 8 probability of the output p(yl,W)*p( y2,W)* ~ ~-*p( Yrr~W) maximally possible. In 9 the case of a normal probability function, the Maximum Likelihood criterion is l0 equivalent to the Least Mean Square criterion-which is, in fact, most widely used for 11 neural network training.
12 In the case of target marketing, the observed output X is a binary variable that 13 is equal to 1 if a customer responded to the offer, and is 0 otherwise. The normality 14 assumption is too rough, and leads to a sub-optimal set of neural network weights if used in neural network training. This is a typical direct marketing scenario.

1~ SUMMARY OF THE INVENTION:
18 The present invention represents a unique application of the Maximum 19 Likelihood statistical method to commercial neural network technologies.
The present 2o invention utilizes the specific nature of the output in target marketing problems and 21 makes it possible to produce more accurate and predictive results. It is best used on 22 "noisy" data and when one is interested in determining a distribution's overall 23 accuracy, or best general description of reality.
24 The present invention provides a competitive advantage over off the-shelf modeling packages in that it greatly enhances the application of Maximum Likelihood 1 to quantitative marketing applications such as customer acquisition, cross-selling/up 2 selling, predictive customer profitability modeling, and channel optimization.
3 Specifically, the superior predictive modeling capability provided by using the present 4 invention means that marketing analysts will be better able to:
~ Predict the propensity of individual prospects to respond to an offer, thus enabling 6 marketers to better identify target markets.
7 ~ Identify customers and prospects who are most likely to default on loans, so that 8 remedial action can be taken, or so that those prospects can be excluded from 9 certain offers.
l0 ~ Identify customers or prospects who are most likely to prepay loans, so a better 11 estimate can be made of revenues.
12 ~ Identify customers who are most amenable to cross-sell and up-sell opportunities.
13 ~ Predict claims experience, so that insurers can better establish risk and set 14 premiums appropriately.
~ Identify instances of credit-card fraud.

18 Figure 1 shows the dataflow of the method of training the model of the present 19 invention.
Figure 2 illustrates a preferred system architecture for employing the present 21 invention.

1 The present invention uses the neural network to calculate a propensity score 2 g(X, W), where W is a set of weights of the neural network, X is a vector of customer 3 attributes (input vector). The probability to respond to an offer for a customer with 4 attributes X can be calculated by a formula:
_ g(X,W) p 1+g(X,W) 6 If there are N independent samples and among them n are responders, the 7 probability of such output is:
~g(X;,W)~' ~(1-g(Xi,W)) L - isresp isnon-resp N
~(1+g(Xi,W)) r=i 9 Using the logarithm of L as a training criterion (training error) in the form of:
to Err=-1nL=~ln(1+gi )- ~In(gi>- ~ln(1-gi) (W ieresp isnon-resp 11 The neural network training procedure fords the optimal weights W that 12 minimize Err and thus maximize likelihood of the observed output L. One can use 13 back propagation or a similar method to perform training. The gradient criterion that 14 is required by a training procedure is computed as follows:
15 Err _ (~ gi - ~ 1 ~ g; + ~ g' )g~
i=~ 1 ~' gi ieresp ienon-resp 1 - gi 16 In order for the training procedure be robust and stable the output of the neural 17 network should be in the middle of the working interval [0, 1]. To ensure that, the 18 present invention introduces the normalized propensity score f which is related to g 19 as:
2o g(X,W) - f~n (X,W) s I Now, let f be the output of the neural network and choose the parameter i in 2 such a way that f may be of the order of 0.5.
3 Let R be an average response rate in the sample. The above condition is 4 satisfied if:
i =1/lnl R
R
6 While training the model, the criterion is optimized so the calculation is based 7 on the output of the neural network using the formula:
N N
8 Err=-1nP=~ln(1+ ft~z~-1 ~~(.f>- ~ln(1-fr'~~~
(-t ~ ieresp ienon-resp 9 The gradient criterion is computed as follows:
f.m-t N fuzes l0 Err ' --_ ( 1 ~ vz 1 ~ 1 / f + ~ ~'v~ ).~
1 + f 2 ieresp isnon_resp 1 -,l;
11 The method was tested on a variety of business cases against both Least Mean 12 Square and Cross-Entropy criteria. In all cases the method gave 20% - SO%
13 improvement in the lift on top 20% of the target marketing sample customer pools.
14 As shown in figure 1, the method inputs data from modeling database 11 into a selected model 12 to calculate scores 13. The error 14 is calculated from comparison 16 with the known responses from modeling database 11 and checked for convergence 17 15 below a desired level. When convergence occurs, a new model 16 is the result to 18 be used for targeted marketing 17. Otherwise, the process minimizes the error and 19 solves for a new set of weights at 18 and begins a new iteration.
The present invention operates on a computer system and is used for targeted 21 marketing purposes. In a preferred embodiment as shown in figure 2, the system runs 22 on a three-tier architecture that supports CORBA as an intercommunications protocol.
23 The desktop client software on targeted marketing workstations 20 supports JAVA. The WO 00/55790 PCTlUS00/06735 1 central application server 22 and multithreaded calculation engines 24, 25 run on 2 Windows NT or UNIX. Modeling database 26 is used for training new models to be 3 applied for targeted marketing related to customer database 28. The recommended 4 minimum system requirements for application server 22 and multithreaded calculation engines 24, 25 are as follows:
9 _ ~HP Platform~i'~
Processor: ~ HP
emory~ ~j 256 MB
isk Space: ~ 10 MB*W
*Approximately 100 MB/1 million records in customer database. The above assumes the user client is installed on a PC with the recommended configuration found below.

Read/Write permissions in area of server Permissions: installation (no root permissions) perating System: ~~ HP/UX 11 (32 Bit) Protoco~ 1: ~~ CP/IP ___ _ Daemons: ~ elnet and FTP (Optional) 2 The recommended minimum requirements for the targeted marketing workstations 3 are as follows:

6 Using the present invention in conjunction with a neural network, the present '7 invention provides a user with data indicating the individuals or classes of individuals s who are most likely to respond to direct marketing.

Claims

7. A system for training neural networks with a maximum likelihood utility function, comprising:

a central application server;

a modeling database connected to said central application server;

at least one workstation networked to said central application server;

at least one multithreaded calculation engine networked to said central application server; and software instructions on said central application server, at least one workstation and at least one multithreaded calculation engine so as to provide for:

said at least one workstation to select an initial model function for a propensity score g(X,W, where W is a set of weights of the neural network and X is a vector of customer attributes from a modeling database; and said at least one multithreaded calculation engine to calculate propensity scores for the customers in the modeling database;

calculate a training error Err, where measure the error to cheek for convergence below a desired value;
obtain a new model and apply it to new data when convergence occurs;
minimize the error to solve for new weights W by minimizing the gradient criterion defined by the formula:

begin a new iteration of the process by calculating new propensity scores for the customers in the modeling database.

8. The system for training neural networks with a maximum likelihood utility function of claim 7, further comprising:

a customer database connected to said central application server and said at least one multithreaded calculation engine; and software instructions to apply the new model to customer data from said customer database upon being selected by said at least one workstation.

9. The system for training neural networks with a maximum likelihood utility function of claim 7, further comprising:

software instructions on said at least one multithreaded calculation engine to:

define f as a normalized propensity score related to g(X,W) by the formula:

g(X,W)=f~~(X,W) where f is the output of the neural network; and choose the parameter t in such a way that f may be of the order of 0.5;

wherein 12 is an average response rate in the sample and the above condition is satisfied if:

wherein:

and gradient criterion is computed as follows:

10. The system for training neural networks with a maximum likelihood utility function of claim 7, further comprising:

a customer database connected to said central application server and said at least one multithreaded calculation engine; and software instructions to apply the new model to a top 20% of a targeted marketing sample customer pool selected from said customer database by said a least one workstation.

11. The system for training neural networks with a maximum likelihood utility function of claim 9, further comprising:

a customer database connected to said central application server and said at least one multithreaded calculation engine; and software instructions to apply the new model to a top 20% of a targeted marketing sample customer pool selected from said customer database by said a least one workstation.

claims 7-12 added to define apparatus of invention.

All the remaining claims are unchanged.