WO2020036590A1

WO2020036590A1 - Evaluation and development of decision-making models

Info

Publication number: WO2020036590A1
Application number: PCT/US2018/046669
Authority: WO
Inventors: Sean J. COLLINS; IV Charles F. BAKER; Jessica Michelle ABRAHAMS; John A. CALABRESE; Katlyn I. FLYNN; Martin R. FRENZEL; Vladimir GABER; Anthony D. ICUSPIT; Tsehsin Jason Liu; Madison Elizabeth Packer; Kristen P. PARTON; Josée POIRIER; Bridget I. SALNA; Marco Schmidt
Original assignee: Connect Financial LLC
Priority date: 2018-08-14
Filing date: 2018-08-14
Publication date: 2020-02-20

Abstract

Among other things, a machine-based method improves execution by a processor in a production environment of a process that uses input data to generate decisions about actions to be applied to a real-world system to improve a state of the real-world system. The decisions conform to data evidencing decisions that improved the state of the real-world system based on known input data. The real-world system operates according to open-ended numbers of and types of variables and is dependent on subjective knowledge about decisions about actions to be applied. Input data for at least some of the variables and subjective knowledge about decisions to be applied is unavailable from time to time. The data evidencing decisions that improved the state of the real-world system is incomplete. The machine-based method includes the following activities. The processor executes a first executable model using available input data to generate decisions on actions applicable to the real-world system. The processor executes a second executable model using the available input data to generate decisions on actions applicable to the real-world system. The processor executes a comparative evaluation process to generate comparative performance data indicative of a performance of the first executable model compared to a performance of the second executable model, based on the available input data and on the decisions generated by execution of the respective executable models. At least one human expert guides generation of a third executable model based on (a) the available input data, (b) the decisions generated by execution of the first and second executable models, (c) the comparative performance data, and (d) the subjective knowledge of the at least one human expert. The activities are repeated for additional iterations. At least one human expert selects one of the models for use in a production environment. The selected one of the models is executed in the production environment.

Description

Evaluation and Development of Decision-Making Models

Background

This description relates to evaluation and development of decision-making models.

In some applications, decision-making models are used to propose decisions aimed at improving a state of a real-world system (the modeled system).

An example of such a real-world system is the system of interpretation of English language sentences; a model of such a system can be developed to interpret the meaning(s) of an arbitrary English sentence to understand an intention expressed in the sentence. If the sentence were a Wall Street Journal headline (June 21, 2014) "Republicans Grill IRS Chief Over Lost Emails", two interpretations are possible:“Republicans harshly question the chief about the emails” or

“Republicans cook the chief using email as the fuel.” Developing a technical solution in the form of a model that can correctly interpret the sentence is a difficult task because there are a potentially unlimited number of variables to consider given the possible different contexts for the interpretation (e.g., politics, culture, business, conventions); the interpretation is subjective because different people may have different perspectives about the correct interpretation; the evidence supporting alternative interpretations may be incomplete; related data needed to develop the model may be unavailable; and human knowledge about the correct interpretation may be unavailable.

Similar challenges may exist in developing models for other systems, such as a model to evaluate a personal financial system (for advising actions to take to improve a state of the system). When the real-world system is a personal financial system of a consumer, the state can be the current or future financial well-being of the consumer, the decisions can be about current or future financial actions, and the decisions can be used to produce advice (guidance) designed to encourage or cause the consumer to engage in the actions for the purpose of improving the state.

A decision-making model is typically implemented as instructions executable by a processor.

The instructions process input data and generate outputs (results) as decisions about financial actions. Decisions generated as outputs by a model can, in the personal financial context, include whether, how, when, and where to generate income, save money, and spend it. To generate outputs based on input data, the executable instructions of decision-making models can implement a wide variety of algorithms and modeling strategies and approaches.

Decision-making models are typically generated by knowledge experts. The generation of a decision-making model can take advantage of historical data (sometimes large volumes of historical data) that represent real-world inputs to a real-world system being modeled and real- world outputs (such as actual decisions or related advice) produced by the real-world system based on the real-world inputs. When the model is put into production, the sets of real-world input data applied to the model can differ from any actual set of inputs found in the historical data.

The decisions that are the outputs of a model (or the related advice) can be used to cause an action (either automatically or through conduct of the human) with respect to the modeled system, alter the characteristics, nature, or operation of the system, or provide alerts, notices, or information to the system or its users, among other things. For example, a consumer could be advised that the lowest cost choice for cellphone service is Plan A from Wireless Supplier X, which may cause the consumer to switch from Plan B of Supplier Y, or, if inflation rates had increased making a projected level of retirement savings unlikely to achieve, the consumer could be encouraged to increase her savings.

Some decision-making models are more effective than others in producing useful outputs in response to real-world input data. The generation of effective models can be difficult, time- consuming, and sometimes unsuccessful. The effectiveness of a given model and its

effectiveness relative to other models may change over time as new data and new sources of data become available, insights are developed from use of the models in a production environment, the financial environment changes (for example, the body responsible for monetary policy raises interest rates causing a rise in credit card APRs), or a new class of consumers is identified to which the model does not apply effectively.

To improve the effectiveness of models, model developers use them, put them into production, test them, evaluate them, and revise them manually. The revisions can be based not only on a better understanding of why the model’s internal features or structure degrade the quality of its outputs, but also on the availability of information represented by new data, for example, data about consumers in general or about a particular consumer accumulated as time passes. Models to be used for a particular application can vary in their robustness, coverage, limitations, biases, policies, and performance boundaries (e.g., false positive rates). Parties who generate or acquire a model for use in an application typically test and evaluate it in advance (or have a third party do so) to determine that it will operate effectively in the application for which it is being created or acquired. For example, a financial institution that plans to operate a model to generate investment or savings advice for consumers who have accounts at the institution will insist on proof of the usefulness of the model.

Among the types of models that may be used to simulate the behavior of a system, such as a human-oriented personal financial system, one category relies on statistical analysis or processing of large volumes of historical input data and outputs (e.g., decisions) as predictors of what outputs a system will produce when presented with a set of current inputs. Another category of model is based on rules that express the model designer’s understanding of how the real-world system behaves.

United States patent 9,704,107 (incorporated here by reference in its entirety) describes a technology that combines statistical and rules-based features for modeling purposes to overcome some of the shortcomings of each technique if it were used alone.

Input data received from a variety of sources and expressed in inconsistent formats may be useful in the execution of a model. As described in United States patent application 15/593,870 (incorporated here by reference in its entirety), to assure that the model can process such data effectively, an ontology of predefined concepts can be maintained and applied to the input data.

Summary

In general, in an aspect, a machine-based method improves execution by a processor in a production environment of a process that uses input data to generate decisions about actions to be applied to a real-world system to improve a state of the real-world system. The decisions conform to data evidencing decisions that improved the state of the real-world system based on known input data. The real-world system operates according to open-ended numbers of and types of variables and is dependent on subjective knowledge about decisions about actions to be applied. Input data for at least some of the variables and subjective knowledge about decisions to be applied is unavailable from time to time. The data evidencing decisions that improved the state of the real-world system is incomplete. The machine-based method includes the following activities. The processor executes a first executable model using available input data to generate decisions on actions applicable to the real-world system. The processor executes a second executable model using the available input data to generate decisions on actions applicable to the real-world system. The processor executes a comparative evaluation process to generate comparative performance data indicative of a performance of the first executable model compared to a performance of the second executable model, based on the available input data and on the decisions generated by execution of the respective executable models. At least one human expert guides generation of a third executable model based on (a) the available input data, (b) the decisions generated by execution of the first and second executable models, (c) the comparative performance data, and (d) the subjective knowledge of the at least one human expert. The activities are repeated for additional iterations. At least one human expert selects one of the models for use in a production environment. The selected one of the models is executed in the production environment.

Implementations may include one or a combination of two or more of the following features. The executable models are different executable models. The executable models are two or more different versions of a given executable model. The processor executes the first executable model, the second executable model, or the third executable model using real-time input data in a production mode. The processor executes the first executable model, the second executable model, or the third executable model using historical input data in a non-production mode. The processor executes the first executable model, the second executable model, or the third executable model based at least partly on inputs from a knowledge expert. The processor executes a comparative evaluation process using production data. The processor executes a comparative evaluation process by generating a report of an evaluation of each of the models.

The processor executes a comparative evaluation process by executing at least one of the first executable model, the second executable model, and the third executable model in real time in a production environment using real-time data and executing at least one of the first executable model, the second executable model, or the third executable model later in a non-production environment using the same real-time data processor executes a comparative evaluation process by executing at least two of the first executable model, the second executable model, and the third executable model in a non-production environment. The processor executes a comparative evaluation process using predictions of outcomes based on decision outputs of the first executable model, the second executable model, or the third executable model. The processor executes a comparative evaluation process using predictions of outcomes based on decision outputs of the first executable model, the second executable model, or the third executable model. The processor receives input data expressed in accordance with concepts specified in ontologies associated with the first executable model, the second executable model, or the third executable model. The processor executes the first executable model, the second executable model, or the third executable model to generate decisions expressed in accordance with concepts specified in ontologies associated with the first executable model, the second executable model, or the third executable model. The processor presents a user interface to the human expert to enable the human expert to guide generation of the third executable model. The user interface enables the human expert to identify profiles of input data.

In general, in an aspect, a knowledge expert is enabled to develop a candidate model of a real- world system by an iterative process that includes, in each iteration of the development process, the following activities. The knowledge expert interactively generates a version of the candidate model. The version of the candidate model is run automatically to generate corresponding model outputs based on one or more known profiles of input data. The known profiles represent possible influences on the real-world system for which particular real-world outputs are expected. Information about the version model, the data profiles, the expected real-world outputs, and the model outputs is presented automatically to the knowledge expert. A subject matter expert can analyze performance of the developed candidate model by an iterative process that includes, in each iteration of the approval process: automatically running the developed candidate model using one or more of the known profiles of input data or one or more real-world profiles of input data to generate corresponding model outputs, automatically evaluating performance of the developed candidate model based on the outputs, and presenting the results of the performance evaluation to a subject matter expert. The subject matter expert can cause further development of the developed candidate model or can approve the developed candidate model for use in a production environment.

Implementations may include one or a combination of two or more of the following features. The input data is expressed in accordance with one or more concepts specified in one or more ontologies associated with the candidate model. The model outputs are expressed in accordance with one or more concepts specified in one or more ontologies associated with the candidate model. The presenting of the results of the performance evaluation of the subject matter expert includes generating a report. The instructions are executable by the one or more processors to present a user interface to the knowledge expert to enable the knowledge expert to generate the versions of the candidate model and to specify the input data. The instructions are executable by the one or more processors to present a user interface to the subject matter expert to enable the subject matter expert to review the results of the performance evaluation. The candidate model includes a machine learning model. The candidate model includes a rule-based model. The candidate model is run in a machine-based development mode. The candidate model is run in a machine-based production mode. The user interface enables a user to identify profiles of input data. The user interface enables a user to identify input data that corresponds to known results. The user interface enables a user to prototype a model using model evaluation, report generation, and interaction with inputs and outputs of the model.

In general, and an aspect, an iterative development process for one or more models of real-world systems includes, each iteration, the following activities. Current versions of two or more competing models of real-world systems are run to generate corresponding model outputs for each of the competing models based on the same one or more profiles of input data. The profiles of input data represent possible influences on the real-world system for which particular outputs are expected. The relative performances of the two or more competing models are evaluated. Information is provided to a human expert about the relative performances. Revised current versions are received of one or more of the competing models developed by the human expert.

Implementations may include one or a combination of two or more of the following features. The two or more competing models of real-world systems include an original model and a revised version of the original model. The two or more competing models of real-world systems include models that were independently developed. The running of the current versions includes running the current versions in an off-line mode. The running of the current versions includes running at least one of the current versions in a production mode. The evaluating relative performances of the two or more competing models includes evaluating precision or accuracy or both. The running of the current versions includes running the current versions automatically. The evaluating of the relative performances of the models includes evaluating the relative performances automatically. .The evaluating of the relative performances of the models includes a confusion matrix. The revised current versions of the model include versions applicable at a given time to two or more contexts. The two or more contexts are associated with respective different groups of end users.

In general, in an aspect, an iterative process is performed for development of a model of a real- world system. Each iteration of the process includes the following activities: a knowledge expert can interactively develop a current version of the model based on subjective knowledge of cause and effect relationships. The current version of the model is automatically evaluated by running the version of the model using profiles of input data to generate test outputs. A subject matter expert evaluates the test outputs using expert knowledge. The knowledge expert provides a revised current version of the model.

Implementations may include one or a combination of two or more of the following features. Evaluating the current version of the model includes using profiles of input data for which expected outputs are known. Evaluating the current version of the model includes evaluating the performance of the model in generating expected test outputs. Evaluating the current version of the model includes evaluating precision or accuracy among other metrics. Enabling the subject matter expert to evaluate the test outputs includes providing a report of the outputs to the subject matter expert. The report includes a confusion matrix.

These and other aspects, features, and implementations (a) can be expressed as methods, apparatus, systems, components, program products, methods of doing business, means or steps for performing a function, and in other ways, and (b) will become apparent from the following description, including the claims.

Description

Figures 1, 2, 4 through 8, 12, and 13 are block diagrams.

Figures 3 and 9 are schematic diagrams.

Figures 10 and 11 are presentation diagrams.

Figure 14 is a matrix diagram.

Figure 15 is a results diagram. Here we describe a technology for developing and evaluating decision-making models that are to be used, for example, in production environments. The decision-making models can relate to a broad range of systems and can be particularly useful for systems that involve human behavior, such as personal financial systems of consumers.

When a processor executes a decision-making model in a production environment, the execution may be slow, use a large amount of processing resources, or produce outputs that are suboptimal. An aspect of the technology that we describe here (which we sometimes call simply“the technology”) improves execution by a processor in a production environment of a process that uses input data to generate decisions on actions to be applied to a personal financial system to improve a state of the personal financial system.

Among other things, the technology can make it possible to generate, evaluate, and improve models for personal financial systems in order to produce and maintain optimal models for various aspects of the personal financial system even though such a system is open-ended (as explained below) and therefore difficult to model. Automated aspects of model development and evaluation are combined with manual involvement of human experts in an iterative process that includes comparative evaluation of competitive models. This combination of activities can drive a faster, cheaper, and more effective model development process. In this approach, human experts can have the role of determining which model processes will work best at generating good decisions, inferring the likelihood of consumers acting on those decisions (in some cases, with the aid of machine-generated predictions about such behavior), and inferring how changes in the environment in which the system operates will affect the outcome.

The input data for such a model may include information about the consumer’s personal financial system, the economic environment in which the personal financial system operates, available financial products that can play a role in the consumer’s personal financial system, and a consumers' preferences and financial goals. The input data can represent historical, current, or predicted future information about an internal state of the modeled system or the external state of the environment in which the modeled system operates. In the case of a personal financial system, the internal state can represent factors such as the amounts of money in bank accounts, the balances on credit card accounts, or the age of the consumer, to name a few. The external state can represent such factors such as the condition of the economy and the availability of financial products relevant to the consumer’s personal financial system.

In some cases, the decisions generated by the model can be expressed as advice to the consumers. The advice can include reports on the current state or expected future state of the personal financial system or coaching, guidance, or instructions aimed at encouraging the consumers to take financial actions that will improve the future state.

Real-world systems that involve human behavior can be difficult to model because human behavior is often unpredictable and may change over time as an individual passes through developmental stages, as the environment within which the individual behaves changes, and as the society shifts its behavioral norms. For example, a decision-making model can be designed to provide advice to assure a person’s retirement savings will be enough to cover her expenses until she dies if her financial actions (income, savings, and spending, for example) remain consistent over time and conform to typical human behavior. But she may decide unpredictably to begin spending money on annual vacations that are so large as to undercut the prediction of solvency until death. Or advice aimed at achieving a certain future financial state (say, ending the month with at least a targeted incremental amount of savings) may be effective if the consumer buys coffee each morning consistently with his past conduct, but may fail if anxiety about his job causes him to spend an unexpectedly large money on coffee in a particular month.

The processes represented by the executable instructions of a decision-making model for a personal financial system can, in effect, analyze the state of the personal financial system,

“diagnose” financial“problems” of that state, and propose possible“remedies” (eg., decisions) for the problems identified by the diagnoses. During development of a model, experts participate in specifying the processes by which the model will“diagnose” and then propose“treatments” for a personal financial system. During evaluation of a model, the expert applies her knowledge and experience in determining, for example, whether the future state of the personal financial system is going to be optimized if the user acts on decisions produced by the model. Models that predict the future states of the personal financial system and the future states of the economic environment may be provided to assist the human expert or to enable automated evaluation and development of models. A principle of operation of the technology is that, if the consumer performs the actions contemplated by the remedies, the future state of her personal financial system (the outcome) will be improved.

Implementing the technology by evaluating and developing useful models of personal financial systems is hard for at least the following reasons.

1. Personal financial systems, unlike many systems that are the subjects of modeling, have open- ended characteristics that make it hard to develop effective models in the usual way, for example:

a. Unlike data for many systems, data and features of data that characterize transactions, events, goals, results, influences, inputs, outputs, and states of personal financial systems can involve an essentially unlimited number and kinds of variables and parameters. The input data may not be represented by a tractable number of specifically identifiable variables, but rather may be represented by arbitrary variables or types of variables and by a large (essentially unlimited) number of variables. For example, input data in the form of personal financial information may be distributed among different financial institutions and represented by a large number of sometimes inconsistent variables, economic and market information can be expressed in input data in a wide variety of ways, and information about available financial products may be essentially unlimited.

b. Choices about the best next actions for the consumer to take are made by knowledge experts in selecting input data to be applied to models, the relationships among the input data in generating decisions, and the processes for generating the decisions based on the input data. These choices are based on their unique experiences and knowledge and are therefore subjective and depend on the particular expert who is doing the job.

c. Input data (used, for example, as evidence on which to base the diagnoses) is often incomplete, ambiguous, or uncertain for reasons similar to those noted in a. above. As a result, the availability of input data for use in generating or evaluating or using the models in production environments may be limited.

d. Input data (evidence) needed for model evaluation and generation may be incomplete because its availability may not be certain or the information may not be trustworthy, for example, as a result of flaws in data collection related to consumer populations, geography, and other factors. In other words, the available data (which provides evidence of the state of the system and inputs and outputs of the system) may be sparse or otherwise incomplete, inconclusive (consistent with more than one possible hypothesis or explanation), ambiguous (not determinative of a particular conclusion), inconsistent (favoring different hypotheses), or not credible, or combinations of them.

e. Useful expert knowledge may not be available or may entail inconsistencies among financial advising experts, for example, due to limitations of individual experts' experiences, observations, market predictions, and personal preferences.

2. Such a model can produce alternative proposed decisions based on the input data and can produce decisions that need to be acted on in a sequence to work properly. The decisions therefore must be processed to select the best decisions and define the proper order for acting on them in light of the consumer’s defined financial goals.

3. In any case, the decisions produced by such a model can be effective in improving a future state of the personal financial system (the outcome) only to the extent that the consumer acts on them in an expected way. But human behavior is hard to predict. Without good predictions about that behavior (e.g., the probabilities of various actions being taken by consumers in response to information presented about proposed decisions) it is hard to be confident that decisions produced by a model will actually produce the intended improvement (outcome). In the technology, predictions about the actions of consumers can be based on concurrent executions of machine learning models using historical information.

4. There is no hard-and-fast rule about what is a good outcome in a personal financial system. Good decisions may depend on a consumer’s personal preferences. (A consumer may prefer not to have credit card debt even though doing so might enhance his financial outcome.) And good outcomes must be measured against a consumer’s personal financial goals. (A consumer may accept a lower amount of retirement savings than might be possible to achieve.) Goals and outcomes of personal financial systems can be expressed in subjective terms (“I want to be comfortable that my spending habits are consistent with my savings needs.”). The consumer may be able to define such preferences and goals, but a consumer may be mistaken about her own intentions, and her preferences and goals may change. 5. Whatever the measure of good outcomes, outcomes also depend on unknown future states of the environment in which the personal financial system operates such as the economic environment or the availability of products that can serve to achieve good outcomes. For a model to be able to produce the best decisions, the model must have access to predictions about the future states of the environment.

6. To be able to evaluate a model in terms of its ability to cause improved outcomes, one must be able to predict the future outcome of a personal financial system and then determine the quality of that outcome based on the personal financial goals of the consumer.

7. There may be no objective, verifiable“ground truth” for evaluating a model. The subjectivity can relate to the inability to evaluate, by objective measures, the correctness of the decisions and corresponding advice produced by a model. For example, an objective measurement for financial advising for a 20-year-long retirement plan would be based on the financial results across the 20 years. Yet the development of the applicable model must be done currently rather than waiting for 20 years to assemble the necessary profile data set. Therefore, the development process takes advantage of the experiences of the knowledge expert and the subject matter expert to perform model evaluation.

Typical non-machine-based (entirely human) financial advising is based on individual experiences, preferences, and analysis of human experts. As a result, biases and hidden interests (e.g., commissions paid to the advisor) can prevent a consumer from being presented with an accurate and balanced picture (and therefore from understanding) the state of her personal financial system and the best strategies to manage debt or investment, for example.

The technology can effectively emulate an understanding of the actual state of a consumer’s personal financial system and mimic the delivery of unbiased financial advice by a well-trained expert applying proper data analysis. To be able to serve a very large number of consumers at a reasonable price, machine-based financial advice technology typically needs to be implemented using artificial intelligence (e.g., in the form of models) to mimic human experts. Even when good financial advice is provided by such technology to customers, it must be accompanied by good explanations of the reasons for the advice in order to motivate typical consumers to act on the guidance. In addition, the technology enables high-level financial strategic advice to be revised as input data about a population of consumers changes (for example, changes in norms) and as input data about a given consumer change (for example through lifetime developmental phases). The technology enables the generation of models that are adaptable to dynamically changing input data. The technology accommodates the natural characteristics of good human financial advising, including its subjectivity and personalization and the potential existence of multiple versions of the state of the personal financial system and advice about it. Operation of the technology is based on combinations of analytical approaches and statistical approaches.

When a computer processes a model subject to these difficulties, the processing can be inefficient (in terms of storage and processing capabilities used) and ineffective (in terms of producing suboptimal decisions as outputs). The technology described here improves the machine processing by enabling optimization of models that can be executed by the processor more efficiently and effectively. The optimization of such models addresses the concerns identified above by, among other things, applying expert knowledge to cope with the

intractability of the number and types of variables of input data, unavailability of input data, the incompleteness of evidence, and the lack of objective metrics for evaluating models, reducing the influence of subjectivity in the generation of the models, providing predictions of human behavior responsive to decisions and advice of the model, acquiring information about consumers’ preferences and goals, and predicting future states of a consumer’s personal financial system.

The technology described here can be used for evaluating and developing decision-making models and implementing them in production environments. The technology is useful in a broad range of applications and particularly, for example, for model generation, evaluation, and production use in contexts that are open-ended in one or more respects including the ones discussed above.

In addition to running models using profile data sets in a development or evaluation mode, the developed models can also be run in a production mode, or in a mode that combines

development or evaluation with production.

In a development or evaluation mode, the decisions generated by a model need not be used to provide advice in real-time interactions with consumers. In the development or evaluation mode, the profile data sets may include historical or artificially generated input data or real-time input data or a combination of them.

In a production mode, the decisions and advice generated by a model are typically used in real- time interactions with consumers through an interactive user interface. The profile data sets used in a production mode typically include real-time input data.

In some applications, the technology is applied to historical data generated during operation of models in a production environment. The comparison of the models can include quantitative and qualitative analysis. In such applications, the process can be at least partially automated and may include human involvement for the analysis. The models being compared in evaluations performed by the technology can be the same model applied to different profile data sets at different successive times or different versions of a model or different models applied to the same data sets or different data sets at a given time or at successive times. The outputs produced by a model in the production mode can be compared to the outputs produced by a different version of the model or a different model.

The technology provides an evaluation process that semi -automatically exposes to human experts outcomes of different models or different versions of models in different modes (current production versus current non-production, current non-production versus current non-production and current non-production versus future production or non-production). This enables human experts to impart their human insights in the context of the environments (and market situations) to help identify the relative effectiveness of models notwithstanding the previously mentioned constraints of arbitrary variables, subjective judgement, incomplete evidence, and constrained availability of data and knowledge. With the help of such evaluation features, the technology can continuously improve models by considering them against challenger models generated using different technologies or by other human experts.

The technology applies not only to the evaluation and validation of one or more models, but also to a general development process for decision-making models for systems characterized by the constraints described above. The general development process can enable an iterative improvement of models leading to optimal models for such systems, including personal financial systems. Typical decision-making model development in the context of advice about personal financial systems uses single, binary variables to measure the performance of models, such as default rates or profitability of mortgages with respect to underwriting models. The technology provides a much more nuanced approach to model evaluation, but starting with modeling strategies (e.g., algorithms and modeling approaches) that are based partly on human knowledge and not solely on data values. The strategies include the use of expert systems, rules engines, and Bayesian probabilities, for example.

Complicating the process of model development are the many possible styles of available modeling including, among others, deep learning, ensemble, neural networks, regularization, rules-based, regression, Bayesian, decision tree, dimensionality reduction, instance based, and clustering, belief functions, Baconian, and fuzzy logic. Each of them has advantages and disadvantages with respect to the characteristics of typically available financial data, namely that the data is incomplete, inconclusive, ambiguous, inconsistent, or not believable. One role of the knowledge expert is to select and apply appropriate modeling techniques in the course of developing, evaluating, and putting into production the models that result from using the technology.

By comparing two or more models one of which may be considered a champion and the other of which may be considered a challenger, the comparison can rely on a relative truth framework and include a human in the evaluation and validation process to identify weaknesses and to incrementally improve a model from one version to the next version.

Over time, as issues associated with, for example, the availability of data become less prominent, the model generation technology can reduce its reliance on the subjective judgments of subject matter experts to decide whether a decision or the related advice will produce a favorable change in the state of the user’s personal financial system. Determining whether a change in state is favorable can be made using empirical evidence. For example, the technology could recommend a 0% refinancing of credit card debt in July to a particular consumer based on her profile, because the applicable model determines that she will be able to pay down the debt within the introductory period of the credit card and therefore that doing so is a good way to service her debt. At one point in time, say January, 2018, a combination of model evaluation and validation and the input of subject matter experts may have determined that such a rule would perform well. Nevertheless, two years later and with the benefit of hindsight using data from the same consumer (who never paid off the balance during the introductory period at 0%) and from a large number of other consumers having similar profiles, it may be determined that, in general, people having profiles who have never had positive cash flow and have more than three revolving credit cards almost never paid down their debt during the introductory period of a 0% credit card. A new version of the credit card strategy model can then be generated which never recommends a 0% credit card refinancing to people with that profile.

In other words, the comparative evaluation of two or more versions of a model or two or more models is part of a more inclusive process of generating and developing models. In the more inclusive process, models can be generated, revised, tested, compared, evaluated, updated, and changed in an iterative process over time based on profile data sets that change over time, and in the context of a model environment that also may change over time. As a result, over a period of time, models used in production environments can be continually updated so that the most effective models are always currently being applied in providing decisions and advice to end- users (e.g., consumers). Over a long period of time, based on the accumulation of more and more profile data sets in the iterative development of decision-making models, it may be possible to develop and put into production and update as necessary models that are close to optimally effective at any given time and from time to time.

In some implementations of the technology, a combination of automated processes and human involvement of knowledge experts and subject matter experts to incorporate human-based knowledge that is used to perform the evaluation and development of decision-making models.

The incorporation of human-based knowledge in the process is useful in dealing with the uncertainty of outcomes in the operation of the models in production environments. Knowledge experts and subject matter experts can address such uncertainty in the model development process.

In theory, it may be possible to perform the iterative development of decision-making models using comparative evaluation fully automatically based on automatic accumulation of profile data sets, automatic generation and updating of models based on automated evaluation techniques, and automatic application of the updated models in production environments. Some features of the technology may be useful in creating platforms for model development that incorporate predictions of future states of personal financial systems based on a current state and the actions that the consumer is predicted to take based on the generated decisions. With such additional capabilities, it may be possible to generate and evaluate models and drive them towards optimal models automatically and without human involvement.

As noted earlier, the technology described here improves the machine processing of models of personal financial systems by enabling optimization of models that can be executed by the processor more efficiently and effectively. The optimization of such models addresses the concerns identified above by, among other things, applying expert knowledge to cope with the intractability of the number and types of variables of input data, unavailability of input data, the incompleteness of evidence, and the lack of objective metrics for evaluating models, reducing the influence of subjectivity in the generation of the models, providing predictions of human behavior responsive to decisions and advice of the model, acquiring information about consumers’ preferences and goals, and predicting future states of a consumer’s personal financial system.

The technology is broadly applicable to automated and partially automated generation, evaluation, comparison, revision, and implementation (among other things) of a wide variety of machine-based decision-making models of real-world systems.

For purposes of discussion, we describe examples of development, evaluation, and productive use of machine-based models in financial analysis, financial management, and financial advising, particularly in the context of personal financial systems. The principles associated with these examples also apply to a broad range of other contexts and applications. These principles are especially useful in (but are not limited to) machine-based modeling of real world systems that are unpredictable or difficult to characterize or that develop and change over time, such as systems governed or affected by human behavior, for example, personal financial systems.

In the technology that we describe here, a rigorous, defined structure (discussed below) can be applied to implement a useful and effective model development process.

The structure of the model development process enables human participants to consider the performance of a model, which may appear to be effective, within a broader context to assure that the model outputs are useful and integrated well into a comprehensive system for providing decisions and advice to consumers.

In some implementations of the technology, the evaluated and developed models can be put into a production environment as part of a comprehensive financial advice platform (such as the platform offered by Cinch Financial of Boston, Mass.) for providing, for example, proposed decisions and advice about personal financial systems and executing actions on behalf of consumers.

Decision-making modeling is an important tool in a machine-based financial advice technology. Because the quality of decision-making models will directly affect the quality of financial advice generated by machine-based financial advice technology, the creation, development, updating, measurement, evaluation, testing, and comparison of historical, current, and future models are important activities. Typically these tasks are performed in a one-off manner by individual model builders and experts and are performed in ad hoc sequences and iterations. In some

implementations of the technology, these tasks can be performed in various contexts: the context of a production system in which real-time financial advice is provided to customers based on real-time data inputs, the context of a development system in which the models are not being used to provide real-time financial advice to consumers, or the context that involves a

combination of the two.

In some cases, multiple models may operate in parallel within such a comprehensive platform. For example, in a comprehensive platform serving consumers with respect to their personal financial systems, the models could include a credit model, a savings model, an investment model, an insurance model, and a spending model, among others. Each model can be

characterized by one or more algorithms or modeling approaches that use portions of input profile data sets to generate corresponding decisions and advice. One aspect of the knowledge expert’s job is to specify and choose features of the model for a given aspect of personal financial systems.

Information about machine-based financial advice technology for providing financial advice to consumers can be found in the following United States patent applications and patents, all of which are incorporated here by reference in their entireties: patent application serial numbers 14/304,633, 14/989,935, 15/231,266, 15/627,938, 15/593,870, 15/681,462, 15/681,482, 16/008,954, 15/886,524, 15/844,034, 15/886,670, and 15/886,535; and patents 9,704,107 and 10,013,538.

As an example of the complexity of decisions and advice to be generated by a model, consider advice that might be given to a consumer with respect to his credit card debt. To know what to recommend, a credit card model could consider how much a consumer can pay against a credit card balance, which is a reflection of income, consumption level, and liquidity (reserves). The output of the model could also depend on the user’s credit profile (inferring whether the user would get approved for more credit and at what price) and behavioral profile (whether the user might benefit from not having access to revolving credit). So to know if recommending a 0% balance transfer is the right advice for the consumer, the model would need to consider not only financing cost (how much the consumer paid in interest), but whether the consumer is able to pay down the debt, stay out of debt afterwards, and maintain a healthy liquidity level and a positive cash flow so that the consumer is no longer consuming more than he can afford. A conventional answer to the evaluation question for such a model: Was that good advice? is simply "Yes, because he saved on interest paid.” That simple answer fails to account for the complexity of the situation.

In some implementations of the model development and evaluation technology, evaluations of models are done by comparison of the relative merits of the decisions (e.g., outputs) produced respectively by two or more models or two or more versions of a model based on a common set of input data.

In some cases, the input data used for comparative evaluation represents a large number of profile data sets. When the modeled system is a personal financial system, each profile data set can include input data associated with a particular consumer at a particular time, including data specific to the consumer (a current bank balance, for example) at the particular time and data associated with an environment (such as the current inflation rate).

When the compared models are run for evaluation purposes, the outputs (decisions) may be, for example, the same for 80% of the input profile data sets and different for 20% of the input profile data sets. Yet, because of the nature of the system being modeled, there may be no absolute expected output. By analyzing the differences in outcomes and their frequencies and the profile data sets that produced them, it is possible to evaluate which of the models may be more effective and for which kinds of profile data sets. Then a choice can be made whether to substitute a more effective model for a less effective model in, for example, a production environment.

In some cases, the comparative evaluation is done by running models in a development or evaluation mode on what may be called black-and-white profile data sets, in which the output of the system being modeled in response to a given profile data set is known. The comparative evaluation is done outside of the context of a production mode and can be performed partially or entirely automatically.

In some instances, models or versions of models can be compared based on their expected performances relative to future outcomes predicted by simulators. For example, if a consumer takes current actions based on advice from respective models, the comparison can analyze which model will produce the more favorable future outcome as indicated by the simulation. The simulation and comparison can be performed partially or entirely automatically.

In some implementations, the process of comparative evaluation can include production mode testing to determine how an end user (e.g. a consumer) uses advice provided by models. This sort of comparative evaluation can be done at the level of profile data sets associated with individual end-users rather than statistically.

The technology uses a“relative truth” framework and an evaluation process that includes human participation. The generation of the decision-making models is based on comparison and evaluation of the performances of two or more of the models or two or more versions of a model. In some implementations, the goal of the comparison and evaluation is not to determine objectively whether a model or which model performs best against a perfect verifiable“ground truth” because the ground truth may not be known. In some examples, models are evaluated based on their coherence with expertise of one or more human subject matter experts. In this context, a departure from what a subject matter expert considers to be better with respect to a model can be viewed as a flaw or limitation of the model to be addressed in other versions of the model or in other models. In some cases, the choice of a model for deployment into a production environment is based on which model has fewer or less severe flaws or limitations in that sense.

The technology streamlines and improves the ability for humans to adjust (tune), evaluate, and validate models in a non-production environment (e.g., a development and evaluation environment or mode) to help solve input data issues and ensure soundness of the model, regardless of the approach taken by the model in producing outputs. In some cases, the human input can be applied in a non-production environment and is not applied to a real-time profile data set for a particular end-user or to evaluate the result of the model when operated in a production environment. The technology is configured to accommodate any number of arbitrary variables, the subjective nature of the decisions of knowledge experts, the flaws of applying only standard unit testing as a way to evaluate a model, constraints on the availability of data for testing (e.g., test profile data sets), and the need to incorporate into the model evaluation and development process the subjective knowledge of subject matter experts.

Turning to more detailed examples, as shown in figure 1, in some implementations of the machine-based financial advice technology, financial advice 10 can be provided in real-time to end-users (e.g., consumers) 12 from a machine-based production system 14 that executes one or more decision-making models 16 based on external input data (e.g., profile data sets) 18 obtained from the consumers and from a wide range of data sources 20. The external input data 18 is processed based on one or more ontologies 22 to derive internal data 24 that conforms to internal data concepts 26 and formats for expressing them, as defined by the ontologies. Additional information about the use of ontologies in machine-based financial advice technology can be found in United States patent application serial 15/593,870, incorporated here by reference in its entirety.

An example JSON format for input data is shown below.

"refinance": {

"apr": {

"name" : "user. credit_card. refinance. apr", "type": "percent",

"value": 0.16,

"estimated": true

} ,

"count": {

"name" :

user . credit_card . refinance. count", "type": "int",

"value": 2,

"estimated": false

},

"can_payoff " : {

"name" :

user . c red it_card . refinance. can_payoff ",

"type": "boolean",

"value": true,

"estimated": false

},

"balance_trend" : {

"name" :

user . c red it_card . refinance. ba la nce_t rend",

"type": "usd",

"value": -320.6666666667, "estimated": true

b

"ending_balance" : {

"name" :

user . c red it_card . refinance. ending_ba la nee" ,

"usd",

5128.28, ed": false yoff " : {

user. cnedit_cand . refinance. can_fast_payoff ",

"type": "boolean",

"value": true,

"estimated": false

An example of code to provide input data to a model follows:

void observe(String name, Object value) { log.debug('Applying evidence [{}] with value [{}] (type: {}) String. format('%-75 s', name), String. format('%-20s', value), value. class. name)

if (value in Boolean) {

network.observeBoolean(name, (Boolean) value)

} else if (value in Double) {

network.observeDouble(name, value)

} else if (value in BigDecimal) {

network.observeDouble(name, value. toDouble())

} else if (value in Integer) {

network.observeDouble(name, value. toDouble())

} else {

throw new OperationNotSupportedException('Evidence of type [' + value class + '] currently not supported.')

}

An example segment of code to extract and update personal transactional data is shown below.

MxtraTransaction updateTransaction(UncommonMessageMetaInfo meta,

MxtraTransactionsMessage.ExtendedTransaction updated,

MxtraTransactionsMessage.TransactionGroup group) {

MxtraTransaction transaction = getOrCreate(updated.id ?: updated.transactionld, meta.userld, updated. accountld) if (transactioaupdatedAt == null || transaction. updatedAt <= meta.createdAt) {

def overrides = [

id : trimPrefix(updated.id),

offsetld : trimPrefixupdated.offsetld),

bucket : group.bucket,

bucketSetBy : group.bucketSetBy,

bucketSetAt : group.bucketSetAt,

periodicity : group.periodicity,

periodicity SetBy : group.periodicitySetBy,

periodicity Set At : group.periodicitySetAt,

extendedCategory : updated. extendedCategory?. id,

updatedAt : mcta.crcatcdAt

1 def ignore = []

Util.setProperties(transaction, updated, overrides, ignore)

if (masker. MASK TRAN S ACTION C ATEGORIES . contains(transaction. category)) {

transactiomshortDescription = masker.mask(transaction.shortDescription)

tra nsac ti o n . lo ng Dc sc ri pti o n = masker.mask(transaction.longDescription)

}

transaction

}

The financial advice 10 can include reports on and assessments of the state of a consumer’s personal financial system, decisions such as allocation of financial resources to financial needs, advice related to the decisions, aspects of financial strategies, and recommendations about products or consumer actions aimed at implementing financial strategies to improve the state of the consumer’s personal financial system. Recommendations about products or actions or activities can be based on product information derived from external product information sources 29. Monitoring functions 28 of the machine-based production system 14 can determine whether financial advice has been adopted or implemented and whether the state of the consumer’s personal financial system has been improved over time, among other things.

As shown in figure 13, the machine-based production system 370 can be implemented as software and software modules 372 stored in memory 374 and executed by one or more processors as part of a computer or other machine 376. The machine can be located at the place where a consumer is receiving the financial advice, or at a server or servers, or a combination of them. The models 360, ontologies 380, internal data 382, and other information can be stored in one or more databases 384 that are associated with and located at the same place or places as the machine, or separated from and located at a different location from the machine. The machine includes communication devices 386 to enable it to exchange data and information with consumer devices 370, external data sources 46, external product information sources 392, and communicate with external databases 394. Typically the communication occurs through the Internet. The consumers interact with the machine-based production system through user interfaces of apps running on consumer devices such as smart phones, tablets, laptops, or workstations. The machine-based production system can serve a very large number of consumers by receiving requests for advice and data and information from the consumers and providing them with financial advice.

When the processors 356 in machine 376 process models, the processing can be inefficient (in terms of storage and processing capabilities used) and ineffective (in terms of producing suboptimal decisions as outputs) unless the model has been iteratively improved over time. The technology described here improves the machine processing by enabling the model generation process to develop better and better models that can be executed by the processor more efficiently and effectively.

As shown in figure 2, in addition to the machine-based production system that implements online execution of models at run time when financial advice is being generated and provided in real time to consumers in a production mode, the machine-based financial advice technology can include a machine-based off-line (non-production-mode) model execution platform 30 that can execute one or more models 32 during the course of model development or model evaluation or both, for example. The model execution platform is off-line in the sense that it can operate without interacting with consumers as would be done if the models were to be executed on the machine-based production system of figure 1. The models 32 can be the same as one or more of the models being executed by the machine-based production system or can be models being developed, tested or evaluated, compared in a competitive analysis, or previously used models, and combinations of them, among others. In some cases, the models can be supplied from the machine-based production system or can be provided to the machine-based production system, or both. The off-line model execution platform can include and present a user interface 34 to one or more users 36 (whom we sometimes call“development users”). The user interface provides features to enable expert model developers and other users to engage in one or more of the activities associated with model development.

The development users 36 are not operating in the role of consumers to whom financial advice is being provided in a production run-time sense. Rather the development users 36 typically are people or entities engaged in the creation, development, updating, testing, comparing, and observing of the models, and in combinations of those activities. The development users can be generating and developing models for use on production systems that they control or for use on production systems of other parties. The development users can be testing or evaluating or comparing models to produce evidence that may be required for their own purposes or by other parties demonstrating that the models are accurate and robust, among other things. One or more of the development users 36 may be domain experts (subject matter experts) in the field of financial advice or financial advice technology, for example, or may be knowledge experts in the field of model development (including model evaluation), or may be non-experts trained in the use of the off-line model execution platform.

Among other functions, the user interface 36 enables a user to locate, identify, develop, configure, execute, observe, compare, test, or analyze (or combinations of those) the operation and results of one or more of the models. We sometimes use the term“development” or“model development” broadly to include, for example, each of those activities, as well as data selection and management mentioned below, and other similar or related activities, and combinations of them. The user interface enables a development user to select data, data records, or sets of data or data records (e.g., profile data sets) 38 to be used in executing one or more of the models during model development or evaluation or both. The data or data records or sets of data can be selected by the user based on a variety of factors, such as the user’s knowledge of the system being modeled, the type of the model, other characteristics of the model, the character and volume of the data set, the likelihood that the data or data set will provide a good test of the model, and other factors, and combinations of these.

A report generator 40 can provide reports 42 about model development through the user interface based on information stored on the platform and the results of operations by the platform, including the evaluation of one or more of the models. Among the information provided in such a report can be information about the models, the state of development, input data and sets of input data, including input data or sets of input data used in the execution of one or more of the models, results of the execution, evaluations of the performance of one or more of the models, comparisons of the performance of one or more of the models with one or more other models, comparisons of the performance of one or more of the models at a first point in time with performance of one or more of the models at a second point in time, and a wide variety of other information useful to the development user in evaluating the accuracy, effectiveness, utility, and other characteristics of one or more of the models taken alone or in combinations. As shown in figure 13, the off-line model execution platform 350 can be implemented as software and software modules 352 stored in memory 354 and executed by one or more processors 356 as part of a computer or other machine 358. The machine can be located at the place where a user is engaging in model development or evaluation, or at a server or servers, or a combination of them. The models 360, data 362, sets of data, reports 364, and other information can be stored in one or more databases 366 that are associated with and located at the same place or places as the machine, or separated from and located at a different location from the machine. The machine includes communication devices 368 to enable it to exchange data and information with user devices 370 and external data sources 46, and communicate with external databases 48. Typically the communication occurs through the Internet. The users interact with the model execution platform through the user interface 34.

As shown in figure 3, characteristics of a model can be arrayed along two dimensions, one of which represents the extent to which the model is based on statistical analysis (e.g., machine learning) of data or on theory or knowledge (e.g., a rules-based approach). Along the other of the two dimensions the models can be characterized with respect to the purpose of the modeling on a spectrum from association and correlation to causation.

In some implementations, it is useful for the financial advice technology to execute models concurrently either by concurrent execution in real time or by execution of a primary model in real time and execution of secondary models off-line in a batch mode.

Therefore, as suggested by figures 1 and 2, in some cases, the financial advice technology performs at least two basic functions.

One function is to implement a machine-based production system for providing real-time financial advice (for example, recommendations for action by a consumer regarding her personal financial system) based on a model-driven financial strategy in such a way that the platform is adaptable and reusable for various financial realms and financial topics, for example, or for non- fmancial realms and topics.

A second function, related to the first function, is to establish and support an engineering process for developing and evaluating models (a model development and evaluation technology or more simply“the technology”) to be used in the machine-based production system. The engineering process (e.g., the model development process) provides a disciplined broadly applicable approach for model evaluation and development. The engineering process can satisfy several criteria. By enabling the analysis of a non-technical human system to be defined in technical terms, for example, as a classification task, the resulting model or models can be used to classify features of current data (e.g., input data or profile data sets) that characterize the human system (e.g., the personal financial system) to derive results that are appropriate to the input data. The results can include information, strategies, suggestions, or other advice, among other things. In addition to processing and classifying data features, the engineering process enables conversion of concepts and knowledge about a system being modeled into rules. On the statistical model (e.g., machine learning) side, the process enables feature engineering to identify useful data in the profile data sets. The engineering process enables effective selection of suitable machine learning algorithms and models. And it is possible to test and evaluate and compare models offline (using a model execution platform) to decide which models can or ought to be put into online production use.

Figure 4 provides a diagram (supplementing the diagram of figure 2) depicting an architecture of a model evaluation (and model development) platform 50 that is part of the technology. The blocks in figure 4 illustrate software components, inputs, and outputs, of the model development platform. The cylinders in figure 4 illustrate databases or other bodies of data.

In general, the model evaluation platform supports a partly human, partly semi-automated, and potentially fully automated process for evaluating a model including the following steps:

Examples of input data and data records to be used in the evaluation can be selected by human model evaluators or in some cases semi-automatically

For example, a human expert may enter criteria for input data and data records through an unstructured interface, or in a formal query, and a semi-automated process can use the entered information as the basis for identifying data and records that meet the criteria or the query partly on the basis of inference.

For example, a human expert might enter, as a criterion“debt of greater than $10,000”. The process then might infer semi-automatically that the criterion is referring to aggregate debt of credit card accounts or other revolving accounts. In another example, the following sequence of actions might be taken: (a) at a particular time when the technology is to be used to execute a rule model A or to perform a statistically-based process such as a Bayesian model B, the evaluators could select data fields x, y, z ... for respective users 1, 2, 3, ... as the input data or data records, (b) After the human experts receive the results of the evaluation, they could alter the models to become model versions Al and Bl.

(3) The human experts then could cause comparative evaluations to be done of pairs of the versions of the models, such as the pairs A-Al, B-Bl, and Al-Bl. The human experts can cause the comparative evaluations to be done consistently by selecting the same set of data x, y, z ... for the same users 1, 2, 3 ... to be applied to the different models being compared. In some cases this could be done manually by the human experts. In some examples, the process could be executed semi-automatically by having the human expert simply identify a stored history of a previous evaluation. Then the technology can apply the same selected data or data records used in the selected previous evaluation in performing the evaluation of the new comparative evaluation to be done. In a more automated implementation, if a human expert has applied the same input data, such as fields x, y, and z, to a comparative evaluation of models A and B (with the result that A is determined to be better than B) and also to a comparative evaluation of B and Bl (with the result that B is better than Bl), the technology would automatically determine that model A is better than model Bl without requiring a human expect to manually conduct a comparative evaluation of A against Bl using the fields x, y, and z.

More than one human model evaluator can participate in the process of evaluating the model.

The selected input data or data records are then subjected to a data preparation process that includes cleaning up the data, removing personally identifiable information, and applying filters, among other things. The resulting cleaned data is provided to a model execution process that executes the model using the selected data and produces outputs or results (e.g., decisions or predictions). A model evaluation process then semi-automatically or automatically evaluates the performance of the model based on the input data or data records and the resulting outputs. A report of the evaluation is provided semi-automatically or automatically. Expert human evaluators can then analyze the report for purposes of evaluating the model. In some instances, the evaluation can apply not only to a single model and comparison of its performance over time as it is being developed, but can also apply to a comparison of two or more different models or versions of models that are treated as being in competition for the best performance with respect to the profile data sets. This enables an iterative process leading towards an optimal model.

An example of a JSON format for output from a model is shown below.

{

"model_name": "whisper",

"results": [

{

"name" : "keep existing card _pay_in_full ",

"reason": "CHECKINGMONEY",

"recommended": true

},

{

"name" : "new_0_percent_credit_card",

"reason": "REJPAYINFULL",

"recommended": false

},

{

"name": "new_personal_loan",

"reason": "REJPAYINFULL",

"recommended": false

},

{

"name" : "new_low_rate_credit_card",

"reason": "REJPAYINFULL",

"recommended": false

}

],

"execution_time_ms": 245,

"details": {

"message": "No debug information available for this model"

}

We use the term“semi-automatic” broadly to include, for example, any operation, process, or task that is performed at least partly by a computer, device, or other machine and is partly supervised, assisted, supplemented, or managed by a human. We use the term“automatic” broadly to include, for example, any operation, process, or task that is performed by a computer, device, or other machine without requiring involvement by a human. Generally, a feature of the technology is providing a mechanism for ongoing apples-to-apples comparison of the performances of individual models, versions of a given model, and types of models as they are developed, implemented, and adapted over time to changing data patterns and changing understanding of how the modeled systems behave. The mechanism also applies to ongoing apples-to-apples comparison of the performance of two or more different models of a given modeled system, both at a given point in time and at successive points in time. Through this mechanism, development and adaptation of competing challenger models proceeds more quickly and more effectively over a long period of time. In the resulting models enable the production mode processing of the model to be more efficient and more effective.

As shown in figure 4, in some implementations, one or more selected models 52 are each evaluated by running the model in the context of one or more selected profile data sets 64 (e.g., sets of input data or data records) to produce model outputs 56 which are used by a report builder 58 to generate one or more reports 60. The reports provide information useful in model development, evaluation, and deployment. The process of selecting the model and data profiles and running the model to generate the report can be iterated one or more times. The iterations enable a user to modify and retest a model, to compare results from applying different profile data sets to a given version of a given model, to test hypotheses about flaws or strengths of models and types of models, to evaluate the range (in terms of profiles) of good or bad performance of a model or a type of model, and to perform other activities as part of model evaluation, development, and deployment.

We use the term“profile data set” broadly to include, for example, any set of values or input data or data records representing one or more data features that characterize a state or behavior of a system being modeled, such as a personal financial system. The complexity of a profile data set can range from simple to complicated. In some cases, each profile data set to be selected in the model evaluator is associated with one or more known or expected outputs. The known or expected outputs may be part of the profile data set or separate from the profile. The known or expected outputs may be familiar to the user (e.g., a subject matter expert or a knowledge expert) who is engaged in the model development and evaluation and can evaluate the relationship between the outputs generated by the model and the expected outputs of the real world system. A key purpose of the model evaluation process is to enable the development user to select test profiles, run models with them, and study the generated results. In some cases, a profile data set is a single set of data representing a single example. In some instances, a profile data set is a template of examples each of which conforms to features or parameter values defined by the profile.

The profile data set (e.g., input data, data records, or data sets) to be used during a given test run of a model or models can be a profile data set 64 selected from existing production profile data sets by a profile selector 62 (controlled by a development user through the user interface). In some cases, the profile data set can be a test profile built using a test profile builder 66 (also controlled by a user). In some cases, the profile data set can be combination of a profile data set selected from existing production profile data sets and adjustments of the profile data sets made by the development user. For building a test profile data set, the development user can specify any value for any data feature that is characteristic of the system being modeled.

Production profile data sets needed for running a profile on a model can be obtained by the model evaluator using an interactive query service 68. The query service uses whatever query protocol is appropriate for a PCI-compliant production database 70 containing the production data. The query service 68 can be used either automatically by the model evaluator based on the selected or built profile or can be used by a development user through a query feature 71 of the user interface, or a combination of them.

An example segment of code for a profile builder is:

class ModellnputFileBuilder {

ModelInputFileBuilder(String excelFileName, String caseSheetName, String

mappingsSheetsName,

String strategiesSheetName, String reasonCodesSheetName, String outputDir) {

@SuppressWarnings('GroovyAssignabilityCheck')

def buildFileFromRow(Row row) {

// Create file

log. info('Creating model input file: {}', "${outputDir}/${userId}.json")

Files. write(Paths.get(outputDir, "${userld}.json"),

JsonOutput.prettyPrint(JsonOutput.toJson(miFileTemplate)). bytes)

} /*** Returns JSON structure with the strategy/reason code information contained in the given row. */

def buildLabels(Row row) {

def labels = []

if (strategiesColumnlndex) {

Cell strategiesCell = row.getCell(strategiesColumnlndex)

if (strategiesCell) {

Map<String, String> parseMappings(Sheet mappingSheet) { When PCI-compliant data is drawn from the database as a result of the query, it passes through a PII scrubber 72 that removes personally identifiable information and loads the scrubbed data into a modeling PII-cleaned data storage where it is held for use when the model is run. PCI- compliant data is stored in a production cluster data store, in cleartext, after being scrubbed of personally identifiable information. This data is regularly transferred out of the production cluster, through an intermediary host, into an isolated environment where it is held for ingestion by the model evaluator.

The user selects the model 52 using a model selector component 76 accessible through the user interface. Through the user interface, the development user can select from a displayed list any existing version of any model in any state of development, including any model being used on the machine-based production system 14. The revised current versions of the model include versions applicable at a given time to two or more contexts. The two or more contexts are associated with respective different groups of end users such as end users of different age groups, genders, or income groups.

Once the model is selected and the profile data sets are selected or built the development user can start a model runner component 78, which executes the model using the profile data sets. In some implementations two or more models can be run concurrently on a given profile data sets or on different profile data sets.

An example of an entry point (e.g., a web service call) to start the evaluation of candidate models follows:

Strategy Service

strategy Service @RequestMapping(value = 'strategies', method =

RequestMethod.POST)

@ResponseBody

ResponseEntity getStrategies(@RequestParam(value = 'debug', defaultValue = 'false') boolean debug,

@RequestBody Map request) {

def profileld = request.metadata?.profile_id ?: 'unspecified'

if (! request. values) { log.error("Malformed request. Model inputs should be nested under 'values'. Profile: { } ", profileld)

return

ResponseEntity. status(HttpStatus.BAD_REQUEST).body([error: "Model inputs should be nested under 'values'"])

}

try { log.info('Received request for profile: { }', profileld) def result = strategyService.evaluateStrategies(request.values As shown in figure 5, in some implementations of a model development and evaluation process, the flow for model generation and evaluation includes a prototyping step 92 in which a development user can work through, for example, a netica user interface 94 to define the characteristics of a model that will be expressed in a model structure 96. The prototyping step need not be limited to the use of the Netica modeling platform but could be implemented using a wide variety of model prototyping platforms such as Figaro, BayesFusion, BayesianLab, or other Bayesian modeling tools.

An example of code for a Bayesian network model is shown below.

/* *

* Inserts a node in the network, retaining type information.

*

* @param name name of the node

* @param nodeType the logical type of the node (e.g. remedy, diagnosis, etc.)

* @param element the Figaro [[com. era. figaro. language. Element Element]] instance that represents the node

* @tparam T type of values produced by the probabilistic process represented by the node

*/

def insert[T: TypeTag](name: String, nodeType: String, element: Element[T]): Unit = { if (contains(name)) throw new RuntimeException("The network already contain a node with name [" + name + "]")

nodes += (name -> Node(name, nodeType, element, typeOf[T]))

}

/* *

* Retrieve a node by name in a type-safe manner. Example use:

{ { {network.getNode[Double]("double_node_name")} } }

* An exception will be thrown if the given type does not match the type of the node with the given name.

*

* @param name name of the node to be retrieved

* @tparam T type of values produced by the probabilistic process represented by the node * @retum the [[com. cra.figaro. language. Element Element]] instance that represents the node

*/

def getNode[T: TypeTag](name: String): Element[T] = {

nodes. get(name) match {

case Some(Node(_, element, valueType)) =>

valueType match {

case x if x =:= typeOf[T] => element.asInstanceOf[Element[T]]

case _ => throw new RuntimeException(" Value type mismatch for node [" + name + "]. Expected [" +

typeOf[T].toString + "], actual [" + valueType.toString + "]")

}

case _ => throw new RuntimeException("No node with name " + name)

}

An example of code to execute a Bayesian network model is shown below:

class BayesianReasoningComponent implements ReasoningService {

private static final List<String> TARGET NODE TYPES = ['remedy', 'diagnosis', 'move']

/* *

* This factory returns a clean instance of the Bayesian network.

*/

private final ObjectFactory<BayesianNetwork> factory

BayesianReasoningComponent(ObjectFactory<BayesianNetwork> factory) {

this. factory = factory

}

/* *

* {@inheritDoc}

* <pxp> * This implementation uses the {@link ObjectFactory} member to fetch a new instance of the {@link BayesianNetwork}

* prototype bean on each invocation.

*/

@Async

@Override

CompletableFuture<List<NodeResult>> evaluate(Map< String, Object> modellnputs) {

// Get Bayesian network instance

BayesianNetwork network = factory. object

log.debug('Obtained fresh bayesian network instance: {}', network)

// Clean up network, just in case

network.unobserveAll()

// Apply MI as evidence if there is a correspondent node in the network

modellnputs. findAll { network.contains(it.key) }.each { network.observe(it.key, it.value) }

// Get probabilities

CompletableFuture.completedFuture(network.getProbabilities(TARGET_NODE_TYPES))

}

The user who typically will perform the prototyping can be a knowledge engineer 102. During the prototyping process, the netica platform uses the information provided by the knowledge engineer about the features, characteristics, and structure of the intended model to generate the executable model structure 96. The model structure is captured in a .neta file. Also during the prototyping process, the model structure can be executed by the netica platform using so-called black-and-white profile data 80. The term“black-and-white” refers to profile data sets for which the expected output of the system being modeled is known in advance by the knowledge engineer. Therefore, it is possible for the knowledge engineer to determine easily whether the model structure when executed using the black-and-white profile data set is producing the expected output.

When the model structure is run using the black-and-white profile data sets, the results can be captured in an Excel report 100. The knowledge engineer has access both to the Excel report and to the .neta description of the model structure. This enables the knowledge engineer to understand flaws and advantages in the model structure and to develop the model by interaction through the GET 94. The feedback through the prototyping platform also includes specifying information about black-and-white data and black-and-white profiles that should be used by the model structure during its execution.

The generation of black-and-white input data and black-and-white profile data sets can begin with a data builder process 86 that can be controlled by the knowledge engineer or in some cases performed semi-automatically or automatically. The data builder process 86 enables the definition of data values and data features for which the expected results of the system being modeled are known. Particular examples can be created and stored in a black-and-white database 84. A profile generator 82 can use information from the model structure 96 to automatically or semi-automatically use appropriate data values and data from the black-and-white database to produce black-and-white profile data sets. For example, suppose a model is designed to use data fields a, b, and c as model inputs subject to certain filtering criteria. For example, the filtering criteria could be: If the value of“a” is >20, then the model generates a different decision than if that criterion is not met. If the value of“b” is between 0.15 and 0.6, then the model generates a specific decision. If the value of“c” is null or not null, then different respective decisions are generated. A profile generator then could generate black-and-white profile data sets

automatically by searching sample data records with those major criteria in mind. The profile generator could identify data records such that 50% of the records have values of“a” > 20 and 50% of the records have values of“a” < 20; some records have“b” values within the range 0.15 and 0.6 and other records have“b” values not within that range; and some records have“c” values that are null and some records have non-null“c” values.

In another example, the model structure may use values for interest rates charged on credit card accounts of the consumer. Black-and-white input data can be built that contains values for such interest rates for respective credit card accounts. The model structure may use a composite value that combines the interest rates charged on the credit cards accounts. And the knowledge engineer may know that if the composite value of the interest rates is higher than 25%, the personal financial system of the consumer should terminate the use of the highest interest rate credit card account. The black-and-white profile generator could generate a profile data set that enumerates the interest rates of the different credit card account and identifies the highest interest rate credit card account. An illustration of execution of a profile generator follows:

Text of Model Evaluator task“generateTestData” ip-l0-254-23-44:model-evaluator anthony$ /gradlew generateTestData - Pexcel=samples/dml_test2.xlsx -PcaseSheet=Cases -PmappingsSheet=Mapping - PstrategiesSheet=Strategies

Starting a Gradle Daemon (subsequent builds will be faster)

Download https://jcenter.bintray.com/com/google/guava/guava/maven-metadata.xml Download https://jcenter.bintray.com/joda-time/joda-time/maven-metadata.xml

> Task :generateTestData

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/Users/anthony/.gradle/caches/modules-2/files- 2. l/org.slf4j/slf4j-simple/l.7.25/8dacf95l4f0c707cbbcdd6fd699e8940d42fb54e/slf4j-simple- l.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/Users/anthony/.gradle/caches/modules-2/files- 2. l/ch.qos.logback/logback- classic/l.2.3/7c4f3c474fb2c04ld8028740440937705ebb473a/logback-classic- 1.2.3 j ar ! / org / slf4j /impl/ StaticLoggerBinder . class] SLF4J: See http://www.slf4j.org/codes. html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org. slf4j .impl. SimpleLoggerFactory]

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Parsing excel file... [main] INFO com.cinchfmancial.ModellnputFileBuilder - Mapping sheet contains 952 rows

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Detected [19] model input mappings in [Mapping] sheet

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Cases sheet contains 27 rows [main] INFO com.cinchfmancial.ModellnputFileBuilder - Detected strategy column with name [Strategies] in sheet [Cases]

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Detected reason codes column with name [Reason code] in sheet [Cases]

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Detected [19] model input columns in [Cases] sheet

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Starting model input file generation...

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file:

test_data/generated/User l.json

test_data/generated/User 2.json

test_data/generated/User 3.json

test_data/generated/User 4.json [main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file:

test_data/generated/User 5.json

test_data/generated/User 6.json

test_data/generated/User 7.json [main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User 8.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User 9.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User lO.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User 11 j son

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l2.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l3.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l4.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l5.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l6.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l7.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l8.json

[main] INFO com.cinchfmancial.ModellnputFileBuilder - Creating model input file: test_data/generated/User l9.json

[main] INFO com.cinchfmancial. As shown in figure 5, the knowledge engineer can iterate the process (suggested by the plus/minus symbol in block 102) of adjusting the model, reviewing the .neta file and the Excel report, and having the model structure run with the black-and-white profile data sets, until the knowledge engineer considers the model structure to be suitable for consideration as a candidate model 104 to be evaluated alone or compared with other models and possibly put into productive use. The knowledge engineer can then embed the .neta file of the candidate model in a wrapper 106. The wrapper provides a mapping between the output of the model (in the case of most probabilistic models, a set of results with their probabilities) and what the production system expects from the model. In some implementations, a set of rules can be used to map the probabilities to yes/no for each of the outputs and to disallow certain combinations of outputs. A model evaluator 108, such as the model evaluator described above, then processes the candidate model by running the model using black-and-white data 110 or production data 112 or a combination of them. The results of running the model are provided in an intermediate report 114.

An example of an execution of candidate models follows:

Text of Model Evaluator task“evaluate” using Prod Data ip-l0-254-23-44:model-evaluator anthony$ /gradlew evaluate -Pmodels=netica, whisper - Pdata=te st data/ prod

> Task : checkout-model Api-netica-devel op Already on 'develop'

Your branch is up-to-date with 'origin/develop'.

> Task :pull-modelApi-netica-develop Already up-to-date.

> Task :debt-strategist:downloadNetworkFile downloading s3://cinchfmancial-dev-dependencies/branches/develop/bayesian-network-debt- netica-0.1. l9.neta to /Users/anthony/dev/model-evaluator/model_api/build/generated- resources/netica/debt network.neta

> Task :debt-strategist:downloadNodeMappingFile downloading s3://cinchfmancial-dev-dependencies/branches/develop/bayesian-network-debt- netica-node-mapping-0.1. l9.csv to /Users/anthony/dev/model- evaluator/model_api/build/generated-resources/netica/node_mappings.csv

> Task : query-model Api-netica-devel op

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/Users/anthony/.gradle/caches/modules-2/files- 2. l/ch.qos.logback/logback- classic/l.2.3/7c4f3c474fb2c04ld8028740440937705ebb473a/logback-classic- 1.2.3 Jar!/ org / slf4j /impl/ StaticLoggerBinder . class]

SLF4J: See http://www.slf4j.org/codes. html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org. slf4j .impl. SimpleLoggerFactory]

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 006bl7d2- bb0a-43ff-9953-fe4lf5fbff99.json [main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 008a7ed0- 8762-4495-b09l-45d56d64cl06.json

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 00dee30d- 6a07-4e7e-b7cc-c5df7l^:325bdl.json [main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 0l205f94- 484c-4882-bda8-410112fe6dd7.j son

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 0l38cd3e- 56ad-4lc9-9cl7-365f2l55le9c.json [main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 01424782- 2ad0-480d-a074-eebcd6969832.json

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 01609151- 527a-4390-bd69-7c3072bab739.json

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 0l85b4c0- 4ac0-4e74-a758-e27f01 e3 lbc9.j son

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 0l99ff84-c7c3- 420a-afcb-68a4f0792cld.json

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 0lc362ef- 3469-4al 7-ae78-31837735dl43.j son [main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 022e272f- e07e-449e-ae9f-7dl5300cd340.json

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 026lac5d- c6de-472d-bb82-059be069b9c6.json

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 02902lfe- 61 cd-46f5-b2e0-9db2a23 e8d0e.j son

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 03df6e32-f888- 40d4-8a45-5f20fd668c6f.j son

[main] INFO com.cinchfmancial.ModelApiClient - Querying Model Api for file: 0486282e- 4dc9-4fd3 -8b 10- 170539d 1 a07b j son

An example of code for running an intermediate report follows: python run intermediate report.py -h

usage: run intermediate report.py [-h]— model 1 MODEL1— model2 MODEL2—truth

TRUTH [-results RESULTS] [-save SAVE]

[-output OUTPUT]

[-sensitivity SENSITIVITY]

launch intermediate report for comparing model outputs

required arguments:

— model 1 MODEL1 name of first model result to compare (default: None)

— model2 MODEL2 name of second model result to compare (default: None)

—truth TRUTH choose one of the two models to use as relative ground truth (default: None) optional arguments:

—results RESULTS path to results JSON (default: data/results.json)

—save SAVE, -s SAVE path to save drilldown report (default:

drilldown report.xlsx)

-output OUTPUT, -o OUTPUT

target to output report (default: report.html)

-sensitivity SENSITIVITY

path to node sensitivity report JSON (default: None)

A successful running of an intermediate report follows:

python run intermediate report.py— model 1 netica— model2 whisper—truth whisper—results data/results.j son—sensitivity data/debt-strategy -network-sensitivity j son

[NbConvertApp] Converting notebook intermediate report.ipynb to html

[NbConvertApp] Executing notebook with kernel: python3

[NbConvertApp] Writing 831976 bytes to report.html

Examples of inputs to report (of two models to be compared and evaluated) to generate an intermediate report follow: [

{

"profile id" : "0a82dec3-34a7-4aa4-bc86-584fb9877n24", "models": [

{

"model name" : "netica",

"results": [

{

"name" : "keep_existing_card",

"reason": "KEEPCARD",

"recommended" : true

},

{

"name" : "new_0_percent_credit_card",

"reason": "REJNOC RE) APPROVE",

"recommended": false

},

{

"name" : "new_low_rate_credit_card",

"reason": "REJNOC ARD APPROVE",

"recommended": false

},

{

"name": "new_personal_loan",

"reason": "REJLOANNOTLOGICAL",

"recommended": false

}

],

"execution time ms": 1101,

"details": {

"net_execution_time_ms": 455,

"wrapper_execution_time_ms" : 30

}

},

{

"model name" : "whisper",

"results": [

{

"name" : "keep_existing_card",

"reason": "KEEPCARD", recommended" : true

},

{

"name" : "new_0 _percent_credit_card",

"reason": "REJLONGP AY OFF " ,

"recommended": false

},

{

"name": "new_personal_loan",

"reason": "REJLOANNOTLOGICAL",

"recommended": false

},

{

"name" : "new_low_rate_credit_card",

"reason": "REJNEWCARDNOTLOGICAL",

"recommended": false

}

],

"execution_time_ms": 741,

"details": {

"message": "No debug information available for this model"

}

],

An example result/comparison diagram is shown in figure 15.

Among the components of the intermediate report may be a precision and recall section 116, a confusion matrix 118, an early performance section 120, and a drill down section 122.

An example of a function to support drilldown reports into profiles with non-matching results from two files follows:

def drilldown(data):

FUNCTIONALITY:

takes in parsed dataframe and selects profiles that have non-matching model results ARGS:

data: dataframe with profiles as rows and model results as columns RETURNS:

data nonmatch: filtered data containing profiles with non-matching results return data nonmatch

An example of a calculation of a percentage of non-matching profiles between models follows: def percent_nonmatching(data, data nonmatch):

FUNCTIONALITY:

calculates percent of profiles that have non-matching model results

ARGS:

data: dataframe with profiles as rows and model results as columns

data nonmatch: data with only non-matching profiles

RETURNS:

percent nonmatch: percent of profiles with non-matching results num profiles = len(data)

num profiles nonmatch = len(data nonmatch)

percent nonmatch = 100*(num_profiles_nonmatch/num_profiles)

return percent nonmatch

An example of code to generate a heat map matrix comparing model outputs follows: def plotfunction(data_match, data nonmatch):

FUNCTIONALITY:

creates visual comparisons (bar plot, heatmap confusion matrix) of profiles with matching or

non-matching model results

ARGS:

data match: data with only matching profiles

data nonmatch: data with only non-matching profiles

RETURNS:

bar plots and heatmap confusion matrices comparing model outputs return barplot(data match), barplot(data nonmatch), heatmap(data match),

heatmap(data nonmatch) A sample heat map matrix is shown in figure 14.

The contents of the drill down section 122 can be implemented by parsing profile data sets from the production data database 112 based on information about a production machine learning model 126 that may be the subject of the model evaluation. The intermediate report produces the drill down section from the results of the model evaluator. The model evaluator results contain (1) the results of having run each model against profile data sets and (2) the profile data sets that were used by the model. The drilldown section shows, side by side, the results of different models for a particular profile, if they didn't match, and the relevant model inputs in that profile.

An example of code to calculate an average time to run each model follows.

def performance stats(data time)

FUNCTIONALITY:

creates plot comparing distribution of execution times for models and outputs average execution times

ARGS:

data time: columns of data containing execution times for each profile and each model

RETURNS:

distplot: histogram of execution times

avg time: average execution time for each model return distplot, avg time

A subject matter expert 128 reviews the intermediate report and may determine that alterations should be made in an updated version of the model as part of the prototyping (e.g., model generation or model development) process, in which case the process shown in figure 5 is repeated (iterated). For models trained using profile data sets, the subject matter expert’s understanding of the decision process represented in the model is important because the expert is able to identify areas where there may be gaps or inconsistencies in the training profile data sets. Subject matter experts also validate or disprove the assumptions and policies created by the modeler. Or the subject matter expert may determine that the performance is adequate, in which case the subject matter expert can produce a model definition report 88 and a load performance stress report 90. The model definition report can include data from the intermediate report and a description of the model including its approach, limitations, and future improvements. The performance stress report can indicate whether the model is ready to go into production from an engineering standpoint, for example, is it fast enough and able to handle the expected processing load.

The process involving the subject matter expert can be iterated repeatedly until the candidate model is considered usable for production or some other purpose.

Therefore, figure 5 describes a model development and evaluation process that includes two levels of iteration. One level of iteration involves the original prototyping and development of successive versions of a model by the knowledge engineer. That level of iteration lies within an overarching iteration performed by the subject matter expert with respect to evaluation of the performance of the prototyped model.

As shown in figure 6, in typical model development systems, a deployed model 138 (one that has been put into a production environment or mode) can be subjected to online evaluation 136 by observing results of applying the deployed model 138 to profile data sets from a live

(production) database 132 to produce production-mode evaluation results 140. Also in typical model development systems, a prototype model 142 under development can be subjected to off- line evaluation 144 using historical data 146 in a process of off-line evaluation of live data 148 that produces validation results 150. Once a prototype model has been validated and accepted for use in a production environment, the prototype model can be set up as the deployed model.

A feature of the technology is that the off-line (non-production) evaluation 144 can include an execution of a prototype model using either historical data or live (real time production) data. And more than one prototype model or deployed model can be subjected to off-line evaluation including in a comparison mode in which a prototype model is viewed as a challenger of a deployed model or in which one version of a prototype model is viewed as a challenger of an earlier version of a prototype model. Different modes of use of the technology can be arranged based on when the models are applied and whether the uses involve end user participation, including the following modes: (1) Generating decisions and advice and interacting with end users in a production mode in real time, and performing offline operation to apply the same production profile data sets to other models and versions of models for comparison purposes.

(2) Running models solely offline using profile data sets to evaluate models and compare them. (3) Same as (1) and including prediction of a future outcome if advice based on decisions of the model are applied, for example, a predicted 5-yr outcome if an end user takes specific actions like paying extra $200 against credit debt every month.

(4) Same as (2) and including prediction of a future outcome if advice based on decisions of the model are applied, for example, a predicted 5-yr outcome if an end user takes specific actions like paying extra $200 against credit debt every month.

When a deployed model being used in a production system is subjected to off-line model execution and evaluation, one or more model experts serving as evaluators can specify and select profile data sets to be used for evaluation of the deployed model.

Referring to figure 7, to facilitate the work of experts (for example, knowledge engineers or subject matter experts or other development users) who evaluate models as part of the model development process, the user interface 160 presented to the experts can include a variety of features. The user interface can provide a query builder 162 that enables the user to select profile data sets based on a variety of characteristics or features. As an example, the query builder could enable the user to specify that the model be evaluated using profile data sets for consumers whose demographic characteristics include an age between 30 and 35. A field selector 164 may then be used to choose the fields of data for the records (data sets) selected by the query to be applied to the model being evaluated. A data joining feature 166 of the user interface can be used to select related data from different data sources and to specify how the selected records of data should be joined, for example, based on the same profile ID.

In evaluating a model, the development user must decide how many different data records (e.g. profile data sets) should be executed during the running of the model. The user interface can provide a number-of-records feature 168 to enable the user to specify the number. In addition to specifying the number of records to be applied to the model, the user can select, specify, or define a sampling principle or algorithm to be applied in selecting that number of records from one or more databases. The user interface provides a sampling algorithm feature 170 for this purpose. As examples, the sampling could be: (1) totally random with no criteria to filter, (2) sampling within a specific target users (such as gender = F, age = 20 - 30, location = New Yolk city, select 0.5M such users randomly), or (3) sampling within the specific target users with ratio matching the population (such as gender = F, age = 20 - 30, location = New York city, select 0.5M such users according to five income levels proportionally: if 30% of the set population is in a range of 30k - 50k income level, then 30% of the randomly selected 0.5M should be within this range; if 10% of this set population is in 50k - 80k income range, then 10% of the randomly selected 0.5M should be within this range, and, if 2% of this set population is in the range of 80k - l50k, then 2% of the randomly selected 0.5M should be within this range.

To facilitate the automatic or semi-automatic evaluation of the model or comparison of models, the user can specify the outputs or results that are expected to be produced by the system being modeled based on the records applied to the model as a result of the selections made by the development user through the interface elements discussed above. The development user can define or specify or select the outputs or results in a variety of ways. For example, the development user can select or define rules or combinations of conditions to be applied to input data records or other information to determine what the outputs or results should be. In some instances, the user can specify manually and literally the outputs to be expected. Other approaches can also be used. An expected results feature 172 of the user interface provides this capability.

A wide variety of other features and capabilities can also be provided on the user interface to enable the user to define how the model should be run, data records (e.g. profile data sets) to be applied to the model during the execution, and other features and combinations of them.

As explained above, the technology provides an off-line (non-production) model execution platform to perform human-assisted, semi-automatic, and automatic evaluation of models being developed or in use in production environments. One feature of the technology that enables the semi-automatic and automatic evaluation of models and also facilitates human-assisted evaluation of models is the use of input data formats, output data formats, and performance metrics formats that are predefined as part of concepts contained in one or more ontologies or hierarchies of ontologies. Reference is made to the patent application cited earlier (and incorporated in its entirety) regarding the use of ontologies for financial advice technology. In some implementations, one or more ontologies can be used to define concepts, a hierarchy of concepts, and protocols for expressing data for such concepts in a common format. When input data is received by a model, the data can be processed in a way that interprets the data

consistently with a predefined part of one of the ontologies. Then the model can treat the processed data in accordance with an internal concept contained in one or more of the ontologies. Similarly, models can be configured so that outputs or results that they deliver are in a predetermined format defined by concepts that are part of the ontologies. As a result, analysis and comparison of model results on an apples-to-apples basis can be achieved easily. Ontologies, hierarchies of ontologies, and protocols for defining the formats of data corresponding to concepts of the ontologies can also be used in the expression of metrics for the performance of models. The use of ontologies facilitates the human-assisted, semi-automatic, or automatic execution of models during an evaluation process and the evaluation comparison of models based on their outputs and on performance measures of their outputs.

As shown in figure 8, in some implementations of the technology, model evaluation and the generation of an evaluation report (such as a comparison report 186 of the results of the comparison of two different models or a given model at two different stages of development) can be done semi-automatically by the off-line model execution platform 184. The semi-automatic operation is facilitated by the use of ontology-compliant data 194 as an input to an ontology- based model 190 that produces ontology-compliant results 196. The evaluation process 192 uses the ontology-compliant results and ontology-compliant evaluation metrics. The use of ontologies enables the off-line model execution platform to execute its processes semi-automatically to produce a useful comparison report. We say that the operations are semi-automatic (rather than fully automatic) because of, for example, the role of the knowledge engineers and subject matter experts 180 through the user interface 182 in configuring the off-line model execution platform for execution as explained earlier.

In some implementations, a model evaluation report of the technology can have two parts. One part is generated automatically by, for example, the off-line model execution platform. The second part can be manually generated by subject matter experts and can contain, for example, evaluations, comments, descriptions, and other information determined by the experts to be useful for inclusion.

Among other things, the model evaluation report (or a comparison report, for example) can include some or all of the following elements:

1. The motivation for the model. a. The problem that the model is attempting to solve.

b. The type of the model.

c. For a statistical type of model, a description of the data available for training.

d. The mechanism used for evaluating the model (online using production data or off-line using production data or historical data or a combination of them).

2. The evaluation mechanisms.

a. The evaluation mechanisms used for off-line evaluation.

b. The testing mechanisms used for online evaluation.

c. Hyper-parameter search.

3. Evaluation metrics.

a. Classification metrics.

i. Accuracy.

ii. Confusion matrix.

iii. Per-class accuracy.

iv. Log-loss.

v. Area under curve.

b. Ranking metrics.

i. Precision-recall. ii. Precision-recall curve and the Fl score. The Fl score combines precision and recall into one metric; it is the harmonic mean of the two, i.e. Fl score =

2 * (preci sion*recall)/ (preci sion+recall) .

iii. normalized discounted cumulative gain (NDCG).

c. Regression metrics i. Root-mean-square error (RMSE).

ii. Quantiles of errors.

d. The basis of the conclusions of the evaluation report and related discussion and statistics suggesting reasons for discrepancies between outcomes of competitive models or a model and“truth”.

e. Access to information used in the evaluation.

i. Where and how to access the model and information about the current model logic.

ii. The location of the data used for the evaluation.

iii. The location of log information about the execution of the model. iv. Other notes.

v. Comparison between Mis values for models being compared.

In some implementations, the evaluation report can include a Venn diagram or chart representing the relationship of precision and recall related to the results of the running of the model. As shown in figure 9, the square 200 represents the results of running the model. The square 204 represents the results known to be relevant. The rectangle 202 (the overlap of the two squares) represents the“happy correct results”, that is, the results of running the model that are known to be relevant.

Mathematically, precision and recall can be defined as precision = TP/(TP+FP) and recall = TP/(TP+FN) where TP=true positive, FP=false positive and FN=false negative).

Precision versus accuracy can be illustrated by the table shown in figure 10. The upper left target in figure 10 illustrates model results that are both accurate (because all of the results obtained during the testing hit“the bull’s-eye”) and precise. The upper right target illustrates model results that are precise (because the results are bunched together) but are inaccurate (based on a systematic error) because the results are off the bull’s-eye. The lower left target illustrates model results that are accurate (because centered on the bull’s-eye) that are imprecise (because they are not tightly bunched, indicating a reproducibility error). The lower right target illustrates model results that are both inaccurate and imprecise.

In some instances, the model evaluation report can include a confusion matrix (or confusion table) that shows a breakdown of correct and incorrect results, such as when the model is a classification model. As shown in figure 11, the rows of the confusion matrix represent known expected results from the modeled system. The columns of the confusion matrix represent actual results from running the model. For the example shown in the figure, there were 100 examples known to be in the positive class, and 200 examples known to be in the negative class. The matrix illustrates that the model produced lower recall for examples in the positive class (only 80 of the 100 examples that were positive or determined by the model to be positive) than it did for examples in the negative class (195 and 95 of 200 total samples were correctly classified). This per-class information is useful because it would not be apparent from a report of the overall accuracy of the model for both positive and negative examples, for which the accuracy would be 275 of a total of 300 examples.

In some implementations, the model evaluation report can include an area-under-curve (AUC) report. The curve can be a receiver operating characteristic (ROC) curve used to show the sensitivity of a classifier model in the form of a curve of the rate of true positives against the rate of false positives. In other words, the ROC curve demonstrates how many more correct positive classifications of examples by the model can be gained by allowing more and more false positive classifications of examples by the model. For example, a perfect classification model that makes no mistakes would hit a true positive rate of 100% immediately, without incurring any false positive, and AUC would be 1, a situation which almost never happens in practice.

The model evaluation engine can use various types of validation mechanisms in terms of the data used for validation. The mechanisms can include holdout validation in which a portion of the example data is used for training and a different portion is used for validation of the trained model. Another mechanism is K fold cross validation in which the data examples are divided into K sets, one of which is a validation set and the others of which are used for training. The third mechanism is a bootstrap resampling mechanism in which an original set of data examples is resampled to produce a resampled data set to which training and validation are applied.

In some implementations, the technology uses a combination of parameters and hyper parameters in evaluating machine learning decision-making models. Typically, a model parameter is a value that is learned during training and defines a relationship between features of the data and a target to be learned. In other words, parameters are central to the process of model training. Hyper parameters, by contrast, are specified outside of the process of training the model. Variations of linear regression, for example, have such hyper parameters. In some cases the hyper parameter is called a“regularization” parameter. Decision tree models have hyper parameters such as a depth of the model and a number of leaves in a tree.

As discussed earlier, an important feature of the technology is in supporting a model

development process that is disciplined, rigorous, adaptable, automated or semi-automated, easy to use, easy to understand, consistent, accurate, and informative, among other things. The model development process establishes generalized target model performance standards, validates existing model accuracy and identifies opportunities for improving models, and creates a repeatable workflow for model generation and evaluation.

The model development process can include the following sequence of activities that are performed by the respective identified experts.

1. Define the concept of the model and the policy associated with model (business analyst)

2. Determine features that will achieve the concept and the policy (data scientist, probabilistic modeler)

3. Develop a prototype model (knowledge expert or modeler)

4. Evaluate the prototype model using the model evaluation platform to access production data, and calculate precision and recall, based on expert knowledge (business analyst, knowledge expert)

5. Create and tune a production version of the model (modeler, subject matter expert, artificial intelligence engineer) 6. Evaluate the integration of the production model with other features of the system such as a financial advice system (business analyst)

7. Test the success of the integration (software developer, quality assurance team)

In some implementations, rules for evaluation and model development workflow can be established to assure consistency and quality in the evaluation and development process. For example, a rule may require that each model be assessed using precision and recall analysis. This analysis can include baseline assessments conducted on existing models and establishment of standards for model accuracy and improvement of a model compared with existing models using the baseline assessment. Another rule is to require the creation of a“ground truth” set of expected results for a set of data, so that evaluation of a model against the input data will be effective. The rules can include a rule requiring that all candidate models for integration into a production system be justified by an evaluation report that includes several components. The components can include statistical performance evaluation against training data sets and ground truth expected results. Another component can be knowledge assessment of the soundness of a model compared with ground truth business logic and the concept of the platform in which the model will operate. A required component of the evaluation report can be a rationale for a recommendation to integrate the model into a production platform in which the recommendation articulates the reasons for the change or the addition of the model. For example, the

recommendation could cite a 50% improvement in accuracy.

As discussed earlier, the model development and evaluation technology supports evolutionary model development and ontology principles of common input data. For this purpose, the model development and evaluation technology serves the following objectives: It enables the mimicking of human experts using other approaches. It performs knowledge engineering to design, for example, Bayesian network-based models. It performs data-driven training for black box approaches such as deep learning. It applies consistent model evaluation criteria to enable the model evaluations to be performed automatically or semi-automatically. It allows each model to be developed over time and to compete at any time and concurrently with other models for use in production contexts.

Among the advantages of the evolutionary model development enabled by the model evaluation technology described here are the following: the model evaluation process is performed semi- automatically which facilitates more rapid, more logical, and more disciplined evaluation by experts of competing types of models (such as machine learning, Bayesian, deep learning, and others) and specific competing models. The technology can enable model development to be done in competition with the development of models by human experts (such as model development by portfolio managers for investment purposes). The technology provides transparency with respect to the performances of types of models in particular models which can lead to broader use of more effective models. Transparency can stimulate a competitive ecosystem to motivate the evolution of different competitive model types and models.

As shown in figure 12, the model development and evaluation technology can involve interaction between the development process for models being used in production environments 302 and candidate models under development 304. With respect to models being used in production environments, the technology enables a user to implement data and feature enhancements 306 of a production model. With respect to candidate models under development, the technology enables a user to define an initial prototype 308 and develop the prototype. Data and feature enhancements can be part of the process of developing the initial prototype, and the data and feature enhancements can be generated in the course of the prototype development. Ontology alignment 310 (that is, assuring that the inputs, outputs, and metrics conform to define concepts of one or more ontologies) is applied to the data and feature enhancements 306 and to developed rules/Bayesian models 312 that result from the prototyping 308. As part of the model development 312, the user can define target use cases 314. Eventually, the rules/Bayesian and model can be put into production deployment 316. Any model in production and any candidate model can be compared by challenger/champion comparison testing 318 using production data in an off-line mode 320. If the challenger performs better than the champion, the production environment can be upgraded by deploying the challenger 322.

Other implementations are also within the scope of the following claims.

Claims

1. A machine-based method for improving execution by a processor in a production environment of a process that uses input data to generate decisions about actions to be applied to a real-world system to improve a state of the real-world system,

the decisions conforming to data evidencing decisions that improved the state of the real- world system based on known input data,

the real-world system operating according to open-ended numbers of and types of variables and dependent on subjective knowledge about decisions about actions to be applied, input data for at least some of the variables and subjective knowledge about decisions to be applied being unavailable from time to time, and

the data evidencing decisions that improved the state of the real-world system being incomplete,

the machine-based method to improve the execution by the processor in the production environment comprising

(a) the processor executing a first executable model using available input data to generate decisions on actions applicable to the real-world system,

(b) the processor executing a second executable model using the available input data to generate decisions on actions applicable to the real-world system,

(c) the processor executing a comparative evaluation process to generate comparative performance data indicative of a performance of the first executable model compared to a performance of the second executable model, based on the available input data and on the decisions generated by execution of the respective executable models,

(d) enabling at least one human expert to guide generation of a third executable model based on (a) the available input data, (b) the decisions generated by execution of the first and second executable models, (c) the comparative performance data, and (d) the subjective knowledge of the at least one human expert,

the actions (a) through (d) being repeated for additional iterations, at least one human expert selecting one of the models for use in a production environment, and executing the selected one of the models in the production environment.

2. The machine-based method of claim 1 in which the processor executing the first executable model, the second executable model, and the third executable model comprising the processor executing two or more different executable models.

3. The machine-based method of claim 1 in which the processor executing the first executable model, the second executable model, and the third executable model comprising the processor executing two or more different versions of a given executable model.

4. The machine-based method of claim 1 in which the processor executing the first executable model, the second executable model, or the third executable model comprises using real-time input data in a production mode.

5. The machine-based method of claim 1 in which the processor executing the first executable model, the second executable model, or the third executable model comprises using historical input data in a non-production mode.

6. The machine-based method of claim 1 in which the processor executing the first executable model, the second executable model, or the third executable model comprises executing based at least partly on inputs from a knowledge expert.

7. The machine-based method of claim 1 in which the processor executing a comparative evaluation process comprises executing a model evaluator using production data.

8. The machine-based method of claim 1 in which the processor executing a comparative evaluation process comprises generating a report of an evaluation of each of the models.

9. The machine-based method of claim 1 in which the processor executing a comparative evaluation process comprises executing at least one of the first executable model, the second executable model, and the third executable model in real time in a production environment using real-time data and executing at least one of the first executable model, the second executable model, or the third executable model later in a non-production environment using the same real time data.

10. The machine-based method of claim 1 in which the processor executing a comparative evaluation process comprises executing at least two of the first executable model, the second executable model, and the third executable model in a non-production environment.

11. The machine-based method of claim 9 in which the processor executing a comparative evaluation process comprises comprising using predictions of outcomes based on decision outputs of the first executable model, the second executable model, or the third executable model.

12. The machine-based method of claim 10 in which the processor executing a comparative evaluation process comprises using predictions of outcomes based on decision outputs of the first executable model, the second executable model, or the third executable model.

13. The machine-based method of claim 1 comprising the processor receiving input data expressed in accordance with concepts specified in ontologies associated with the first executable model, the second executable model, or the third executable model.

14. The machine-based method of claim 1 comprising the processor executing the first executable model, the second executable model, or the third executable model to generate decisions expressed in accordance with concepts specified in ontologies associated with the first executable model, the second executable model, or the third executable model.

15. The machine-based method of claim 1 comprising the processor presenting a user interface to the human expert to enable the human expert to guide generation of the third executable model.

16. The machine-based method of claim 15 in which the processor presenting a user interface enables the human expert to identify profiles of input data.

17. An apparatus comprising

one or more processors, and

a storage containing instructions executable by the one or more processors to perform an iterative development process for one or more models of real-world systems, each iteration including: running current versions of two or more competing models of financial systems to generate corresponding model outputs for each of the competing models based on the same one or more profiles of input data, the profiles of input data representing possible influences on the real-world system for which particular outputs are expected,

evaluating relative performances of the two or more competing models, providing information about the relative performances to a human expert, and receiving revised current versions of one or more of the competing models developed by the human expert.

18. The apparatus of claim 17 in which the two or more competing models of real-world systems comprise an original model and a revised version of the original model.

19. The apparatus of claim 17 in which the two or more competing models of real-world systems comprise models that were independently developed.

20. The apparatus of claim 17 in which the running of the current versions comprises running the current versions in an off-line mode.

21. The apparatus of claim 17 in which running of the current versions comprises running at least one of the current versions in a production mode.

22. The apparatus of claim 17 in which evaluating relative performances of the two or more competing models comprises evaluating precision or accuracy or both.

23. The apparatus of claim 17 in which the running of the current versions comprises running the current versions automatically.

24. The apparatus of claim 17 in which the evaluating of the relative performances of the models comprises evaluating the relative performances automatically.

25. The apparatus of claim 17 in which the evaluating of the relative performances of the models comprises a confusion matrix.

26. An apparatus comprising

one or more processors, and a storage containing instructions executable by the one or more processors to perform an iterative process for development of a model of a real-world system, each iteration of which includes:

enabling a subject matter expert to interactively develop a current version of the model based on subjective knowledge of cause and effect relationships,

automatically evaluating the current version of the model by running the version of the model using profiles of input data to generate test outputs,

enabling the subject matter expert to evaluate the test outputs using expert knowledge and provide a revised current version of the model.

27. The apparatus of claim 26 in which evaluating the current version of the model comprises using profiles of input data for which correct outputs are known.

28. The apparatus of claim 26 in which evaluating the current version of the model comprises evaluating the performance of the model in generating correct test outputs.

29. The apparatus of claim 26 in which evaluating the current version of the model comprises evaluating precision or accuracy or both.

30. The apparatus of claim 26 in which enabling the subject matter expert to evaluate the test outputs comprises providing a report of the outputs to the subject matter expert.