EP1131750A1

EP1131750A1 - Modelling tool with controlled capacity

Info

Publication number: EP1131750A1
Application number: EP99972333A
Authority: EP
Inventors: Bernard Alhadef; Philippe Baratier; Marie-Annick Giraud
Original assignee: Sofresud SA
Current assignee: Sofresud SA
Priority date: 1998-11-17
Filing date: 1999-11-16
Publication date: 2001-09-12
Also published as: FR2786002A1; US20020156603A1; FR2786002B1; WO2000029992A1

Abstract

The invention concerns a method for modelling digital data from a data sample comprising means for acquiring input data, means for preparing the input data, means for constructing a learning model on the processed data, means for analysing the resulting model, means for operating the resulting model, characterised in that it consists in controlling by regression the consistency of the standard learning process by adding to the covariance matrix a disturbance in the form of the product of a scalar quantity μ by a matrix H during the model computation.

Description

MODELING TOOL WITH CONTROLLED CAPACITY

The field of the present invention consists of learning and modeling methods. The invention makes it possible to find a model for predicting the evolution of a phenomenon from a set of digital data of any size. It can be produced in the form of specially designed integrated circuits and is then in the form of a specific element operating independently. It can also be performed in software form and be integrated into a computer program. It can in particular be used for processing a digital signal in an electronic circuit. In a more general application, it allows the modeling of non-linear phenomena, the analysis of phenomena thanks to immediately usable formulas and the generation of robust models. The precision allowed by these new methods makes it possible to significantly increase learning speeds. The invention can also be used in the field of risk analysis by insurance companies. These keep in a more or less structured form the characteristics of drivers, their vehicles and claims suffered or caused. From these available elements, we can find out which are the high risk elements.

In the modeling of physical phenomena, the events analyzed generally correspond to the data collected by the various sensors in the measurement chain. We can, for example, determine which combinations of factors are the source of defective products, so anticipate problems and gain productivity.

In the field of flow management, these events will rather correspond to the information collected over time. We can, for example, look for the relationships existing between the flows considered and calendar data, or the variables more specific to the application considered such as meteorological data for electricity consumption or promotion periods for sales analysis. , which will make it possible to better manage the stocks and the load of the manufacturing factories.

In the banking sector, the events will represent on the one hand the profile of the customers and on the other hand, a description of the operations. The modeling will highlight, for example, the risk factors linked to individuals and operations.

The learning problem is to find dependencies using a limited number of observations. It is therefore a question of choosing, from a given set of functions f (x, α), αeA, A being a set of parameters, the one which makes it possible to best approach the output.

If L (y, f (x, α)) is a measure of the difference between the actual output y and the output predicted by the model f (x, α), we must therefore minimize the effective risk: while knowing that the probability distribution F (x, y) is unknown and that the only information available is contained in the k data (x,, y ...,

( ^χ _* y of all observations (training data).

Conventionally, we seek the function which minimizes the empirical risk calculated on the basis of the training data:

1 * Ren, _P = - L (yi, f (xi,)) (2) k; = ι Then, we postulate that it will be the best approximation of the function which minimizes the effective risk given by (1). The problem is to know to what extent a system built on the principle of minimization of the empirical risk (2) is generalizable, that is to say allows to minimize the effective risk (1) including data having not been learned.

Mathematically, a problem is said to be correctly posed as soon as it admits a single solution and this solution is stable, that is to say that a slight modification of the initial conditions can modify only in an infinitesimal way the form of the solutions . Problems that do not have these properties are called ill-posed problems.

It frequently happens that the problem of finding f satisfying the equality A.f = U is poorly posed: even if there is a unique solution to this equation, a small variation of the second member can cause large variations in the solution.

As soon as the second member is not exact (u _ε instead of u with || M —w ≤ε), the functions which minimize the empirical risk R (f) = \\ Af - are not necessarily good approximations of the sought solution, even when ε tends to 0.

An improvement in the search for solutions consists in minimizing another functional, called regularized, of the form:

R (f) = R (f) + λ (ε) .Ω (f) (3)

OR :

- = "HJ) is a functional which belongs to a special type of operators called regularizers,

Aψ) _{is a} well chosen constant which depends on the noise level existing on the data.

We then obtain a series of solutions which converges towards the right solution when ε tends to 0. In minimizing the regularized risk rather than the empirical risk, one thus obtains from a limited number of observations a solution that can be generalized to all the cases. The introduction of the term regularizing certainly provides a unique solution to an ill-posed problem. This may be slightly less faithful than the conventional solution, but it has the fundamental property of being stable, thereby resulting in greater robustness of the results.

The methods for solving ill-posed problems show that there are other inductive principles which make it possible to obtain a better capacity for regularization than the principle consisting in minimizing the error made on the learning set.

Therefore, the main objective of theoretical analysis is to find the principles allowing to control the generalization capacity of learning systems and to build algorithms which implement these principles.

Vapnik's theory is the tool allowing to find necessary and sufficient conditions for a learning process based on the principle of minimization of the empirical error to be generalizable have been established, leading to a new inductive principle called the principle of minimization of structural risk.

We can show that the effective risk verifies an inequality of the form:

R () <R _{e mp} () + F (h, k) (4) WHERE:

- h is the Vapnik-Chervonenkis dimension of the space of functions f (x, α) among which the solution is sought,

- k is the number of observations available to build the model, F is an increasing function of h and decreasing of k.

We immediately see that, as the number k of the available observations is finite, the fact of minimizing the empirical error cannot be enough to minimize the effective error. The general idea of the principle of minimizing structural risk is to take into account the two terms of the second member of (4), rather than just the empirical risk. This implies constraining the structure of the set of functions f (x, α) among which the solution is sought to limit or even control the parameter h.

Following this principle, the development of new algorithms to control the robustness of learning systems becomes possible.

The invention relates to a new modeling technology for very general use, the essential characteristics of which relate to the efficiency of the method, the simplicity of the models obtained, their robustness, that is to say their performance on n data. 'not used for learning. The implementation of this technique in a computer, electronic or mechanical system equipped with sensors and model operating functions makes it possible to design a tool capable of adapting and controlling an environment in which complex and changing phenomena exist, and where the sensors only partially account for all of the phenomena involved. Furthermore, the extreme simplicity of the models obtained provides the user of the tool with an intuitive understanding of the phenomena he seeks to control.

The invention uses both conventional techniques, such as the calculation of covariance matrices, and more recent theories, such as those of statistical regularization and the consistency of learning processes. The invention consists in that the matrices of covariance are not used as such, but according to a new method which consists on the one hand of disturbing the covariance matrix in a certain way, and on the other hand of adjusting the level of noise added in another way. We describe here mathematically how to add and control noise to the data, but it is possible to perform these operations electronically or mechanically.

The invention consists of a method of modeling digital data from a sample of data comprising means for acquiring the input data, means for preparing the input data, means for building a model by learning on the processed data means for analyzing the performance of the model obtained, means for exploiting the model obtained characterized in that the consistency of the conventional learning process is controlled by regression by the addition to the covariance matrix of a perturbation in the form of a matrix H depending on a vector of k parameters Λ = (λ ,, λ ₂ , ... λ _k ) or in the form of the product of a scalar λ by a matrix H during the calculation of the model. The matrix H can be such that H (p + l, p + l) is different from at least one of the terms H (i, i) for i between 1 and p.

Subsequently, we consider that two numbers are neighbors when their relative deviation does not exceed 10%. Advantageously, the matrix H satisfies the following conditions: H (i, i) is close to 1 for i between 1 and p, H (p + l, p + l) is close to 0 and H (i, j) is neighbor of 0 for i different from j. In a variant, the matrix H satisfies the following conditions: H (i, i) is close to a variable a for i between 1 and p, H (p + l, p + l) is close to a variable b , H (i, j) is close to a variable c for i different from j with a = b + c

In an advantageous variant, the matrix H satisfies the following additional conditions: a is close to 1-1 / p, b is close to 1, c is close to -1 / p, where p is the number of variables in the model.

Preferably, an automatic module for adjusting the parameter λ is added, which can be such that the module for adjusting the parameter λ is produced by the integration of a module for separating the training data into two preferably separate subsets: l one serving as a learning base for the modeling process, the other serving to adjust the value of the parameter λ according to a criterion of validity of the model obtained on data which did not participate in the learning. The adjustment module can also be used to adjust the vector of parameters Λ. In both cases, this module can be automated, either by acting directly on the parameter (s), or through a coding function (exponential, logarithm or others).

The basic data separation module can be carried out by external software of spreadsheet or database type and can perform a purely random sorting in two subsets or a random sorting in two subsets, while respecting the representativeness of the input vectors in the two subsets.

Advantageously, the data are prepared by a statistical normalization of the columns of data, by a reconstruction of the missing data or by detection and possible correction of outliers.

This preparation can be carried out by a polynomial mono or multivariate development relating to all or part of the entries, by a trigonometric development of the entries or by an explanatory development of the date type entries.

A preferential variant consists in using a change of reference, resulting from an analysis in principal components with possible simplification or one or more forward or backward time offsets of all or part of the columns containing time variables.

Advantageously, a preparation explorer is added, which is based on a description of the preparations possible by the user and on an exploration strategy based either on a pure performance criterion in learning or in generalization, or on a compromise between these performance and capacity of the learning process obtained. In a variant, an operating module is added to the modeling process generating polynomial formulas mono or multivariate descriptive of the phenomenon, trigonometric formulas descriptive of the phenomenon, or formulas descriptive of the phenomenon containing expansions of dates into calendar indicators.

The general block diagram of the invention is given in FIG. 1. It includes all or part of the following elements

- a data acquisition module (1);

- a data preparation module (2);

- a modeling module (3);

- a performance analysis module (4);

- an optimization module (5); - a module for exploring the preparations (6);

- an operating module (7).

The purpose of the data acquisition module (1) is to collect all the information necessary for developing the models. The collection is done through acquisition configuration information, which is transmitted by an operator, either once and for all during the design of the system, or dynamically according to new needs identified during its operation. Data can be collected either through to physical measurement sensors, either in databases through queries, or both. By configuring the acquisition, the operator defines a modeling problem to be treated with the tool. On request, this module produces a raw history of the phenomenon, characterized by a table comprising in columns the magnitudes characteristic of the phenomena (for example from sensors), and in rows the events each corresponding to an observation of the phenomenon. This historical table can be supplemented by a description of the data including information which can be useful for modeling, then for the exploitation of the models. The description typically contains the following information:

- column name; - reference of the associated sensor; nature of the data (boolean, integer, numeric indicator, date, region, ...).

The data preparation module (2), also called the data processing module, makes it possible to refine the characteristics of the raw data resulting from the acquisition. From the historical table, and from the description of the data, this module creates a more complex table, where each column is obtained from a processing operating on one or more of the columns of the historical table. The treatments on a column can be in particular: a transformation of the column by a conventional function (log, exp, sin, ...), each element of the column being replaced by its image by the chosen function,

- a monovariable polynomial order K development, generating K columns from an input column x, corresponding to the variables x, x ² , ..., x ^κ

- a spectral development of period T and of order N, generating 2K columns from a column input x, the first K columns being equal to cos (2πix / T) (i being between 1 and K), and the last K being equal to sin (2πix / T) (i being between 1 and K) . - development in calendar indicators, generating for a date entry column, a list of finer indicators of events associated with this date (annual, weekly, monthly trigonometric developments, Boolean day of week, day of day indicators leave, bridge day, bridge watch, bridge watch day, bridge day after, bridge day after day, leave indicators, start and end of leave specific to each region, ...)

The data preparation module can also act on several columns or several groups of columns. He can in particular carry out the following constructions:

- from a date column and a region column, the preparer can develop meteorological indicators (wind, precipitation, humidity, etc.) for the same day or for adjacent days. This operation is carried out from a meteorological database;

- from two groups of columns G1 and G2, the preparer can create a new group of columns G3 comprising the products crossed between all the columns of the two groups;

- from a group of columns G, comprising p variables x _x , x ₂ , ... ^' X _p , the preparer can generate all the polynomial terms of degree less than or equal to K, therefore a group of columns each comprising a term of the type (x ₁ ) ^κl (x ₂ ) ^κ2 ... (x _p ) ^Kp with (Kl + ... + Kp) <K, all Ki being between 0 and K.

The data preparation module can also perform operations on the lines, in particular: - centering, which subtracts from each element of a column the average obtained on its column;

- reduction, which divides each element of a column by the standard deviation of its column; - statistical normalization which links the two previous operations.

The data preparation module can also carry out global operations so as in particular to reduce the dimension of the problem: - the elimination of a column if its standard deviation is zero; the elimination of a column whose correlation with a previous column is greater than a threshold; - the elimination of a column whose correlation with the output is less than a threshold;

- carrying out an analysis in principal components, which leads to a change of reference by privileging the principal axes of representation of the phenomenon, and the possible elimination of the non-significant columns.

The data preparation module also makes it possible to define the treatment of missing values. By default, a sample containing one or more missing values will be ignored. However, the user can replace the missing value according to several criteria:

- average of the value on the column;

- average of the value over a subset of the most frequent value column (boolean or enumerated

- choice of a fixed replacement value;

- estimation of this value based on modeling according to the other variables. Another way of dealing with missing values is to consider them as a particular state of the variable which one can for example take into account by creating an additional Boolean column indicating whether the value is present or not.

The data preparation module also makes it possible to detect and process suspicious values. Detection is based on the following criteria: given outside of a range defined by the operator; data outside a range calculated by the system (for example, range centered on the mean and wide value of K times the standard deviation, analysis of the extreme percentiles, ...); - for Boolean or enumerated data, values whose number of occurrences is less than a given threshold.

Samples containing one or more suspicious values can be treated according to the same methods as those proposed for the missing values.

For time variables of type X (t), the preparation module also makes it possible to automatically generate the columns corresponding to the variable X taken at different anterior or posterior instants. Thus, the variable X (t) is replaced by a group of variables: {X (t-kdt), ..., X (t-dt), X (t), X (t + dt), .. ., X (t + ndt)}

The data preparation module offers all these possibilities individually but also allows the user to combine these treatments thanks to a suitable command language. All of these data preparation possibilities are also accessible in the preparation exploration module. The preparation process preferably ends with a statistical normalization operation. The modeling module (3) constitutes the heart of the invention. With its new technology, it can process a large number of input columns while controlling the validity and robustness of the model. It is therefore perfectly suited to the data preparer described above, which is likely to generate a very large number of explanatory columns.

The modeling module uses a history of data after preparation. It can be used on all of these data, but gives its full performance when it is only used on part (of the lines) of this data, this part being defined by the optimization module.

The modeling module proceeds as follows: the table of input data after preparation constitutes a matrix called [X], the column vector of the outputs corresponding to these inputs constitutes a column vector [Y]; - we build a matrix [Z] from the matrix [X] by completing it on the right with a column of 1. the model vector [w] is obtained by the following formula:

[w] = ( ^fc [Z] [Zj '+ λfH]) - ¹ [Z] [Y]) where [H] is a particular matrix allowing fast calculations, and where λ is a scalar. the output y * of the model for an input vector [x] = (xl, ..., xp) is obtained by adding a constant equal to 1 following the vector [x], to obtain the vector [z ] = (xl, ..., xp, 1), then by performing the dot product between the vector [w] and the vector [z], that is y * = w _Λ + ... + w _pp + _{p + 1} .

The analysis module (4) assesses the performance of the model according to certain criteria, the performance being evaluated either on the learning history, that is to say on the data having been used for the calculation of the matrix [X], or on data which did not take part in learning (history known as "of generalization "). The performances are evaluated by comparing on the designated history the vector [y], corresponding to the actual value of the output, to the vector [y *] corresponding to the value of the output obtained by application of the model. The comparison can be made with conventional statistical indicators of error with or without screening.

The analysis module also makes it possible to sort historical data either in rows or in columns. The online sorting criterion relates to the modeling error. This criterion makes it possible to separate individuals conforming to the model from nonconforming individuals. Non-compliant individuals may be due to anomalies encountered with the sensors, but they may also reveal abnormal or original behavior, information which can prove to be very valuable depending on the context. The sorting criterion in columns is carried out according to the model vector [w]. This makes it possible to order the factors influencing the phenomenon according to their positive or negative contribution to the phenomenon.

The object of the optimization module (5) is to adjust the parameter λ. To do this, it separates the historical data into two parts, one serving as a learning base for the modeling module, and the other serving to analyze the performance of the model on unlearned data. The optimization module automatically activates the modeling module by varying the parameter λ so as to obtain optimum performance on unlearned data. The very construction of the perturbation model and matrix [H] gives properties to the scalar λ particular, and especially that of playing on the effective capacity of the learning structure.

The _ optimization criterion can be chosen by the operator from among all the possibilities offered by the analysis module.

Data separation can be done directly by the operator, but can also be supported by the system in different ways.

The module for exploring the preparations (6) constitutes the second level for adjusting the capacity of the learning structure. This module links the models (with or without optimization of the scalar λ) by changing the data preparation at each step. This module uses a description of the possible preparations provided by the user. This description defines in an orderly manner columns, groups of columns, and preparations operating on these columns or groups of columns. For example, the description of the possible preparations can define among the variables of the basic history:

- a possible polynomial development of column 1, from degree 1 at least to 5 at most;

- possible trigonometric development of column 2 from degree 1 at least to 7 at most; - multivariable polynomial development possible on columns 4 to 8 of degree 1 at least and 3 at most; all or part of the other columns without specific treatment. This description makes it possible to formalize the knowledge of the user in relation to the phenomenon to be modeled. The preparation explorer then relieves the user of the tedious tasks of exploring possible preparations, by performing data preparation, modeling, performance analysis, and recording the references of the test and the results obtained.

This exploration is done through parameters left free by the description completed by the user. The explorer can activate different methods to achieve this function. Among these methods, the simplest is the systematic exploration of all possible combinations, in the parameters left to the operator. However, this method can prove to be very costly in computation time, since the number of computations increases exponentially with the number of parameters.

Another method consists in making random draws in the possible parameters, then in sorting the results so as to approach the most interesting areas.

A third method consists in implementing a control of the capacity of the second level learning process. We use for this the fact that for each type of development (polynomial, trigonometric, ...), the capacity of the learning process increases with the parameter (degree of development). The method starts from a minimum preparation (all parameters are at their minimum), then it considers all possible preparations by incrementing only one of the parameters. On each of the preparations obtained, the method launches a modeling and chooses from among all the models obtained, the one which has the best performance according to a certain criterion.

This criterion can be according to the objective aimed by the user:

- a minimum of error with or without screening, on data not learned;

- the relationship between one of the previous criteria and the capacity of the learning structure after preparation, (this capacity can also be approached by known formulas); the relationship between the increase in one of the previous criteria and the increase in the capacity of the learning structure; an increasing function as a function of an error criterion as described above, and decreasing as a function of the capacity of the learning structure.

The operating module (7) allows the tool to transmit the results of the modeling to a user or to a host system. In a simple version, it can calculate the output of the model evaluated on unlearned data by giving indications as to the reliability of the estimate. In a more elaborate version, the operation can transmit the elaborate model, its preparation and its performance to a host system. In an even more elaborate version, the tool is fully controllable by the host system, like an industrial process control system for example, by giving it unprecedented possibilities in terms of adaptability to a complex and changing environment.

It is also possible that the basic data separation module performs a sequential sorting, for example: 70% in learning; 20% in general; 10% in test, or, in another variant, a first sequential sorting into two subsets (the first subset comprising the training and generalization data, the second subset comprising the test data). The data separation module then performs a random sorting on the first subset to separate the learning and generalization subsets.

The basic data separation module can also perform a sort of choice type of one (or more) sample (s) according to a law programmed in advance (by example: all N samples) for the generation of learning, generalization and or test subsets.

We can also prepare missing, aberrant or exceptional data in one or more groups so as to group them in the same category to apply a particular treatment to them (for example: a weighting, a “false, alarm” category, ...) .

In a variant, each explanatory power is calculated for its explanatory power (or discriminating power) in relation to the phenomenon studied. This process allows on the one hand to select from a list, the predominant variables and to eliminate the second order variables, and on the other hand to explain the phenomenon studied. The preparation of the data can be carried out by segmentation algorithms which can, for example, be of the “decision tree” or “machine vector support” type.

Preferably, each state of a “nominal” variable (for example the postal code or “APE” code) is associated with a table of values translating its meaning with respect to the phenomenon studied (for example: number of inhabitants of the commune, standard of living in the commune, average age of the inhabitants of the commune, ...). We can then code the nominal variables in the form of an array of Boolean or real variables.

During flow modeling applications, we transform the temporal data (date) by applying the reporting rules resulting from the knowledge of the phenomena studied. For example, for a financial flow, when a holiday is the associated amounts are carried over according to a business rule partly on the preceding days and partly on the following days with weighting coefficients. We can also process the flows (for example financial exchanges) by identifying the periodic deadlines (for example monthly deadlines) and by applying to them rules of postponement specific to each deadline (for example: if the day of the deadline is a holiday, postpone transactions on the following day, ...).

A post-processing function (which can be derived from the coefficient λ) allowing the robustness to be calculated

(or precision) of the model generated on new unlearned data can be applied to the result.

When the database has only a few characteristic elements of a phenomenon to be modeled, the learning, generalization and forecasting spaces may not be separated (for example: use of data belonging to the “learning” subset To generate the "generalization" or "forecast" subsets).

The prepared data can be shared between different uses of the data modeling method according to the invention.

One manages the whole of the data in a specific environment while guaranteeing the availability of information by using for example a file system, a database or a specific tool. Multiple users can be guaranteed simultaneous access to the data. To this end, we define a relational structure based on variables, phenomena, and models to store and manage all the basic data, descriptive formulas of the phenomenon.

Claims

1 - Method for modeling digital data from a data sample comprising a step of acquiring the input data (1), a step of preparing the input data (2), a step of constructing a model by learning on the processed data (3), a step of analyzing the performance of the model obtained (4), a step of exploiting the model obtained (7) characterized in that the consistency of the process is checked classical learning by regression by adding to the covariance matrix a disturbance in the form of a matrix H depending on a vector of k parameters Λ = (λ ₁ ,λ ₂ ,...λ _k ) or in the form of the product of a scalar λ by a matrix H during the calculation of the model.

2 - Data modeling method according to the main claim characterized in that the matrix H satisfies the following conditions: H(i,i) is close to 1 for i between 1 and p, H(p+l,p+l ) is close to 0 and H(i,j) is close to 0 for i different from j.

3 - Data modeling method according to the main claim characterized in that the matrix H satisfies the following conditions: H(i,i) is close to a for i between 1 and p, H(p+l,p+l ) is close to b, H(i,j) is close to c for i different from j and a = b + c.

4 - Data modeling method according to claim 3 characterized in that the matrix H satisfies the following additional conditions: a is close to 1- 1/p, b is close to 1, c is close to - 1/p, where p is the number of variables in the model. 5 - Data modeling method according to the main claim characterized in that the matrix H satisfies the following condition: H(p+l,p+l) is different from at least one of the terms H (i, i) for i between 1 and p.

6 - Data modeling method according to any one of the preceding claims characterized in that an additional step of adjusting either the scalar λ or the parameter vector Λ(5) is carried out, this step being able to be the subject of 'automation, either by acting directly on the parameter(s), or through a coding function (exponential, logarithmic or others).

7 - Data modeling method according to claim 6 characterized in that the step of adjusting the scalar λ or the vector of parameters Λ is carried out by the integration of a module for separating the learning data into two sub- preferentially disjoint sets: one serving as a learning basis for the modeling method according to the main claim, the other serving to adjust the value of the parameter λ or the vector Λ according to a criterion of validity of the model obtained on data n' not having participated in the learning.

8 - Data modeling method according to claim 6 or 7 characterized in that the step of separating the basic data is carried out by an operator using for example external software of the spreadsheet or database type, or specific tools .

9 - Data modeling method according to any one of claims 6 to 8 characterized in that the step of separating the basic data performs a purely random sorting into two subsets. 10 - Data modeling method according to any one of claims 6 to 8 characterized in that the step of separating the basic data performs a random sorting into two subsets, while respecting the representativeness of the input vectors in both subsets.

11 - Data modeling method according to any one of claims 6 to 8 characterized in that the basic data separation module carries out sequential sorting.

12 - Data modeling method according to any one of claims 6 to 8 characterized in that the basic data separation module carries out a first sorting into two subsets, the first subset comprising the learning data and generalization, the second subset comprising the test data.

13 - Data modeling method according to any one of claims 6 to 8 characterized in that the basic data separation module carries out a choice type sorting of at least one sample according to a law programmed in advance for the generation of training, generalization and/or test subsets.

14 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by statistical normalization of the data columns.

15 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by reconstituting the missing data. 16 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by detection and possible correction of aberrant values.

17 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by a single or multi-variable polynomial expansion covering all or part of the inputs.

18 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by periodic development of the inputs.

19 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by an explanatory development of the date type entries.

20 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by a change of reference, resulting from a principal component analysis with possible simplification.

21 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by one or more forward or backward temporal shifts of all or part of the columns containing temporal variables. 22 - Data modeling method according to any one of the preceding claims characterized in that a preparation explorer (6) is added, which is based on a description of the preparations possible by the user and on a strategy of exploration based either on a criterion of pure performance in learning or in generalization, or on a compromise between this performance and the capacity of the learning process obtained.

23 - Data modeling method according to any one of the preceding claims characterized in that an operating module (7) is added to the modeling method generating single or multi-variable polynomial formulas descriptive of the phenomenon.

24 - Data modeling method according to any one of the preceding claims characterized in that an operating module (7) generating periodic formulas descriptive of the phenomenon is added to the modeling method.

25 - Data modeling method according to any one of the preceding claims characterized in that an operating module (7) is added to the modeling method generating descriptive formulas of the phenomenon containing developments of dates in calendar indicators.

26 - Data modeling method according to claim 18 characterized in that the periodic development is a trigonometric development.

27 - Data modeling method according to claim 24 characterized the periodic formulas descriptive of the phenomenon are trigonometric based. 28 - Data modeling method according to any one of the preceding claims characterized in that the 'nominal' type data is prepared in order to reduce the number of distinct states by carrying out one or more of the following actions:

- calculate the quantity of information provided by each state;

- group together the homogeneous states with respect to the phenomenon studied;

- create a specific state bringing together all the elementary states that do not provide significant information on the phenomenon.

29 - Data modeling method according to any one of the preceding claims characterized in that missing, aberrant or exceptional data is grouped into one or more groups to apply specific treatments to them.

30 - Data modeling method according to any one of the preceding claims characterized in that 1 'the nominal variables are coded in the form of a table of Boolean or real variables.

31 - Data modeling method according to any one of the preceding claims characterized in that its explanatory power in relation to the phenomenon studied is calculated for each input variable.

32 - Data modeling method according to any one of the preceding claims characterized in that the data is prepared by segmentation algorithms which can for example be of the 'decision tree' or 'support vector machine' type. 33 - Data modeling method according to any one of the preceding claims characterized in that each state of a 'nominal' variable is associated with a table of values reflecting its meaning with regard to the phenomenon studied.

34 - Data modeling method according to any one of the preceding claims characterized in that the data is transformed by applying the reporting rules resulting from the knowledge of the phenomena studied.

35 - Data modeling method according to any one of the preceding claims characterized in that the flows are processed by identifying the periodic deadlines and applying carryover rules specific to each deadline.

36 - Data modeling method according to any one of the preceding claims characterized in that the learning, generalization and prediction spaces may not be disjoint.

37 - Data modeling method according to any one of the preceding claims characterized in that a relational structure is defined based on the variables, phenomena, and models to store and manage all of the basic data, descriptive formulas of the phenomenon.

38 - Device for modeling digital data from a data sample comprising means for acquiring the input data (1), means for preparing the input data (2), means for construct a model by learning on the processed data (3), means for analyzing the performance of the model obtained (4), means for exploiting the model obtained

(7) characterized in that it comprises means for controlling the consistency of the conventional learning method by regression by adding to the covariance matrix a disturbance in the form of a matrix H depending on a vector of k parameters Λ = (λ^λ^.-.λ _j or in the form of the product of a scalar λ by a matrix H during the calculation of the model.