WO2001065395A1

WO2001065395A1 - Infinite level meta-learning through compression

Info

Publication number: WO2001065395A1
Application number: PCT/SE2001/000465
Authority: WO
Inventors: Peter Nordin
Original assignee: Vill Ab
Priority date: 2000-03-03
Filing date: 2001-03-05
Publication date: 2001-09-07
Also published as: AU2001241321A1

Abstract

The present invention relates to a method and a computer program for data compression and/or machine learning, such as data prediction including meta-learning of how to learn in a closed system with many levels of meta learning or infinite meta-levels. Compression is used both for the learning of target data and for meta-learning. A history entity is used for history compression to remember a trace of what the learning entity has done so far. The learning entity is provided with feed-back by adding random strings to the history entity. Random strings are a negative reinforcement for an entity which is trying to achieve compression. The reinforcement can be used both as an off-line system without an environment (internal reinforcement) and for external reinforcement from an environment.

Description

INFINITE LEVEL META-LEARNING THROUGH COMPRESSION

Field of the invention

The present invention relates to a method and a computer program for data compression and/or machine learning, such as data prediction including meta-learning of how to learn in a closed system with many levels of meta learning or infinite meta-levels. Compression is used both for the learning of target data and for meta- learning. A history entity is used for history compression to remember a trace of what the learning entity has done so far. The learning entity is provided with feed-bac by adding random strings to the history entity. Random strings are a negative reinforcement for an entity, which is trying to achieve compression. The reinforcement can be used both as an off-line system without an environment (internal reinforcement) and for external reinforcement from an environment .

Machine learning systems has been produced in the art for the solution of problems such as classification, prediction of time-series-data, symbolic regression, optimal control, language understanding etc. Examples of various machine learning techniques are neural networks, fuzzy networks, genetic algorithms, evolutionary strategies, evolutionary algorithms, decision tree algorithms, evolutionary programming, ADATE, cellular automata, simulated annealing, reinforcement learning and many other.

The invention can also be seen as a very general method for reinforcement learning. Reinforcement learning is a computational approach to learning from interaction with an environment where a system tries to decide on actions given states in the environment and a numerical reward/reinforcement signal. To obtain a lot of reward a reinforcement learning system must prefer actions that it has tried in the past and found to be effective in producing reward. In most cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards.

Background of the invention

Formally data compression can be used as a method for prediction according to Solomonoff's induction principles [3] . The probability μ that the string x is continued with y goes to 1 as the length of x, 1 (x) grows unboundedly if y can be maximally compressed with respect to x so that Km (xy) -Km (x) is minimized, where Km is the complexity [4, page 333] . Even though Km is not computable, computable compressions can be used as approximations to prediction probabilities. This is another way of formulating Occam 's razor: of two possible explanations the simplest one is the most probable where simple means that it can be expressed with less information [1] .

The prefix version of Solomonoff ' s theory of inductive inference can be used to motivate Occam ' s razor in general . Assume that we want to predict the next symbol in a sequence of known symbols. The sequence could consist of any type of symbols such as numbers, letters or bits. We will use bits in the example below, without loss in generality since any symbol can be represented as a sequence of bits. We have the observed training sequence s and our objective is to predict the next bit. (The theory can be expanded to any alphabet size.) If si is the event s followed by symbol i and i is either 1 or 0. Using Bayes formula we get:

If P (sO) > P (sl ) we predict a 0 as the next symbol and if P (sO) < P (sl) we predict a 1. Because P (si ) =0 ( (l/2) ^{κ (31>} ) when i is either 1 or 0. The continuation with lower Kolmogorov complexity will (in general) be the most probable [6, page 221] .

The sequence prediction formulation is also equivalent to finding the shortest program p which when run produces the requested sequence as the first subsequence of its output. If the program is run beyond the length of that of the training data sequence. As an example consider the sequence:

ABCDEABCDEABCDEABCDEABC... (3 )

The sequence could be described as: 'Output "ABODE ' five times, then output 'ABC and end", but a shorter description would be "Output 'ABODE ' indefini tely" . The first description would predict the empty string while the second description would predict ...DEABCDEABCDE.... According to Occam's razor and more formally Solomonoff's induction principles is the second continuation more likely. So a prediction can be obtained by letting a short algorithmic description of the data be executed past the length of the original sequence. In essence any sequence prediction problem can be formulated as the problem of coding or compressing the sequence with a short program, which upon decompression produces not only the original sequence but also the probable continuation. The shorter the program (higher compression) the more probable is the prediction.

Furthermore, most machine learning problems can be cast into a sequence prediction problem.

In summary, generally a machine learning algorithm has to solve the problem of finding a short program compressing a data sequence to be able to predict the continuation by executing this program beyond the length of the of the original sequence. What problem to solve for a learning algorithm is therefore defined with few restrictions, and the problem of "How" is of course the important question.

Many machine learning problems can be formulated as the prediction problem of finding a set of short fixe- length continuations y given an original string x maximizing the probability μι (y_∑ \ x) by finding short programs pi which upon execution by a general machine U outputs "xy... " :

17 ( i ) =xyι (4 )

The hypothesis 's represented by pi are ordered by 1 (p_λ) so that the most likely continuation y_p has shortest length I.

We would ideally like to find a function c which given x produces the shortest p, but since this would be equivalent of computing the not computable kolmogorov complexity K (x) = p_k, we must be satisfied with short approximations to p_k :

c (x) =p (5)

If we are satisfied with approximations we can regard c as a primitive recursive function which can be performed by a machine equivalent to a universal Turing machine U. If we further assume that U is an imperative register machine we could try to find a program w for U which will compress x:

U (w, x) =p (6)

The compression program w will be a sequence of instructions " =i_Ii...i_n" which, when executed on U, will attempt to compress any x. Since U is imperative we can at any given time have the trace of instructions performed so far (the trace) as well as any temporary results obtained when trying to compress x. With this we must decide the next ideal action (list of instructions) to perform. This formulation casts the "How" problem into ]ust another sequence prediction problem. The compression algorithm represented by w should constantly attempt to predict an optimal continuation m the sequence of executed instructions. A good program should by this try to meta-monitor what it has done so far m order to learn the best continuation and further more meta-meta-momtor learning to better learn how to learn. In the ideal situation it should be able to perform meta- selfinspection to arbitrary levels. The approach presented m this application presents a simple program framework with the possibility of achieving at least some of these goals.

The problem of data compression of history information and meta- level learning has been addressed by Jurgen H. Schmidhuber [9] and [10] . However, his approaches are not generally applicable and not very efficient. Further, they are constrained m expressiveness. Still further, they are not well suited for the application domain of data compression.

Summary of the invention It is therefore an object of the present invention to provide a method, system and a program to at least alleviate the above-discussed problems.

This object is achieved with the invention as it is defined by the appended claims.

Brief description of the drawings

For exemplifying purposes, the invention will be described m closer detail m the following with reference to embodiments thereof illustrated m the attached drawings, wherein:

Fig 1 is a first schematic flow diagram of an embodiment of the method according to the invention; Fig 2 is a second schematic flow diagram of an embodiment of the method according to the invention; and

Fig 3 is a schematic illustration of the entities and their relationship within the invention.

Description of preferred embodiments

Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled m the art from this detailed description.

The present invention generally relates to the art of computerized learning systems and computerized data compression systems as well as computerized data prediction systems. The invention makes use of the following entities:

• H or h, is the history sequence entity storing information on what the system has done so far and what the outcome (compression) was.

• K is the set of initial instructions/functions/ methods/steps. This set can comprise any function/ instruction or compression methods as well as meta instructions concatenating instructions into new instructions, self-modifying instructions or instructions triggering events on any level of the system

• p is the currently best model, a "program" that can be run incrementally to produce a sequence and that can execute for a limited time or indefinitely, p is the results of a compression of H or initialisation to an initial model.

The entities and their relationship within the invention are illustrated schematically m fig 3. In a first embodiment of the method according to the invention, as is illustrated m fig 1, the following steps is performed:

• SI: Initialization with a bootstrap method (such as randomly, enumeration etc.)

• S2 : Write performed actions and their results to the history sequence entity.

• S3 : Use currently best compressed model p to chose next step (s) . • S4 : Execute next step(s) compressing H.

• S5 : Update currently best model p.

• S6: If a termination criterion is not true goto S2. Else, print the result produced by the best (shortest) model . • S7: End

More specifically this can be done m the following way, as is illustrated m fig 2 :

• Sll : Initialize the p entity. This can be done by model that selects Ks randomly. The initial p can be any feasible initial model that selects instruction steps by many different methods; arbitrarily, enumeration or another selection scheme.

• S12 : Run p and let it produce the next Kx. • S13 : Execute Kx on H .

• S14 : If the model py resulting from Kx is smaller than current p, replace current p with the new model p<=py.

• S15 : Append Kx to the sequence H, and append py to the sequence H. • S16 : If a termination criterion is not true goto S12 and run p again. If the termination criterion is met, print the result produced by best (shortest) model.

• S17: End. Use of the invention in Compression and Machine Learning/Artificial Intelligence

The present invention is a method for machine learning and data compression. Data compression, machine learning and prediction are related tasks . Assume that we have the data entity:

"0,4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76 ,80, 84, 88, 92, 96, 100, 104, 108" (9)

This data entity can be compressed to the computer program:

For i=0 to 108 step 4: print i : next (10)

This computer program produces the same data sequence but with fewer characters.

The program in (10) produces exactly the same sequence, but if we relax the constraints from "the exact sequence" to "a sequence with the same beginning" we can use the same principle for prediction. An example of such a program is:

1 print i: i=i+4 : goto 1 (11)

This programs produces a sequence which start as in (9) and the continues indefinitely:

112, 116,120... indefintely. (12)

There are good theoretical reasons to see the numbers in (12) as a prediction from (9) . In general underlying models such as (10) and (11) can be ranked and sorted according to their size (length or complexity) . Thus (11) is a slightly more probable model than (10) because (11) is shorter than (10) . It is possible to formulate a generic method for prediction from this principle: If the objective is to predict the continuation of the data series S, then the following method can be devised:

I . Find the shortest computer program P that produces a series that starts with S

II. run this program (P) , a few additional increments after S is produced then P will produce a probable prediction of the continuation of S.

The method above is formulated as a data prediction problem but most problems within machine learning and artificial intelligence can be cast to a sequence prediction problem thus (I) and (II) can be seen as a general goal for intelligent algorithms.

The difficulty is m step (I) ; to find a feasible method to identify P.

The invention is a general and effective method for step (I) and can therefore be used for machine learning, compression, prediction and other tasks requiring intelligent systems.

The method uses the compression principle not only for the data to be compressed S but on many different levels .

The history entity

A history entity (H) is created as a kind of "trace" of what the algorithm has done so far. The history entity is a sequence of what has been done and what the result was. In the simplest case the algorithm works by combining other possible compression methods. However, the system can be built by more atomic "instructions" which does not constitute full compression methods.

Assume that we have the compression methods:

Kl , K2 , K3 , K4 (13: K can also constitute commands for setting meta parameters of the system or combing other Ks together. In principle K could be any kind of instruction/function but here it is examplified with compression functions. We can start the procedure with an arbitrary (random) sequence of these methods (or some other simple procedure such as enumeration. )

We apply this sequence of compression methods in turn to the data sequence that should be compressed or which continuation should be predicted, S. Call the results of these compression attempts pi.

All this information is then stored in the History Entity

(H) .

Example : H="K2,K4,Kl,pl"

We can repeat this procedure until we get a few examples like this in the History entity

Entity : H="K2 , K4 , Kl , pi , Kl , K3 , K2 , p2 , K5 , Kl , Kl , p3 "

When the history entity H has reached a certain size we can use H itself as input to the algorithm. If we have stored the sequence of events (the trace) in H then a compression of H, that is p, can not only be used to predict the continuation of S but also most feasible continuation of H itself, where feasible means a sequence of "instructions" Ks that gives maximal future compression /prediction. By compressing H and get a prediction of continuation Kx we are no longer limited to choosing K arbitrary, instead we can apply the Kx which is most feasible given the experiences in the History Entity.

When p is run it produces K2 , K4 , Kl , pi , Kl , K3 , K2 , p2 , K5 , Kl , Kl , p3 , Kx , Ky....

The feasibility of Kx is given by the fact that the compression of the strings pl,p2,p3 can be seen as a fitness measurement on how well the algorithm performs its compression task. According to game theory the optimal action is the one with highest product of probability and expected return. This kind of reasoning is built into the invention through the fact that the probability of any given prediction is inversely proportional to the length of the compressed sequence. The length of the sequence is proportional to the logarithm and thus the concatenation of strings in the in history entity corresponds to a multiplication of probability and expected outcome. By adding two lengths in the entity we perform an addition of logarithms, thus a multiplication and the total length of the sequence defines how feasible a strategy is for the rest of the execution of the algorithm. The Kx produced by the method is then executed and the process continues. The point is that the system compresses H repeatedly and recursively on higher and higher levels of abstraction. The algorithm not only learns how to learn in this problem, or how to learn to learn but how to learn to learn to learn ...up to an arbitrary meta-levels. In this way we can produce the most feasible future prediction/compression.

In summary: • An artificial intelligence problem can be formulated as a sequence prediction problem. • Most feasible sequence prediction is achieved by optimal compression of the sequence and the resulting "model" is used to incrementally generate a prediction • The invention exploits this principle by not only trying to predict the continuation of the in data sequence but instead by attempting to predict the continuation of a sequence H which consists of concatenations of experiences from compression attempt and resulting compressions. • The invention uses this method recursively and iteratively on an H that contains higher and higher levels of experiences and meta-expeπences all folded together m the sequence H.

Fixed number of execution steps A variant of the method is the case when prediction/compression is needed m a maximal number of execution steps. The task for the system is then to find the most probable continuation of the series m X steps. This means that a maximal length L is attributed to H. The method then has a related execution flow:

1. Initial steps: Start by selecting and executing an initial sequence of Ks . The initial Ks can be chosen randomly, arbitrarily or by another simple method. Add this sequence to the history entity H. Entlty : H=" K2 , K4 , Kl , pi , Kl , K3 , K2 , p2 , K5 , Kl , Kl , p3 "

2. Use the best model/program p so far and execute it, to produce a prediction over all L steps of H. Note the best achieved compression of S .

3. Apply the predicted next step Kx from H m step 2 and goto step to if H does not have length L

4. Print result prediction of S.

5. End

The invention formally In the following a formal, general description of an instance of the innovation is presented.

A central part of the invention is the history string h, which the system constantly tries to compress. The objective of the system is to compress x as good as possible m order to predict the continuation y. The initial history string h₀ is set to the (self-delimited) input string x. h₀=x (7)

Let for the moment assume that the instruction set of U includes some standard compression methods fι f₂,...,f_n such as huffman coding [huff] , and Lempel-Ziv [lz78] (this will not restrict the generality as seen later) . It is to be noted that all compression methods, even simple standard methods, can be used for the prediction. Dictionary based methods such as Lempel-Ziv can be used to predict sequences such as those in (5) and huffman- coding can predict symbols when fed by random string, etc. The null -model which just copies the input to the output predicts a random letter of the alphabet. All compression methods results in programs or models which can be decoded on the same machine N. There is also a working memory M which will store the output of the compression, f (x) =M, and a sorted vector H storing the best m number of hypothesis's (programs), H= { jι,J2, — , jm } ■ Initially H is filled with programs which just copies input to output. A special instruction i_M transfers M to be sorted into the H vector. Note that the compression functions fι,f₂,..., f_n all includes the i_M instruction. The invention has a second numeric parameter, which is the number of instructions to be performed before self- inspection, η. The first action by the system is to perform η random instructions ws= i₁i₂...i_η. The system then appends the trace wg and the best model j_{X ι} hι= ₀W_δ ji. In general h is updated to:

The best hypothesis j is then executed past the length of h and the resulting prediction is used to decide which instructions to be performed next. The same procedure is repeated until a maximal number of iterations is exceeded or if the program terminates itself by suggesting the empty string as next instructions, ws =ε.

By recursively feeding back the best model ji and compressing the whole history string the system will gradually be forced not only to compress well to give a good prediction but to compress well and at the same time predict an action which it foresees to give the maximum tradeoff between future compression and probability. A specific variant can be define formally as follows:

Definition 1 Generic Meta Compressor System

Given an input sequence x i t is required to search for and return a prediction y, for the mos t probable continuation of x. The inputs to the system are

Decoding function N:

Maximum number of i terations : max

Number of instructions before self -inspection : η

A universal machine : U An ini tial set of hypothesis ' s returning random predi ctions : H₀

A prediction function using N to run past of the length of h : r

An execution function executing the predicted instructions ws: e

1. h₀=x

2 . while w$ ≠ε and i <max do

3 . i <—i +l

4 . ws_n+i <-r (j_ln , N, h_n, η) 5 . H_n+1<^e (U, ws_h+ι,H, h_n)

6. h_n+1 <—h_nWsn+l jln

7 . od

8. return use j _± to output y

The system starts by assigning the input to the original history list (1) . It iterates over four steps until it terminates by maximal number of iterations exceeded or by predicting the empty string as the optimal next list of instructions (2) . The current best model j _ln is executed by r according using the decoding function N and the decoding is executed bast the length of the current history string until η new instructions has been produced. The instructions ws_n+i will be used in the next step to create a new model (3) . The universal machine U is used to execute the predicted instructions. The output (in M) is checked for consistency with the history list and then ranked sorted and possible added to the hypothesis vector H (5) . The history list is updated by appending the predicted instructions and the best compressed model (6) . Upon termination the best model is used to extract the prediction goal y (7) .

Example

Below we give a simple example to illustrate the principle. Let us assume that we want to predict the next letter in the sequence

x=h=ABCDABCDABCDABCDABCDA... (9)

Further assume that the instruction set of U contains a compression function f_d representing dictionary compression with a two letter window size . In this simple case this means that it can replace a two-letter sequence with a single letter. If we assume that the first random g contains fa and that we are working with a production rule decoder N_p, we get compressed output from f_d.

M^- f_d (h) (10)

M= AB< -E, CD<-F : EFEFEFEFEFE (11 )

N_p (M) = ABCDABCDABCDABCDABCDAB =hB (12) By coding and decoding m this way we obtain the prediction B as next symbol. M will go into H since it is consistent we h when decoded and since the output M is shorter than the models produced by quoting the original x. The history string will be updated:

h_n+1 -&-h_nw_δ j _! (13 )

h₁ =ABCDABCDABCDABCDABCDA \ ... f_d ... | AB<-E, CD -F : EFEFEFEFEFE \ (14 )

The best model j₂ is appended to the end of the history string. Note that the central idea of the system is that j , which is the best compression from the system's point of view is essentially random and the system is punished with the addition of "random information" to the next compression cycle. In order to be punished as little as possible the system must avoid as much of these random strings as possible meaning it must learn to do as much compression as possible of the complete history string. Note that finding additional patterns m the history string might result m a "cascade" of lowered complexity since the original x is coded one time for each iteration. Compressing the history string means not only compressing the input x but also compressing and finding patterns and predicting the best action to be taken the next iteration. This internal reinforcement of "random" string can be expanded to an external feed back from an environment of real random strings, which is discussed below.

The next iteration it will be hi that will be compressed which now includes information of what has been done and what the result was.

As the iterations proceeds h will build up experience about how different instructions affect the search. h_n=X Wδ oJ l O Wg ij^' n δ ₂J l2 δ ₂j l2 _{δ 3 13} Wδ ij ₁₄ W_{g 5}j is -Wδ nj^' lπ

In sequence like this it is possible for the system to find patterns about meta learning. Consider this example:

h—X Σ good I J short \ tbad ] J long | tgood \ J short \ 1 good \ J short \ bad \ J ong \ ( I -¹ )

It is possible to find the pattern showing that the good learning function f_goθd often precedes a short model j short while the not so good function f_bad results m worse compression and prediction confidence m the longer model i_ong • It is therefore likely that a good future model will predict | goo p short I since it will enable a shorter model m H by predicting shorter future models j . Due to the recursive nature of h, there is no limit to the levels of insights into the original compression problem and the systems mode of its own performance m the domain, finding patterns on an arbitrary meta-level is possible . In the above it has been assumed that the system includes compression functions. However, it is possible to relax this assumption and just let any instruction operate on the memory entity M. Another issue to be discussed is the growth of the history entity. So far we have assumed that the history entity will have a maximal size and after that work as a buffer. An alternative approach would be to include a special construct RANDOM (N) which when appearing m the history string should be interpreted as a random sequence of N letters. As the history list grows to the limit, arbitrary segments of it will be replaced by the RANDOM construct with corresponding N. The advantage of this approach is that it is consistent with the principle of the system and there is no way for the system to get a better position by "forgetting" . Everything that is forgotten will be impossible to compress. The storage capacity of the history string will be almost infinite since a segment is replaced by a symbol with a logarithmic relationship to the segment's length.

A part from feeding back the compression result to the history list for compression in the next iteration, it is feasible to use to random string as feed back from an environment. If the system interacts with an environment, it can be reinforced by the length of random strings written to the history string. The relationship between length of external random strings and size of the history list will determine how the emphasis is distributed between internal compression and fulfilling the external goals of the environment.

Handling of errors The invention may further be adapted to automatically manage and handle errors. For example, in the case where the model p or candidate model py does not exactly match the beginning of H this can be administered by estimating the information content of the error vector and adding the information content to the size of p/py.

Conclusion

The invention has now by described by way of examples. The invention is useful within many technical areas and applications, and especially machine learning applications and data compression applications.

Machine learning applications of the invention include but are not limited to; data mining, signal processing, image processing and understanding, reasoning, natural language understanding, search, information filtering, speech recognition, automatic programming, data series prediction, automatic model building, design support, dialogue systems, optimal control etc . Data compression applications of the invention include but are not limited to; text compression, image compression, video compression, audio and speech compression, lossy and lossless compression etc, for transfer, storage, presentation etc.

Hence, many alternatives of the invention are possible. For example, the model may be any model which could estimate a next step based on a history sequence, the history entity may comprise any kind of data, the end criterion could be of several different types etc. Such and other closely related alternatives must be considered to be within the scope of the inventions, as it is defined by the appended claims.

References

[1] Wolff J.G. (1993) Computing Cognition and

Compression. AI communications 6(2), pages 107-127. [2] Schmidhueber J. (1994) Discovering Problem

Solutions with Low Kolmogorv Complexity and high

Generalisation Capability. Technical Report FKI-194-94.

Computer Science Department, Techmsche Universitat

Munchen, Germany. [3] Li M., Vitanyi P. (1992) Inductive Reasoning and

Kolmorogov Complexity. In Journal of Computer and System

Sciences, 44, p. 343-384

[4] Li M., Vitanyi P. (1997) An introduction to

Kolmogorov Complexity and its Applications. Springer NY [5] Hanson N. R. (1969) Perception and Discovery,

Freeman, Cooper & Co, 1969, p. 359.

[6] Nordm P. (1997) Evolutionary Program Induction of Binary Machine Code and its Applications. Krehl

Verlag, Mύnster, Germany [7] Huffman, D.A. (1952) A method for the construction of minimal -redundancy codes, Proceedings of the IRE, Volume 40, Number 9, pages 2098-1101

[8] Ziv, J., and Lempel, A. (1977) A universial algorithm for sequential data compression. IEEE Transactions on Information Theory, Volume 23, Number 3, pages 337-343 [9] J. H. Schmidhuber, "Learning complex, extended sequences using the principle of history compression" , Neural Computation, vol. 4, pp. 234--242, 1992.

[10] J. H. Schmidhuber. "Adaptive history compression for learning to divide and conquer.", In

Proc . International Joint Conference on Neural Networks, Singapore, volume 2, pages 1130--1135. IEEE, 1991.

Claims

1. A method for computerized machine learning and/or data compression comprising the steps : (a) analyze a set of performed actions (K) and their outcome in a history sequence entity (H) with a first model (p) , said first model being able to reproduce at least a part of the history sequence entity;

(b) use said first model to produce a next action (Kx) to be made on the history sequence entity (H) ;

(c) execute said next action (Kx) on said history sequence entity (H) ;

(d) analyze the resulting history sequence entity (H) and determine a second model (py) , said second model being able to reproduce the history sequence entity;

(e) determine whether the second model (py) is smaller than the first model (p) , and if so replace the first model with the second model; and

(f) repeating the steps (b) to (e) until an end criterion is reached.

2. A method according to claim 1, wherein at least one of the models is a compression model .

3. A method according to claim 1 or 2 , wherein the end criterion is a maximum number of iterations.

4. A method according to claim 1 or 2 , wherein the end criterion is a predetermined next action (Kx) resulting from the first model (p) .

5. A method according to any one of the preceding claims, further comprising the step of initially choosing a initialization model (p) to be used as the first model.

6. A method according to any one of the preceding claims, further comprising the step of adding the determined next action (Kx) to said history sequence entity (H) as well as the determined second model (py) , in case the second model (py) is smaller than the first model (p) .

7. A method according to any one of the preceding claims, wherein the actions (K) comprises at least one instruction to a step to be performed.

8. A method according to any one of the preceding claims, wherein the actions (K) comprises at least one function .

9. A method according to any one of the preceding claims, wherein the actions (K) comprises at least one meta-instruction concatenating actions into new actions or action triggering events.

10. A method according to any one of the preceding claims, wherein the first model (p) could generate an endless sequence, of which the history sequence entity is a part .

11. A method according to any one of the preceding claims, wherein the first model (p) is able to reproduce the whole history sequence entity.

12. A method according to any one of the preceding claims, wherein, in the case where the first or second model does not exactly match the beginning of the history sequence entity (H) this is administered by estimating the information content of an error vector and adding the information content to the size of the model.

13. A method according to any one of the preceding claims, wherein the history sequence entity is compressed on gradually higher levels of abstractions during the recursions .

14. A method according to any one of the preceding claims, comprising the additional step of adding a random string to the history sequence entity.

15. A computer program product for computerized machine learning and/or compression, said program comprising codes for executing the steps:

(a) analyze a set of performed actions (K) and their outcome in a history sequence entity (H) with a first model (p) , said first model being able to reproduce at least a part of the history sequence entity;

(b) use said first model to produce a next action (Kx) to be made on the history sequence entity (H) ; (c) execute said next action (Kx) on said history sequence entity (H) ;

(d) analyze the resulting history sequence entity (H) and determine a second model (py) , said second model being able to reproduce the history sequence entity; (e) determine whether the second model (py) is smaller than the first model (p) , and if so replace the first model with the second model; and

(f) repeating the steps (b) to (e) until an end criterion is reached.