WO2018224165A1

WO2018224165A1 - Device and method for clustering a set of test objects

Info

Publication number: WO2018224165A1
Application number: PCT/EP2017/064136
Authority: WO
Inventors: Alexandros AGAPITOS; Janakiraman THIYAGARAJAH; Peng Lv; Hongbin Wang; Haiping Chen; Luca De Matteis
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2018-12-13

Abstract

Clustering device configured to cluster a set of test objects based on a set of labelled objects, the clustering device comprising a preparation unit configured to generate a set of genuine training pairs, each comprising two objects with a same label, and a set of impostor training pairs, each comprising two objects with different labels, a learning unit configured to learn, based on the genuine training pairs and the impostor training pairs, a mapping function that maps objects into a target space such that in the target space objects with different labels have a higher distance between them than objects with a same label, and an application unit configured to apply the learned distance measure to the set of test objects to obtain a clustering, wherein learning the mapping function comprises first using genetic programming to learn a structure of the mapping function and then performing an evolutionary parameter optimization to determine optimum parameters of the learned structured mapping function.

Description

DEVICE AND METHOD FOR CLUSTERING A SET OF TEST OBJECTS

TECHNICAL FIELD

The present invention relates to a clustering device and a method for clustering a set of test objects. The present invention also relates to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out such a method.

BACKGROUND

Clustering or cluster analysis refers to the task of grouping a set of objects such that similar objects are assigned to the same group (wherein the group is also referred to as cluster). Clustering problems occur in many technical fields and a large number of algorithms have been suggested to perform clustering.

In the absence of prior domain knowledge, the vast majority of cluster analysis applications rely on the Euclidean distance metric. However, the Euclidean distance metric has proven inefficient in high-dimensional, sparse, or non-isotropic input spaces.

The distance metric can be implemented to compute the Euclidean distance after performing a linear transformation of the input space X to a new space X' as the multiplication of matrix L with vector

There exist two families of methods for learning matrix L:

1. Eigenvetor methods (i.e. Principal Component Analysis, Linear Discriminant Analysis)

2. Convex optimisations over the space of semi-definite matrices.

Conventional Mahalanobis distance metric learning methods seek a linear transformation, which cannot always capture the non-linear manifold where high-dimensional input objects usually lie on.

A second way to improve the distance metric is to map the input space X into a new space X' using a non-linear function, and calculate the Euclidean distance in that space. The Siamese architecture based on Convolutional neural networks has been previously used for the nonlinear mapping.

Among the eigenvector methods, the problem with Principal Component Analysis is that this method operates without class label information to derive informative linear projections. The problem with the Linear Discriminant Analysis is that its projections are based on second- order statistics, and work to separate classes whose conditional densities are Gaussian. In cases where this assumption does not hold, the performance of LDA may be poor. Convex optimizations over the space of semi-definite matrices generate a linear transformation that minimizes the expected classification error rate when distance is computed in the transformed space. These methods are crucially tied to the requirement of continuous and differentiable loss functions in order for error to be minimized by gradient descent methods. The resulting loss function with respect to the model parameters are non-convex, and thus the optimization can in principle suffer from poor local optima. It has been shown in practice that the results of these methods are greatly dependent on the initialization of distance metric parameters.

Learning direct non-linear mappings to the transformed space is reduced to the problem of minimizing a non-convex loss function with respect to a number of model parameters. Non- convexity of the minimizing function may result in sub-optimal solutions. Another potential problem is that of model selection and overfitting because in the case of non-linear mappings, the hypothesis space is very rich. This can be a serious concern in cases of a limited number of training examples.

Other algorithms make use of domain- specific knowledge. Ideally, this knowledge is obtained by "learning" from labelled examples. However, such machine learning methods can have the problem that a large number of training examples are required in order for the algorithm to obtain sufficient information about the problem domain. This can involve a significant com- putational effort. SUMMARY OF THE INVENTION

The objective of the present invention is to provide a clustering device and a method for clustering the set of text objects, wherein the clustering device and the method for clustering the set of text objects overcome one or more of the above-mentioned problems of the prior art.

A first aspect of the invention provides clustering device configured to cluster a set of test objects based on a set of labelled objects, the clustering device comprising:

a preparation unit configured to generate a set of genuine training pairs, each compris- ing two objects with a same label, and a set of impostor training pairs, each comprising two objects with different labels,

a learning unit configured to learn, based on the genuine training pairs and the impostor training pairs, a mapping function that maps objects into a target space such that in the target space objects with different labels have a higher distance between them than objects with a same label, and

an application unit configured to apply the learned distance measure to the set of test objects to obtain a clustering,

wherein learning the mapping function comprises first using genetic programming to learn a structure of the mapping function and then performing an evolutionary parameter optimiza- tion to determine optimum parameters of the learned structured mapping function.

The cluster device of the first aspect has the advantage that the difficult problem of learning the mapping function is divided into two separate tasks: First, genetic programming is used to learn a structure of the mapping function. Secondly, an evolutionary parameter optimization is used to determine optimum parameters of the mapping function that has been learned in the first step. Experiments have shown that with this two-step approach, superior results can be achieved.

The labelled objects serve as training objects, from which the mapping function can be learned and then be applied to the test objects (for which the labels are not previously known).

The first step of learning a structure of the mapping function can comprise iteratively evaluation a fitness measure of a number of different structures of the mapping function. Thus, iteratively a better structure of the mapping function can be determined. The clustering device of the first aspect can be applied to different kinds of test objects. For example, these can be real or complex-valued vectors. The test objects can relate to different kind of applications. In particular, they can be representations of alarms that are generated by network devices.

In a first implementation of the clustering device according to the first aspect, the learning unit is configured to sample an impostor training pair with a probability that is a function of a difficulty measure and a measure of when the impostor training pair was last sampled.

Learning the mapping function can require dealing with a large number of labelled (training) objects. As this can be a computational burden, the clustering device of the first implementation has the advantage that learning can be focused on those objects that are difficult to learn. For example, the difficulty measure of an object can be based on whether in previous itera- tions this object has been assigned to the correct label (group) or not.

In other embodiments, the difficulty measure can be based on a distance between the objects from impostor pairs in target space, i.e., after applying the mapping function. In a second implementation of the clustering device according to the first aspect as such or according to the first implementation of the first aspect, the probability of sampling an impostor training pair is given by:

wherein s a generation number,

T is a total number of impostor training pair,

) is a number of generations since this training pair was last selected, and is a difficulty measure that is determined in an iterative

manner.

Experiments have shown that this represents an efficient and powerful way of determining the sampling probability. In a third implementation of the clustering device according to the first aspect, each training pair comprises a first element X_j and a second element y_j and the difficulty measure is itera- tively determined based on:

wherein η is a predetermined parameter.

Experiments have shown that particularly good results can be achieved if the difficulty measure is determined as specified above. In a fourth implementation of the clustering device according to the first aspect, the learning unit is configured to evaluate a fitness function on the set of labelled training objects, wherein evaluating the fitness function comprises comparing an average distance of genuine training pairs with an average distance of imposture training pairs, wherein preferably the distance is evaluated in the target space.

A second aspect of the invention refers to a method for clustering a set of test objects based on a set of labelled training objects, the method comprising:

generating a set of genuine training pair, each comprising two objects with a same label, and a set of impostor training pairs, each comprising two objects with different labels,

learning, based on the genuine training pairs and the impostor training pairs, a mapping function that maps objects into a target space such that in the target space objects with different labels have a higher distance between them than objects with a same label, and

- applying the learned distance measure to the set of test objects to obtain a clustering, wherein learning the mapping function comprises first using genetic programming to learn a structure of the mapping function and then performing an evolutionary parameter optimization to determine optimum parameters of the structured mapping function.

The methods according to the second aspect of the invention can be performed by the clustering device according to the first aspect of the invention. Further features or implementations of the method according to the second aspect of the invention can perform the functionality of the clustering device according to the first aspect of the invention and its different implementation forms.

In a first implementation of the method for clustering a set of test objects of the second aspect, learning the mapping function comprises sampling an impostor training pair with a probability that is a function of a difficulty measure and a measure of when the impostor training pair was last sampled.

In a second implementation of the method for clustering a set of test objects of the second aspect as such or according to the first implementation of the second aspect, the probability of sampling an impostor training pair is given by:

wherein is a generation number,

T is a total number of impostor training pair, is a number of generations since this train

ing pair was last selected, and is a difficulty measure that is determined in an iterative

manner.

In a third implementation of the method for clustering a set of test objects of the second aspect as such or according to any of the preceding implementations of the second aspect, each training pair comprises a first element Xj and a second element yj and the difficulty measure is iteratively determined based on:

wherein η is a predetermined parameter. In a fourth implementation of the method for clustering a set of test objects of the second aspect as such or according to any of the preceding implementations of the second aspect, learning the mapping function comprises sampling a balanced number of genuine training pairs and impostor training pairs. In a fifth implementation of the method for clustering a set of test objects of the second aspect as such or according to any of the preceding implementations of the second aspect, obtaining the clustering is based on a threshold that is determined based by maximizing a weighted sum of a hit rate and a slack rate on a labelled validation set.

In a sixth implementation of the method for clustering a set of test objects of the second aspect as such or according to any of the preceding implementations of the second aspect, learning the mapping function comprises evaluating a fitness function on the set of labelled training objects, wherein evaluating the fitness function comprises comparing an average distance of genuine training pairs with an average distance of imposture training pairs, wherein preferably the distance is evaluated in the target space.

In a seventh implementation of the method for clustering a set of test objects of the second aspect as such or according to any of the preceding implementations of the second aspect, the fitness function is given by:

wherein is a set of all genuine training pairs, is a set of all impostor train

ing pairs, G is the mapping function, 1 is a predetermined parameter and SD is a measure of standard deviation within a group.

Experiments have shown that this fitness function is efficient to compute, yet leads to superior clustering results.

In an eighth implementation of the method for clustering a set of test objects of the second aspect as such or according to any of the preceding implementations of the second aspect, the measure of standard deviation is given by:

wherein

A third aspect of the invention refers to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of the second aspect or one of the implementations of the second aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

To illustrate the technical features of embodiments of the present invention more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present invention, modifications on these embodiments are possible without departing from the scope of the present invention as defined in the claims.

FIG. 1 is a block diagram illustrating a clustering device, FIG. 2 is a flow chart of a method for clustering a set of test objects,

FIG. 3 illustrates an example of a crossover operator,

FIG. 4 illustrates a transformation function: G: X -> Z,

FIG. 5 shows a processing pipeline of a learning system of a clustering device,

FIG. 6 shows a generic cluster analysis system based on trainable similarity metrics, FIG. 7 shows a high-level view of a clustering device,

FIG. 8 shows a detailed architecture of a model induction system,

FIG. 9 illustrates an example of a genetic program representing a 2-dimensional trans- formation as an array of two individual expression-trees,

FIG. 10 shows a table that summarizes the generalization performance of an online network alarm grouping system, and FIG. 11 shows a table that summarizes the generalization performance of an online network alarm grouping system.

DETAILED DESCRIPTION OF EMBODIMENTS

In the following, a similarity metric or distance metric is a mapping over

the vector space X. A similarity metric possesses the following properties:

Note that the terms similarity metric and distance metric will be used interchangeably in the following.

The Euclidean distance between M-dimensional vectors X and X' is defined as

The Li norm of the difference between M-dimensional vectors X and X' is defined as

where abs(y) denotes the absolute value of y. Li norm is denoted as IIX - X'l

An Evolutionary Algorithm is an iterative stochastic search algorithm for searching spaces of objects, where the search process is loosely modelled after the biological process of natural selection (i.e. survival of the fittest). A recipe for solving a problem using an Evolutionary Algorithm is as follows:

^■ Define a representation (i.e. hypothesis) space in which candidate solutions can be specified.

^■ Define a fitness criterion for quantifying the quality of a solution.

■ Define variation operators (i.e. mutation, crossover) for generating offspring from a parent or a set of parents.

^■ Define the parent selection (i.e. fitness-proportionate, tournament, rank) and replacement policy (i.e. generational, steady-state). Iterate the process of Fitness Evaluation -> Selection -> Variation -> Replacement

A Genetic Program can be seen as a function composition represented using a directed acyclic graph structure, which is amenable to evolutionary modification. A function composition is the point- wise application of one function to the result of another one to produce a third function.

For example, given three functions:

the genetic program w where p can be formed by the

composition of f, g, h as follows:

This is represented as the directed acyclic graph, also known as an expression-tree. An ex- pression-tree is evaluated in a feed-forward fashion starting from the leaf nodes.

Genetic Programs can be generated as function compositions using a set of functions F and a set of terminals T. Elements of the function set are the primitive building blocks (i.e. functions) of a function composition. Elements of the terminal set represent constants and varia- bles of a function composition.

Given function and terminal sets of primitive elements, a maximum depth of the tree, and the probability of selecting a terminal node, a random program can be generated with the following recursive algorithm:

The function rand() selects a real- value uniform-randomly in the interval [0.0, 1.0]. The function CHOOSE_RAND_ELEMENT() choses an element uniform-randomly from within the set supplied as arguent to the function invocation. The function ARITY() is passed an element of the function set as argument and returns the arity of the function. As an example, f(a,b)=a+b has an arity of 2, where log(a) has an arity of 1.

Genetic programs are structures that are amenable to evolutionary modification. We define three variation operators, namely subtree mutation, point mutation, and subtree crossover, presented below.

The Subtree Mutation Operator is operating on a single parent tree. It picks a tree-node at random in the parent tree and generates an offspring graph by replacing the subtree rooted at the selected node by a randomly generated subtree using a random expression-tree generation algorithm:

The Point Mutation Operator is operating on a single parent expression-tree. It picks a tree node at random in the parent graph and generates an offspring by replacing that node with a randomly selected function from the function set (in case of an inner node), or with a random- ly selected terminal from the terminal set (in case of a leaf node).

FIG. 1 shows a clustering device 100 configured to cluster a set of test objects based on a set of labelled objects. The clustering device comprises a preparation unit 110, a learning unit 120 and an application unit 130. The preparation unit 110 is configured to generate a set of genuine training pairs, each comprising two objects with a same label, and a set of impostor training pairs, each comprising two objects with different labels.

The learning unit 120 is configured to learn, based on the genuine training pairs and the impostor training pairs, a mapping function that maps objects into a target space such that in the target space objects with different labels have a higher distance between them than objects with a same label.

The application unit 130 is configured to apply the learned distance measure to the set of test objects to obtain a clustering.

In the above, learning the mapping function comprises first using genetic programming to learn a structure of the mapping function and then performing an evolutionary parameter optimization to determine optimum parameters of the learned structured mapping function.

The preparation unit 110, the learning unit 120 and the application unit 130 can be implemented on a same processor.

FIG. 2 shows a method 200 for clustering a set of test objects based on a set of labelled training objects. The method 200 can be performed e.g. by the clustering device 100 of FIG. 1.

The method 100 comprises a first step of generating 210 a set of genuine training pairs, each comprising two objects with a same label, and a set of impostor training pairs, each comprising two objects with different labels.

The method comprises a second step of learning 220, based on the genuine training pairs and the impostor training pairs, a mapping function that maps objects into a target space such that in the target space objects with different labels have a higher distance between them than objects with a same label.

The method comprises a third step of applying 230 the learned distance measure to the set of test objects to obtain a clustering. Learning the mapping function comprises first using genetic programming to learn a structure of the mapping function and then performing an evolutionary parameter optimization to determine optimum parameters of the structured mapping function.

The Crossover Operator is operating on a pair or parent graphs. It picks two nodes at random, one from each parent expression-tree and then generates two offspring expression-trees by swapping the subtrees rooted at the previously selected nodes between the parent programs. FIG. 3 illustrates the crossover operator.

A first parent genetic program is

A second parent genetic program is

A first offspring program is

A second offspring program is

We suggest that the edges of a directed graph of a genetic program are further parameterized to generate composite functions of the form h(x; w), where x is a vector of independent variables, and w is the vector of parameters. Given a function composition, the size of the weight vector w is equal to a*b, where a is the number of functions that make up the composition, and b is the arity of every such function. Each of the elements of w is used as a coefficient for each of the independent variables.

As an example, given

then we can write down the analytic function composition

The weight vector is a real-valued encoding of the parameterization of the composite function. Real- valued encodings are amenable to evolutionary search and optimization with the use of Evolution Strategies. Embodiments can be used to solve problems faced in the operations for Cloud and more particularly to improve the system and method for clustering incoming alarm data in the context of automating the process of root cause analysis. Cloud datacenter operators have recognized the need for a method and system to impose structure to an incoming stream of alarm data, by partitioning and organizing alarm data into related subsets. This can greatly facilitate the process of alarm correlation rule mining, and root cause analysis. Cluster analysis, also called data segmentation, relates to grouping or segmenting a collection of objects into subsets or clusters, such that those in the same cluster are more closely related to one another than objects assigned to different clusters. Central to cluster analysis is the process measuring the degree of similarity (or dissimilarity) between the individual objects. This is performed by means of a similarity metric. The choice of similarity metric comes from sub- ject matter considerations. This implies that the similarity metric needs to be specifically defined for the grouping task at hand. In the absence of prior domain knowledge, the standard Euclidean metric is used for computing distances between objects.

In order to improve the performance of clustering algorithms, a metric learning approach can be adopted. Metric learning methods attempt to generate a transformation of the original input space into a new feature space in which objects that belong to a particular group are "closer " than objects that belong into different groups. This transformation in the current method is generated with a supervised learning procedure by providing to the machine examples of genuine (i.e. positive) and impostor (i.e. negative) pairs of objects. The trainability of metric ensures that:

1. The similarity metric can be customized to any particular clustering domain, provided that labelled examples are provided.

2. It is a general method for object similarity computation that can be applied to cluster- ing algorithms like K-Means, Hierarchical Agglomerative Clustering, and Self- Organizing Maps.

Firstly, there is a need of producing optimal or near-optimal data clustering as opposed to sub- optimal clustering that may be generated with standard Euclidean-distance-based algorithms that do not use any prior knowledge of the problem domain. To achieve optimal or near- optimal performance, the parameters controlling the size and number of clusters given a set of objects should be fine-tuned using ground-truth or labelled examples of what constitutes a cluster of objects. This is in contrast with the current practice of cluster analysis in which these parameters are human-controlled and need to be specified in each new application of a clustering algorithm.

Secondly, there is a need for the method to generate low-dimensional representations of high- dimensional data, which can accelerate online clustering algorithms whose time-complexity factors in the input dimensionality. This is of significant importance in time-critical application of online grouping of telecom network alarm data.

Further requirements can include one or more of:

1. The ability to generate an input transformation represented as function compositions of linear and / or non-linear components can circumvent potentially sub-optimal performance of linear transformations introduced in previous work on the subject matter.

2. The ability to perform input dimensionality reduction will have an impact of saving CPU-time of online clustering.

3. The ability to generate optimal or near-optimal clustering by parametrically controlling the agglomeration procedure of hierarchical agglomerative clustering.

4. The ability to group alarm data, in light of different similarity metrics required for different domains, i.e. Cloud, Radio, Fixed network.

5. The ability to enhance the performance of Genetic Programming in very large hypothesis spaces.

A problem in the prior art is a sub-optimal clustering performance that may result in cases where the standard Euclidean distance metric applied to an original input space X is used as a similarity measure for clustering. The application of Euclidean distance in X assumes that no prior information is used about the problem domain in calculating similarity between objects. This is illustrated in FIG. 4.

In this example the number of clusters is known a priory, and is set to three. We hypothesize that there exists some space Z that can be reached via a transformation (i.e. mapping) G ap- plied to input X, in which objects belonging to a group are "close" to one another, whereas objects belonging to different groups are "far" from one another. Formally, the original input representation X is transformed via function G into a new representation Z in which the Li norm of a difference between objects that belong to the same group is small, whereas the Li norm of a difference between objects that belong to different groups is large. We denote the original M-dimensional input space as X, and the 2-dimensional space that results after the transformation via function G as Z. We learn G: X -> Z using a novel Evolutionary Computation that comprises a cascade of two learning stages:

1. Structural Learning

2. Parametric Optimization

Structural Learning aims at generating the overall function composition made out of linear and / or non-linear functions; the function composition is represented as a directed acyclic graph (i.e. expression-tree). The basis for Structural learning is the method of Genetic Programming. Parametric Optimization acts as a local search optimization method, and aims at fine-tuning the function composition. It starts by parameterizing the edges of the evolved expression-tree. Analytically, this corresponds to parameterizing the arguments of each function that makes up the composition with real-valued weights. The basis for optimization is the method of Evolution Strategy.

A schematic representation of a learning system is depicted in FIG. 5. The learning system 500 comprises a structural learning unit 510, which determines a structure F(x). This is passed to a parametric optimization unit 520. The parametric optimization unit 520 determines optimum parameters of the mapping function and passes the optimized mapping function G(x) to a transformation repository 530. Thus, differently initialized cascades of structural followed by parametric learning give rise to different models (i.e. transformations G(x)), and these are stored in the transformation repository. Once the repository is populated with a number of models, the runtime operation of the clustering system can be performed as illustrated in FIG. 6.

FIG. 6 illustrates a generic cluster analysis system 600 that is based on trainable similarity metrics. Original input vectors are transformed into a new vector representation that is used in the computation of average Li norms between pairs of objects, required in the process of hierarchical agglomerative clustering. A distance threshold controls the termination of the agglomeration procedure, and designates the output of the whole process, which is a number of groups of objects. In particular, the cluster analysis system 600 comprises a transformation repository 610, which passes a transformation (mapping function) to a similarity evaluation unit 620. The similarity evaluation unit 620 uses the transformation G(x) to evaluate a Li similarity between two objects xi and x₂. The Li similarity value is then passed to a similarity metric averaging unit 630, which evaluates an average. The average value is then passed to a hierarchical ag- glomerative clustering unit 640. Based on a termination criterion 650 that is provided from an external unit 650, the hierarchical agglomerative clustering unit 640 then determines object groups as output. A high-level view of a processing sequence of a clustering system is given in FIG. 7. The processing sequence 700 comprises:

Training Data Processing 710, responsible for data generation

- Model Induction 720, responsible for generating mappings (i.e. transformations) G(x)

- Distance Threshold Determination 730, for deciding on the parameter that will be used in conjunction with agglomerative clustering to determine optimal or near-optimal clustering

- Hierarchical Agglomerative Online Clustering 740, as the ultimate application of the system to real-time data. Training Data Processing

Training examples can be arranged into sets of objects that constitute a group. Each object in each of these sets is paired up with every other object in these sets in order to generate a set of combinations for genuine (i.e. positive) and impostor (i.e. negative) examples. The set of combinations can be exhaustive.

As a demonstration of the process of creating training examples consider the following. Given three groups of objects Groupl={Al, A2, A3 }, Group2={A4, A5 }, and Group3={A6, A7} we generate the set of genuine training examples Dgenuine = {Al, A2, 1 }, {Al, A3, 1 }, {A2, A3, 1 }, {A4, A5, 1 }, {A6, A7, 1 } and impostor training examples Dimpostor {Al, A4, 0}, {Al, A5, 0 }, {A2, A4, 0}, {A2, A5, 0}, {A3, A4, 0}, {A3, A5, 0}, {Al, A6, 0}, {Al, A7, 0}, {A2, A6, 0 }, {A2, A7, 0}, {A3, A6, 0}, {A3, A7, 0}, {A4, A6, 0}, {A4, A7, 0}, {A5, A6, 0}, {A5, A7, 0 } . The labels 0, 1 indicate genuine, impostor pairs respectively. The examples in

are further grouped as follows:

This grouping will facilitate the definition of loss function that ensures that the average deviation of Li norms within a group is minimized. The process of creating a set of genuine and impostor examples results in a highly unbalanced set of examples. The set of examples can be exhaustive. Preferably, a down-sampling mechanism of impostor examples is used. Therein, the probability of sampling a group of impostor training examples can be a function of difficulty and sampling recency of said example. The algorithm can randomly select a mini-batch of groups of impostor examples at each genera- tion, with a bias, so that a group of impostor examples is more likely to be selected if it is difficult or has not been selected for several generations. Mini-batch sampling ensures a balanced distribution between genuine and impostor examples.

The process for mini-batch sampling is as follows:

- In the first pass of the entire set Dimpostor groups, of size T, in generation g, each training case i is assigned a weight W, which is the sum of its current difficulty, D, exponentiated to a certain power d, and the number of generations since it was last selected, A, exponentiated to a certain power a.

In the second pass of the entire set of Dimpostor groups, each group is given a probability of bein selected. A group' s probability of being selected is given by

If a group is selected to be in the mini-batch of current generation g, its difficulty Di is set to 0 and its age Ai is set to 1, otherwise its difficulty remains unchanged and its age incremented. While executing each genetic program of a population of genetic programs of size K using the selected group of impostor examples, the difficulty of the said group is incremented by

where

is the learning rate set to 0.1

IGroupl is the cardinality of the group of impostor pairs

is the Li norm between 2-dimensional vectors

Model Induction

The detailed architecture of a model induction system 800 is illustrated in FIG 8. A Model Induction unit 820 is provided with a number of inputs. These can include Training Data 810, a Function set, a hypothesis space 814, a loss function 816, and a dynamic mini-batch sam- pling 818. The hypothesis space 814 can be determined based on information 812 that includes a function set, a terminal set and a tree depth value.

As outlined above, the Model Induction Unit 820 comprises a structural learning unit and a parametric optimization unit.

The Model Induction Unit 820 interacts with an evolutionary algorithm 830. The evolutionary algorithm comprises the following steps:

First, in step 831, an initial population is created. Subsequently, in step 832, the fitness of each individual is evaluated. Then, a selection is applied 833 and crossover/mutation is per- formed 834. Subsequently, it is evaluated whether a termination criterion has been reached already. If it has been reached, the method ends, otherwise it continues with the fitness evaluation in step 832. Structural Learning aims at generating the overall function composition made out of linear and / or non-linear functions; the function composition is represented as a directed acyclic graph (i.e. expression-tree). The basis for Structural learning is the method of Genetic Programming. Parametric Optimization acts as a local search optimization method, and aims at fine-tuning the function composition. It starts by parameterizing the edges of the evolved ex- pression-tree. Analytically, this corresponds to parameterizing the arguments of each function that makes up the composition with real-valued weights. The basis for optimization is the method of Evolution Strategy.

The application of Genetic Programming requires the experimenter to specify the maximum depth of tree- structured genetic programs, as well as the function and terminal sets, which collectively define the hypothesis space.

As an example of a hypothesis search space we consider all possible tree- structures that can be generated using the following realizations for function / terminal sets and tree-depth:

- Function set F={add(a,b), subtract(a,b), multiply(a,b), divide(a,b), sin(a), log(a), sqrt(a), pow(a,b), IF- THEN-ELSE, GreaterThanOrEqual, LessThan}

Terminal set T={ Original Input Features, Random constants in { - 1.0, 1.0 } }

Maximum tree- depth: 10 The task is to learn transformations that map an M-dimensional input vector X into a 2- dimensional feature vector Z. The representation of a solution is an array of two expression- trees. Each of the expression-trees evaluates to one component of vector Z=[Zi , Z₂]. An example is given in FIG. 9.

The loss function (to be minimized) is the fitness criterion that can be used in the "Evaluate Fitness of Each Individual" process of the Evolutionary Algorithm illustrated in FIG. 8. The loss function can be defined as follows:

where

G(x) is a transformation that maps M-dimensional input X into 2-dimensional vector Z with components Gi(x) and G₂(x)

- Dgenuine is the set of genuine groups with cardinality I Dgenuine I. Each example is a pair of objects (x, y). The set of genuine pairs is further organized into groups of all the pairs that can be extracted from a given group.

- Dimpostor is the set of impostor examples with cardinality I Dimpostor I. Each example is a pair of objects (x, y). The set of impostor pairs is further organized into groups of GroupxVsGroupy.

- Group; is a group of genuine of impostor examples

A is a parameter that controls the tradeoff between minimizing the average Li norm of genuine pairs, and maximizing the average Li norm of impostor pairs.

- Tanh(x) is the hyperbolic tangent function with scalar argument x

Learning is reduced to the task of searching for a genetic program that minimizes the proposed loss function. The search engine of Genetic Programming is an Evolutionary Algorithm, which is illustrated as a flowchart in FIG. 8. The initialization of population can be performed using a random-tree-generation algorithm. The weights in the edges of a genetic program can all be set to the value of 1 during Structural learning. A predetermined maximum number of generations can be used to terminate this process. At the end of Structural learning we select the individual genetic program that achieves the best value of the loss function calculated on a separate validation set. This genetic program is Gbest(x ; w). The process of Parametric optimization follows, in which an Evolution Strategy ES(1+ λ ) is used to further optimize the weighted edges of Gbest(x ; w) with vector w. The encoding of a solution is the weight vector extracted from the expression-tree of Gbest(x ; w). The calculation of the loss function used for estimating the fitness of a population of weight vectors is based on the weight-parameterized function composition Gbest(x ; w) as described above. The output of parametric optimization is the best-evolved function composition resulted from the process, with the best-evolved weight vector w', denoted as Gbest(x ; w').

A Model Induction life-cycle is now terminated. Several invocations (i.e. 20 independent cycles) of this life-cycle can be be realized in order to populate a repository of evolved trans- formations. Every independent run of Evolutionary Algorithm for Structural Learning will be typically initialized in different parts of the search space. Randomization during initialization is achieved using a random genetic program generation algorithm, as illustrated above.

Distance Threshold Determination

Given a repository of evolved input-space transformations, this process aims at determining the value of a distance threshold parameter that will be used in agglomerative clustering algorithm with the aim of reproducing the groups of objects of a validation set as precisely as possible. The determination of distance parameter is performed by measuring average pairwise Li norm between objects in a group during agglomerative clustering, then selecting the threshold value that terminates agglomeration and sets a trade-off between average Hit Rate and average Slack Rate metrics. Importantly, Hit Rate and Slack Rate metrics are evaluated on a validation set, independent of the one used during training. Let T be a target set of objects with cardinality ITI, and Y be a predicted set of objects with cardinality IYI,

The hit rate can be defined as and the slack rate correspondingly as

wherein ^ denotes the set-intersection operator and ^ denotes the set- difference operator. The algorithm illustrated below can be used to generate groups of objects given a set of objects X, a set of transformations G, and a similarity threshold T.

For a given value of threshold T, the average Hit Rate and average Slack Rate is calculated for resulting grouping C. A grid-search procedure (with certain step-size K) is followed to test an array of threshold values, and a threshold T is selected that maximizes the weighted sum of i_{n a validation set}

Parametric Agglomerative Hierarchical Clustering

The above algorithm can be used to group a set of objects in real-time. Significant speedups in execution time can be obtained by calculating the proximity matrix containing the average Li distance of each pair of points.

An application of the above generic system for cluster analysis is the online grouping of network alarm data. For this purpose, the system of FIG. 6 can be used, wherein the set of objects are alarm objects, e.g. grouped to a block of alarms. The block of alarms can be provided from an alarm accumulation and/or pre-processing unit, which obtains a stream of alarms, e.g. directly from a plurality of network components.

The Accumulator can use some business logic to partition the incoming stream of Alarms into "Blocks" of alarms. In each such block, basic pre-processing removes duplicate and flapping alarms.

Hierarchical Agglomerative clustering requires the computation of pairwise distances between alarms, and these are computed as Li norms on the transformed alarm representations. Averaging the distances computed using different transformations enhances the robustness of estimated similarity. A similarity threshold is used to halt the agglomeration process of hierarchical clustering. The output of the process is the partitioning of alarms into a number of groups. Simulations have confirmed the effectiveness and applicability of the proposed device and method. The tables in FIG. 10 and 11 summarize the generalization performance of an online network alarm grouping system. The system was trained on 21 days of alarm data. It attained on average a 87.8% hit rate when tested on data that were drawn from the same distribution as the training data, and a 79.3% hit rate when tested on data drawn from a different distribution from the training data.

Embodiments of the invention can include:

A novel non-differentiable, non-convex loss function that cannot be minimized by tra- ditional gradient-based methods, but is amenable to minimization with stochastic search algorithms from the field of Evolutionary Computation.

A novel mini-batch sampling method specifically developed to deal with the class imbalance problem inherent in the generation of training examples as genuine and impostor pairs of objects.

A novel two-stage model induction process that combines global optimization of a function composition with local optimization of its variables.

A novel method able to produce optimal or near-optimal grouping of objects using a parametric agglomerative clustering algorithm and a trainable similarity metric. The output of parametric agglomerative clustering is controlled by a single threshold value that is chosen based on validation data performance.

• A Li distance averaging mechanism as opposed to standalone distance metrics used previously.

Possible applications of the presented device and method can include:

• Significant importance in time-critical application areas where learning has to be done on real-time streaming data and applied along with the learning

• Alternative solution to the dependency on domain knowledge for learning metrics that is used in various machine learning algorithms related to kernel regression, prototype- based classification, distance-based clustering.

• In cloud operations & management, where decisions have to be taken at real-time, when applied on numerous set of metrics across domains

• In IoT area, when decisions have to be taken for optimizations & dynamic planning of network

• In other 5G domain, for prediction of results on Network slicing, optimization & planning

• As a precursor stage for Root Cause Analysis

The foregoing descriptions are only implementation manners of the present invention, the scope of the present invention is not limited to this. Any variations or replacements can be easily made through person skilled in the art. Therefore, the protection scope of the present invention should be subject to the protection scope of the attached claims.

Claims

CLAIMS 1. Clustering device (100) configured to cluster a set of test objects based on a set of labelled objects, the clustering device comprising:

a preparation unit (110) configured to generate a set of genuine training pairs, each comprising two objects with a same label, and a set of impostor training pairs, each comprising two objects with different labels,

a learning unit (120) configured to learn, based on the genuine training pairs and the impostor training pairs, a mapping function that maps objects into a target space such that in the target space objects with different labels have a higher distance between them than objects with a same label, and an application unit (130) configured to apply the learned distance measure to the set of test objects to obtain a clustering,

wherein learning the mapping function comprises first using genetic programming to learn a structure of the mapping function and then performing an evolutionary parameter optimization to determine optimum parameters of the learned structured mapping function.

2. The clustering device (100) of claim 1, wherein the learning unit (120) is configured to sample an impostor training pair with a probability that is a function of a difficulty measure and a measure of when the impostor training pair was last sampled.

3. The clustering device (100) of claim 2, wherein the probability of sampling an impostor training pair is given by:

wherein is a generation

number, T is a total number of impostor training pairs,

is a number of generations since this training pair was last selected, and is a difficulty measure that is

determined in an iterative manner.

4. The clustering device (100) of claim 2 or 3, wherein each training pair comprises a first element and a second element y_;- and the difficulty measure is iteratively de

termined based on:

wherein η is a predetermined parameter.

5. The clustering device (100) of one of the previous claims, wherein the learning unit (120) is configured to evaluate a fitness function on the set of labelled training objects, wherein evaluating the fitness function comprises comparing an average distance of genuine training pairs with an average distance of imposture training pairs, wherein preferably the distance is evaluated in the target space.

6. Method (200) for clustering a set of test objects based on a set of labelled training objects, the method comprising:

generating (210) a set of genuine training pairs, each comprising two objects with a same label, and a set of impostor training pairs, each comprising two objects with different labels,

learning (220), based on the genuine training pairs and the impostor training pairs, a mapping function that maps objects into a target space such that in the target space objects with different labels have a higher distance between them than objects with a same label, and

applying (230) the learned distance measure to the set of test objects to obtain a clustering,

wherein the learning (220) the mapping function comprises first using genetic programming to learn a structure of the mapping function and then performing an evolutionary parameter optimization to determine optimum parameters of the structured mapping function.

7. The method (200) of claim 6, wherein learning (220) the mapping function comprises sampling an impostor training pair with a probability that is a function of a difficulty measure and a measure of when the impostor training pair was last sampled.

8. The method (200) of claim 7, wherein the probability of sampling an impostor training pair is given by:

wherein

is a generation number, Tis a total number of impostor training pairs, A

is a number of generations since this training pair was last selected, and D is a difficulty measure that is

determined in an iterative manner.

9. The method (200) of claim 7 or 8, wherein each training pair comprises a first element and a second element and the difficulty measure is iteratively determined based

on:

wherein η is a predetermined parameter.

10. The method (200) of one of claims 6 to 9, wherein learning (220) the mapping function comprises sampling a balanced number of genuine training pairs and impostor training pairs.

11. The method (200) of one of claims 6 to 10, wherein obtaining the clustering is based on a threshold that is determined based by maximizing a weighted sum of a hit rate and a slack rate on a labelled validation set.

12. The method (200) of one of claims 6 to 11, wherein learning (220) the mapping function comprises evaluating a fitness function on the set of labelled training objects, wherein evaluating the fitness function comprises comparing an average distance of genuine training pairs with an average distance of imposture training pairs, wherein preferably the distance is evaluated in the target space.

13. The method (200) of claim 12, wherein the fitness function is given by:

wherein is a set of all genuine training pairs, D_impostor is a set of all impostor

training pairs, G is the mapping function, 1 is a predetermined parameter and SD is a measure of standard deviation within a group.

14. The method (200) of claim 13, wherein the measure of standard deviation is given by:

wherein

15. A computer-readable storage medium storing program code, the program code comprising instructions that when executed by a processor carry out the method of one of claims 6 to 14.