WO2022016363A1

WO2022016363A1 - Similar data set identification

Info

Publication number: WO2022016363A1
Application number: PCT/CN2020/103233
Authority: WO
Inventors: Feng Liu; Xiaoxuan Zhang; Dino PACANDI; Mengmeng Liu; Zhiming Zhang; Nan LIANG; Jiawei Xu; Hang Li; Zhentao Liu
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ); Feng Liu
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2022-01-27

Abstract

The present disclosure is related to a method, an electronic device, and a system for identifying similar data sets from a plurality of data sets. The method comprises: converting the plurality of data sets into a plurality of data sequences, respectively, each of the data sequences being a sequence of one or more data objects; determining a feature vector for each of the data objects such that each of the plurality of data sets corresponds to a sequence of one or more feature vectors; and clustering the plurality of data sets into one or more data clusters based on similarities between their corresponding sequences of feature vectors, each of the data clusters being a cluster of similar data sets.

Description

SIMILAR DATA SET IDENTIFICATION

Technical Field

The present disclosure is generally related to the field of data mining, and in particular, to a method, an electronic device, and a system for identifying similar data sets from a plurality of data sets.

Background

Nowadays, a device has to be thoroughly tested before it is allowed to be operated in a modern telecommunication network. For example, in a Third Generation Partnership Project (3GPP) Long Term Evolution (LTE) or New Radio (NR) -compliant network, a user equipment (UE) , an evolved NodeB (eNB) , or a gNB, or any other network node has to fulfill numerous requirements of technical specifications provisioned by 3GPP which may include thousands of pages of documents and involve hundreds of features. Therefore, many resources for testing are involved and numerous test cases (TC) are designed.

A test case is a set of actions or steps executed to verify a particular feature of functionality of a device or a software program. A test case may contain test steps, test data, pre-condition, post-condition developed for specific test scenario to verify any requirement. A test case includes specific variables or conditions, using which a testing engineer (or a tester) can compare expected and actual results to determine whether a device or a software program is functioning as per the requirements of technical specification.

For example, a simple test case for testing a LOGIN function may be as follows:

Table 1: Exemplary Test Case

For a complex device, such as an LTE-or NR-compliant device, it is impossible for a single tester to design and test all the test cases required by the 3GPP technical specifications. Typically, different testers may be assigned to design test cases for different aspects of the device, respectively. However, for those common functions or features involved in the different aspects, similar or even identical test cases may probably be designed by different testers, which may result in a waste of test resources (such as, time, CPU cycles, network bandwidth, or the like) .

Therefore, people want to detect similar test cases, and then the similar test cases may be optimized by changing/merging code or script of the test cases to improve the testing efficiency or saving the test resources.

Summary

According to a first aspect of the present disclosure, a method for identifying similar data sets from a plurality of data sets is provided. The method comprises: converting the plurality of data sets into a plurality of data sequences, respectively, each of the data sequences being a sequence of one or more data objects; determining a feature vector for each of the data objects such that each of the plurality of data sets corresponds to a sequence of one or more feature vectors; and clustering the plurality of data sets into one or more data clusters based on similarities between their corresponding sequences of feature vectors, each of the data clusters being a cluster of similar data sets.

In some embodiments, each of the plurality of data sets is a log for steps of a test case, and each of the data objects is a JavaScript Object Notation (JSON) object, and wherein the step of converting the plurality of data sets into a plurality of data sequences, respectively, comprises: converting the plurality of logs into the plurality of sequences of JSON objects, respectively, each of the JSON objects corresponding to a step of a corresponding test case.

In some embodiments, the step of determining a feature vector for each of the data objects comprises: determining pairwise similarities between the data objects; and mapping each of the data objects to a feature vector based on the pairwise similarities between the data objects.

In some embodiments, the mapping is achieved by a graph embedding algorithm.

In some embodiments, the step of determining pairwise similarities between the data objects comprises: determining whether all the pairwise similarities for the data objects is needed or not at least based on the number of the data objects.

In some embodiments, the method further comprises: calculating all the pairwise similarities for the data objects in response to determining all the pairwise similarities for the data objects is needed.

In some embodiments, the method further comprises: calculating one or more pairwise similarities for the data objects in response to determining all the pairwise similarities for the data objects is not needed; and deriving all the pairwise similarities for the data objects from the one or more pairwise similarities for the data objects by the Nystrom approximation algorithm.

In some embodiments, a similarity between two data objects is determined based on a distance between the two data objects.

In some embodiments, the distance between the two data objects is calculated based on the tree edit distance algorithm or the tree kernel algorithm.

In some embodiments, the similarity is calculated by the following equation:

s = f (d)

wherein s represents the similarity between the two data objects and is greater than or equal to zero, f (·) represents a monotonically decreasing function, and d represents the distance between the two data objects and is greater than or equal to zero.

In some embodiments, the step of clustering the plurality of data sets into one or more data clusters based on similarities between their corresponding sequences of feature vectors comprises: determining pairwise similarities between the sequences of feature vectors; and clustering the plurality of data sets into one or more data clusters based on the pairwise similarities between the sequences of feature vectors.

In some embodiments, before the step of determining pairwise similarities between the sequences of feature vectors, the method further comprises: in response to determining that the feature vectors are multivariate feature vectors, clustering the feature vectors into vector clusters; and converting the multivariate feature vectors into univariate feature vectors, respectively, by replacing multivariate feature vectors, which belong to a same vector cluster, with a same univariate feature vector.

In some embodiments, the step of clustering the feature vectors into vector clusters in response to determining that the feature vectors are multivariate feature vectors is performed based on one of Kmeans, Kmeans++, Kmedoids, Spectral clustering, and Gaussian mixture model.

In some embodiments, a similarity between two sequences of feature vectors is determined by a distance based algorithm.

In some embodiments, the step of clustering the plurality of data sets into one or more data clusters based on the pairwise similarities between the sequences of feature vectors comprises: clustering the plurality of data sets into one or more data clusters based on the pairwise similarities between the sequences of feature vectors by the k-medoids algorithm or the spectral clustering algorithm.

In some embodiments, before the step of converting the plurality of data sets into a plurality of data sequences, the method further comprises: cleaning the plurality of data sets by removing unnecessary data from the data sets.

According to a second aspect of the present disclosure, an electronic device is provided. The electronic device comprises: a processor; a memory storing instructions which, when executed by the processor, cause the processor to perform any method described above.

According to a third aspect of the present disclosure, a computer program is provided. The computer program comprises instructions which, when executed by at least one processor, causes the at least one processor to carry out any method described above.

According to a fourth aspect of the present disclosure, a carrier containing the computer program described above is provided, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

According to a fifth aspect of the present disclosure, a system for identifying similar test cases from a plurality of test cases is provided. The system comprises: one or more computing nodes comprising one or more slave nodes and a master node, wherein the master node is configured to trigger the one or more slave nodes to perform any method described above, collectively.

Brief Description of the Drawings

The foregoing and other features of the present disclosure will become more fully apparent from the following description and appended claims, taken in conjunction with the accompanying drawings. Understanding that these drawings depict only several embodiments in accordance with the disclosure and therefore are not to be considered limiting of its scope, the disclosure will be described with additional specificity and detail through use of the accompanying drawings.

Fig. 1 is an overview diagram illustrating a system for testing according to an embodiment of the present disclosure.

Fig. 2 is a message flow diagram illustrating exemplary messages exchanged between different nodes for processing test logs according to an embodiment of the present disclosure.

Fig. 3 is a flow chart illustrating an exemplary method for identifying similar data sets according to an embodiment of the present disclosure.

Fig. 4 is a diagram illustrating an exemplary step of the method of Fig. 3.

Fig. 5 is a flow chart illustrating another exemplary step of the method of Fig. 3.

Fig. 6 is a diagram illustrating an exemplary application of the Nystrom approximation algorithm according to an embodiment of the present disclosure.

Fig. 7 is a diagram illustrating an exemplary application of the graph embedding algorithm according to an embodiment of the present disclosure.

Fig. 8 is a flow chart illustrating yet another exemplary step of the method of Fig. 3.

Fig. 9 is a flow chart illustrating an exemplary step of the method of Fig. 8.

Fig. 10 schematically shows an embodiment of an arrangement which may be used in an electronic device according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, the present disclosure is described with reference to embodiments shown in the attached drawings. However, it is to be understood that those descriptions are just provided for illustrative purpose, rather than limiting the present disclosure. Further, in the following, descriptions of known structures and techniques are omitted so as not to unnecessarily obscure the concept of the present disclosure.

Those skilled in the art will appreciate that the term "exemplary" is used herein to mean "illustrative, " or "serving as an example, " and is not intended to imply that a particular embodiment is preferred over another or that a particular feature is essential. Likewise, the terms "first" , "second" , "third" , "fourth" , "fifth" , "sixth, " and similar terms, are used simply to distinguish one particular instance of an item or feature from another, and do not indicate a particular order or arrangement, unless the context clearly indicates otherwise. Further, the term "step, " as used herein, is meant to be synonymous with "operation" or "action. " Any description herein of a sequence of steps does not imply that these operations must be carried out in a particular order, or even that these operations are carried out in any order at all, unless the context or the details of the described operation clearly indicates otherwise.

Conditional language used herein, such as "can, " "might, " "may, " "e.g., " and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or states. Thus, such conditional language is not generally intended to imply that features, elements and/or states are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and/or states are included or are to be performed in any particular embodiment. Also, the term "or" is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term "or" means one, some, or all of the elements in the list. Further, the term "each, " as used herein, in addition to having its ordinary meaning, can mean any subset of a set of elements to which the term "each" is applied.

The term "based on" is to be read as "based at least in part on. " The term "one embodiment" and "an embodiment" are to be read as "at least one embodiment. " The term "another embodiment" is to be read as "at least one other embodiment. " Other definitions, explicit and implicit, may be included below. In addition, language such as the phrase "at least one of X, Y and Z, " unless specifically stated otherwise, is to be understood with the context as used in general to convey that an item, term, etc. may be either X, Y, or Z, or a combination thereof.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limitation of example embodiments. As used herein, the singular forms "a" , "an" , and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" , "comprising" , "has" , "having" , "includes" and/or "including" , when used herein, specify the presence of stated features, elements, and/or components etc., but do not preclude the presence or addition of one or more other features, elements, components and/or combinations thereof. It will be also understood that the terms "connect (s) , " "connecting" , "connected" , etc. when used herein, just mean that there is an electrical or communicative connection between two elements and they can be connected either directly or indirectly, unless explicitly stated to the contrary.

Of course, the present disclosure may be carried out in other specific ways than those set forth herein without departing from the scope and essential characteristics of the disclosure. One or more of the specific processes discussed below may be carried out in any electronic device comprising one or more appropriately configured processing circuits, which may in some embodiments be embodied in one or more application-specific integrated circuits (ASICs) . In some embodiments, these processing circuits may comprise one or more microprocessors, microcontrollers, and/or digital signal processors programmed with appropriate software and/or firmware to carry out one or more of the operations described above, or variants thereof. In some embodiments, these processing circuits may comprise customized hardware to carry out one or more of the functions described above. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Although multiple embodiments of the present disclosure will be illustrated in the accompanying Drawings and described in the following Detailed Description, it should be understood that the disclosure is not limited to the disclosed embodiments, but instead is also capable of numerous rearrangements, modifications, and substitutions without departing from the present disclosure that as will be set forth and defined within the claims.

As mentioned above, similar or identical test cases should be identified from numerous test cases designed by different testers to save test resources and to optimize test procedures. Traditionally, a method for detecting similar test cases is to cluster the test cases by their names. However, such a clustering is obviously not accurate enough since different test cases may have similar test case names or a same test case may be named differently by different testers.

Therefore, some embodiments of the present disclosure propose a solution for identifying similar test cases from a plurality of test cases, or more generally speaking, a solution for identifying similar data sets from a plurality of data sets. According to an embodiment of the present disclosure, a test case comprising several test steps or actions may be wrapped into a sequence of JavaScript Object Notation (JSON) objects with one JSON object for each test step, as will be described in details below. Thus, the execution and/or response of a test case can be represented by an ordered sequence of JSON objects.

To be specific, each JSON object may be regarded as a tree (data structure) , and a similarity metric for trees (e.g. the tree edit distance, the tree kernel, etc. ) may be used to measure the similarity between two JSON objects. Then, machine learning techniques (such as, Laplacian eigenmap, deep walk, etc. ) may be used to embed each JSON object into a vector space, that is, for each JSON object, a numerical feature vector is generated. Once a numerical representation of each json object is determined, each test case may be converted to a sequence of vectors, and then machine learning techniques may be applied for time series clustering. For example, the pairwise similarities among different time series may be calculated by using any arbitrary time series similarity metric, such as dynamic time warping, longest common subsequences, etc., and then the time series may be clustered by any clustering algorithm that supports pairwise similarity.

In this way, two sequences of JSON objects representing two test cases may be compared to each other not only in terms of their content, but also in terms of their tree structures, thereby resulting in an accurate comparison therebetween.

Next, an introduction of an exemplary test system will be given with reference to Fig. 1.

Fig. 1 is an overview diagram illustrating a system 10 for testing according to an embodiment of the present disclosure. The system 10 may comprise one or more testers 100-1, 100-2, and 100-3 (collectively, "the testers 100" ) , test resources 110, and a log server 120.

As mentioned above, a plurality of testers 100-1, 100-2, and 100-3 may be assigned to test different aspects of a same device or software program. In such a case, they may design different test cases (TCs) , TC #1, TC #2, and TC #3, respectively, and submit these TCs to the test resources 110 for testing. For example, these test cases may be scripts for invoking different test procedures or programs at the test resources 110, and may be uploaded and executed at the test resources 110 to perform corresponding tests. Please note that although it is recited in the present embodiment that the testers are human-beings, the present disclosure is not limited thereto. For example, in some other embodiments, a tester may be an Artificial Intelligence (AI) or a machine which is capable to design test cases and/or test them.

When the test procedures at the test resources 110 are completed, logs for the test procedures may be generated, respectively. For example, as shown in Fig. 1, different logs, Log #1, Log #2, and Log #3, are generated in response to the TC #1, TC #2, and TC #3 submitted to and executed at the test resources 110. These logs may be stored at the log server 120.

In some embodiments, the log server 120 may be a file server storing different log files, or a centralized or distributed database storing a table having different entries for different test cases, or any other storage storing the logs. As shown in Fig. 1, the log server 120 may be a database which hosts a table having three entries for the three TCs, respectively. The first entry of the table may be a record, "TC #1: step a1, result; step a2 {sub-step a2-1, ... } , result; ... " , which reflects that the test case TC#1 may comprise two or more steps, a1 and a2, and the step a2 may further comprise sub-steps including the sub-step a2-1. Further, this entry may include a response or result for each of the steps/sub-steps. However, in some other embodiments, one or more steps may not have their response recorded or obtained.

Further, although only three testers are shown in Fig. 1, the present disclosure is not limited thereto. In some other embodiments, a different number of testers may be involved. Further, although a pool of test resources 110 is shown, the test resources 110 and/or the log server 120 may be distributed across different physical locations. Further, although only three TCs and three logs are shown, the present disclosure is not limited thereto. For example, in some embodiments, the tester #1 100-1 may submit more than one TCs and the test resources 110 may generate more than one test logs for each of the test cases.

Furthermore, although it is shown in Fig. 1 that the logs for the different TCs may have a same content format, the present disclosure is not limited thereto. In fact, different testers may use different test resources and may design different scripts for different test procedures, and therefore different logs with different formats may be expected.

Further, as shown in Fig. 1, the system 10 may optionally comprise a similarity detection system (SDS) 130 for identifying similar test cases according to an embodiment of the present disclosure, which will be described in details with reference to Fig. 2. In some embodiments, the similarity detection system 130 may retrieve the logs from the log server 120 for subsequent analysis shown in Fig. 2. Additionally or alternatively, the SDS 130 may receive the TCs from the testers 100 directly without the test results/responses. In such a case, the SDS 130 may identify similar test cases from the scripts of the TCs directly without the test results/responses.

As shown in Fig. 2, the SDS 130 may be operated on a big data platform which may be a scalable platform for distributed file storage and distributed computing. In some embodiments, a typical platform may comprise HDFS (a distributed file system) , Spark (a distributed computation framework) , and HBase (a database) , and so on. However, the present disclosure is not limited thereto. On the platform, middleware is provided for common functions, such as fetching, post-processing, and/or storing of the logs, triggering machine learning (ML) algorithms, and pushing back processing results. Further, the ML algorithms involved may be those described below with references to Fig. 3 -Fig. 9, and the processing results may be presented to the users via a web portal in a visualized and user friendly manner.

Referring to Fig. 2, the SDS 130 may comprise a plurality of nodes including a master node 131 and one or more slave nodes 133-1, ..., 133-n, which may be physical or logical entities enabled by the platform and/or the middleware.

As shown in Fig. 2, the master node 131 may initialize the process of similar test case identification, and may trigger starting of jobs of similar test case identification at the slave nodes 133-1, ..., 133-n (collectively, "the slave nodes 133" ) at

steps

201a and 201b. At step 202a, the slave node #1 133-1 may request test logs from the log server 120 upon the start of the job assigned to itself, and fetch, retrieve, or otherwise receive, at step 203a, a part of the logs in response to the request. Similarly, at step 202b, the slave node #n 133-n may request test logs from the log server 120 upon the start of the job assigned to itself, and fetch, retrieve, or otherwise receive, at step 203b, another part of the logs in response to the request. In this way, multiple slave nodes 133 may process the logs stored in the log server 120 in parallel.

Upon reception of the logs, the slave node 133 may process the logs, respectively, at

steps

204a and 204b. The processing of the logs may comprise but not limited to log cleaning, log formatting, etc. After the processing of the logs, the logs may have a unified format, such as a JSON format as described below, for subsequent processing.

At

step

205a and 205b, the processed logs are transmitted from the slave nodes 133 to the master node 131. However, the present disclosure is not limited thereto. For example, in some other embodiments, the processed logs may be further processed at the salve nodes 133 locally, for example, by a machine learning algorithm described below.

At step 206, upon receipt of the processed logs, the processed logs are subjected to the machine learning algorithms for similarity detection. In some embodiments, the execution of the machine learning algorithms may be distributed across multiple nodes (such as, the slave nodes 133) . In some other embodiments, the execution of the machine learning algorithms may be performed on a same node, such as, the master node 131.

At step 207, the results of the ML algorithms may be stored somewhere. In some embodiments, the ML results may be stored locally at the master node 207, or at a remote storage, such as the log server 120 or another remote database.

Next, a detailed description of the ML algorithms mentioned in the step 206 will be given with reference to Fig. 3 -Fig. 9.

Fig. 3 is a flow chart illustrating an exemplary method 300 for identifying similar data sets (or more specifically, similar test cases) according to an embodiment of the present disclosure. As shown in Fig. 3, the method 300 may comprise steps 310 through 340. However, the present disclosure is not limited thereto. In some other embodiments, one or more of steps 310 -340 may be omitted or additional steps may be included. Further, the order of the steps is not limited to that shown in Fig. 3.

Referring to Fig. 3, the method 300 may begin with an optional step 310 where the logs (e.g. the logs which are retrieved from the log server 120 shown in Fig. 2) are pre-processed. As mentioned above, the scripts of the test cases may be processed rather than logs. For example, in step 310 (or the

steps

204a, 204b shown in Fig. 2) , the logs may be cleaned to exclude information unnecessary for similarity detection, such as, an IP address of a host on which a test procedure is executed, a fixed string identifying the developer of the test software/hardware, or the like. However, since the log data per se may already be cleaned and therefore no pre-processing is needed, the step 310 may be optional.

At step 320, the plurality of test logs (or, generally, data sets) may be converted into a plurality of data sequences, respectively, each of the data sequences being a sequence of one or more data objects.

To be specific, each test log may comprise one or more steps, and each of the steps may be represented in a format of JSON object, as mentioned above. An example of how a log is converted into a sequence of JSON objects is shown in Fig. 4.

Fig. 4 is an exemplary diagram illustrating the step 320 of the method 300 shown in Fig. 3. As shown in Fig. 4, an exemplary test log 125 is presented in which a test case TC #1 may comprise multiple steps, for example, a test step #1 through a test step #n. However, this example is given for the purpose of illustration only, and the present disclosure is not limited thereto.

Referring to Fig. 4, the test step #1 comprises several attributes and their corresponding values to indicate different aspects of a test step, such as, "time_stamp" , "type" , "case_id" , "seq_no" , and "event" as shown by the log 125. However, please note that the attributes/values shown in Fig. 4 is presented for the purpose of illustration only, and therefore the present disclosure is not limited thereto. Further, an attribute may comprise its own sub-attributes. For example, the attribute "event" may comprise two sub-attributes, "description" and "testStepStartTime" . Similarly, the test step #n also comprises several attributes and their corresponding values, which may or may not be different from those of the test step #1. Further, the attribute "event" of the test step #n comprises two sub-attributes, "description" and "result" ..

To reflect both the content and the structure of the log 125, a tree data structure may be used for a test step as shown in Fig. 4. Each of the attributes and values may be represented by a node in the tree structure, and a leaf node in the tree structure may indicate a value. For example, the test step #1 is represented by a tree structure in which the root node indicates that this tree structure is associated with "Test Step #1" , and each of the child nodes at the first layer indicates an attribute of the test step #1.Further, each of the child nodes at the second layer indicates either a value or a sub-attribute of the test step #1. If a child node at the second layer is a sub-attribute, then this child node may have its own child node. For example, the child node "event" at the first layer may have two child nodes "description" and "testStepStartTime" at the second layer, which may in turn have their own child nodes at the third layer, "The sample test starts" and "1562929996650" , respectively. In other words, any attribute/sub-attribute may have one or more sub-attributes and corresponding values. Typically, a leaf node in the tree structure is a value. In this way, for each test step in the log 125, a JSON object 400 may be determined, and naturally a sequence of JSON objects 410 may be obtained for each log 125 or each test case by concatenating the determined JSON objects.

Please note that the log 125 is shown in Fig. 4 for the purpose of illustration only, and therefore the present disclosure is not limited thereto and it may comprise different content and/or have a different format.

Referring back to Fig. 3, after the step 320, a sequence of JSON objects may be obtained for each of the test cases, and then the step 330 may be performed where a feature vector for each of the data objects may be determined such that each of the plurality of data sets corresponds to a sequence of one or more feature vectors. For example, feature vectors for the data objects may be determined in a manner described with reference to Fig. 5.

Fig. 5 is an exemplary flow chart illustrating the step 330 of the method 300 shown in Fig. 3. Once the JSON objects are obtained for a test case, it is determined whether all pairwise similarities for these JSON objects are required at step 331.

In some embodiments, all pairwise similarities for the JSON objects may be arranged in a form of similarity matrix. A similarity matrix is a symmetric matrix, and each element of the matrix is a pairwise similarity between two JSON objects. For example, an element s _i, j of a similarity matrix S is the similarity between a JSON object i and a JSON object j. Typically, s _i, j is equals to s _j, i, and therefore the matrix S is symmetric. In some embodiments, the similarity matrix S can be computed in parallel by techniques like multi-process, multithread, CUDA, map-reduce, etc.

In some embodiments, a similarity between two data objects may be determined based on a distance between the two data objects. In some embodiments, the distance between the two data objects may be calculated based on the tree edit distance algorithm or the tree kernel algorithm. In some embodiments, the similarity between the two JSON objects may be calculated by the following equation:

wherein s represents the similarity between the two data objects, e represents the base of the natural logarithm, γ is a hyper parameter which is adjustable, and d represents the distance between the two data objects.

In a more general sense, the similarity may be calculated by the following equation:

s =f (d)

If it is determined that all pairwise similarities for these JSON objects are required, then the method proceeds to step 334 where all similarities are calculated. Otherwise, the method proceeds to step 332 where only a part of the pairwise similarities is calculated, and the rest of the pairwise similarities may be approximated from the calculated part of the pairwise similarities by using the Nystrom approximation at step 333. In this way, the computing resource and time may be saved by the approximation. In some embodiments, the criteria for determining whether all pairwise similarities are required or not are whether the number of JSON objects and/or the consumed computing capability is above a certain threshold. However, some errors may be introduced by the Nystrom approximation, and therefore this is a tradeoff between the efficiency and accuracy.

The Nystrom approximation algorithm will be briefly introduced with reference to Fig. 6. Fig. 6 is a diagram illustrating an exemplary application of the Nystrom approximation algorithm according to an embodiment of the present disclosure. The Nystrom approximation algorithm is an effective approximation for similarity matrix computation. The Nystrom approximation can also be performed in parallel by techniques like multi-process, multithread, CUDA, map-reduce, etc.

As shown in Fig. 6, assuming that the similarity matrix S is a n-by-n similarity matrix required by subsequent steps and sub matrices A (m-by-m) , B, and B ^T are the calculated part of the matrix S and the rest of the matrix S is the sub-matrix C to be approximated, where B ^T is the transpose matrix of B since the matrix S is a symmetric matrix. The following approximation applies:

where

is a matrix formed of U and U′s Nystrom extension (i.e. B ^TUΛ ^-1) , U is a matrix formed of the eigenvectors of the sub-matrix A, and Λ is a matrix formed of eigenvalues of the sub-matrix A. Further, according to the definitions of eigenvector and eigenvalue, AU = UΛ or A = UΛU ^T since U ^TU is equal to the identity matrix when A is symmetric.

Based thereon, the following equations are hold:

where

is the approximation of the similarity matrix S and B ^TA ^-1B is the Nystrom approximation of the sub-matrix C.

Referring back to Fig. 5, when all similarities are calculated, either directly or indirectly by the Nystrom approximation, the method proceeds to the step 335 where the JSON objects may be mapped into a vector space based on the pairwise similarities calculated. Tn some embodiments, the mapping can be achieved by using any valid graph embedding algorithm, such as the Laplacian eigenmap algorithm, the deep random walk algorithm, etc. Next, a brief introduction of the graph embedding algorithm will be given with reference to Fig. 7.

Fig. 7 is a diagram illustrating an exemplary application of the graph embedding algorithm according to an embodiment of the present disclosure. As shown in Fig. 7, each of the JSON objects may be mapped into a point or a vector pointing from the origin to the point in the vector space based on its similarities between itself and other JSON objects. A general principle of the mapping is to make two points as close as possible when their pairwise similarity is high whereas as far as possible when their pairwise similarity is low. For example, as shown in Fig. 7, similar JSON objects may be mapped into vectors gathered closely, and non-similar JSON objects may be mapped into vectors separated farther.

Referring back to Fig. 5, after the step 335, feature vectors are determined for all the JSON objects, and naturally a sequence of vectors is determined for each sequence of JSON objects or a test case (or generally speaking, a data set) .

Referring further back to Fig. 3, at step 340, the plurality of data sets may be clustered into one or more data clusters based on similarities between their corresponding sequences of feature vectors, each of the data clusters being a cluster of similar data sets. In some embodiments, the clustering may be described with reference to Fig. 8 and Fig. 9.

Fig. 8 is an exemplary flow chart illustrating the step 340 of the method of Fig. 3, and Fig. 9 is an exemplary flow chart illustrating the step 342 of the method of Fig. 8. As shown in Fig. 8, it is determined whether a (time) sequence of multi-variable vectors (also known as multivariate time series) is to be converted into a (time) sequence of uni-variable vectors (also known as univariate time series) at step 341.

In some embodiments, a feature vector generated at step 335 may be a multi-variable vector, and therefore at step 341, the method provides an option to reduce a multivariate time series to a univariate time series, for which some time series clustering algorithms are specifically designed. In other words, to cater for these univariate-only clustering algorithms, a multivariate time series has to be converted into a univariate time series.

If it is determined that a multivariate time series is to be converted into a univariate time series at step 341, then the method proceeds to step 342 where a univariate time series is obtained, for example, by a process shown in Fig. 9.

Referring to Fig. 9, at step 343, the multi-variable feature vectors may be clustered into vector clusters. In some embodiments, the step 343 may be performed based on one of Kmeans, Kmeans++, Kmedoids, Spectral clustering, and Gaussian mixture model. In other words, similar multi-variate feature vectors may be grouped together to form a cluster.

At step 344, the multivariate feature vectors may be converted into univariate feature vectors, respectively, by replacing multivariate feature vectors, which belong to a same vector cluster, with a same univariate feature vector. For example, for two similar multivariate vectors, such as "x ₁, y ₁" and "x ₂, y ₂" , they may be represented by a same univariate vector, such as "z ₁" if they are grouped into a same vector cluster.

Referring back to Fig. 8, no matter whether multivariate time series or univariate time series are obtained, pairwise similarities among the time series may be calculated at step 345. In some embodiments, a similarity between two sequences of feature vectors may be determined by a distance based algorithm. For example, the dynamic time warping (DTW) algorithm may be used for both univariate and multivariate time series, while the Longest Common SubSequences (LCSS) may be used for univariate time series. In some embodiments, if the metric is the distance d, then the relation:

may be used to convert the distance d to the similarity s, where γ is a hyper parameter and e represents the base of the natural logarithm.

In a more general sense, the similarity between the time series may be calculated by the following equation, which is similar to the above equation for calculating similarities between two JSON objects:

s = f (d)

wherein s represents the similarity between the two time series and is greater than or equal to zero, f (·) represents a monotonically decreasing function, and d represents the distance between the two time series and is greater than or equal to zero.

Further, in some other embodiments, the Nystrom approximation algorithm may also be used to reduce the calculations of the pairwise similarities among the time series.

After that, at step 346, the plurality of time series may be clustered into one or more clusters based on the pairwise similarities between the sequences of feature vectors. For example, the plurality of time series may be clustered into one or more clusters based on the pairwise similarities between the sequences of feature vectors by the k-medoids algorithm or the spectral clustering algorithm.

Finally, for all time series (or their corresponding JSON objects or the corresponding test case) in a same cluster, they can be regarded as similar test cases, and further optimization may be performed thereon.

Fig. 10 schematically shows an embodiment of an arrangement 1000 which may be used in an electronic device or a node (e.g., the master node 131, the slave nodes 13, the SDS 130) according to an embodiment of the present disclosure. Comprised in the arrangement 1000 are a processing unit 1006, e.g., with a Digital Signal Processor (DSP) or a Central Processing Unit (CPU) . The processing unit 1006 may be a single unit or a plurality of units to perform different actions of procedures described herein. The arrangement 1000 may also comprise an input unit 1002 for receiving signals from other entities, and an output unit 1004 for providing signal (s) to other entities. The input unit 1002 and the output unit 1004 may be arranged as an integrated entity or as separate entities.

Furthermore, the arrangement 1000 may comprise at least one computer program product 1008 in the form of a non-volatile or volatile memory, e.g., an Electrically Erasable Programmable Read-Only Memory (EEPROM) , a flash memory and/or a hard drive. The computer program product 1008 comprises a computer program 1010, which comprises code/computer readable instructions, which when executed by the processing unit 1006 in the arrangement 1000 causes the arrangement 1000 and/or the electronic device in which it is comprised to perform the actions, e.g., of the procedure described earlier in conjunction with Fig. 1 to Fig. 9 or any other variant.

The computer program 1010 may be configured as a computer program code structured in

computer program modules

1010A, 1010B, and 1010C. Hence, in an exemplifying embodiment when the arrangement 1000 is used in an electronic device, the code in the computer program of the arrangement 1000 includes: a converting module 1010A for converting the plurality of data sets into a plurality of data sequences, respectively, each of the data sequences being a sequence of one or more data objects; a determining module 1010B for determining a feature vector for each of the data objects such that each of the plurality of data sets corresponds to a sequence of one or more feature vectors; and a clustering module 1010C for clustering the plurality of data sets into one or more data clusters based on similarities between their corresponding sequences of feature vectors, each of the data clusters being a cluster of similar data sets.

The computer program modules could essentially perform the actions of the flow illustrated in Fig. 1 to Fig. 9, to emulate the network elements. Tn other words, when the different computer program modules are executed in the processing unit 1006, they may correspond to different modules in the various network elements.

Although the code means in the embodiments disclosed above in conjunction with Fig. 9 are implemented as computer program modules which when executed in the processing unit causes the arrangement to perform the actions described above in conjunction with the figures mentioned above, at least one of the code means may in alternative embodiments be implemented at least partly as hardware circuits.

The processor may be a single CPU (Central processing unit) , but could also comprise two or more processing units. For example, the processor may include general purpose microprocessors; instruction set processors and/or related chips sets and/or special purpose microprocessors such as Application Specific Integrated Circuit (ASICs) . The processor may also comprise board memory for caching purposes. The computer program may be carried by a computer program product connected to the processor. The computer program product may comprise a computer readable medium on which the computer program is stored. For example, the computer program product may be a flash memory, a Random-access memory (RAM) , a Read-Only Memory (ROM) , or an EEPROM, and the computer program modules described above could in alternative embodiments be distributed on different computer program products in the form of memories within the UE.

The present disclosure is described above with reference to the embodiments thereof. However, those embodiments are provided just for illustrative purpose, rather than limiting the present disclosure. The scope of the disclosure is defined by the attached claims as well as equivalents thereof. Those skilled in the art can make various alternations and modifications without departing from the scope of the disclosure, which all fall into the scope of the disclosure.

Claims

A method (300) for identifying similar data sets from a plurality of data sets, the method (300) comprising:

Converting (320) the plurality of data sets into a plurality of data sequences, respectively, each of the data sequences being a sequence of one or more data objects;

determining (330) a feature vector for each of the data objects such that each of the plurality of data sets corresponds to a sequence of one or more feature vectors; and

clustering (340) the plurality of data sets into one or more data clusters based on similarities between their corresponding sequences of feature vectors, each of the data clusters being a cluster of similar data sets.
The method (300) of claim 1, wherein each of the plurality of data sets is a log for steps of a test case, and each of the data objects is a JavaScript Object Notation (JSON) object, and

wherein the step (320) of converting the plurality of data sets into a plurality of data sequences, respectively, comprises:

converting (320) the plurality of logs into the plurality of sequences of JSON objects, respectively, each of the JSON objects corresponding to a step of a corresponding test case.
The method (300) of claim 1, wherein the step (330) of determining a feature vector for each of the data objects comprises:

determining (331, 332, 333, 334) pairwise similarities between the data objects; and

mapping (335) each of the data objects to a feature vector based on the pairwise similarities between the data objects.
The method (300) of claim 3, wherein the mapping (335) is achieved by a graph embedding algorithm.
The method (300) of claim 3, wherein the step (331) of determining pairwise similarities between the data objects comprises:

determining (331) whether all the pairwise similarities for the data objects is needed or not at least based on the number of the data objects.
The method (300) of claim 5, further comprising:

calculating (334) all the pairwise similarities for the data objects in response to determining all the pairwise similarities for the data objects is needed.
The method (300) of claim 5, further comprising:

calculating (332) one or more pairwise similarities for the data objects in response to determining all the pairwise similarities for the data objects is not needed; and

deriving (333) all the pairwise similarities for the data objects from the one or more pairwise similarities for the data objects by the Nystrom approximation algorithm.
The method (300) of claim 3, wherein a similarity between two data objects is determined based on a distance between the two data objects.
The method (300) of claim 8, wherein the distance between the two data objects is calculated based on the tree edit distance algorithm or the tree kernel algorithm.
The method (300) of claim 8, wherein the similarity is calculated by the following equation:

s =f (d)

wherein s represents the similarity between the two data objects and is greater than or equal to zero, f (·) represents a monotonically decreasing function, and d represents the distance between the two data objects and is greater than or equal to zero.
The method (300) of claim 1, wherein the step (340) of clustering the plurality of data sets into one or more data clusters based on similarities between their corresponding sequences of feature vectors comprises:

determining (345) pairwise similarities between the sequences of feature vectors; and

clustering (346) the plurality of data sets into one or more data clusters based on the pairwise similarities between the sequences of feature vectors.
The method (300) of claim 11, wherein before the step (345) of determining pairwise similarities between the sequences of feature vectors, the method (300) further comprises:

in response to determining (341) that the feature vectors are multivariate feature vectors,

clustering (342) the feature vectors into vector clusters; and

converting (342) the multivariate feature vectors into univariate feature vectors, respectively, by replacing multivariate feature vectors, which belong to a same vector cluster, with a same univariate feature vector.
The method (300) of claim 12, wherein the step (342) of clustering the feature vectors into vector clusters in response to determining that the feature vectors are multivariate feature vectors is performed based on one of Kmeans, Kmeans++,

Kmedoids, Spectral clustering, and Gaussian mixture model.
The method (300) of claim 11, wherein a similarity between two sequences of feature vectors is determined by a distance based algorithm.
The method (300) of claim 11, wherein the step (340) of clustering the plurality of data sets into one or more data clusters based on the pairwise similarities between the sequences of feature vectors comprises:

clustering (340) the plurality of data sets into one or more data clusters based on the pairwise similarities between the sequences of feature vectors by the k-medoids algorithm or the spectral clustering algorithm.
The method (300) of claim 1, wherein before the step (320) of converting the plurality of data sets into a plurality of data sequences, the method further comprises:

cleaning (310) the plurality of data sets by removing unnecessary data from the data sets.
An electronic device (130, 131, 133, 1000) , comprising:

a processor (1006) ;

a memory (1008) storing instructions (1010) which, when executed by the processor (1006) , cause the processor (1006) to perform the method of any of claims 1-16.
A computer program comprising instructions (1010) which, when executed by at least one processor (1006) , causes the at least one processor (1006) to carry out the method of any of claims 1-16.
A carrier (1008) containing the computer program (1010) of claim 18, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium (1008) .
A system (130) for identifying similar test cases from a plurality of test cases, the system (130) comprising:

one or more computing nodes comprising one or more slave nodes (133-1, 133-n) and a master node (131) ,

wherein the master node (131) is configured to trigger the one or more slave nodes (133-1, 133-n) to perform the method of any of claims 1-16 collectively.