CN104050162B

CN104050162B - Data processing method and data processing equipment

Info

Publication number: CN104050162B
Application number: CN201310075814.7A
Authority: CN
Inventors: 黄琦珍; 张军; 钟朝亮; 松尾昭彦
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-03-11
Filing date: 2013-03-11
Publication date: 2017-10-13
Anticipated expiration: 2033-03-11
Also published as: CN104050162A

Abstract

A kind of data processing method and data processing equipment are disclosed, the data processing method includes：Sorting procedure, the target function value that multiple samples with multiple dimensions are carried out with Cooperative Clustering to obtain the sample cluster of the first quantity, the dimension cluster of the second quantity and represent the information relationship before and after clustering；Weight calculation step, the dimension cluster and target function value of sample cluster, the second quantity based on the first quantity obtained calculate the weight for representing the correlation degree between each sample cluster and each dimension cluster；Dimension cluster sequence step, dimension cluster is ranked up based on the weight calculated, during so that the proper distribution to dimension cluster and sample cluster being visualized, it is distributed in from each dimension cluster correlation degree highest sample cluster near the dimension cluster and different sample clusters is separated from one another；And visualization step, the distribution visualization of dimension and sample is caused based on the sequence of identified dimension cluster.

Description

Data processing method and data processing device

Technical Field

The present disclosure relates to a data processing method and a data processing apparatus, and more particularly, to a data processing method and a data processing apparatus for improving visualization quality of multi-dimensional sample data by using a weighted dimension cluster algorithm.

Background

The data is visualized by means of a graphical means, so that information can be more clearly transmitted and communicated. Displaying different categories of data in the data set separately and displaying the same category of data adjacently in the graph helps the user to select the different categories of data. For example, a user may need to select different classes of services from a large number of web services for mashup, and thus visualizing the cluster structure of the web services facilitates the user to intuitively pick out the desired class of services.

In the big data era, high dimensional data ratios are all web service data expressed using different keywords, genetic data expressed using different experimental conditions, astronomical data expressed using different observation indexes, and the like. Radial visualization (Radviz) is a widely used visualization technique that can efficiently display cluster structures in high-dimensional data sets. Radviz maps the dimensions (i.e., features) of a sample to a circle, and then calculates the coordinates of the sample by using Hooke's law in physics, and maps the sample to the circle. The visualization effect of Radviz depends on the order of the dimensions on the circle, and improper order of dimensions often causes problems such as clusters of samples being displayed too intensively, some samples being displayed intensively near the center of the circle, overlapping clusters, clutter, etc. Conventional Radviz dimension sorting methods include random sorting (see non-patent document 1 below), sorting based on similar dimensions (see non-patent documents 2 and 3 below), and t-statistics sorting based on a dimension mean (see non-patent document 4 below), and the like. However, the sequencing methods in the prior art have disadvantages, for example, the quality of visualization effect of random sequencing is random; similar dimensions may be placed together based on their ordering, but there is no guarantee that the clusters of samples associated with the dimensions are close to those dimensions; and the sorting based on the mean value of the dimensions associates the samples with the dimension with a larger value, but does not consider the similarity between the dimensions.

Reference list

[ non-patent document 1 ]: hoffman, G.Grinstein, K.Marx, I.Grosse and E.Stanley, "DNAvial and analytical data mining". In Proceedings of the8th conference on visualization' 97, pages 437-ff., Los Alamides, CA, USA, 1997.

[ non-patent document 2 ]: caro, L.D., Frias-Martinez, V, and Frias-Martinez, E., "analysing the Role of Dimension Arrangement for Data Visualization In Radviz," In: M.J Zaki et al (Eds.): PAKDD2010, Part II, LNAI6119, pp.125-132,2010.Springer-Verlag, Heidelberg, 2010.

[ non-patent document 3 ]: m.ankerst, s.berchold and d.a.keim, "Similarity clusterings of Dimensions for an Enhanced Visualization of Multidimensional Data", inninfofovis, 1998.

[ non-patent document 4 ]: sharko, g.grinstein and k.a.marx. "vectored radviz indexes applications to multiple cluster databases.

[ non-patent document 5 ]: I.S.Dhillon, S.Malella and D.S.Modha.information-the interactive co-conditioning. in Proceedings of the science ACM SIGKDD International conference-on Knowledge Discovery and Data mining, pages89-98.ACM,2003.

[ non-patent document 6 ]: cheng and G.M.Church.biclusing of expression data. introduction of the International Conference on organic Systems for molecular Biology, volume8, pages93-103,2000.

[ non-patent document 7 ]: cho, I.Dhillon, Y.Guan and S.Sra, "Minimum sum-square dry co-clustering of gene expression data," in Proceedings of the four SIAM international conference on data mining, vol.114,2004.

[ non-patent document 8 ]: chakrabarti, S.Patadimitou, D.Modha and C.Faloutsos, "full automatic cross-associations," in Proceedings of the ten ACM SIGKDDintennational reference on Knowledge display and data mining. ACM,2004, pp.79-88.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood, however, that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.

In view of the above circumstances, an object of the present invention is to provide a data processing method and a data processing apparatus capable of enabling a visualization method to more accurately display a data cluster structure by performing collaborative clustering on a plurality of samples having a plurality of dimensions to obtain sample clusters and dimension clusters, calculating weights of the dimension clusters according to the degree of association between each sample cluster and the dimension cluster, and ranking the dimension clusters based on the dimension cluster weights, so that when the distributions of the sample clusters and the dimension clusters are visualized, the sample clusters are close to the dimension cluster related thereto and sample clusters of different categories are significantly separated.

Fig. 1 schematically illustrates the difference between the visualization effect obtained by randomly ordering dimensions, and the visualization effect obtained by applying the present invention, in the case of a Radviz visualization method, where the left side of fig. 1 illustrates the visualization effect obtained by randomly ordering dimensions, and the right side of fig. 1 illustrates the visualization effect obtained by applying the present invention. It can be seen that in the case of randomly ordering dimensions, the sample clusters result too dense to visually see which dimensions the sample clusters are related to. According to the invention, through the proper sequencing of the dimensions, the visualized sample cluster is close to the dimension related to the visualized sample cluster, and different sample clusters are well separated, so that a user can presume the category of the sample cluster according to the dimension name close to the sample cluster.

According to an aspect of the present invention, there is provided a data processing method including: a clustering step, in which a plurality of samples with a plurality of dimensions are subjected to collaborative clustering to obtain a first number of sample clusters, a second number of dimension clusters and objective function values, wherein the objective function values represent information relationships before and after clustering; a weight calculation step of calculating a weight representing a degree of association between each sample cluster and each dimension cluster based on the obtained first number of sample clusters, second number of dimension clusters, and objective function values; a dimension cluster sorting step of sorting the dimension clusters based on the calculated weights so that when the distributions of the dimension clusters and the sample clusters are visualized, the sample cluster with the highest degree of association with each dimension cluster is distributed near the dimension cluster and different sample clusters are separated from each other; and a visualization step of visualizing the distribution of the dimensions and the samples based on the ranking of the dimension clusters determined in the dimension cluster ranking step.

According to a preferred embodiment of the present invention, the dimension cluster sorting step further comprises: a dimension cluster allocation sub-step of, for each dimension cluster, determining a sample cluster having the highest degree of association with the dimension cluster based on the determined weight and allocating the dimension cluster to the determined sample cluster; and a first sorting sub-step of sorting the dimension clusters based on the result of the allocation in the dimension cluster allocation sub-step so that all the dimension clusters allocated to the same sample cluster are arranged at adjacent positions.

According to another preferred embodiment of the present invention, the dimension cluster sorting step further comprises: and a second sorting sub-step of sorting the dimension clusters arranged at adjacent positions and assigned to the same sample cluster based on the weight of each dimension cluster in the dimension clusters relative to the sample cluster.

According to another preferred embodiment of the invention, radial coordinate visualization (Radviz) is used in the visualization step to visualize the distribution of samples and dimensions.

According to another preferred embodiment of the present invention, the visualizing step may further comprise: a dimension cluster arrangement substep of arranging the dimension clusters on a circle according to the determined sequence; a sample coordinate calculation sub-step of calculating coordinates of each sample of the plurality of samples within a circle based on arrangement of the dimension clusters on the circle; and a visualization sub-step of visualizing the distribution of the dimensions and the samples based on the arrangement of the dimension clusters and the coordinates of the samples.

According to another aspect of the present invention, there is also provided a data processing apparatus comprising: a clustering unit configured to perform collaborative clustering on a plurality of samples having a plurality of dimensions to obtain a first number of sample clusters, a second number of dimension clusters, and objective function values, wherein the objective function values represent information relationships before and after clustering; a weight calculation unit configured to calculate a weight based on the obtained first number of sample clusters, the second number of dimension clusters, and the objective function value, the weight representing a degree of association between each sample cluster and each dimension cluster; a dimension cluster sorting unit configured to sort the dimension clusters based on the calculated weights such that when the distributions of the dimension clusters and the sample clusters are visualized, the sample cluster with the highest degree of association with each dimension cluster is distributed near the dimension cluster and different sample clusters are separated from each other; and a visualization unit configured to visualize the distribution of the dimensions and the samples based on the ranking of the dimension clusters determined by the dimension cluster ranking unit.

According to still another aspect of an embodiment of the present invention, there is also provided a storage medium including a program code readable by a machine, which, when executed on an information processing apparatus, causes the information processing apparatus to execute a data processing method according to the present invention.

Furthermore, according to still another aspect of embodiments of the present invention, there is also provided a program product including machine-executable instructions that, when executed on an information processing apparatus, cause the information processing apparatus to execute a data processing method according to the present invention.

Therefore, according to the embodiment of the present invention, the effect of visualizing high-dimensional data can be improved, so that the sample clusters are distributed close to the relevant dimension clusters thereof, and different sample clusters are obviously separated.

Additional aspects of embodiments of the present invention are set forth in the description section that follows, wherein the detailed description is presented to fully disclose preferred embodiments of the present invention and not to limit it.

Drawings

The invention may be better understood by referring to the detailed description presented below in conjunction with the following drawings, in which like or similar reference numerals are used throughout the figures to indicate like or similar parts. The accompanying drawings, which are incorporated in and form a part of the specification, further illustrate the preferred embodiments of the present invention and explain the principles and advantages of the invention, are incorporated in and constitute a part of this specification. Wherein:

fig. 1 is a diagram schematically illustrating an example of visualization effects according to the prior art and according to the present invention;

FIG. 2 is a flow diagram illustrating an example process of a data processing method according to an embodiment of the invention;

FIG. 3 is a flowchart showing an example of a specific processing operation of the weight calculation step shown in FIG. 2;

FIG. 4 is a flowchart showing an example of a specific processing operation of the dimension cluster sorting step shown in FIG. 2;

FIG. 5 is a flowchart showing an example of a specific processing operation of the visualization step shown in FIG. 2;

fig. 6 is a block diagram showing an example configuration of a data processing apparatus according to an embodiment of the present invention;

fig. 7 is a block diagram showing a detailed configuration example of the weight calculation unit shown in fig. 6;

FIG. 8 is a block diagram showing a detailed configuration example of the dimension cluster sorting unit shown in FIG. 6;

fig. 9 is a block diagram showing a detailed configuration example of the visualization unit shown in fig. 6; and

fig. 10 is a block diagram showing an example configuration of a personal computer as an information processing apparatus employed in the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the device structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

A data processing method and a data processing apparatus according to an embodiment of the present invention will be described below with reference to fig. 1 to 10.

First, an exemplary process flow of a data processing method according to an embodiment of the present invention will be described with reference to fig. 2.

As shown in fig. 2, the data processing method according to the present invention may include a clustering step S210, a weight calculation step S212, a dimension cluster sorting step S214, and a visualization step S216. The processing in each step will be described in detail below.

First, in a clustering step S210, a plurality of samples having a plurality of dimensions are cooperatively clustered to obtain a first number of sample clusters, a second number of dimension clusters, and objective function values, where the objective function values represent information relationships before and after clustering.

As an example, it is assumed that the input data is an m × n data matrix, that is, the input data includes m samples and n dimensions, wherein the raw data needs to be preprocessed to form the data matrix, rows of the matrix represent the samples, and columns of the matrix represent the dimensions, that is, each sample is represented by n dimensions (i.e., features). Here, a collaborative clustering algorithm is used to cluster rows and columns (i.e., samples and dimensions) of the data matrix, respectively, because although only the dimension clusters need to be sorted, all outputs of the collaborative clustering are needed to calculate the weights of the dimension clusters, and thus a unidirectional clustering method cannot be used. The output after co-clustering the input data matrix includes k (i.e., a first number) sample clusters and l (i.e., a second number) dimension clusters and objective function values representing information relationships before and after clustering.

Preferably, as an example, information theory-based collaborative clustering (ITCC) is adopted as a specific collaborative clustering algorithm in the present invention, and in this case, the objective function value represents mutual information loss before and after clustering.

And the ITCC measures the quality of the clustering result according to the mutual information loss between the input matrix and the output matrix obtained by clustering, and the smaller the mutual information loss is, the better the clustering effect is. Let x denote the index of the clustered sample cluster (row), y denote the index of the clustered dimension cluster (column), and k sample clusters in the clustered output result are expressed asThe l dimensional clusters are represented asAnd the mutual information loss can be represented by the following objective function (see the above-mentioned non-patent document 5):

it should be understood, however, that the ITCC algorithm described above is merely an example and is not a limitation, and those skilled in the art may certainly employ other collaborative clustering algorithms to perform collaborative clustering on the input data matrix, such as a collaborative clustering algorithm in which an objective function is based on residual mean square (see non-patent document 6 described above), a collaborative clustering algorithm in which an objective function is based on residual sum of squares (see non-patent document 7 described above), a collaborative clustering algorithm in which an objective function is based on code length (see non-patent document 8 described above), and so on, as long as these collaborative clustering algorithms are capable of outputting sample clusters and dimension clusters and an objective function thereof can be expressed in the form of a sum of sub-portions with respect to each sample cluster and each dimension cluster.

Next, in the weight calculation step S212, a weight representing the degree of association between each dimension cluster and each sample cluster is calculated based on the first number of sample clusters, the second number of dimension clusters, and the objective function value obtained in the clustering step S210.

An example of a specific processing procedure of the weight calculation step S212 will be described in detail below with reference to fig. 3.

As shown in fig. 3, the weight calculation step S212 may further include a normalization sub-step S310 and a weight determination sub-step S312. The processing in each step will be described in detail below.

As described above, since the objective function can be expressed as the sum of sub-sections with respect to each dimension cluster and each dimension cluster, the objective function value is the sum of the contribution values of each sub-section. Taking the k sample clusters and the l dimension clusters after clustering by the ITCC algorithm as an example, the objective function is the sum of kxl sub-parts, and the specific equivalent transformation process is as follows:

+…

as can be appreciated from the above equivalent transformations, each sub-portion may be represented asThe value of the sub-portion can be understood as the contribution of one sample cluster and one dimension cluster to the mutual information loss, toTo indicate.

In the normalization substep S310, the contribution values of the above-mentioned respective subsections are normalized.

Preferably, as an example, in the normalization substep S310, the contribution values of the sub-portions associated with each of the second number of dimensional clusters and the sample cluster are normalized on a per sample cluster basis.

In particular, the contribution of the above-mentioned sub-portions is converted here, for example using a linear transfer functionLimited to the range of 0 to 1, thereby normalizing the contribution valueCan be expressed as follows:

wherein i is more than or equal to 1 and less than or equal to l

Wherein,representing clusters about samplesAnd a minimum contribution value of, andrepresenting clusters about samplesThe maximum contribution value of.

It should be understood that this normalization method is merely an example and not a limitation, and one skilled in the art may employ any other method to normalize the contribution values of the various sub-portions, as long as the normalized contribution values facilitate comparing the mutual information loss of all the dimensional clusters with respect to each sample cluster.

Furthermore, it should be understood that, since the contribution value measures the mutual information loss, the larger the contribution value of a sub-portion is, the larger the contribution value of the sample cluster and the dimension cluster constituting the sub-portion contributes to the mutual information loss, thereby indicating that the degree of association between the sample cluster and the dimension cluster is lower.

Based on this understanding, preferably, in the weight determination sub-step S312, the weight associated with each sub-portion is determined based on the contribution value of the sub-portion normalized in the normalization sub-step S310, such that the weight is inversely related to the contribution value.

As an example, the weights are determined here only by a simple swap process, i.e., for a particular sample cluster, the weight of the dimension cluster with the largest contribution value is made equal to the smallest contribution value, and the weight of the dimension cluster with the smallest contribution value is made equal to the largest contribution value; the weight of the dimension cluster with the second largest contribution value is equal to the second smallest contribution value, and the weight of the dimension cluster with the second smallest contribution value is equal to the second largest contribution value; and so on. This exchange process can be expressed as the following expression:

wherein the dimension clusterIs arranged at the t-th bit in the order from big to small, and the dimension clusterThe contribution values of (a) are arranged at the (l-t) th bit in descending order.

Through the above weight determination process, weights representing the degree of association between each sample cluster and each dimension cluster are obtained, so that the objective function values obtained after the collaborative clustering are converted into a k × l weight matrix, which can be represented by the following table:

it can be seen that in the above processing steps, for each particular cluster of samplesNormalizing the weights of all dimension clusters to a value between 0 and 1 and the sum of the weights of all dimension clusters to 1 facilitates intuitive comparison of all dimension clusters to sample clustersThe degree of association of (c). It should be understood that the weight determination process described with reference to fig. 3 is only an example and not a limitation, and those skilled in the art may employ other methods to determine the weight with respect to each dimension cluster and each sample cluster as long as the weight may reflect the degree of association of the dimension cluster and the sample clusterAnd (4) finishing. For example, instead of negatively correlating the contribution values, the weight values may be positively correlated, and in this case, the larger the weight value, the lower the degree of association of the dimension cluster with the sample cluster is.

Next, referring back to fig. 2, after the weights for the respective sample clusters and the respective dimension clusters are determined, in a dimension cluster sorting step S214, the dimension clusters are sorted based on the weights calculated in the weight calculation step S212 so that when the dimension clusters and the distribution of the sample clusters are visualized, the sample cluster with the highest degree of association with each dimension cluster is distributed in the vicinity of the dimension cluster and different sample clusters are separated from each other. The processing in the dimension cluster sorting step will be described in detail below with reference to fig. 4 as an example.

First, in the dimension cluster allocation sub-step S410, for each dimension cluster, based on the determined weight, a sample cluster having the highest degree of association with the dimension cluster is determined and the dimension cluster is allocated to the determined sample cluster. It should be understood that this is because one sample cluster may be simultaneously related to multiple dimensional clusters, i.e., one sample cluster may be represented with multiple features.

Next, in the first sorting substep S412, the dimension clusters are sorted based on the distribution result in the dimension cluster distribution substep S410 so that all the dimension clusters distributed to the same sample cluster are arranged at adjacent positions.

For example, assume that there are three sample clusters Andand eight dimensional clustersToWherein in the dimension cluster allocation substep S410, an allocation to a sample cluster is determinedIncludes a first set of dimension clusters Andassigning to clusters of samplesIncludes a second set of dimensional clusters Andand is assigned to a sample clusterIncludes a third set of dimension clustersAndso that when sorting the dimension clusters in the dimension cluster sorting substep S412, a first set of dimension clusters is made Andarranged adjacent to each other, a second set of dimensional clusters Andare arranged adjacent to each other and a third set of dimensional clustersAndare arranged adjacent to each other.

It should be understood that, according to the process in the first sorting sub-step S412, the relative sorting between different sets of the three sets of dimensional clusters and the sorting of the dimensional clusters in each set of dimensional clusters may be arbitrary as long as it is ensured that the dimensional clusters in the same set are adjacently arranged together.

Preferably, in order to further improve the quality of subsequent visualization, all the dimension clusters included in each set of dimension clusters may be further sorted. Thus, the dimension cluster sorting step may further comprise a second sorting substep S414.

In the second sorting sub-step S414, for adjacently arranged dimension clusters assigned to the same sample cluster determined in the first sorting sub-step S412, the dimension clusters are further sorted based on their weights with respect to the sample cluster.

For example, for the above-described allocation to sample clustersTo (1) aSet of dimension clusters Andsuppose that Andabout sample clustersAre respectively weighted as W₁₁、W₁₃And W₁₅And W is₁₃>W₁₁>W₁₅Such that the ordering of the dimension clusters within the first set is in turn Andsimilarly, assume that the rankings within the second set of dimensional clusters are determined to be in order Andand the ordering in the third set of dimension clusters is sequentiallyAnd

the processing of the dimension cluster ordering step described with reference to fig. 4 can be implemented, for example, by the following algorithm:

for each sample cluster i

Order the dimension clusters in descending order according to theweights to form an array:

important_dim_clusters[i]

for each dimension cluster j

for each sample cluster i

if important_dim_clusters[i][j]is assigned to a sample cluster ai

if weight of (i, important_dim_clusters[i][j])>weight of (ai,important_dim_clusters[i][j])

remove important_dim_clusters[i][j]from ai

else continue

Assign important_dim_clusters[i][j]to sample cluster i

according to the algorithm described above, all the dimensional clusters assigned to each sample cluster can be determined and further sorted in descending order of weight.

The above algorithm is only an example algorithm for implementing the dimension cluster sorting step described above, and those skilled in the art can adopt other algorithms to perform dimension cluster sorting according to the above principle to realize the visualization effect that the sample cluster is close to the dimension cluster related thereto and different sample clusters are separated.

In order to more intuitively represent the relationship between the dimension clusters and the sample clusters in a graphical manner, thereby facilitating the determination of the sample category, the distribution of the dimension clusters and the sample clusters is visualized according to the determined ordering of the dimension clusters by adopting a proper visualization algorithm.

Referring back to fig. 2, in the visualization step S216, the distribution of the dimensions and samples is visualized based on the ranking of the dimension clusters determined in the dimension cluster ranking step S214. Preferably, radial coordinate visualization (Radviz) is employed in the visualization step S216.

The processing operation of the visualization step will be described in detail below with reference to fig. 5. As shown in fig. 5, the visualization step S216 may further include a dimension cluster arrangement substep S510, a sample coordinate calculation substep S512, and a visualization substep S514.

First, in the dimension cluster arrangement substep S510, the dimension clusters are arranged on a circle in the determined order.

Specifically, as an example, after the ordering of the dimensional clusters is determined, all the dimensional clusters are arranged on a circle in a counterclockwise direction starting with a position where the circular arc is zero. Clustering with the above determined dimensionsToBy way of example, according toAndthe clusters of dimensions are arranged on a circle in a counter-clockwise direction. It should be understood that, in the present invention, since the number of dimensions included in each dimension cluster may be greater than 1, and only the relative order between the dimension clusters is determined when the dimension clusters are sorted, the sorting of each dimension cluster in one dimension cluster may be arbitrary, only ensuring that when visualization is performed,all dimensions in the same dimension cluster are mapped to adjacent positions on the circle.

Next, in the sample coordinate calculation sub-step S512, the coordinates of each sample within the circle are calculated based on the arrangement of the dimension clusters on the circle. Specifically, from the coordinates of the respective dimension clusters on the circle determined in the dimension cluster arrangement substep S510, the mapping formula of Radviz is employed to calculate the coordinates of each sample within the circle. The specific calculation method of the sample coordinates is well known to those skilled in the art and will not be described herein.

In the visualization substep S514, the distribution of the dimensions and the samples is visualized based on the arrangement of the dimension clusters and the coordinates of the samples.

The distribution of samples and dimensions obtained after visualization with Radviz according to an embodiment of the invention is shown on the right side of fig. 1. It can be seen that the samples are distributed close to the dimensions associated with them and the different sample categories are clearly distinguished, so that the user can intuitively and accurately determine the categories of the samples according to the visualized samples and the dimension distribution map, thereby facilitating the user to efficiently select a desired service category when selecting among a large number of web services, for example.

It should be understood that while the process of visualizing the distribution of samples and dimensions is described in terms of Radviz, it is understood that other visualization methods, such as parallel coordinate visualization methods, may be applied by one skilled in the art in accordance with the principles of the present invention.

Although the above describes in detail an example of a data processing method according to an embodiment of the present invention with reference to fig. 1 to 5, it should be understood by those skilled in the art that the flow chart shown in the drawings is only exemplary, and the flow of the above method may be modified accordingly according to actual application and specific requirements. Further, it is to be understood that the above examples are not to be construed as limiting the invention and that the skilled person can apply suitable modifications to the above described processes to other applications based on the principles taught.

Next, a functional configuration example of the data processing apparatus according to the present invention will be described in detail with reference to fig. 6 to 9.

First, referring to fig. 6, the data processing apparatus 600 according to an embodiment of the present invention may include a clustering unit 610, a weight calculating unit 612, a dimension cluster sorting unit 614, and a visualization unit 616. The functional configurations of the respective units will be described in detail below, respectively.

Clustering unit 610 may be configured to co-cluster a plurality of samples having a plurality of dimensions to obtain a first number of sample clusters, a second number of dimension clusters, and objective function values representing information relationships before and after clustering.

Preferably, the clustering unit 610 may employ, for example, an information theory-based collaborative clustering (ITCC) algorithm to perform collaborative clustering on m samples having n dimensions to obtain k sample clusters, l dimension clusters, and objective function values representing mutual information loss before and after clustering. As described above, the objective function value may be expressed in the form of a sum of sub-parts (k × l sub-parts) with respect to each sample cluster and each dimension cluster. The equivalent transformation process may refer to the description in the above method embodiments, and is not described herein again.

It should be understood that collaborative clustering algorithms other than ITCC may also be employed, such as objective function based on residual mean square, objective function based on residual sum of squares, objective function based on code length, etc., and the objective function values of these collaborative clustering algorithms may also be expressed in the form of the sum of the individual sub-parts.

The weight calculation unit 612 may be configured to calculate a weight representing a degree of association between each dimension cluster and each sample cluster based on the resulting first number of sample clusters, second number of dimension clusters, and objective function values.

Preferably, an example of the functional configuration of the weight calculation unit 612 will be described in detail with reference to fig. 7.

Referring to fig. 7, the weight calculation unit 612 may further include a normalization module 710 and a weight determination module 712.

In particular, the normalization module 710 may be configured to normalize a contribution value of each sub-portion obtained by equivalently transforming the objective function, the contribution value of each sub-portion representing a contribution of a sample cluster and a dimension cluster with respect to the sub-portion to a sample function value (e.g., mutual information loss).

Preferably, the normalization module 710 can be configured to normalize, on a per sample cluster basis, the contribution values of the sub-portions associated with the sample cluster and each of the second number of dimension clusters.

For a specific normalization process, reference may be made to corresponding descriptions in the above method embodiments, and details are not described herein again.

The weight determination module 712 may be configured to determine the weight associated with each sub-portion based on the normalized contribution value of that sub-portion such that the weight is inversely related to the contribution value.

It should be understood that, as described above, since the contribution value represents the contribution of one sample cluster and one dimension cluster to the mutual information loss, the larger the contribution value, the lower the association degree of the sample cluster and the dimension cluster is. Thus, the processing performed by the normalization module 710 and the weight determination module 712 makes the weights inversely correlated with the contribution values, such that the larger the weight, the higher the degree of association of the sample cluster and the dimension cluster is represented; further, by normalizing the weight value to a value between 0 and 1 on a per sample cluster basis and the sum of the weights of all the dimension clusters with respect to the sample cluster is equal to 1, the degree of association between each dimension cluster and each sample cluster can be visually compared.

Next, referring back to fig. 6, the dimension cluster sorting unit 614 may be configured to sort the dimension clusters based on the calculated weights such that when visualizing the distribution of the dimension clusters and the sample clusters, the sample cluster most associated with each dimension cluster is distributed near the dimension cluster and different sample clusters are separated from each other. An example of the functional configuration of the dimension cluster sorting unit 614 will be described in detail below.

Referring to fig. 8, the dimension cluster sorting unit 614 may further include a dimension cluster allocation module 810, a first sorting module 812, and a second sorting module 814.

In particular, the dimension cluster assignment module 810 can be configured to, for each dimension cluster, determine a sample cluster having a highest degree of association with the dimension cluster based on the determined weight and assign the dimension cluster to the determined sample cluster.

The first ordering module 812 can be configured to order the dimension clusters based on the assignment result of the dimension cluster assignment module 810 such that all dimension clusters assigned to the same sample cluster are arranged in adjacent positions.

The dimension clusters assigned to the same sample cluster are arranged adjacent to each other through the sorting process performed by the first sorting module 812, but the specific sorting of the dimension clusters relative to each other is not determined. Therefore, preferably, to further improve the subsequent visualization quality, all the dimensional clusters assigned to the same sample cluster may be further sorted.

The second ordering module 814 can be configured to order the adjacently positioned dimensions assigned to the same sample cluster as determined by the first ordering module 812 further based on their weights with respect to the sample cluster.

Preferably, the dimension cluster sorting process performed by the modules may be implemented by, for example, an algorithm described in the above method embodiment, and is not described herein again.

Referring back to fig. 6, the visualization unit 616 may be configured to visualize the distribution of dimensions and samples based on the rankings of the dimension clusters determined by the dimension cluster ranking unit 614.

Preferably, the visualization unit 616 may be configured to employ a Radviz method to visualize the dimensions and the distribution of the samples in order to facilitate the user to intuitively determine the relationship between the samples and the dimensions.

An example of the functional configuration of the visualization unit 616 will be described in detail below with reference to fig. 9. As shown in fig. 6, the visualization unit 616 may further include a dimension cluster arrangement module 910, a sample coordinate calculation module 912, and a visualization module 914. Examples of functional configurations of the respective modules will be described in detail below, respectively.

The dimension cluster arrangement module 910 may be configured to arrange the dimension clusters on a circle in a previously determined ordering. As can be seen from the above, when the dimension maps are arranged on the circle according to the determined order, the dimension clusters assigned to the same sample cluster are arranged on the circle in sequence according to the weight values, and the ordering of the dimensions in each dimension cluster may be arbitrary as long as it is ensured that the dimensions in the same dimension cluster are arranged on the circle adjacently.

The sample coordinate calculation module 912 may be configured to calculate the coordinates of each sample within a circle based on the arranged coordinates of the dimensional clusters on the circle. This coordinate calculation process is well known to those skilled in the art and will not be described in detail herein.

The visualization module 914 may be configured to visualize the distribution of the dimensions and samples based on the arrangement of the dimension clusters on the circle and the coordinates of the samples. The visualization effect according to the invention may be, for example, as shown on the right side of fig. 1.

It should be appreciated that while the description of the visualization process is made in terms of Radviz, it should be appreciated that other visualization algorithms may be employed as desired based on the principles of the present invention.

It should be noted that the data processing apparatus according to the embodiment of the present invention corresponds to the foregoing method embodiment, and therefore, for parts that are not described in detail in the apparatus embodiment, please refer to descriptions of corresponding positions in the method embodiment, which is not described herein again.

Further, it should be noted that the above series of processes and means may also be implemented by software and/or firmware. In the case of implementation by software and/or firmware, a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware structure, such as a general-purpose personal computer 1000 shown in fig. 10, which is capable of executing various functions and the like when various programs are installed.

In fig. 10, a Central Processing Unit (CPU) 1001 executes various processes in accordance with a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 to a Random Access Memory (RAM) 1003. The RAM1003 also stores data necessary when the CPU1001 executes various processes and the like, as necessary.

The CPU1001, ROM1002, and RAM1003 are connected to each other via a bus 1004. An input/output interface 1005 is also connected to the bus 1004.

The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, and the like. The communication section 1009 performs communication processing via a network such as the internet.

A driver 1010 is also connected to the input/output interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as needed, so that a computer program read out therefrom is installed into the storage portion 1008 as needed.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1011.

It will be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1011 shown in fig. 10, in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1011 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM1002, a hard disk included in the storage section 1008, or the like, in which programs are stored and which are distributed to users together with the device including them.

It is also to be noted that the steps of executing the series of processes described above may naturally be executed chronologically according to the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Furthermore, the terms "comprises," "comprising," or any other variation thereof, in embodiments of the present invention are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

According to an embodiment of the invention, the following additional notes are also disclosed:

supplementary note 1. a data processing method, comprising:

a clustering step, in which a plurality of samples with a plurality of dimensions are subjected to collaborative clustering to obtain a first number of sample clusters, a second number of dimension clusters and objective function values, wherein the objective function values represent information relationships before and after clustering;

a weight calculation step of calculating a weight representing a degree of association between each sample cluster and each dimension cluster based on the obtained first number of sample clusters, second number of dimension clusters, and objective function values;

a dimension cluster sorting step of sorting the dimension clusters based on the calculated weights so that, when the dimension clusters and the distribution of the sample clusters are visualized, the sample cluster with the highest degree of association with each dimension cluster is distributed near the dimension cluster and different sample clusters are separated from each other; and

a visualization step of visualizing the distribution of the dimensions and the samples based on the ranking of the dimension clusters determined in the dimension cluster ranking step.

Supplementary note 2. the data processing method according to supplementary note 1, wherein the objective function value is a sum of contribution values of a plurality of sub-sections, the number of the plurality of sub-sections being a product of the first number and the second number, and the contribution value of each sub-section representing a contribution to the objective function value with respect to one dimensional cluster and one sample cluster of the sub-section, and wherein the weight calculating step further comprises:

a normalization sub-step of normalizing the contribution value of each of the plurality of sub-portions; and

a weight determination sub-step of determining the weight associated with each sub-portion on the basis of the normalized contribution value of that sub-portion, so that the weight is inversely related to the contribution value.

Supplementary note 3. the data processing method according to supplementary note 2, wherein in the normalization sub-step, the contribution values of the sub-sections associated with each of the second number of dimensional clusters and the sample cluster are normalized on a per sample cluster basis.

Note 4. the data processing method according to note 1, wherein the dimension cluster sorting step further includes:

a dimension cluster allocation sub-step of, for each dimension cluster, determining a sample cluster having the highest degree of association with the dimension cluster based on the determined weight and allocating the dimension cluster to the determined sample cluster; and

a first sorting sub-step of sorting the dimension clusters based on the allocation result in the dimension cluster allocation sub-step so that all the dimension clusters allocated to the same sample cluster are arranged at adjacent positions.

Supplementary note 5. the data processing method according to supplementary note 4, wherein the dimension cluster sorting step further includes:

and a second sorting sub-step of sorting the dimension clusters arranged at adjacent positions and assigned to the same sample cluster based on the weight of each dimension cluster in the dimension clusters relative to the sample cluster.

Supplementary notes 6. the data processing method according to any of the supplementary notes 1 to 5, wherein the visualization of Radviz with radial coordinates visualizes the distribution of the samples and the dimensions in the visualization step.

Supplementary note 7. the data processing method according to supplementary note 6, wherein the visualizing step further comprises:

a dimension cluster arrangement substep of arranging the dimension clusters on a circle according to the determined ordering;

a sample coordinate calculation sub-step of calculating coordinates of each sample of the plurality of samples within the circle based on arrangement of the dimension clusters on the circle; and

a visualization sub-step of visualizing the distribution of the dimensions and the samples based on the arrangement of the dimensional clusters and the coordinates of the samples.

Supplementary note 8 the data processing method according to supplementary note 7, wherein the arrangement of the respective dimensions within each dimension cluster on the circle is disordered.

Supplementary note 9. the data processing method according to any one of supplementary notes 1 to 8, wherein collaborative clustering based on information theory is employed in the clustering step.

Supplementary notes 10. the data processing method according to supplementary notes 9, wherein the objective function values represent mutual information loss before and after clustering by the information-theoretic-based collaborative clustering.

Note 11 that a data processing apparatus includes:

a clustering unit configured to perform collaborative clustering on a plurality of samples having a plurality of dimensions to obtain a first number of sample clusters, a second number of dimension clusters, and an objective function value, wherein the objective function value represents an information relationship before and after clustering;

a weight calculation unit configured to calculate a weight representing a degree of association between each sample cluster and each dimension cluster based on the obtained first number of sample clusters, second number of dimension clusters, and objective function values;

a dimension cluster sorting unit configured to sort the dimension clusters based on the calculated weights such that when the dimension clusters and the distribution of the sample clusters are visualized, the sample cluster with the highest degree of association with each dimension cluster is distributed near the dimension cluster and different sample clusters are separated from each other; and

a visualization unit configured to visualize the dimensions and the distribution of the samples based on the ordering of the dimension clusters determined by the dimension cluster ordering unit.

Supplementary note 12 the data processing apparatus according to supplementary note 11, wherein the objective function value is a sum of contribution values of a plurality of sub-sections, the number of the plurality of sub-sections being a product of the first number and the second number, and the contribution value of each of the plurality of sub-sections representing a contribution of one dimensional cluster and one sample cluster to the objective function value, and wherein the weight calculation unit further comprises:

a normalization module configured to normalize the contribution values of each of the plurality of sub-portions; and

a weight determination module configured to determine the weight associated with each sub-portion based on the normalized contribution value of that sub-portion such that the weight is inversely related to the contribution value.

Reference 13. the data processing apparatus of reference 12, wherein the normalization module is further configured to normalize, on a per sample cluster basis, the contribution values of the sub-portions associated with the sample cluster and each of the second number of dimensional clusters.

Supplementary note 14 the data processing apparatus according to supplementary note 11, wherein the dimension cluster sorting unit further comprises:

a dimension cluster allocation module configured to determine, for each dimension cluster, based on the determined weight, a sample cluster having a highest degree of association with the dimension cluster and allocate the dimension cluster to the determined sample cluster; and

a first ordering module configured to order the dimension clusters based on the allocation result of the dimension cluster allocation module so that all dimension clusters allocated to the same sample cluster are arranged at adjacent positions.

Supplementary note 15 the data processing apparatus according to supplementary note 14, wherein the dimension cluster sorting unit further comprises:

and the second sorting module is configured to sort the dimension clusters which are arranged at adjacent positions and are distributed to the same sample cluster based on the weight of each dimension cluster in the dimension clusters relative to the sample cluster.

Annex 16. the data processing apparatus according to any one of the annex 11 to 15, wherein the visualization unit is configured to visualize Radviz with radial coordinates such that the distribution of the sample and the dimension is visualized.

Supplementary note 17 the data processing device according to supplementary note 16, wherein the visualization unit further comprises:

a dimension cluster arrangement module configured to arrange the dimension clusters on a circle in the determined ordering;

a sample coordinate calculation module configured to calculate coordinates of each sample of the plurality of samples within the circle based on an arrangement of the dimension clusters on the circle; and

a visualization module configured to visualize the dimensions and the distribution of the samples based on the arrangement of the dimensional clusters and the coordinates of the samples.

Supplementary note 18. the data processing apparatus according to supplementary note 17, wherein the arrangement of the respective dimensions within each dimension cluster on the circle is unordered.

Supplementary notes 19. the data processing apparatus according to any of the supplementary notes 11 to 18, wherein the clustering unit is configured to employ information theory-based collaborative clustering.

Reference numeral 20, the data processing apparatus according to reference numeral 19, wherein the objective function value represents mutual information loss before and after clustering using the information-theoretic-based collaborative clustering.

Supplementary note 21 a storage medium including machine-readable program code that, when executed on an information processing apparatus, causes the information processing apparatus to execute a data processing method according to any one of supplementary notes 1 to 10.

Note 22. a program product including machine-executable instructions that, when executed on an information processing apparatus, cause the information processing apparatus to execute the data processing method according to any one of notes 1 to 10.

Claims

1. A method of data processing, comprising:

a visualization step of visualizing the distribution of the dimensions and the samples based on the ranking of the dimension clusters determined in the dimension cluster ranking step,

wherein the objective function value is a sum of contribution values of a plurality of sub-portions, the number of the plurality of sub-portions being a product of the first number and the second number, and the contribution value of each sub-portion represents a contribution to the objective function value with respect to one dimension cluster and one sample cluster of the sub-portion, and wherein the weight calculating step further comprises:

2. The data processing method of claim 1, wherein the dimension cluster ordering step further comprises:

3. The data processing method of claim 2, wherein the dimension cluster ordering step further comprises:

4. A method of data processing according to any one of claims 1 to 3, wherein visualizing Radviz with radial coordinates in the visualizing step visualizes the distribution of the samples and the dimensions.

5. The data processing method of claim 4, wherein the visualizing step further comprises:

6. A data processing apparatus comprising:

a visualization unit configured to visualize the dimensions and the distribution of the samples based on the ordering of the dimension clusters determined by the dimension cluster ordering unit,

wherein the objective function value is a sum of contribution values of a plurality of sub-portions, the number of the plurality of sub-portions being a product of the first number and the second number, and the contribution value of each of the plurality of sub-portions representing a contribution of one dimensional cluster and one sample cluster to the objective function value, and wherein the weight calculation unit further comprises:

7. The data processing apparatus according to claim 6, wherein the dimension cluster sorting unit further comprises:

8. The data processing apparatus according to claim 7, wherein the dimension cluster sorting unit further comprises:

9. The data processing apparatus of any of claims 6 to 8, wherein the visualization unit is configured to visualize Radviz with radial coordinates such that the distribution of the samples and the dimensions is visualized.

10. The data processing apparatus of claim 9, wherein the visualization unit further comprises: