CN106067029B

CN106067029B - The entity classification method in data-oriented space

Info

Publication number: CN106067029B
Application number: CN201610348890.4A
Authority: CN
Inventors: 王念滨; 王红滨; 周连科; 祝官文; 何鸣; 王瑛琦; 宋奎勇
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2016-05-24
Filing date: 2016-05-24
Publication date: 2019-06-18
Anticipated expiration: 2036-05-24
Also published as: CN106067029A

Abstract

The entity classification method in data-oriented space, belongs to natural language processing field.Under Evolution Environment, existing can not be by assuming that entity be stationary state, and the problem of classify to entity.A kind of entity classification method in data-oriented space, firstly, proposing that K-Means that is improved, developing clusters frame for the data space entity to develop, that is, defining the objective cost function based on profile value and KL- divergence；Secondly, devising a kind of novel data space entity method for measuring similarity；Then, according to heuristic rule, the K-Means clustering algorithm to develop is proposed.In addition, further expanding the evolution cluster frame of this chapter proposition, changed at any time or the case where snapshot entity is added or removes at any time with handling number of clusters amount.The present invention can not only capture current entity cluster result in high quality, moreover it is possible to which robustly reflecting history clusters situation.

Description

Entity classification method facing data space

Technical Field

The invention relates to an entity classification method facing to a data space.

Background

Data space integration is one of the important approaches to data space construction. Because the data space faces large-scale data with diversified structures, complex semantic relations and distributed storage, the data space integration mainly comprises two aspects of work: (1) integration of entities; (2) and integrating entity relations. Currently, the existing data space integration work mainly focuses on entity relationship integration and proposes some effective strategies or methods, however, the research on entity integration is relatively less. It is therefore of great interest to study the integration of data spaces, especially the integration of entities. As an important step of entity integration, classification of entities has wide application, such as query answering systems, relational extraction, data space query, machine translation, text clustering, and the like. Therefore, the entity classification technology for researching the data space has important significance.

Currently, classification studies of (named) entities have attracted a lot of attention from scholars in the field of Natural Language Processing (NLP). These works are largely divided into two main categories: coarse-grained entity classification and fine-grained entity classification. Coarse-grained entity classification aims at dividing a group of entities into a set of smaller coarse-grained class label sets, the number of classes is usually less than 20 classes, and there is no hierarchy between the classes, such as entity classes like person name, organization name, place name, etc. Common methods include machine learning-based methods, knowledge-assisted methods based on ontologies and external resources, and the like. For example, Chifu et al use an unsupervised neural network model for unsupervised classification of named entities, Kliegr proposes an unsupervised named entity classification method based on Bag-of-Articles, and Gamallo and Garcia propose a resource-based named entity classification system. The fine-grained entity classification is to classify entities into more fine-grained categories, the number of the categories is more, and the category hierarchy is more complex. For example, FIGER uses 112 Freebase types and HYENA uses 505 YAGO types. Typical methods are context-based and grammar-based feature methods. For example, Gillick et al propose a context-dependent fine-grained entity classification method, and based on grammatical features, gilliano and Gliozzo propose a fine-grained entity classification method based on an instance learning algorithm, thereby generating a richer human ontology.

However, the entity classification method in the above NLP field often uses priori entity class knowledge such as context information, linguistic information, and external knowledge characteristics to classify, and the objects to be classified are static, but the entity classification technology in the data space is rarely studied. In a data space environment, entity classification is a more challenging task, which is mainly reflected in the following aspects: (1) and richness of entity information. The data space entity not only contains its name information, but also contains rich attribute feature information and content feature information, which are in fact more important, and therefore, a more appropriate similarity function is needed to evaluate the similarity between the data space entities. (2) Entity class hysteresis. Since the data space advocate leads the way in which edges are built (Pay-ag-you-go) in an integrated manner, which results in entity class knowledge being gradually acquired in nature, clustering techniques are a more appropriate way to achieve entity classification. (3) And (4) dynamic evolution characteristics of the entity. The traditional entity classification method has a strict assumption condition: the entities are static and do not evolve over time. However, this assumption is no longer applicable in the data space environment, and both the extracted entity information and the number of entities change over time. Thus, in an evolving environment, it is more challenging how to classify entities.

Disclosure of Invention

The invention aims to provide a data space-oriented entity classification method for solving the problem that entities cannot be classified by assuming that the entities are in a static state in an evolution environment.

A data space-oriented entity classification method is realized by the following steps:

the method comprises the steps that firstly, an evolved K-Means clustering framework is provided aiming at an evolved data space entity, namely, a target cost function based on a contour value and KL-divergence is defined;

designing a data space entity similarity measurement method;

step three, providing an evolved K-Means clustering algorithm, and solving the problem of initial point selection and the problem of evolved data space entity classification;

and step four, expanding the K-Means clustering framework evolved in the step one under the condition that the number of clusters changes along with time or the snapshot entity is added or removed along with time.

The invention has the beneficial effects that:

the invention provides an improved and evolved K-Means clustering framework aiming at an evolved data space entity, namely, a target cost function based on a contour value and KL-divergence is defined; the method not only considers the quality of the current cluster, namely the snapshot cost, but also considers the time smoothness of all historical cluster structures, and the time smoothness is the historical cost; the method not only considers richer entity self information, such as structured feature information and unstructured feature information of the entity; historical appearance mode information among the entities is also considered, so that the similarity among the entities is more accurately measured; providing an evolved K-Means clustering algorithm, and solving the problem of initial point selection and the problem of evolved data space entity classification; and finally, further expanding the evolved K-Means clustering framework, and processing the universality problem of the evolved K-Means clustering framework under the condition that the cluster number changes along with time or a snapshot entity is added or removed along with time.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a pictorial view of snapshots at adjacent time steps to which the present invention relates; the figure shows a snap-shot at time steps t-1 and t. The snapshot map comprises 6 vertexes, each vertex corresponds to a snapshot entity, and the numbers on the edges represent the similarity between the snapshot entities. During the time steps t-1 to t, their similarities change somewhat (e.g., snapshot entity V)₁And snapshot entity V₂Snapshot entity V₃And snapshot entity V₄) And some are unchanged (e.g., snapshot entity V)₅And snapshot entity V₆)。

Suppose snap chart G^tContaining n vertices, thenIs a snapshot map G^tAn adjacency matrix of (a). Over time, the history of similarity between snapshot entities is passed through a series of snapshot graphs<G¹，G²，…，G^h，…，G^t>To capture;

fig. 3 is a clustering scenario of evolved entities according to the present invention, showing snap plots at time steps t-1 and t, where each vertex represents a snapshot entity and the number on the side represents the similarity score between snapshot entities. Obviously, under the time step t-1, the six snapshot entities should be clustered in the way of clustering 1; at time step t, the clustering result is not unique, for example, six snapshot entities may be clustered in clusters 2 and 3. Obviously, the clustering 2 and the clustering 3 can both well ensure the entity clustering quality of the current time step t. However, according to the basic concept principle, the way of clustering 2 is more favored because it is more consistent with the clustering result of the historical time step t-1;

FIG. 4 shows snapshot costs at different λ involved in experiment 1 of the present invention;

FIG. 5 shows the snapshot costs at different β for experiment 2 according to the present invention;

FIGS. 6 to 8 show the historical costs of experiment 3 according to the present invention at different α;

FIG. 9 shows the mutual information value under different evolution methods according to experiment 4 of the present invention;

FIG. 10 is a schematic diagram of the average run time of each iteration of the SD-EKM according to the present invention;

fig. 11 is a diagram illustrating an average number of iterations in convergence according to the present invention.

Detailed Description

The first embodiment is as follows:

the entity classification method for data space of the present embodiment is shown in conjunction with the flowchart in fig. 1, and is implemented by the following steps:

designing a data space entity similarity measurement method;

The second embodiment is as follows:

different from the first specific embodiment, in the entity classification method for data space of the present embodiment, the first step is to provide an evolved K-Means clustering framework, that is, a process of defining an objective cost function based on a contour value and KL-divergence is,

step one, defining a total target cost function by adopting a linear combination mode:

the target cost function consists of two parts: the snapshot Cost of the current time step and the historical Cost of the historical time step are respectively recorded as Cost_snapshotAnd Cost_temporal(ii) a (ii) a The former is only used for measuring the snapshot quality of the current clustering result related to the current entity information, which reflects the measurement index of the clustering algorithm, obviously, the higher the snapshot cost means the lower the snapshot quality; the latter measures time smoothness according to the degree of fitting between the current cluster structure and the historical cluster structure, and obviously, higher historical cost means continuous timeThe less consistent or time-smooth the cluster structure of the step size, and the different weights for the historical costs of different historical time steps. A total target cost function is defined in a linear combination mode and is used for evaluating the K-Means clustering quality of the evolution entity, the total target cost function comprises two parts of snapshot cost of the current time step and historical cost of the historical time step, and the formula is as follows:

in the formula, 0 is not less than α is not more than 1, and represents the weight factor of the snapshot cost;representing the snapshot cost of the current time step t;representing the historical cost of the historical time step h; factor e^t-hIndicating the historical cost of the current time step t the closer it is to itThe more the weight is occupied, the smaller the deviation degree of the weight is, and the smaller the total target cost function is, the better the total target cost function is, namely the closer the current time step t is, the better the time smoothness of the cluster structure of the historical time step h is;

step two, measuring the snapshot cost based on the contour value:

let G be the fast graph of the current time step t^t＝(V^t,E^t,W^t) Wherein | V^t|＝n，W^tFor the similarity matrix between snapshot entities, the specific similarity calculation method is shown in formula (11); entity partitioning based on the snapshot mapWhereinAnd isThe snapshot cost is intended to measure the snapshot quality of the current clustering result with respect to the current snapshot entity, which reflects the metric index of the clustering algorithm, and obviously, a higher snapshot cost means a lower snapshot quality. Many mature metric criteria for evaluating clusters have emerged, such as tabulation, sum of squared error criteria, contour Value (Silhouette Value), class label-based accuracy and recall, and so on. These criteria vary in terms of whether or not to reference golden standards, similarity measure dependencies, cluster number bias, etc. Due to the fact that the number of entities in the environment of the chapter is large, the entity is dependent on specific similarity measurement, only the entity is referred to, and the like, the quality of a K-Means clustering result is measured by adopting a contour value criterion, wherein the contour value is also called a contour coefficient, and the method is a clustering evaluation method which is proposed by Kaufman and Rousseeuw and only refers to data per se and does not refer to a gold standard; each cluster in the cluster evaluation method is represented by a contour, objects in the cluster and objects far away from the cluster are reflected through the contours, two influence factors of cohesion and separation are reflected by the cluster evaluation method, and the clustering effect is better when the contour value is larger; the snapshot cost is defined as:

in the formula, k represents the number of cluster entity partitions under the current time step t,indicates the p-th cluster (physical partition),representing a clusterAverage contour value of (a); according to the formula (1), the smaller the snapshot cost, the better; the larger the average contour value is, the better the clustering effect is, so that the average contour value is in inverse relation with the snapshot cost; the physical meaning of equation (2) is expressed as dividing Z at the entity^tThe larger the average contour value is, the better the clustering quality is, so that the characteristics of the current snapshot entity can be more accurately reflected;

step one and three, according to each clusterIncludes a set of snapshot entitiesThen each cluster will beMean contour valueThe definition is an average value of contour values of all snapshot entities in the cluster, and specifically includes:

in the formula,representing a clusterA middle snapshot entity;representing a clusterThe number of middle snapshot entities;representing snapshot entitiesThe metric formula of (a) is expressed as:

wherein,representing snapshot entitiesAnd the cluster to which it belongsOther snapshot entities in (2)The average similarity of the average of the similarity of the two,representing snapshot entitiesWith other clustersAll snapshot entities inThe maximum average similarity of;the larger the value of (A), the more the snapshot entity is indicatedWith the interior of the clusterThe average similarity of a snapshot entity is greater than the average similarity of the snapshot entity between clustersIs correctly classified;

step one, step four, based on step one, step threeRepresenting snapshot entitiesThe physical meaning of the equation (4) for measuring the profile value of (A), is definedThe formula of (1) is:

definition ofThe formula of (1) is:

in the formula,into a clusterThe entity of the medium snapshot is the entity,into a clusterMiddle snapshot entity, w_ii'For snapshot entities in the same clusterAndsimilarity between, w_ijFor snapshot entities in different clustersAndsimilarity between them;

substituting formulas (3) to (6) into formula (2), rewriting the snapshot cost as:

sixthly, performing historical cost measurement based on KL-divergence:

first, the historical cost aims to measure the time smoothness characteristic according to the fitting degree of the current cluster structure and the historical cluster structure, and obviously, the smaller the historical cost means that the cluster structure between continuous time steps is better in consistency or stronger in time smoothness. For the sake of convenience of the discussion,

set a snap chart G¹,G²,…,G^h,…,G^t，

At the current time step t, based on the corresponding snap map G^tIs divided into

Under the historical time step h, based on the corresponding historical snap map G^hIs divided into

Secondly, defining a measurement for comparing two kinds of clustering partitions, inspiring by Graph-factorization clustering (Graph-factorization clustering) thought, representing the relation between entities and clusters by adopting a bipartite, and dividing the entities into Z^tThe problem is transformed into a joint probability distribution problem in bipartite graphs:

let BG^t＝(V^t,C^t,F^t,P^t) Is a corresponding snap chart G^t＝(V^t,E^t,W^t) A bipartite graph of (1); wherein,is a snapshot entity set;is a cluster set;is a set of edges, the two vertices of the edge are respectively from a set V^tAnd set C^t；Is an n x k joint probability matrix corresponding to the edge weight matrix of the bipartite graph; calculated using joint probability formulas, i.e.Determining entitiesAnd clusterJoint probability of each otherWherein,into a clusterThe probability of the occurrence of the event is,to be in a clusterunder-Generation-Condition entityThe probability of occurrence; if entityBelong to a clusterThenn_jAnd n is each a clusterThe number of middle snapshot entities and the number of all snapshot entities; if not, then,since in the present invention the clustering is only a hard clustering and not a soft clustering, i.e. an entity can only belong to one cluster, for the joint probability matrix P^tFor any row i, there is only one column j such that p_ijIs not 0, thereby

Third, currently, in the literature of classification and clustering, there are already a number of methods for comparing two cluster partitions, such as the centrometric difference method, the chi-method, the correlation coefficient method, the KL-divergence method, etc. In this chapter of environment, the cluster partitioning problem is regarded as a joint probability distribution problem of entities and clusters, so that measuring two cluster partitioning difference problems is equivalent to measuring two probability distribution difference problems. Since KL-divergence (also called relative entropy) is a measure derived from information theory to determine the difference of two probability distributions, the KL-divergence method is used for the measurement:

bipartite graph BG given a current time step t^t＝(V^t,C^t,F^t,P^t) And bipartite BG of historical time step h^h＝(V^h,C^h,F^h,P^h) Entity division of current time step tAnd entity partitioning of historical time step hWherein, BG^tCorresponds to Z^t，BG^hCorresponds to Z^hThen the historical cost for two time steps h and t is defined as follows:

wherein n is the number of snapshot entities, k represents the number of clusters under the current time step t,is a joint probability matrix P between snapshot entities and clusters under a time step t^tThe medium element (II) is selected from the group consisting of,is a joint probability matrix P between snapshot entities and clusters under a historical time step h^hMiddle element;

fourth, from the foregoing analysis, it can be seen that the joint probability matrix P^tOrP^hIs a sparse matrix, i.e. there are non-zero elements, while the standard KL-divergence does not supportOrCase 0, for which the probability matrix P is combined^tOr P^hThe following smoothing process is performed: p^tOr P^hEach element ofOrAdding a constant e^-12(ii) a The processed elements are then renormalized and are recorded asOrThe probability matrixes after the smoothing are respectively recorded asAndequation (8) is modified to:

wherein n is the number of snapshot entities, k is the number of clusters,is a joint probability matrix between the entity and the cluster after being smoothed under the time step tThe medium element (II) is selected from the group consisting of,a joint probability matrix of entities and clusters smoothed in historical time step hMiddle element;

fifth, substituting equation (7) and equation (8) into equation (1), the target total cost function is equivalent to:

wherein 0 ≦ α ≦ 1 is the weight factor of the snapshot cost, k represents the number of clusters (entity partitions) under the current time step t,representing entity partitions Z^tP-th element of (A), w_ii'Or w_ijQuick drawing G^t＝(V^t,E^t,W^t) W of (2)^tThe medium element (II) is selected from the group consisting of,orRepresents G^tMiddle snapshot entity, n represents bipartite graph BG^t＝(V^t,C^t,F^t,P^t) V of^tThe number of entities in the middle snapshot is,representing smoothed joint probability matricesThe medium element (II) is selected from the group consisting of,representing smoothed joint probability matricesAnd (5) medium element.

The third concrete implementation mode:

different from the first or second embodiment, in the data space-oriented entity classification method of the second embodiment, the process of designing the data space entity similarity measurement method in the second step is,

on one hand, the snapshot entity contains rich information, such as structured attribute information and unstructured content information; on the other hand, in a data space environment, entities repeatedly appear over time, and such historical appearance pattern information also has a certain role in judging whether two entities are similar, for this reason, a data space entity, i.e., a snapshot entity, measures the similarity of the snapshot entity according to the self information of the entity and the historical appearance pattern information of the entity, i.e., a similarity function of the snapshot entity is composed of the self similarity and the historical similarity, and an expression is defined as:

wherein 0- β -1 is the weight of self-similarity,for the snapshot entity at the current time step t,as snapshot entitiesAndthe self-similarity between the two groups of the Chinese characters,as snapshot entitiesAndhistorical similarity between them;

intuitively, the same or similar attribute names of entities in the same class are in a higher proportion, while the same or similar attribute names of entities in different classes are in a lower proportion; in addition, some entities often contain only unstructured information, and two entities with similar contents may also belong to the same class to some extent. Therefore, based on the structured feature information corresponding to the attribute feature information of the snapshot entities and the unstructured feature information corresponding to the content feature information, the self-similarity between the snapshot entities is defined as follows:

wherein, 0 is more than or equal to lambda is less than or equal to 1, which is the weight of attribute feature similarity,andattribute feature similarity and content feature similarity for the snapshot entity respectively,for the attribute characteristics of the snapshot entity,is a content characteristic of the snapshot entity;

if the frequency information patterns of the two snapshot entities appearing in the past history step are relatively consistent, for the two snapshot entities of the current time step, the correlation of the history information patterns indicates that the two snapshot entities have similarity, and the historical similarity is measured by adopting a classical pearson correlation coefficient, specifically:

wherein,for the snapshot entity at the current time step t,andare respectively snapshot entitiesAndthe number of times the historical time step h occurs,andseparately snapshotting entitiesAndstep out at all historical timesAverage value of present times;

substituting equation (12) and equation (13) into equation (11), the similarity function of the snapshot entity is rewritten as:

wherein,for the snapshot entity at the current time step t,andare respectively snapshot entitiesAndthe number of times the historical time step h occurs,andseparately snapshotting entitiesAndthe average of the number of occurrences over all historical time steps,andare respectively snapshot entitiesAndthe characteristic of the properties of (a) to (b),andas snapshot entitiesAndthe content characteristics of (1) 0- β are self-similarityλ is more than or equal to 0 and less than or equal to 1 as the similarity of attribute featuresThe weight of (c).

The fourth concrete implementation mode:

different from the third embodiment, the entity classification method for data space of the third embodiment of the present invention, step three, proposes an evolved K-Means clustering algorithm, and solves the initial point selection problem and the evolved data space entity classification problem by the following steps,

firstly, giving some relevant definitions to better select an initial central point, and then describing an evolutionary K-Means clustering algorithm in detail;

as is well known, the quality of initial point selection greatly influences the quality of K-Means clustering effect, and the traditional random selection method easily causes the problems of too low algorithm convergence speed and the like. Thus, prior to solving the initial point selection problem, the following relative definition is made:

definition of η -neighbors at time t given a snapshot G^t＝(V^t,E^t,W^t) And parameter 0<η ≦ 1, then for any snapshot entityIn terms of the η -neighbor formalization at time t is defined as:wherein | V^tI is snapshot G^tThe number of the middle top points is,is W^tMiddle element;

definition of similarity density at time t: giving a snap map G^t＝(V^t,E^t,W^t) And η -neighbors at time tThen for any snapshot entityIn other words, the similarity density formalization at time t is defined as:

the above definition shows that: snapshot entityThe higher the similarity density of (A), its η -neighborsThe greater the number andthe average similarity of other snapshot entities in the list is also higher; the higher the similarity density is, the higher the probability of the snapshot entity as the cluster center is; the definition of the similarity density at the time t avoids selecting, for example, a snapshot entity in a low-density area, noise data of an isolated snapshot entity, or an edge snapshot entity in a cluster as a cluster center point of a K-Means cluster;

secondly, determining the selection principle of the first initial central point as a snapshot entity with the maximum similarity density;

determining a selection rule of the initial center points except the first initial center point, excluding η -neighbor snapshot entities of the selected initial center points, average similarity lower than all the selected initial center points, and similarity density higher than the current center point, wherein the average similarity value of all the selected initial center points is 0.3, and the similarity density of the current center point is 10, the rule can be formulated as the following formula:

wherein l is more than or equal to 1 and less than or equal to j-1 is the sequence number of the selected initial central point,is the η -neighbor union of all selected initial center points,as snapshot entitiesAnd the selected initial center pointThe similarity of (a) to (b) is,as snapshot entitiesThe similarity density at time t, plus a factor of 1, is intended to prevent the denominator from being zero;

thirdly, the basic idea of executing the evolutionary K-Means clustering algorithm is as follows: executing the K-Means clustering algorithm in a circulating mode in all time step lengths until the current time step length; wherein, the process of executing the K-Means clustering algorithm for each time step is to select an initial center point based on the similarity density and formula (15), and then iteratively execute the following operations:

1) the snapshot entity is assigned to the cluster center point with the highest similarity,

2) updating the cluster center point until reaching the convergence condition of the minimum target cost in the formula (10);

the specific process of the evolutionary K-Means clustering algorithm is as follows:

inputting: a series of snapshot entity sets O ═ { O ] of different time steps¹,O²,…,O^h,…,O^tAnd f, setting the cluster number set K-K corresponding to different time step lengths as K¹，k²，…，k^h，…，k^t}；

And (3) outputting: clustering result set C ═ C of all time steps¹,C²,…,C^h,…,C^t}; wherein h represents a time step, h is 1, 2.

(1) And for each time step h, circularly executing:

(2) calculating a snapshot entity set O under the current time step h by using a formula (14)^hCorresponding similarity matrix W^hAnd constructing a corresponding snapshot graph G^h＝(V^h,E^h,W^h)；

(3) Collecting cluster center pointsInitialization is null;

(4) and carrying out a process of selecting an initial central point: firstly, selecting the snapshot entity with the highest similarity density as the first initial center pointThen, the initial central point of the selection residue is obtained by calculation according to the formula (15)Wherein j is in ascending order from 1 to k, and superscript h represents the time step;

(5) and circularly executing: set of snapshot entities O^hEach snapshot entity inTo the cluster at which the cluster center most similar to it is located (e.g.,) (ii) a Updating the center point of each cluster and recording the clustering result C^h(ii) a Until the target cost function in the formula (10) is satisfiedA minimum convergence condition;

accumulating and updating the clustering results of different time step lengths;

and returns the clustering result C for all time steps.

The fifth concrete implementation mode:

different from the first, second, or fourth embodiments, in the data space-oriented entity classification method of the present embodiment, in the fourth step, in the case that the number of clusters changes with time or the snapshot entity is added or removed with time, the process of expanding the K-Means clustering framework evolved in the first step is,

in the evolved K-Means clustering framework according to one of the first to fourth embodiments, the number of clusters does not change with time; and in all time steps, the snapshot entity set to be clustered is the same, i.e. the snapshot entity is not added or removed. However, in practical applications, these two assumptions are too restrictive. To this end, this subsection extends the evolved K-Means clustering framework proposed above to handle the following:

first, when the number of clusters changes over time:

cluster number k when history time step h^hNumber of clusters k less than current time step t^tThen only the corresponding column needs to be added to the joint probability matrix P^hThereby expanding intoWhereinThis is because the probability of the snapshot entity-new cluster union occurring at the historical time step h is zero for the newly added cluster. At this time, after the expansion,and P^tAre all n × k^tThus, equation (10) is modified to:

cluster number k when history time step h^hNumber of clusters k greater than current time step t^tThen only the corresponding column needs to be added to the joint probability matrix P^tThereby expanding intoWherein,this is because for deleted clusters, the probability of the snapshot entity-deleted cluster union occurring at the current time step t is zero. At this time, after expansion, P^hAndare all n × k^hThus, equation (10) is modified to:

second, when a snapshot entity is added or removed over time:

assume that P is at historical time step h^hIs n^hXk, P at the current time step t^tIs n^hX k joint probability matrix, n₀The snapshot entities appear in time steps h and t at the same time; when snapshot entities at historical time step h are removed, for time step t, the probability of joint occurrence of those removed snapshot entities with the current cluster is 0, at P^tIs increased by the corresponding row, thereby obtainingP^t Whereinwhen the snapshot entity at the current time step t is newly added, for the historical time step h, the joint occurrence probability of the newly added snapshot entity and the historical cluster is 0, and at P^hIs increased by the corresponding row, thereby obtainingP^h Whereinat this time, after the expansion,P^h andP^t are all (n)^h+n^t-n₀) X k, and therefore equation (10) is modified to:

in the formula, symbolRepresents the matrix after the matrix X is processed according to the smoothing method in the formula (9),is a matrixAnd (5) medium element.

The sixth specific implementation mode:

different from the fifth embodiment, in the method for classifying entities oriented to data space of the fifth embodiment, if the snapshot entity is t, the snapshot entity at time t may be formally represented as o^t(Attr, Cont); wherein Attr represents the snapshot entity o^tE.g. a set of attribute names in a tuple, and Attr ═ a₁,a₂,…,a_n}; cont denotes the snapshot entity o^tE.g., a set of keywords in the content, and Cont ═ keyword₁,keyword₂,…,keyword_mN and m respectively represent the element numbers of the sets Attr and Cont;

note that at different times, the structured and unstructured feature information in the entity o may change, which may be caused by various reasons in reality, such as the fact that the entity information changes due to the fact that the information sources extracted by the entity are increased. But this chapter does not pay attention to the problem of entity information extraction at any time. And all snapshot entities at the current time step t are recorded asThe snapshot illustration at adjacent time steps is intended as shown in fig. 2.

The seventh embodiment:

different from the first, second, fourth or sixth specific embodiments, in the data space-oriented entity classification method of the present embodiment, at time t, one thesis snapshot entity o is obtained^tThe method comprises the following steps of including title, author and size attributes, and also including unstructured content information of data space, entity and classification; then o^t({ title, author, size }, { data space, entity, classification });

the specific implementation mode is eight:

unlike the seventh embodiment, the entity classification method for data space according to the present embodiment,

the snapshot is that, a time step t, the snapshot at the time t can be formally expressed as G^t＝(V^t,E^t,W^t) Wherein in the diagram G^tIn each vertexRepresenting snapshot entities, each edgeRepresenting snapshot entitiesAndthe method has the advantages of having similarity with each other,representing edgesI.e. the snapshot entity at time step tAndthe similarity score between them. The snapshot at the adjacent time step is illustrated schematically in fig. 3.

Experiment and result analysis:

experimental setup:

the experiment used DBLP data from the release version 3 months 2015 as the basic data set required for the experiment, with the download address http:// DBLP. The extracted entity categories include treatises, doctor treatises, authors, meetings, periodicals, and university institutions. The following points need to be noted: (1) the paper entity is from an inprocessing record or an arrow record with a key prefixed by 'journals', the doctor paper entity is from a phdthesis record, the author entity is from a WWW record or an author tag, the conference entity is from a booktitle tag in the inprocessing record with a key prefixed by 'conf', the journal entity is from a journal tag or a booktitle tag in the inprocessing record with a key prefixed by 'journals', and the university institution is from a school tag; (2) only selecting entities generated in the time span from 2005 to 2014, wherein one time step is one year, and the number of the extracted entities is about 3M in total; (3) to model the evolution characteristics of data space entities, this chapter randomly selects a 20% proportion of entities for each time step of the entity set, and then simply randomly removes some attribute information or content information. (4) To model the Pay-as-you-go nature of the data space, no category labels are provided for all entities collected, i.e. there is no classification information (Grountritth) known in advance. (5) In order to test the expandability of the method, the number of entities is continuously reduced for different classes of entities according to the principle of equal proportion, so that DBLP data sets with the sizes of 2.5M, 2M, 1.5M and 1M are generated.

The experimental environment was set as follows: the PC host computer adopts Intel (R) core (TM) i5-4570 CPU 3.20GHz, the memory capacity is 4G, the hard disk capacity is 1TB, the operating system is Windows 7(64bit), and all algorithms in the experiment are realized by adopting Java language. Unless otherwise specified, in all experiments, the parameter K in the evolutionary K-Means algorithm of this section defaults to 6 and the data set size is 3M.

Effect and extensibility evaluation

(1) Selection of parameters

Next, the influence of the variation of different parameters on the clustering effect is tested by three sets of experiments, so as to determine the optimal values of the parameters λ, β and α.

Experiment 1, evaluating the influence of the selection of the weight λ in the entity similarity function on the clustering effect, because the experiment only focuses on the influence of the change of the weight λ on the clustering effect, setting the parameter α to 1 and β to 1, and further aggregating the data of all time steps together, repeating the experiment 50 times on the basis, and recording the average snapshot cost corresponding to all λ, the abscissa in fig. 4 represents different values of the weight λ, and the ordinate represents the snapshot cost (see formula (7)), from fig. 4, it can be known that the snapshot cost gradually becomes smaller as λ increases, when the value reaches 0.6, the snapshot cost is smallest (at this time, 0.5) and the clustering effect is best, and then the snapshot cost gradually becomes larger, which indicates that the attribute feature information of the entity plays a more important role in comparison with the content information of the entity itself similarity measure, mainly because the probability of similar attribute features is larger than the probability of similar contents in the entities of similar categories, as shown in fig. 4, the snapshot cost under different λ is different.

Experiment 2. evaluate the influence of the selection of the weight β in the entity similarity function on the clustering effect, since this experiment only focuses on the influence of the change of the weight β on the clustering effect, and experiment 1 shows that the effect is the best when λ is 0.6, this experiment sets parameters α to 1 and λ to 0.6, then aggregates the data of all time steps together, and runs the experiment 50 times on this basis, and records the average snapshot cost corresponding to all β. in fig. 5, the abscissa represents different values of the weight β, and the ordinate represents the snapshot cost (see formula (7)), and it can be seen from fig. 5 that the overall trend of the snapshot cost becomes gradually smaller as β increases, and when 0.75 is reached, the snapshot cost is the smallest (at this time, 0.36), i.e., the best clustering effect, and then the snapshot cost becomes gradually larger.

Experiment 3. evaluating the influence of the selection of the weight α in the target cost function on the clustering effect, according to the conclusions of the previous two experiments, this experiment sets the parameters β to 0.75 and λ to 0.6, then the experiment is repeatedly run 50 times on DBLP data sets of consecutive time steps (1-10), and the corresponding average values corresponding to all the time steps are recorded, the abscissa in fig. 6 represents the time step, and the ordinate represents the snapshot cost (see equation (7)), from fig. 6, the snapshot cost starts to decrease sharply as time evolves (i.e., the time step increases), and then tends to be almost stable, mainly because the entity's own information and history mode information are gradually enriched and converged during the process, so that the clustering effect is enhanced and is in a more stable state, furthermore, it can be observed that the snapshot cost gradually decreases as the value gradually increases in fig. 7, the evolution algorithm emphasizes the current result quality as the value is larger, and thus the clustering cost is smaller as the abscissa in fig. 7 represents the clustering effect, the snapshot cost gradually decreases as the time value gradually increases, the history cost gradually increases, i.e., from the average value of the clustering cost gradually decreases from the graph α value, the history cost gradually increases, the history cost shows that the history cost gradually increases from the graph 3, the graph shows that the history cost gradually increases from the graph shows the history cost when the history cost evolves, the history of the graph shows the graph 26, the graph shows that the history cost gradually increases, the graph shows that the graph shows the history cost gradually increases, the graph shows that the history of the graph shows that the graph shows that the graph shows.

(2) Comparison of effects of different methods

Experiment 4. comparing the method (simple sensitivity-Based evolution K-Means Cluster, SD-EKM) with other reference methods in terms of Clustering effect. The experiment designed several reference methods: (1) naive method (Evolution K-Means Clustering, N-EKM), an Evolutionary K-Means based on random initial point selection, is another version of the method SD-EKM herein, differing in the manner of initial point selection (2) the classical PCM-EKM method, which is an Evolutionary K-Means Clustering method based on preserving cluster members proposed by Yun Chi et al, following the experimental conclusion of Yun Chi, the parameter α is also set to 0.9, but with minor modifications in this experiment, i.e. replacing the original similarity measure for the method of entity similarity measure proposed in this chapter (3) IND method, i.e. running the K-Means algorithm independently for each time step data and not considering the historical time step data, since this data set has no previously known entity classification information (i.e. theoretical entity classification results) and no unified measure of the target cost function of these methods, this experiment is evaluated using cross-reference information (refer to the reader-referred to Xu et al)^[136]The proposed mutual information definition between two partitions), basically the higher the mutual information between two partitions, the greater the probability of similarity between them. All experiments were repeated 50 times on DBLP data sets of consecutive time steps (1-10) and the mutual information values for all time steps were recorded. As shown in fig. 9, the abscissa represents a time step and the ordinate represents mutual information. As shown in FIG. 9, it can be known that (1) the mutual information of the SD-EKM, N-EKM and PCM-EKM methods is significantly better than that of the IND method, and the mutual information value of the former is relatively stable with the time evolution. This is mainly the case with the first three methodsThe idea of evolutionary clustering, entity information of historical time step is considered; (2) mutual information of the SD-EKM method and the N-EKM method is better than that of the PCM-EKM method, and stability of the method in continuous time steps is better, mainly because the PCM-EKM method only considers historical information of the last time step, and the SD-EKM and the N-EKM method only consider the historical information of all historical time steps. (3) The SD-EKM method is superior to N-EKM. The method SD-EKM adopts the guiding principle of similarity density, so that the selection of noise data (such as snapshot entities in a low-density area) as an initial central point is greatly avoided, and the clustering result is better. The effect of the different methods is shown in a schematic diagram in comparison with fig. 9.

(3) Extensibility

Experiment 5, testing the expandability of the SD-EKM in the execution time. The experiment repeatedly runs the method SD-EKM 50 times on data sets with different sizes, and the average running time of each iteration and the average iteration times executed when the method converges are respectively recorded. The abscissa in fig. 10 represents the data set size and the ordinate represents the average run time per iteration. In fig. 11, the abscissa represents the data set size and the ordinate represents the average number of iterations when the SD-EKM algorithm converges. As can be seen from FIGS. 10 and 11, (1) the average run time of each iteration of the SD-EKM of the present invention is nearly linear with respect to the data set size; (2) the number of iterations required for the algorithm to converge is almost insensitive to the dataset size, around 650. The two groups of experiments show that the average running time of the SD-EKM of the invention is in a linear relation with the size of a data set, and the SD-EKM has good expandability. The average running time of each iteration is shown in fig. 10, and the average number of iterations in convergence is shown in fig. 11.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A data space-oriented entity classification method is characterized in that: the method is realized by the following steps:

designing a data space entity similarity measurement method;

thirdly, solving the initial point selection problem and the evolved data space entity classification problem based on the evolved K-Means clustering framework;

step four, under the condition that the number of clusters changes along with time or a snapshot entity is added or removed along with time, expanding the K-Means clustering framework evolved in the step one;

step one, the process of providing an evolved K-Means clustering framework, namely defining a target cost function based on a contour value and KL-divergence, comprises the following steps of defining a total target cost function in a linear combination mode:

the target cost function consists of two parts: the snapshot Cost of the current time step and the historical Cost of the historical time step are respectively recorded as Cost_snapshotAnd Cost_temporal(ii) a Defining a total target cost function in a linear combination mode, wherein the total target cost function is used for evaluating the K-Means clustering quality of an evolution entity, the total target cost function comprises two parts, namely snapshot cost of a current time step and historical cost of a historical time step, and the formula is as follows:

in the formula, 0 is not less than α is not more than 1, and represents the weight factor of the snapshot cost;representing the snapshot cost of the current time step t;representing the historical cost of the historical time step h; factor e^t-hIndicating the historical cost of the current time step t the closer it is to itThe more the weight occupied by the cluster structure is, the smaller the deviation degree of the cluster structure is, namely the closer the cluster structure is to the current time step t, the better the time smoothness of the cluster structure of the historical time step h is;

step two, measuring the snapshot cost based on the contour value:

let G be the fast graph of the current time step t^t＝(V^t,E^t,W^t) In which V is^tFor snapshot entity sets, E^tFor the set of similarities between snapshot entities, | V^t|＝n，W^tIs a similarity matrix between snapshot entities; entity partitioning based on the snapshot mapWhereinp is not equal to q andmeasuring the quality of the K-Means clustering result by adopting a contour value criterion, wherein the contour value is also called a contour coefficient and is a clustering evaluation method only referring to data and not referring to a gold standard; each cluster in the cluster evaluation method is represented by a contour, objects in the cluster and objects far away from the cluster are reflected through the contours, two influence factors of cohesion and separation are reflected by the cluster evaluation method, and the clustering effect is better when the contour value is larger; the snapshot cost is defined as:

where k represents the number of clusters at the current time step t,it indicates the p-th cluster, and,representing a clusterAverage contour value of (a); the average contour value is in inverse proportion to the snapshot cost;

wherein,representing snapshot entitiesAnd the cluster to which it belongsOther snapshot entities in (2)The average similarity of the average of the similarity of the two,representing snapshot entitiesWith other clustersAll snapshot entities inThe maximum average similarity of;the larger the value of (A), the more the snapshot entity is indicatedThe average similarity with the snapshot entity in the cluster is larger than the average similarity with the snapshot entity in the cluster;

definition ofThe formula of (1) is:

in the formula,into a clusterThe entity of the medium snapshot is the entity,into a clusterThe entity of the medium snapshot is the entity,for snapshot entities in the same clusterAndthe similarity between the two groups is similar to each other,for snapshot entities in different clustersAndsimilarity between them;

sixthly, performing historical cost measurement based on KL-divergence:

first, set fast graph G¹,G²,…,G^h,…,G^t，

Under the historical time step h, based on the corresponding historical snap map G^hIs divided intoSecondly, defining a metric for comparing two cluster partitions, representing the relationship between an entity and a cluster by adopting a bipartite graph, and dividing the entity into Z^tThe problem is transformed into a joint probability distribution problem in bipartite graphs:

let BG^t＝(V^t,C^t,F^t,P^t) Is a corresponding snap chart G^t＝(V^t,E^tWt) a bipartite graph; wherein,is a snapshot entity set;is a cluster set;is a set of edges, the two vertices of the edge are respectively from a set V^tAnd set C^t；Is an n x k joint probability matrix corresponding to the edge weight matrix of the bipartite graph; calculated using joint probability formulas, i.e.Determining entitiesAnd clusterJoint probability of each otherWherein,into a clusterThe probability of the occurrence of the event is,to be in a clusterunder-Generation-Condition entityThe probability of occurrence; if entityBelong to a clusterThenn_jAnd n is each a clusterThe number of middle snapshot entities and the number of all snapshot entities; if not, then,for the joint probability matrix P^tFor any row i, there is only one column j such that p_ijIs not 0, thereby

Thirdly, measuring by adopting a KL-divergence method:

fourth, the joint probability matrix P^tOr P^hThe following smoothing process is performed: p^tOr P^hEach element ofOrAdding a constant e^-12(ii) a The processed elements are then renormalized and are recorded asOrThe probability matrixes after the smoothing are respectively recorded asAndequation (8) is modified to:

wherein 0 ≦ α ≦ 1 is the weighting factor for the snapshot cost, k represents the number of clusters under the current time step t,representing entity partitions Z^tP-th element of (A), w^t _ii′Or w^t _ijQuick drawing G^t＝(V^t,E^t,W^t) W of (2)^tThe medium element (II) is selected from the group consisting of,orRepresents G^tMiddle snapshot entity, n represents bipartite graph BG^t＝(V^t,C^t,F^t,P^t) V of^tThe number of entities in the middle snapshot is,representing smoothed joint probability matricesThe medium element (II) is selected from the group consisting of,representing smoothed joint probability matricesMiddle element;

step two the process of designing the data space entity similarity measurement method is,

the data space entity is a snapshot entity, the similarity of the snapshot entity is measured according to the self information of the entity and the historical appearance mode information of the entity, namely, a similarity function of the snapshot entity consists of self similarity and historical similarity, and an expression is defined as follows:

based on the structured feature information corresponding to the attribute feature information of the snapshot entities and the unstructured feature information corresponding to the content feature information, the self-similarity between the snapshot entities is defined as follows:

the historical similarity is measured by adopting a classical pearson correlation coefficient, and specifically comprises the following steps:

wherein,for the snapshot entity at the current time step t,andare respectively snapshot entitiesAndthe number of times the historical time step h occurs,andseparately snapshotting entitiesAndaverage of the number of occurrences at all historical time steps;

wherein,for the snapshot entity at the current time step t,andare respectively snapshot entitiesAndthe number of times the historical time step h occurs,andseparately snapshotting entitiesAndthe average of the number of occurrences over all historical time steps,andare respectively snapshot entitiesAndthe characteristic of the properties of (a) to (b),andas snapshot entitiesAndthe content characteristics of (1) 0- β are self-similarityλ is more than or equal to 0 and less than or equal to 1 as the similarity of attribute featuresThe weight of (c);

step three, the process for solving the initial point selection problem and the evolved data space entity classification problem based on the evolved K-Means clustering framework is that,

firstly, the following relevant definitions are made:

definition of η -neighbors at time t given a snapshot G^t＝(V^t,E^t,W^t) And parameter 0<η ≦ 1, then for any snapshot entityIn terms of the η -neighbor formalization at time t is defined as:

wherein, | V^tI is snapshot G^tThe number of the middle top points is,is W^tMiddle element;

in the formula w^t _ijFor snapshot entities in different clustersAnda similarity value therebetween;

determining a selection principle of the initial center points other than the first initial center point, excluding η -neighbor snapshot entities of the selected initial center point, average similarity below all the selected initial center points, similarity density above the current center point, which can be formulated as follows:

inputting: a series of snapshot entity sets O ═ { O ] of different time steps¹,O²,…,O^h,…,O^tAnd f, setting the cluster number set K-K corresponding to different time step lengths as K¹,k²,…,k^h,…,k^t}；

And (3) outputting: clustering result set C ═ C of all time steps¹,C²,…,C^h,…,C^t}; wherein h represents a time step, and h is 1,2,...,t；

(1) And for each time step h, circularly executing:

(3) Collecting cluster center pointsInitialization is null;

(5) and circularly executing: set of snapshot entities O^hEach snapshot entity inTo the cluster in which the cluster center most similar to it is located; updating the center point of each cluster and recording the clustering result C^h(ii) a Until the target cost function in the formula (10) is satisfiedA minimum convergence condition;

and returning the clustering results C of all time step lengths;

step four, under the condition that the number of clusters changes along with time or the snapshot entity is added or removed along with time, the process of expanding the K-Means clustering framework evolved in the step one is as follows,

first, when the number of clusters changes over time:

cluster number k when history time step h^hNumber of clusters k less than current time step t^tThen only the corresponding column needs to be added to the joint probability matrix P^hThereby expanding intoWhereinAt this time, after the expansion,and P^tAre all n × k^tThus, equation (10) is modified to:

cluster number k when history time step h^hNumber of clusters k greater than current time step t^tWhile adding the corresponding column to the joint probability matrix P^tIn, extend toWherein,at this time, after expansion, P^hAndare all n × k^hThus, equation (10) is modified to:

second, when a snapshot entity is added or removed over time:

assume that P is at historical time step h^hIs n^hXk, P at the current time step t^tIs n^tX k joint probability matrix, n₀The snapshot entities appear in time steps h and t at the same time; when snapshot entities at historical time step h are removed, for time step t, the probability of joint occurrence of those removed snapshot entities with the current cluster is 0, at P^tIs increased by the corresponding row, thereby obtainingP^t Whereinwhen the snapshot entity at the current time step t is newly added, for the historical time step h, the joint occurrence probability of the newly added snapshot entity and the historical cluster is 0, and at P^hIs increased by the corresponding row, thereby obtainingP^h Whereinat this time, after the expansion,P^h andP^t are all (n)^h+n^t-n₀) X k, and therefore equation (10) is modified to:

2. The data space-oriented entity classification method of claim 1, characterized by: the snapshot entity is represented as o in a formalized way at the moment t^t(Attr, Cont); wherein Attr represents the snapshot entity o^tAnd Attr ═ a₁,a₂,…,a_n}; cont denotes the snapshot entity o^tAnd Cont ═ keyword₁,keyword₂,…,keyword_mN and m respectively represent the element numbers of the sets Attr and Cont; all snapshot entities at the current time step t are recorded as

3. The data space-oriented entity classification method according to claim 1 or 2, characterized by: if the snapshot entity is a paper, at time t, the paper snapshot entity o^tThe method comprises the following steps of including title, author and size attributes, and also including unstructured content information of data space, entity and classification; then o^t({ title, author, size }, { data space, entity, class }).

4. The data space-oriented entity classification method of claim 3, characterized by: the snapshot is that, a time step t, the snapshot at the time t can be formally expressed as G^t＝(V^t,E^t,W^t) Wherein in the diagram G^tIn each vertexRepresenting snapshot entities, each edgeRepresenting snapshot entitiesAndthe method has the advantages of having similarity with each other,representing edgesI.e. the snapshot entity at time step tAndthe similarity score between them.