CN106067029A

CN106067029A - The entity classification method in data-oriented space

Info

Publication number: CN106067029A
Application number: CN201610348890.4A
Authority: CN
Inventors: 王念滨; 王红滨; 周连科; 祝官文; 何鸣; 王瑛琦; 宋奎勇
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2016-05-24
Filing date: 2016-05-24
Publication date: 2016-11-02
Anticipated expiration: 2036-05-24
Also published as: CN106067029B

Abstract

The entity classification method in data-oriented space, belongs to natural language processing field.Under Evolution Environment, existing cannot be by assuming that entity be resting state, and the problem classifying entity.A kind of entity classification method in data-oriented space, first, for the data space entity developed, proposes K Means that improve, that develop and clusters framework, i.e. define based on profile value and the objective cost function of KL divergence；Secondly, the data space entity method for measuring similarity of a kind of novelty is devised；Then, according to heuristic rule, the K Means clustering algorithm developed is proposed.Additionally, further expand the evolution cluster framework that this chapter proposes, to process the situation that number of clusters amount changes in time or snapshot entity adds in time or removes.The present invention can not only capture current entity cluster result in high quality, moreover it is possible to robustly reflecting history clusters situation.

Description

The entity classification method in data-oriented space

Technical field

The present invention relates to a kind of entity classification method in data-oriented space.

Background technology

Data space is integrated is one of the important channel that builds of data space.Is various structures faced by data space Change, semantic relation complexity, the large-scale data of distribution storage, therefore data space is integrated mainly includes that both sides works: (1) Entity integrated；(2) entity relationship is integrated.At present, the integrated work of existing data space] it is focused mainly on entity relationship collection Become and propose some effective strategy or methods, but the research in terms of entity is integrated^[44]Relatively fewer.Therefore Data spatially integrate (especially entity is integrated) is significant.As the important step that entity is integrated, real The classification tool of body is widely used, and such as, inquiry question answering system, Relation extraction, data space inquiry, machine translation, text gather Class etc..Therefore, the entity classification technology in data space is significant.

At present, the sort research (naming) entity has caused a large amount of scholar extensive in natural language processing field (NLP) Pay close attention to.These work are mainly divided into two big classes: the entity classification of coarseness and fine-grained entity classification.The entity of coarseness Classification is intended to be divided into a group object coarseness class label set of a group less, and class number is typically less than 20 classes and class Other does not has level, the entity class such as such as name, organization name, place name.Common method includes method based on machine learning, base Method in the supplementary knowledge such as body and external resource.Such as, Chifu et al. uses unsupervised neural network model to name Entity carries out unsupervised segmentation, Kliegr proposes a kind of based on Bag-of-Articles without supervision name entity classification side Method and Gamallo and Garcia propose a kind of resource-based name entity classification system.Fine-grained entity classification is then Entity is divided into more fine-grained classification, and its class number is more, and class hierarchy is more complicated.Such as, FIGER uses 112 kinds Freebase type, HYENA employ 505 kinds of YAGO types.Typical method has based on context and based on grammar property Method.Such as, Gillick et al. proposes a kind of fine granularity entity classification method of Context-dependent, according to grammar property, Giuliano and Gliozzo proposes a kind of fine granularity entity classification method of instance-based learning algorithm, thus generates more Add abundant human body.

But, the entity classification method in above NLP field utilizes contextual information, linguistic information and outside often The priori entity class knowledge such as knowledge feature carry out classifying and classify to as if static, but in data space, entity divides Class technology rarely has research.In data space environment, entity classification is a task the most challenging, this be mainly manifested in Under several aspects: (1) entity information is rich.As described in chapter 2, data space entity is not only the title letter comprising it Breath, also comprises abundant attribute character information and content characteristic information, it is true that this partial information is even more important, therefore, needs Want the similarity of a kind of more appropriate similarity function assessment data space inter-entity.(2) entity class hysteresis quality.Due to number Advocate according to space and build while with the integration mode of (Pay-ag-you-go), this causes being gradually to obtain on entity class knowledge essence , therefore clustering technique is the more Appropriate method realizing entity classification.(3) entity Dynamic Evolution.Traditional entity divides Class method has a strict assumed condition: entity is static, develops the most in time.But this assumed condition exists Under data space environment the most applicable, the entity self-information of extraction and physical quantities the most constantly change. Therefore, under Evolution Environment, how entity is classified and more challenge.

Summary of the invention

The invention aims to solve under Evolution Environment, existing cannot be and right by assuming that entity is resting state Entity carries out the problem classified, and proposes a kind of entity classification method in data-oriented space.

A kind of entity classification method in data-oriented space, described method is realized by following steps:

Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on wheel Wide value and the objective cost function of KL-divergence；

Step 2, design data space entity method for measuring similarity；

The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space of initial point select permeability and evolution Entity classification problem；

Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, expand The K-Means developed in exhibition step one clusters framework.

The invention have the benefit that

The present invention be directed to the data space entity developed, it is proposed that a kind of improvement, evolution K-Means clusters frame Frame, i.e. defines based on profile value and the objective cost function of KL-divergence；It not only considers the quality currently clustered, i.e. snapshot generation Valency, it is also contemplated that all history cluster the time smoothing of structure, and time smoothing is history cost；The data space of design Entity method for measuring similarity, the method not only considers more rich entity self information, such as, the structured features letter of entity Breath and destructuring characteristic information；There is pattern information in the history also contemplating inter-entity, thus more accurately measures entity Between similarity；Propose the K-Means clustering algorithm developed, and solve the data space entity of initial point select permeability and evolution Classification problem；Finally, the K-Means further expanding evolution clusters framework, changes in time or snapshot in number of clusters amount In the case of entity adds in time or removes, process the versatility problem of the K-Means cluster framework developed.

Accompanying drawing explanation

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the snapshot plotting schematic diagram under the adjacent time step-length that the present invention relates to；Figure illustrates time step t-1 and Snapshot plotting under t.Comprising 6 summits, one snapshot entity of each vertex correspondence in snapshot plotting, the numeral on limit represents the most according to the facts Similarity between body.During time step t-1 to t, their similarity changes (e.g., snapshot entity V a bit₁ With snapshot entity V₂, snapshot entity V₃With snapshot entity V₄), some does not then change (such as snapshot entity V₅With snapshot entity V₆)。

Assuming that snapshot plotting G^tContaining n summit, thenIt is snapshot plotting G^tOn an adjacency matrix.Over time Evolution, the similarity history of snapshot inter-entity pass through a series of snapshot plotting < G¹,G²,…,G^h,…,G^t> capture；

Fig. 3 is the cluster scene of the entity of the evolution that the present invention relates to, and illustrates in figure under time step t-1 and t Snapshot plotting, wherein, each vertex representation snapshot entity, the numeral on limit represents the similarity scores of snapshot inter-entity.Obviously, exist Under time step t-1, six snapshot entities should cluster according to the mode of cluster 1；Under time step t, cluster result Not unique, such as, six snapshot entities may cluster according to the mode of cluster 2 and cluster 3.Obviously, cluster 2 and cluster The mode of 3 all can ensure that the entity of current time step t clusters quality the most very well.But, according to basic thought principle, more Add the mode tending to cluster 2, this is because it is more consistent with the result that clusters of historical time step-length t-1；

Fig. 4 is the snapshot cost that the present invention tests under the 1 different λ related to；

Fig. 5 is the snapshot cost that the present invention tests under 2 different beta related to；

Fig. 6 to Fig. 8 is the history cost that the present invention tests under the 3 different α related to；

Fig. 9 is the association relationship situation that the present invention tests under the 4 different evolution methods related to；

Figure 10 is the average operating time schematic diagram of each iteration of the SD-EKM that the present invention relates to；

Mean iterative number of time schematic diagram when Figure 11 is the convergence that the present invention relates to.

Detailed description of the invention

Detailed description of the invention one:

The entity classification method in the data-oriented space of present embodiment, shown in Fig. 1 flow chart, described method is passed through Following steps realize:

Step 2, design data space entity method for measuring similarity；

Detailed description of the invention two:

Unlike detailed description of the invention one, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method, the K-Means proposing described in step one to develop clusters framework, i.e. defines target generation based on profile value and KL-divergence The process of valency function is,

Step one by one, use the mode of linear combination to define total objective cost function:

Cost function is made up of two parts: the snapshot cost of current time step and the history cost of historical time step-length, It is designated as Cost respectively_snapshotAnd Cost_temporal；；The former is served only for measuring the result that currently clusters about current entity information Snapshot quality, the metric of this reflection clustering algorithm, it is clear that the highest snapshot cost means that snapshot quality is the lowest；Then Person then carrys out measuring period flatness according to the cluster fitting degree of structure of currently cluster structure and history, it is clear that the highest history Cost means that the structural integrity that clusters of step-length continuous time is poor or time smoothing is the most weak, additionally, go through for difference The history cost of history time step, its weight is the most different.The mode using linear combination defines total objective cost function, uses In assessment evolution entity K-Means clustering result quality, total objective cost function comprise current time step snapshot cost and The history cost of historical time step-length is two-part, formula specific as follows:

{Cost}_{t o t a l}^{t} = α \cdot {Cost}_{s n a p s h o t}^{t} + (1 - α) \cdot Σ_{h = 1}^{t - 1} e^{t - h} {Cost}_{t e m p o r a l}^{h} - - - (1)

In formula, 0≤α≤1, represent the weight factor of snapshot cost；Represent the snapshot of current time step t Cost；Represent the history cost of historical time step-length h；Factor e^t-hShow from current time step t more close to, it is gone through History costShared weight is the heaviest, and its departure degree is the least, owing to total objective cost function is the smaller the better, i.e. from Current time step t is the nearest, and the time smoothing of the structure that clusters of historical time step-length h is the best；

Step one two, the tolerance of snapshot cost based on profile value of carrying out:

If the snapshot plotting of current time step t is G^t=(V^t,E^t,W^t), wherein | V^t|=n, W^tPhase for snapshot inter-entity Like property matrix, concrete similarity calculation method is shown in formula (11)；The entity division obtained based on this snapshot plotting isWhereinAndSnapshot cost is intended to tolerance about recent snapshot The snapshot quality of the result that currently clusters of entity, the metric of this reflection clustering algorithm, it is clear that the highest snapshot cost meaning Snapshot quality the lowest.The current measurement criterion many ripe assessment clusters occurred, such as contingency table, error sum of squares standard Then, profile value (Silhouette Value), accuracy rate based on class label and recall rate etc..Whether these criterions reference The aspect performances such as the dependence of golden standard, similarity measurement, number of clusters amount prejudice are different.Due to entity number in the environment of this chapter Measure more, depend on specific similarity measurement, only with reference to reasons such as itself entities, therefore use profile value criterion tolerance K- Means clusters the quality of result, and wherein, profile value is also referred to as silhouette coefficient, is the one that proposes of Kaufman and Rousseeuw Only with reference to data itself without reference to the Cluster Evaluation method of golden standard；In this Cluster Evaluation method each bunch with a profile Represent, the object that is positioned at bunch by profile reflection and away from bunch object, utilize this Cluster Evaluation method reflect cohesion degree and Two kinds of influence factors of separating degree, and the profile value effect that clusters the most greatly is the best；Snapshot cost is defined as:

{Cost}_{s n a p s h o t}^{t} = 1 - \frac{1}{k} Σ_{p = 1}^{k} s i l (V_{p}^{t}) - - - (2)

In formula, k represents the number of lower bunch of entity division of current time step t,Represent pth bunch (entity division),Expression bunchMean profile value；Understanding according to formula (1), snapshot cost is the smaller the better；And mean profile value is the biggest The effect that means to cluster is the best, and therefore mean profile value is with snapshot cost inversely；The physical significance of formula (2) is expressed as In entity division Z^tUnder, mean profile value is the biggest, and its quality that clusters is the best, thus more can reflect recent snapshot entity exactly Feature；

Step one three, according to each bunchIn comprise one group of snapshot entityThen by each bunchAverage wheel Wide valueThe meansigma methods of the profile value of all snapshot entities in being defined as bunch, particularly as follows:

s i l (V_{p}^{t}) = \frac{1}{| V_{p}^{t} |} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} s i l (o_{i}^{t}) - - - (3)

In formula,Expression bunchMiddle snapshot entity；Expression bunchThe number of middle snapshot entity；Represent snapshot EntityProfile value, its measure formulas is expressed as:

s i l (o_{i}^{t}) = \frac{b (o_{i}^{t}) - a (o_{i}^{t})}{m a x {a (o_{i}^{t}), b (o_{i}^{t})}} - - - (4)

Wherein,Represent snapshot entityWith it belonging to bunchIn other snapshot entityAverage similarity, Represent snapshot entityWith other bunchIn all snapshot entitiesMaximum average similarity；Value the biggest, show fast According to the facts bodyAnd in bunch the average similarity of snapshot entity more than it and bunch between in the average similarity of snapshot entity, thus snapshot EntityCorrectly classified；

Step one four, based on described in step one threeRepresent snapshot entityThe thing of measure formulas (4) of profile value Reason meaning, definitionFormula be:

b (o_{i}^{t}) = \frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}^{t} - - - (5)

DefinitionFormula be:

a (o_{i}^{t}) = \max_{V_{q}^{t} &NotEqual; V_{p}^{t}} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{p}^{t}}{Σ} w_{i j}^{t}} - - - (6)

In formula,For bunchMiddle snapshot entity,For bunchMiddle snapshot entity, w_ii'For snapshot entity in same clusterWith Between similarity, w_ijFor snapshot entity in difference bunchWithBetween similarity；

The step First Five-Year Plan, substituted in formula (2) by formula (3) to (6), then snapshot cost is rewritten as:

\begin{matrix} {Cost}_{s n a p s h o t}^{t} = 1 - \frac{1}{k} Σ_{p = 1}^{k} s i l (V_{p}^{t}) \\ = 1 - \frac{1}{k} Σ_{p = 1}^{k} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{b (o_{i}^{t}) - a (o_{i}^{t})}{\max {a (o_{i}^{t}), b (o_{i}^{t})}} \\ = 1 - \frac{1}{k} Σ_{p = 1}^{k} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}} - \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}{\max {\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}, \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}} \end{matrix} - - - (7)

Step one six, the history cost metric based on KL-divergence that carries out:

The first, history cost is intended to carry out measuring period put down according to the cluster fitting degree of structure of currently cluster structure and history Sliding characteristic, it is clear that the least history cost means that the structural integrity that clusters between step-length continuous time is preferable, or the time puts down Slip is the strongest.For discussion purposes,

If snapshot plotting G¹,G²,…,G^h,…,G^t,

Under current time step t, based on corresponding snapshot plotting G^tEntity division be designated as

Under historical time step-length h, based on corresponding historical snapshot figure G^hEntity division be designated as

The second, define a kind of tolerance for comparing two kinds of clusterings, decomposed cluster (Graph-by figure Factorization clustering) thought inspiration, use bipartite graph presentation-entity and bunch relation, by entity division Z^tProblem It is converted into joint probability distribution problem in bipartite graph:

Make BG^t=(V^t,C^t,F^t,P^t) it is corresponding snapshot plotting G^t=(V^t,E^t,W^t) a bipartite graph；Wherein, For snapshot entity sets；Close for gathering；For the set on limit, two of limit Summit is respectively from set V^tWith set C^t；For n × k joint probability matrix, corresponding to the limit weight square of bipartite graph Battle array；Joint probability formula is used to calculate, i.e.Determine entityWith bunchBetween connection Close probabilityWherein,For bunchThe probability occurred,For bunchEntity under occurrence conditionThe probability occurred； Such as sporocarpBelong to bunchSon_jWith n be respectively bunchMiddle snapshot physical quantities and all fast the most according to the facts The quantity of body；Otherwise,Due in the present invention, a kind of hard cluster of cluster rather than soft cluster, i.e. one entity One bunch can only be belonged to, therefore, for joint probability matrix P^tIn for arbitrarily row i, only exist a row j and make p_ijIt is not 0, Thus

3rd, current, in classification with the document of cluster, exist a large amount of about the side comparing two clusterings Method, such as central point differential method, X 2 method, correlation coefficient process, KL-divergence method etc..In this chapter environment, clustering problem Regard as an entity with bunch joint probability distribution problem, therefore tolerance two clustering difference problems be just equivalent to measure two Individual probability distribution variances problem.Owing to KL-divergence (being also relative entropy) is that one derives from information-theoretical tolerance, it is used for determining two The difference of individual probability distribution, therefore uses KL-divergence method to measure:

The bipartite graph BG of given current time step t^t=(V^t,C^t,F^t,P^t) and the bipartite graph BG of historical time step-length h^h= (V^h,C^h,F^h,P^h), the entity division of current time step tEntity division with historical time step-length hWherein, BG^tCorresponding to Z^t, BG^hCorresponding to Z^h, then the history cost of two time steps h and t is fixed Justice is as follows:

\begin{matrix} {Cost}_{t e m p o r a l}^{h} = K L (P^{t} | | P^{h}) \\ = Σ_{i = 1}^{n} Σ_{j = 1}^{k} p_{i j}^{t} \times l o g (p_{i j}^{t} / p_{i j}^{h}) \end{matrix} - - - (8)

Wherein, n is the quantity of snapshot entity, the number that k is bunch,For snapshot entity under time step t and bunch between Joint probability matrix P^tMiddle element,For snapshot entity under historical time step-length h and bunch between joint probability matrix P^h Middle element；

4th, understand from analysis above, joint probability matrix P^tOr P^hIt is a sparse matrix, i.e. there is non-zero Element, and the KL-divergence of standard is not supportedOrIt is the situation of 0, to this end, joint probability matrix P^tOr P^hDo following smooth Process: P^tOr P^hIn each elementOrPlus constant ε, and ε=e^-12；Then the most regular to element after processing Change, be designated asOrProbability matrix after smoothing processing is designated as respectivelyWithThen formula (8) is modified to:

Wherein, n is the quantity of snapshot entity, the number that k is bunch,For under time step t after smoothing processing Entity and bunch between joint probability matrixMiddle element,For under historical time step-length h through smoothing processing entity with Bunch joint probability matrixMiddle element；

5th, formula (7) and formula (8) are substituted in formula (1), then target total cost function is equivalent to:

Wherein, 0≤α≤1 is the weight factor of snapshot cost, and k represents that current time step t's lower bunch (entity division) is individual Number,Presentation-entity divides Z^tMiddle pth element, w_ii'Or w_ijRepresent snapshot plotting G^t=(V^t,E^t,W^t) W^tMiddle element,OrRepresent G^tMiddle snapshot entity, n represents bipartite graph BG^t=(V^t,C^t,F^t,P^t) V^tMiddle snapshot physical quantities,Represent flat Joint probability matrix after sliding processMiddle element,Represent joint probability matrix after smoothing processingMiddle element.

Detailed description of the invention three:

Unlike detailed description of the invention one or two, the reality in the data-oriented space of the present embodiment of present embodiment Body sorting technique, the process designing data space entity method for measuring similarity described in step 2 is,

On the one hand, snapshot entity self comprises abundant information, the most structurized attribute information, non-structured content Information；On the other hand, in data space environment, entity occurs the most repeatedly, and this history pattern information occurs to sentencing Disconnected two entities are the most similar also has certain effect, to this end, data space entity i.e. snapshot entity, according to self letter of entity There is pattern information to measure the similarity of snapshot entity in the history of breath and entity, and i.e. the similarity function of snapshot entity is by self Similarity and historical similarity two parts composition, expression formula is defined as:

w_{i j}^{t} = {Sim}^{t} (o_{i}^{t}, o_{j}^{t}) = β \cdot {Sim}_{s e l f}^{t} (o_{i}^{t}, o_{j}^{t}) + (1 - β) \cdot {Sim}_{h i s}^{t} (o_{i}^{t}, o_{j}^{t}) - - - (11)

Wherein, 0≤β≤1 is the weight of self similarity,For the snapshot entity under current time step t,For snapshot entityWithBetween self similarity,For snapshot entityWithBetween Historical similarity；

Intuitively, the same or analogous ratio of entity attributes name with apoplexy due to endogenous wind is higher, and the genus of the entity in inhomogeneity The property same or analogous ratio of name is relatively low；Additionally, some entity often comprises only unstructured information, similar two of its content Entity, to a certain extent, they are likely to belong to same class.To this end, it is corresponding based on snapshot entity attributes characteristic information Structured features information, the destructuring characteristic information corresponding with content characteristic information, self between snapshot entity is similar Property is defined as follows:

\begin{matrix} {Sim}_{s e l f}^{t} (o_{i}^{t}, o_{j}^{t}) = λ \cdot {sim}_{a t t r}^{t} (o_{i}^{t}, o_{j}^{t}) + (1 - λ) \cdot {sim}_{c o n t}^{t} (o_{i}^{t}, o_{j}^{t}) \\ = λ \cdot \frac{o_{i}^{t} . A t t r \cap o_{j}^{t} . A t t r}{o_{i}^{t} . A t t r \cup o_{j}^{t} . A t t r} + (1 - λ) \cdot \frac{o_{i}^{t} . C o n t \cap o_{j}^{t} . C o n t}{o_{i}^{t} . C o n t \cup o_{j}^{t} . C o n t} \end{matrix} - - - (12)

Wherein, 0≤λ≤1 is the weight of attribute character similarity,WithIt is respectively snapshot Entity attributes characteristic similarity and content characteristic similarity,For snapshot entity attributes feature,For the most according to the facts The content characteristic of body；

If the number information pattern that two snapshot entities occur in the history step-length in past is relatively uniform, then for working as Two snapshot entities of front time step, this historical information pattern dependency represents that the two snapshot entity has similarity, Use classical Pearson correlation coefficients tolerance historical similarity, particularly as follows:

\begin{matrix} {sim}_{h i s}^{t} (o_{i}^{t}, o_{j}^{t}) = ρ (o_{i}^{t}, o_{j}^{t}) \\ = \frac{Σ_{h = 1}^{t - 1} (n_{i}^{h} - μ_{i}^{t}) (n_{j}^{h} - μ_{j}^{t})}{\sqrt{Σ_{h = 1}^{t - 1} {(n_{i}^{h} - μ_{i}^{t})}^{2} Σ_{h = 1}^{t - 1} {(n_{j}^{h} - μ_{j}^{t})}^{2}}} \end{matrix} - - - (13)

Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithIn history The number of times that time step h occurs,WithSnapshot entity respectivelyWithAverage in all historical time step-length occurrence numbers Value；

Formula (12) and formula (13) are substituted into formula (11), then the similarity function of snapshot entity is rewritten as:

\begin{matrix} w_{i j}^{t} = β \cdot {Sim}_{s e l f}^{t} (o_{i}^{t}, o_{j}^{t}) + (1 - β) \cdot {Sim}_{h i s}^{t} (o_{i}^{t}, o_{j}^{t}) \\ β \cdot (λ \cdot \frac{o_{i}^{t} . A t t r \cap o_{j}^{t} . A t t r}{o_{i}^{t} . A t t r \cup o_{j}^{t} . A t t r} + (1 - λ) \cdot \frac{o_{i}^{t} . C o n t \cap o_{j}^{t} . C o n t}{o_{i}^{t} . C o n t \cup o_{j}^{t} . C o n t}) \\ + (1 - β) \cdot (\frac{Σ_{h = 1}^{t - 1} (n_{i}^{h} - μ_{i}^{t}) (n_{j}^{h} - μ_{j}^{t})}{\sqrt{Σ_{h = 1}^{t - 1} {(n_{i}^{h} - μ_{i}^{t})}^{2} Σ_{h = 1}^{t - 1} {(n_{j}^{h} - μ_{j}^{t})}^{2}}}) \end{matrix} - - - (14)

Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithIn history The number of times that time step h occurs,WithSnapshot entity respectivelyWithAverage in all historical time step-length occurrence numbers Value,WithIt is respectively snapshot entityWithAttribute character,WithFor snapshot entityWith Content characteristic；0≤β≤1 is self similarityWeight, 0≤λ≤1 is attribute character similarity Weight.

Detailed description of the invention four:

Unlike detailed description of the invention three, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method, proposes the K-Means clustering algorithm developed described in step 3, and the data solving initial point select permeability and evolution is empty Between the process of entity classification problem be,

The first, first provide some related definitions, preferably to select initial center point, then describe evolution in detail K-Means clustering algorithm；

It is known that the quality that initial point selects greatly affects the quality of K-Means Clustering Effect, and tradition is selected at random The method selected is easily caused algorithm the convergence speed and crosses the problems such as slow.Therefore, before solving initial point select permeability, following phase is carried out Close and define:

The definition of the η-neighbours under t: a given snapshot plotting G^t=(V^t,E^t,W^t) and parameter 0 ＜ η≤1, then right In any snapshot entityFor, the η under t-neighbours' formal definitions is: Wherein | V^t| for snapshot plotting G^tMiddle number of vertices,For W^tMiddle element；

The definition of the similarity density under t: a given snapshot plotting G^t=(V^t,E^t,W^t) and t under η-neighbour OccupySo for any snapshot entityFor, the similarity Density Format under t is defined as:

{Density}_{s i m} (o_{i}^{t}) = | N (o_{i}^{t}, η) | \times \log (1 + \frac{1}{| N (o_{i}^{t}, η) | - 1} Σ_{o_{j}^{t} &Element; N (o_{i}^{t}, η), o_{j}^{t} &NotEqual; o_{j}^{t}}^{N (o_{i}^{t}, η)} w_{i j});

Above-mentioned definition understands: a snapshot entitySimilarity density the highest, then its η-neighboursQuantity The most and withIn the average similarity of other snapshot entity the highest；And the snapshot entity that similarity density is the highest, It is the biggest as the probability at bunch center；The definition of the similarity density under t avoids and selects in such as density regions Snapshot entity, the noise data of isolated snapshot entity, or bunch in edge snapshot entity as K-Means cluster bunch in Heart point；

The second, the snapshot entity that selection principle is similarity density maximum of first initial center point is determined；

Determine the selection principle of the initial center point in addition to first initial center point: remove the initial center point selected The snapshot entity of η-neighbours；Average similarity less than all initial center point selected；Similar higher than Current central point Property density；Wherein, the average similarity value of all initial center point selected is 0.3, and the similarity density of Current central point is 10, then, this principle form can turn to equation below:

c_{j} = \underset{o_{i}^{t} &Element; V^{t}, o_{i}^{t} &NotElement; \cup_{l = 1}^{j - 1} N (o_{s_{l}}^{t}, η),}{\arg \max} {{Density}_{s i m} (o_{i}^{t}) / (1 + \frac{1}{j - 1} Σ_{l = 1}^{j - 1} w_{s_{l} i}^{t})} - - - (15)

Wherein, 1≤l≤j-1 is the serial number having selected initial center point,Initial center has been selected for all The union of the η-neighbours of point,For snapshot entityWith select initial center pointSimilarity,For snapshot EntitySimilarity density under t, adds that coefficient 1 purpose is the situation preventing denominator from being zero；

The basic thought of the K-Means clustering algorithm the 3rd, performing evolution is as follows: in the institute to current time step Having in time step, circulation performs K-Means clustering algorithm；Wherein, each time step performs K-Means clustering algorithm Process is, selects initial center point based on similarity density and formula (15), is then iteratively performed following operation:

1) snapshot entity is assigned to bunch central point that similarity is the highest,

2) bunch central point is updated, until it reaches the condition of convergence that in formula (10), target cost is minimum；

The K-Means clustering algorithm detailed process developed is as follows:

Input: the snapshot entity sets O={O of a series of different time step-lengths¹,O²,…,O^h,…,O^t, different time walks Long corresponding bunch number set K={k¹,k²,…,k^h,…,k^t}；

Output: the cluster result set C={C of all time steps¹,C²,…,C^h,…,C^t}；Wherein, h express time step Long, h=1,2 ..., t；

(1), to each time step h, circulation performs:

(2) formula (14), is utilized to calculate snapshot entity sets O under current time step h^hCorresponding similarity matrix W^h, and Build corresponding snapshot plotting G^h=(V^h,E^h,W^h)；

(3), by bunch central point setIt is initialized as sky；

(4), carry out selecting the process of initial center point: first select the snapshot entity that similarity density is the highest as first Initial center pointThen it is calculated the remaining initial center point of selection according to formula (15)Wherein, j is according to from 1^hTo k^h Ascending order order, subscript h express time step-length；

(5), circulation performs: by snapshot entity sets O^hIn each snapshot entityIt is assigned to bunch center most like with it Place bunch (e.g.,)；Update the central point of each bunch and record cluster result C^h；Until meeting target generation in formula (10) Valency functionThe minimum condition of convergence；

The accumulative cluster result updating different time step-length；

And return the cluster result C of all time steps.

Detailed description of the invention five:

Unlike detailed description of the invention one, two or four, the data-oriented space of the present embodiment of present embodiment Entity classification method, change in time or snapshot entity adds in time or removes in number of clusters amount described in step 4 In the case of, the process of the K-Means cluster framework developed in spread step one is,

In the evolution K-Means one of detailed description of the invention one to four Suo Shu clusters framework, cluster quantity the most in time Change；And in all time steps, snapshot entity sets to be clustered is identical, i.e. snapshot entity does not add Situation about entering or remove.But in actual applications, the two assumes that qualifications is the strictest.To this end, before this trifle extends The evolution K-Means that face proposes clusters framework, to process following situation:

The first, when number of clusters amount changes in time:

Quantity k that clusters when historical time step-length h^hQuantity k that clusters less than current time step t^tTime, only need to increase phase The row answered are to joint probability matrix P^hIn, thus be extended toWhereinThis is because for Newly increase bunch for, under historical time step-length h, snapshot entity-new bunch combines the probability of generation is zero.Now, after extension,And P^tIt is all n × k^tJoint probability matrix, therefore, formula (10) is revised as:

Quantity k that clusters when historical time step-length h^hQuantity k that clusters more than current time step t^tTime, only need to increase phase The row answered are to joint probability matrix P^tIn, thus be extended toWherein,This is because for Delete bunch for, under current time step t, the probability of generation is combined in snapshot entity-deletion bunch is zero.Now, after extension, P^hWithIt is all n × k^hJoint probability matrix, therefore, formula (10) is revised as:

The second, when snapshot entity adds in time or removes:

Assuming that when historical time step-length h, P^hIt is a n^hThe joint probability matrix of × k, when current time step t, P^t It is a n^hThe joint probability matrix of × k, n₀Individual snapshot entity occurs in time step h and t simultaneously；When walking at historical time When the snapshot entity of long h is removed, for time step t, those the snapshot entities being removed were sent out with combining of current cluster Raw probability is 0, at P^tIncrease the row of correspondence middlely, thus obtainP ^t, wherein,And when in current time When the snapshot entity of step-length t is newly added, for historical time step-length h, those the snapshot entities being newly added and history bunch Associating probability of happening is 0, at P^hThe middle row increasing correspondence, thus obtainP ^h, wherein,Now, extension After,P ^hWithP ^tIt is all (n^h+n^t-n₀The joint probability matrix of) × k, therefore, formula (10) is revised as:

In formula, symbolRepresenting matrix X according in formula (9) smoothing processing method process after matrix,It it is matrixIn Element.

Detailed description of the invention six:

Unlike detailed description of the invention five, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method, described snapshot entity is, t, then the snapshot entity under t can formalization representation be o^t=(Attr, Cont)；Its In, Attr represents snapshot entity o^tStructured features information, such as attribute-name set in tuple, and Attr={a₁,a₂,…, a_n}；Cont represents snapshot entity o^tDestructuring characteristic information, such as keyword set in content, and Cont={keyword₁, keyword₂,…,keyword_m, n and m represents the element number of set Attr and Cont respectively；

Noting, under the most in the same time, structuring and destructuring characteristic information in entity o are it may happen that change, existing In reality, this may be caused by many reasons, and the such as information source of entity extraction increases the change causing entity information.But this chapter It is not concerned with entity information extraction problem under any time.And all snapshot entities under current time step t are designated asSnapshot plotting schematic diagram under adjacent time step-length is as shown in Figure 2.

Detailed description of the invention seven:

Unlike detailed description of the invention one, two, four or six, the data-oriented of the present embodiment of present embodiment is empty Between entity classification method, under moment t, a paper class snapshot entity o^tComprise title, author, size attribute, also comprise Data space, entity, the unstructured content information of classification；Then o^t=({ title, author, size }, data space, entity, Classification })；

Detailed description of the invention eight:

Unlike detailed description of the invention seven, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method,

Described snapshot plotting is, time step t, then the snapshot plotting under t can formalization representation be G^t=(V^t, E^t,W^t), wherein at figure G^tIn, each summitRepresent snapshot entity, each edgeRepresent snapshot entityWithTool There is similarity,Represent limitWeight, i.e. snapshot entity under time step tWithBetween similarity divide Number.Then the snapshot plotting schematic diagram under adjacent time step-length is as shown in Figure 3.

Experiment and interpretation of result:

Setup Experiments:

This experiment use from March, 2015 release version DBLP data as experiment needed for basic, Download address is http://dblp.uni-trier.de/.The entity class of extraction includes paper, thesis for the doctorate, author, meeting View, periodical, mechanism of university.Have following some should be noted: (1) paper entity come from inproceedings record or key key With " journals " the Article record as prefix, thesis for the doctorate entity comes from phdthesis record, author's entity from In WWW record or author label, meeting entity comes from key key with " conf " inproceedings record as prefix In booktitle label, periodical entity comes from journal label or key key with " journals " as prefix Booktitle label in inproceedings record, mechanism of university comes from school label；(2) span when being only chosen at Degree is the entity produced between 2005 to 2014, and a time step is 1 year, and the physical quantities size after extraction is the most about 3M；(3) it is the evolution properties of analog data spatial entities, this chapter entity set to each time step, randomly choose 20% ratio The entity of example, then removes some attribute informations or content information simple randomization.(4) it is the Pay-as-in analog data space Collected all entities are not the most provided class label by you-go characteristic, i.e. do not have previously known classification information (Ground truth).(5) for the extensibility of test this method, different classes of entity is constantly reduced entity according to the principle of equal proportion Quantity, thus generate the DBLP data set of 2.5M, 2M, 1.5M and 1M size.

Experimental situation is provided that PC main frame uses Intel (R) Core (TM) i5-4570CPU 3.20GHz, and internal memory holds Amount is 4G, and hard-disk capacity is 1TB, and operating system is Windows 7 (64bit), and in experiment, all algorithms all use Java language Realize.Unless stated otherwise, in all experiments, the parameter k default value in the evolution K-Means algorithm of this chapter is 6, data set Size is 3M.

Effect is assessed with extensibility

(1) the choosing of parameter

Test the change of the different parameters value impact on Clustering Effect respectively below by three groups of experiments, thus determine respectively Parameter lambda, the optimal value of β and α.

In experiment 1. assessment entity similarity function, weight λ chooses the impact on Clustering Effect.Owing to this experiment is only closed Noting the change impact on Clustering Effect in weight λ, therefore setup parameter α=1 and β=1, in addition the number of all time steps It is aggregating according to collection, experiment 50 times of reruning on this basis, and records average snapshot cost corresponding to all λ.Horizontal stroke in Fig. 4 The different values of coordinate representation weight λ, vertical coordinate represents snapshot cost (see formula (7)).As can be known from Fig. 4, along with λ increases, Snapshot cost tapers into；When reaching 0.6, snapshot Least-cost (being now 0.5), i.e. Clustering Effect is optimal；Afterwards, snapshot Cost then becomes larger.This shows, for entity self similarity measurement, to compare the content information of entity, entity attributes Characteristic information plays and more importantly acts on.This is primarily due to the entity for similar categorization, and they attribute character are similar The similar probability of likelihood ratio content is big.Snapshot cost under different λ as shown in Figure 4.

In experiment 2. assessment entity similarity function, weight beta chooses the impact on Clustering Effect.Owing to this experiment is only closed Note the change impact on Clustering Effect in weight beta, and test 1 and show best results during λ=0.6, for this this experiment setting Parameter alpha=1 and λ=0.6, be then aggregating the data set of all time steps, and experiment runs 50 times on this basis, and Record average snapshot cost corresponding to all β.In Fig. 5, abscissa represents the different values of weight beta, and vertical coordinate represents snapshot cost (see formula (7)).As can be seen from Figure 5, along with β increases, snapshot cost overall trend tapers into；When reaching 0.75, snapshot generation Valency is minimum (being now 0.36), i.e. Clustering Effect is optimal；Afterwards, snapshot cost then becomes larger.This shows for entity similar Property tolerance for, not only it is also contemplated that the self information of entity, it should also be taken into account that pattern information occurs in the history of entity.Additionally, For pattern information occurs in history, the self information of entity more can affect the quality of entity cluster (classification) effect.

In experiment 3. assessment objective cost function, weight α chooses the impact on Clustering Effect.According to the first two experiment Conclusion, this experiment setup parameter β=0.75 and λ=0.6, then repeat on the DBLP data set of step-length continuous time (1-10) Running experiment 50 times, and record the corresponding meansigma methods that all time steps are corresponding.Abscissa express time step-length in Fig. 6, vertical Coordinate representation snapshot cost (see formula (7)).As can be seen from Figure 6, developing (i.e. time step increase) over time, snapshot cost is opened Beginning strongly reduces, and the most almost trends towards stable.This is primarily due in evolutionary process, the self information of entity and history mould Formula information is gradually abundant also until restraining, so that Clustering Effect strengthens and until being in relatively steady statue.Further, it is also possible to Observing, along with α value is gradually increased, snapshot cost is gradually reduced.It is the biggest that this is primarily due to α value, and evolution clustering algorithm is the strongest Adjust current cluster result quality, thus snapshot cost is the least.Abscissa express time step-length in Fig. 7, vertical coordinate represents history generation Valency (see formula (9)).As can be seen from Figure 7, along with α value is gradually reduced, history cost tapers into.It is the least that this is primarily due to α value, Evolution clustering algorithm more emphasizes the flatness of Historic Clustering result.In Fig. 8, abscissa represents weight α value size, and vertical coordinate represents Total cost (see formula (10)).As it can be observed in the picture that when α value is 0.9, total Least-cost, the effect the most now clustered is best. This shows that the evolution K-Means clustering algorithm that this chapter proposes can be rolled between snapshot cost and history cost well In.The history cost under different α as shown in Figure 6 to 8.

(2) Contrast on effect of distinct methods

This method (Similary Density-Based Evolutionary K-is compared in experiment 4. in terms of Clustering Effect Means Clustering, SD-EKM) and other pedestal methods.The following pedestal method of this experimental design: (1) simple method (Evolutionary K-Means Clustering, N-EKM), the evolution K-i.e. selected based on random initial point Means.It is another version of context of methods SD-EKM, and difference is that the mode that initial point selects is different.(2) classical PCM-EKM method, it is a kind of based on reservation cluster member evolution K-Means clustering method that Yun Chi et al. proposes.Abide by Following the experiment conclusion of Yun Chi, parameter alpha is also configured as 0.9, but has carried out small size change in this experiment, i.e. replaces original Similarity measurement is the entity method for measuring similarity that this chapter proposes.(3) IND method, the data of the most each time step are the most solely The vertical data run K-Means algorithm and do not consider historical time step-length.Owing to notebook data collection does not has previously known entity to divide The tolerance that the cost function of category information (that is, theoretic entity classification result) and this several method is the most unified, therefore this Experimental evaluation tolerance uses mutual information tolerance as with reference to (reader interested can be referring in detail to Xu et al.^[136]Draw for two groups proposed Mutual information definition between Fen), substantially, the mutual information between two groups of divisions is the highest, and probability similar between them is the biggest.All realities Test experiment 50 times of reruning on the DBLP data set of step-length continuous time (1-10), and it is corresponding to record all time steps Association relationship.Abscissa express time step-length in Fig. 9, vertical coordinate represents mutual information.As can be seen from Figure 9, (1) SD-EKM, N-EKM and The mutual information of PCM-EKM method is significantly better than IND method, and develops over time, and the former association relationship is the most steady Fixed.This is mainly the most above three methods and uses the thought clustered that develops, it is considered to the entity information of historical time step-length；(2) The mutual information of SD-EKM and N-EKM method is relatively better than PCM-EKM method, and in the stability also phase of step-length continuous time To preferably, this is primarily due to PCM-EKM method and only considers the historical information of a upper time step, and the method for the present invention SD-EKM and N-EKM then considers the historical information of all historical time step-lengths.(3) SD-EKM method is better than N-EKM.This is mainly Because method SD-EKM of the present invention uses the guideline of similarity density, largely avoid selection noise data (as low Snapshot entity in density area) as initial center point, so that cluster result is more preferable.Distinct methods as shown in Figure 9 Contrast on effect schematic diagram.

(3) extensibility

Experiment 5. test this method SD-EKM upon execution between on extensibility.This experiment is at different size of data set On rerun this method SD-EKM 50 times, hold when recording average operating time and this method convergence of each iteration respectively The mean iterative number of time of row.In Figure 10, abscissa represents data set size, and vertical coordinate represents the average operating time of each iteration. In Figure 11, abscissa represents data set size, mean iterative number of time when vertical coordinate represents SD-EKM algorithmic statement.From Figure 10 and Tu 11 understand, and the average operating time of each iteration of (1) SD-EKM of the present invention is almost linear with data set size；(2) calculate Iterations required during method convergence is almost insensitive with data set size, about at about 650 times.These two groups experiments show this The average operating time of invention SD-EKM is linear with data set size, is with good expansibility.As shown in Figure 10 The average operating time of each iteration, and mean iterative number of time schematic diagram during convergence shown in Figure 11.

The present invention also can have other various embodiments, in the case of without departing substantially from present invention spirit and essence thereof, and this area Technical staff is when making various corresponding change and deformation according to the present invention, but these change accordingly and deformation all should belong to The protection domain of appended claims of the invention.

Claims

1. the entity classification method in a data-oriented space, it is characterised in that: described method is realized by following steps:

Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on profile value Objective cost function with KL-divergence；

Step 2, design data space entity method for measuring similarity；

The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space entity of initial point select permeability and evolution Classification problem；

Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, extension step The K-Means developed in rapid one clusters framework.

The entity classification method in data-oriented space the most according to claim 1, it is characterised in that propose described in step one The K-Means developed clusters framework, and i.e. defining process based on profile value and the objective cost function of KL-divergence is,

Cost function is made up of two parts: the snapshot cost of current time step and the history cost of historical time step-length, respectively It is designated as Cost_snapshotAnd Cost_temporal；The mode using linear combination defines total objective cost function, is used for assessing evolution The K-Means clustering result quality of entity, total objective cost function comprises snapshot cost and the historical time step of current time step Long history cost is two-part, formula specific as follows:

{Cost}_{t o t a l}^{t} = α \cdot {Cost}_{s n a p s h o t}^{t} + (1 - α) \cdot Σ_{h = 1}^{t - 1} e^{t - h} {Cost}_{t e m p o r a l}^{h} - - - (1)

In formula, 0≤α≤1, represent the weight factor of snapshot cost；Represent the snapshot cost of current time step t；Represent the history cost of historical time step-length h；Factor e^t-hShow from current time step t more close to, its history generation ValencyShared weight is the heaviest, and its departure degree is the least, i.e. from current time step t more close to, historical time step-length h The time smoothing of the structure that clusters is the best；

If the snapshot plotting of current time step t is G^t=(V^t,E^t,W^t), wherein | V^t|=n, W^tSimilarity for snapshot inter-entity Matrix；The entity division obtained based on this snapshot plotting isWhereinAndUsing profile value criterion tolerance K-Means to cluster the quality of result, wherein, profile value is also referred to as silhouette coefficient, Be a kind of only with reference to data itself the Cluster Evaluation method without reference to golden standard；In this Cluster Evaluation method each bunch with one Individual profile represents, the object that is positioned at bunch by profile reflection and away from bunch object, in utilizing the reflection of this Cluster Evaluation method Poly-degree and two kinds of influence factors of separating degree, and the profile value effect that clusters the most greatly is the best；Snapshot cost is defined as:

{Cost}_{s n a p s h o t}^{t} = 1 - \frac{1}{k} Σ_{p = 1}^{k} s i l (V_{p}^{t}) - - - (2)

In formula, k represents the number of lower bunch of current time step t,Represent pth bunch,Expression bunchMean profile Value；Mean profile value is with snapshot cost inversely；

Step one three, according to each bunchIn comprise one group of snapshot entityThen by each bunchMean profile valueThe meansigma methods of the profile value of all snapshot entities in being defined as bunch, particularly as follows:

s i l (V_{p}^{t}) = \frac{1}{| V_{p}^{t} |} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} s i l (o_{i}^{t}) - - - (3)

s i l (o_{i}^{t}) = \frac{b (o_{i}^{t}) - a (o_{i}^{t})}{m a x {a (o_{i}^{t}), b (o_{i}^{t})}} - - - (4)

Wherein,Represent snapshot entityWith it belonging to bunchIn other snapshot entityAverage similarity,Represent Snapshot entityWith other bunchIn all snapshot entitiesMaximum average similarity；Value the biggest, show fast the most according to the facts BodyAnd in bunch the average similarity of snapshot entity more than it and bunch between in the average similarity of snapshot entity；

Step one four, based on described in step one threeRepresent snapshot entityProfile value measure formulas (4) physics meaning Justice, definitionFormula be:

b (o_{i}^{t}) = \frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}^{t} - - - (5)

DefinitionFormula be:

a (o_{i}^{t}) = \max_{V_{q}^{t} &NotEqual; V_{p}^{t}} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{p}^{t}}{Σ} w_{i j}^{t}} - - - (6)

In formula,For bunchMiddle snapshot entity,For bunchMiddle snapshot entity, w_ii'For snapshot entity in same clusterWithIt Between similarity, w_ijFor snapshot entity in difference bunchWithBetween similarity；

\begin{matrix} {Cost}_{s n a p s h o t}^{t} = 1 - \frac{1}{k} Σ_{p = 1}^{k} s i l (V_{p}^{t}) \\ = 1 - \frac{1}{k} Σ_{p = 1}^{k} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{b (o_{i}^{t}) - a (o_{i}^{t})}{\max {a (o_{i}^{t}), b (o_{i}^{t})}} \\ = 1 - \frac{1}{k} Σ_{p = 1}^{k} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}} - \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}{\max {\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}, \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}} \end{matrix} - - - (7)

Step one six, the history cost metric based on KL-divergence that carries out:

The first, snapshot plotting G is set¹,G²,…,G^h,…,G^t,

The second, define a kind of tolerance for comparing two kinds of clusterings, use bipartite graph presentation-entity and bunch relation, by entity Divide Z^tProblem is converted into joint probability distribution problem in bipartite graph:

Make BG^t=(V^t,C^t,F^t,P^t) it is corresponding snapshot plotting G^t=(V^t,E^t,W^t) a bipartite graph；Wherein,For snapshot Entity sets；Close for gathering；For the set on limit, two summits on limit are divided Do not come from set V^tWith set C^t；For n × k joint probability matrix, corresponding to the limit weight matrix of bipartite graph；Use Joint probability formula calculates, i.e.Determine entityWith bunchBetween joint probability Wherein,For bunchThe probability occurred,For bunchEntity under occurrence conditionThe probability occurred；Such as sporocarpBelong to bunchSon_jWith n be respectively bunchMiddle snapshot physical quantities and the number of all snapshot entities Amount；Otherwise,For joint probability matrix P^tIn for arbitrarily row i, only exist a row j and make p_ijIt is not 0, thus

3rd, KL-divergence method is used to measure:

The bipartite graph BG of given current time step t^t=(V^t,C^t,F^t,P^t) and the bipartite graph BG of historical time step-length h^h=(V^h, C^h,F^h,P^h), the entity division of current time step tEntity division with historical time step-length hWherein, BG^tCorresponding to Z^t, BG^hCorresponding to Z^h, then the history cost of two time steps h and t is fixed Justice is as follows:

\begin{matrix} {Cost}_{t e m p o r a l}^{h} = K L (P^{t} | | P^{h}) \\ = Σ_{i = 1}^{n} Σ_{j = 1}^{k} p_{i j}^{t} \times \log (p_{i j}^{t} / p_{i j}^{h}) \end{matrix} - - - (8)

Wherein, n is the quantity of snapshot entity, the number that k is bunch,For snapshot entity under time step t and bunch between connection Close probability matrix P^tMiddle element,For snapshot entity under historical time step-length h and bunch between joint probability matrix P^hMiddle unit Element；

4th, joint probability matrix P^tOr P^hIt is following smoothing processing: P^tOr P^hIn each elementOrPlus often Amount ε, and ε=e^-12；Then the most regular to element after processing, it is designated asOrProbability square after smoothing processing Battle array is designated as respectivelyWithThen formula (8) is modified to:

\begin{matrix} {Cost}_{t e m p o r a l}^{h} = K L (\hat{P^{t}}, \hat{P^{h}}) \\ = Σ_{i = 1}^{n} Σ_{j = 1}^{k} \hat{p_{i j}^{t}} \times l o g (\hat{p_{i j}^{t}} / \hat{p_{i j}^{h}}) \end{matrix} - - - (9)

Wherein, n is the quantity of snapshot entity, the number that k is bunch,For entity after smoothing processing under time step t And the joint probability matrix between bunchMiddle element,For under historical time step-length h through smoothing processing entity with bunch Joint probability matrixMiddle element；

\begin{matrix} {Cost}_{t o t a l}^{t} = α \cdot {Cost}_{s n a p s h o t}^{t} + (1 - α) \cdot Σ_{h = 1}^{t - 1} e^{t - h} {Cost}_{t e m p o r a l}^{h} \\ = α (1 - \frac{1}{k} Σ_{p = 1}^{k} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}} - \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}{\max {\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}, \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}}) \\ + (1 - α) Σ_{h = 1}^{t - 1} Σ_{i = 1}^{n} Σ_{j = 1}^{k} e^{t - h} \times \hat{p_{i j}^{t}} \times \log (\hat{p_{i j}^{t}} / \hat{p_{i j}^{h}}) \end{matrix} - - - (10)

Wherein, 0≤α≤1 is the weight factor of snapshot cost, and k represents the number of lower bunch of current time step t,Presentation-entity Divide Z^tMiddle pth element, w_ii'Or w_ijRepresent snapshot plotting G^t=(V^t,E^t,W^t) W^tMiddle element,OrRepresent G^t Middle snapshot entity, n represents bipartite graph BG^t=(V^t,C^t,F^t,P^t) V^tMiddle snapshot physical quantities,Combine after representing smoothing processing Probability matrixMiddle element,Represent joint probability matrix after smoothing processingMiddle element.

The entity classification method in data-oriented space the most according to claim 1 or claim 2, it is characterised in that: set described in step 2 The process counting spatial entities method for measuring similarity is,

According to the self information of entity and the history of entity, data space entity i.e. snapshot entity, occurs that pattern information is measured soon The according to the facts similarity of body, the i.e. similarity function of snapshot entity are made up of self similarity and historical similarity two parts, express Formula is defined as:

w_{i j}^{t} = {Sim}^{t} (o_{i}^{t}, o_{j}^{t}) = β \cdot {Sim}_{s e l f}^{t} (o_{i}^{t}, o_{j}^{t}) + (1 - β) \cdot {Sim}_{h i s}^{t} (o_{i}^{t}, o_{j}^{t}) - - - (11)

Based on the structured features information that snapshot entity attributes characteristic information is corresponding, the non-structural corresponding with content characteristic information Changing characteristic information, self similarity definition between snapshot entity is as follows:

\begin{matrix} {Sim}_{s e l f}^{t} (o_{i}^{t}, o_{j}^{t}) = λ \cdot {sim}_{a t t r}^{t} (o_{i}^{t}, o_{j}^{t}) + (1 - λ) \cdot {sim}_{c o n t}^{t} (o_{i}^{t}, o_{j}^{t}) \\ = λ \cdot \frac{o_{i}^{t} . A t t r \cap o_{j}^{t} . A t t r}{o_{i}^{t} . A t t r \cup o_{j}^{t} . A t t r} + (1 - λ) \cdot \frac{o_{i}^{t} . C o n t \cap o_{j}^{t} . C o n t}{o_{i}^{t} . C o n t \cup o_{j}^{t} . C o n t} \end{matrix} - - - (12)

Wherein, 0≤λ≤1 is the weight of attribute character similarity,WithIt is respectively snapshot entity Attribute character similarity and content characteristic similarity,For snapshot entity attributes feature,For snapshot entity Content characteristic；

Use classical Pearson correlation coefficients tolerance historical similarity, particularly as follows:

\begin{matrix} {sim}_{h i s}^{t} (o_{i}^{t}, o_{j}^{t}) = ρ (o_{i}^{t}, o_{j}^{t}) \\ = \frac{Σ_{h = 1}^{t - 1} (n_{i}^{h} - μ_{i}^{t}) (n_{j}^{h} - μ_{j}^{t})}{\sqrt{Σ_{h = 1}^{t - 1} {(n_{i}^{h} - μ_{i}^{t})}^{2} Σ_{h = 1}^{t - 1} {(n_{j}^{h} - μ_{j}^{t})}^{2}}} \end{matrix} - - - (13)

Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithAt historical time The number of times that step-length h occurs,WithSnapshot entity respectivelyWithMeansigma methods in all historical time step-length occurrence numbers；

\begin{matrix} w_{i j}^{t} = β \cdot {Sim}_{s e l f}^{t} (o_{i}^{t}, o_{j}^{t}) + (1 - β) \cdot {Sim}_{h i s}^{t} (o_{i}^{t}, o_{j}^{t}) \\ = β \cdot (λ \cdot \frac{o_{i}^{t} . A t t r \cap o_{j}^{t} . A t t r}{o_{i}^{t} . A t t r \cup o_{j}^{t} . A t t r} + (1 - λ) \cdot \frac{o_{i}^{t} . C o n t \cap o_{j}^{t} . C o n t}{o_{i}^{t} . C o n t \cup o_{j}^{t} . C o n t}) \\ + (1 - β) \cdot (\frac{Σ_{h = 1}^{t - 1} (n_{i}^{h} - μ_{i}^{t}) (n_{j}^{h} - μ_{j}^{t})}{\sqrt{Σ_{h = 1}^{t - 1} {(n_{i}^{h} - μ_{i}^{t})}^{2} Σ_{h = 1}^{t - 1} {(n_{j}^{h} - μ_{j}^{t})}^{2}}}) \end{matrix} - - - (14)

Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithAt historical time The number of times that step-length h occurs,WithSnapshot entity respectivelyWithIn the meansigma methods of all historical time step-length occurrence numbers,WithIt is respectively snapshot entityWithAttribute character,WithFor snapshot entityWithInterior Hold feature；0≤β≤1 is self similarityWeight, 0≤λ≤1 is attribute character similarityPower Weight.

The entity classification method in data-oriented space the most according to claim 3, it is characterised in that: propose described in step 3 to drill The K-Means clustering algorithm changed, and the process solving the data space entity classification problem of initial point select permeability and evolution is,

The first, following related definition is carried out:

The definition of the η-neighbours under t: a given snapshot plotting G^t=(V^t,E^t,W^t) and parameter 0 ＜ η≤1, then for appointing Meaning snapshot entityFor, the η under t-neighbours' formal definitions is:

N (o_{i}^{t}, η) = {o_{i}^{t} | η \leq w_{i j}^{t} \leq 1, 1 \leq j \leq | V^{t} |},

Wherein, | V^t| for snapshot plotting G^tMiddle number of vertices,For W^tMiddle element；

The definition of the similarity density under t: a given snapshot plotting G^t=(V^t,E^t,W^t) and t under η-neighboursSo for any snapshot entityFor, the similarity Density Format under t is defined as:

{Density}_{s i m} (o_{i}^{t}) = | N (o_{i}^{t}, η) | \times \log (1 + \frac{1}{| N (o_{i}^{t}, η) | - 1} Σ_{o_{j}^{t} &Element; N (o_{i}^{t}, η), o_{j}^{t} &NotEqual; o_{j}^{t}}^{N (o_{i}^{t}, η)} w_{i j});

Determine the selection principle of the initial center point in addition to first initial center point: remove the η of the initial center point selected- The snapshot entity of neighbours；Average similarity less than all initial center point selected；Close higher than the similarity of Current central point Degree；This principle form can turn to equation below:

c_{j} = \underset{o_{i}^{t} &Element; V^{t}, o_{i}^{t} &NotElement; \cup_{l = 1}^{j - 1} N (o_{s_{l}}^{t}, η),}{\arg \max} {{Density}_{s i m} (o_{i}^{t}) / (1 + \frac{1}{j - 1} Σ_{l = 1}^{j - 1} w_{s_{l} i}^{t})} - - - (15)

Wherein, 1≤l≤j-1 is the serial number having selected initial center point,The η having selected initial center point for all- The union of neighbours,For snapshot entityWith select initial center pointSimilarity,For snapshot entity Similarity density under t, adds that coefficient 1 purpose is the situation preventing denominator from being zero；

The basic thought of the K-Means clustering algorithm the 3rd, performing evolution is as follows: in the institute to current time step sometimes Between in step-length, circulation performs K-Means clustering algorithm；Wherein, each time step performs the process of K-Means clustering algorithm It is to select initial center point based on similarity density and formula (15), is then iteratively performed following operation:

The K-Means clustering algorithm detailed process developed is as follows:

Input: the snapshot entity sets O={O of a series of different time step-lengths¹,O²,…,O^h,…,O^t, different time step-length pair Bunch number set K={k answered¹,k²,…,k^h,…,k^t}；

Output: the cluster result set C={C of all time steps¹,C²,…,C^h,…,C^t}；Wherein, h express time step-length, h =1,2 ..., t；

(1), to each time step h, circulation performs:

(3), by bunch central point setIt is initialized as sky；

(4), carry out selecting the process of initial center point: first select the snapshot entity that similarity density is the highest initial as first Central pointThen it is calculated the remaining initial center point of selection according to formula (15)Wherein, j is according to from 1^hTo k^hLiter Sequence order, subscript h express time step-length；

(5), circulation performs: by snapshot entity sets O^hIn each snapshot entityIt is assigned to bunch center place most like with it Bunch；Update the central point of each bunch and record cluster result C^h；Until meeting objective cost function in formula (10)The minimum condition of convergence；

The accumulative cluster result updating different time step-length；

And return the cluster result C of all time steps.

5. according to the entity classification method in data-oriented space described in claim 1,2 or 4, it is characterised in that: described in step 4 In the case of number of clusters amount changes in time or snapshot entity adds in time or removes, spread step one develops The process of K-Means cluster framework is,

The first, when number of clusters amount changes in time:

Quantity k that clusters when historical time step-length h^hQuantity k that clusters less than current time step t^tTime, only need to increase corresponding Row arrive joint probability matrix P^hIn, thus be extended toWhereinNow, after extension,With P^tIt is all n × k^tJoint probability matrix, therefore, formula (10) is revised as:

\begin{matrix} {Cost}_{t o t a l}^{t} = α \cdot {Cost}_{s n a p s h o t}^{t} + (1 - α) \cdot Σ_{h = 1}^{t - 1} e^{t - h} {Cost}_{t e m p o r a l}^{h} \\ = α (1 - \frac{1}{k^{t}} Σ_{p = 1}^{k^{t}} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}} - \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}{\max {\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}, \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}}) \\ + (1 - α) Σ_{h = 1}^{t - 1} Σ_{i = 1}^{n} Σ_{j = 1}^{k^{t}} e^{t - h} \times \hat{p_{i j}^{t}} \times \log (\hat{p_{i j}^{t}} / {p_{i j}^{h}}_{&RightArrow;}^{^}) \end{matrix} - - - (16)

Quantity k that clusters when historical time step-length h^hQuantity k that clusters more than current time step t^tTime, increase corresponding row and arrive Joint probability matrix P^tIn, it is extended toWherein,Now, after extension, P^hWithBe all n × k^hJoint probability matrix, therefore, formula (10) is revised as:

\begin{matrix} {Cost}_{t o t a l}^{t} = α \cdot {Cost}_{s n a p s h o t}^{t} + (1 - α) \cdot Σ_{h = 1}^{t - 1} e^{t - h} {Cost}_{t e m p o r a l}^{h} \\ = α (1 - \frac{1}{k^{t}} Σ_{p = 1}^{k^{t}} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}} - \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}{\max {\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}, \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}}) \\ + (1 - α) Σ_{h = 1}^{t - 1} Σ_{i = 1}^{n} Σ_{j = 1}^{k^{h}} e^{t - h} \times {p_{i j}^{t}}_{&LeftArrow;}^{^} \times \log ({p_{i j}^{t}}_{&LeftArrow;}^{^} / \hat{p_{i j}^{h}}) \end{matrix} - - - (17)

The second, when snapshot entity adds in time or removes:

Assuming that when historical time step-length h, P^hIt is a n^hThe joint probability matrix of × k, when current time step t, P^tIt is one Individual n^hThe joint probability matrix of × k, n₀Individual snapshot entity occurs in time step h and t simultaneously；When in historical time step-length h When snapshot entity is removed, for time step t, those the snapshot entities being removed combine probability of happening with current cluster It is 0, at P^tIncrease the row of correspondence middlely, thus obtainP ^t, wherein,And when at current time step t Snapshot entity when being newly added, for historical time step-length h, those the snapshot entities being newly added were sent out with combining of history bunch Raw probability is 0, at P^hThe middle row increasing correspondence, thus obtainP ^h, wherein,Now, after extension,P ^hWithP ^tIt is all (n^h+n^t-n₀The joint probability matrix of) × k, therefore, formula (10) is revised as:

\begin{matrix} {Cost}_{t o t a l}^{t} = α \cdot {Cost}_{s n a p s h o t}^{t} + (1 - α) \cdot Σ_{h = 1}^{t - 1} e^{t - h} {Cost}_{t e m p o r a l}^{h} \\ = α (1 - \frac{1}{k} Σ_{p = 1}^{k} \underset{o_{i}^{t} &Element; V_{p}^{t}}{Σ} \frac{1}{| V_{p}^{t} |} \frac{\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}} - \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}{\max {\frac{1}{| V_{p}^{t} | - 1} \underset{o_{i^{'}}^{t} &Element; V_{p}^{t}, o_{i}^{t} &NotEqual; o_{i^{'}}^{t}}{Σ} w_{{ii}^{'}}, \max_{q &NotEqual; p} {\frac{1}{| V_{q}^{t} |} \underset{o_{j}^{t} &Element; V_{q}^{t}}{Σ} w_{i j}}}}) \\ + (1 - α) Σ_{h = 1}^{t - 1} Σ_{i = 1}^{n^{h} + n^{t} - n_{0}} Σ_{j = 1}^{k} e^{t - h} \times {p_{i j}^{t}}_{-}^{^} \times \log ({p_{i j}^{t}}_{-}^{^} / {p_{i j}^{h}}_{-}^{^}) \end{matrix};

In formula, symbolRepresenting matrix X according in formula (9) smoothing processing method process after matrix,It it is matrixMiddle element.

The entity classification method in data-oriented space the most according to claim 5, it is characterised in that: described snapshot entity is, Snapshot entity form under t is expressed as o^t=(Attr, Cont)；Wherein, Attr represents snapshot entity o^tStructuring Characteristic information, and Attr={a₁,a₂,…,a_n}；Cont represents snapshot entity o^tDestructuring characteristic information, and Cont= {keyword₁,keyword₂,…,keyword_m, n and m represents the element number of set Attr and Cont respectively；When current Between all snapshot entities under step-length t be designated as

7. according to the entity classification method in data-oriented space described in claim 1,2,4 or 6, it is characterised in that: described snapshot Entity is paper, then under moment t, and paper snapshot entity o^tComprise title, author, size attribute, also comprise data space, reality Body, the unstructured content information of classification；Then o^t=({ title, author, size }, { data space, entity, classification }).

The entity classification method in data-oriented space the most according to claim 7, it is characterised in that: described snapshot plotting is, one Individual time step t, then the snapshot plotting under t can formalization representation be G^t=(V^t,E^t,W^t), wherein at figure G^tIn, Mei Geding PointRepresent snapshot entity, each edgeRepresent snapshot entityWithThere is similarity,Represent limitWeight, i.e. snapshot entity under time step tWithBetween similarity scores.