CN106067029A - The entity classification method in data-oriented space - Google Patents
The entity classification method in data-oriented space Download PDFInfo
- Publication number
- CN106067029A CN106067029A CN201610348890.4A CN201610348890A CN106067029A CN 106067029 A CN106067029 A CN 106067029A CN 201610348890 A CN201610348890 A CN 201610348890A CN 106067029 A CN106067029 A CN 106067029A
- Authority
- CN
- China
- Prior art keywords
- entity
- snapshot
- sigma
- time step
- bunch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The entity classification method in data-oriented space, belongs to natural language processing field.Under Evolution Environment, existing cannot be by assuming that entity be resting state, and the problem classifying entity.A kind of entity classification method in data-oriented space, first, for the data space entity developed, proposes K Means that improve, that develop and clusters framework, i.e. define based on profile value and the objective cost function of KL divergence;Secondly, the data space entity method for measuring similarity of a kind of novelty is devised;Then, according to heuristic rule, the K Means clustering algorithm developed is proposed.Additionally, further expand the evolution cluster framework that this chapter proposes, to process the situation that number of clusters amount changes in time or snapshot entity adds in time or removes.The present invention can not only capture current entity cluster result in high quality, moreover it is possible to robustly reflecting history clusters situation.
Description
Technical field
The present invention relates to a kind of entity classification method in data-oriented space.
Background technology
Data space is integrated is one of the important channel that builds of data space.Is various structures faced by data space
Change, semantic relation complexity, the large-scale data of distribution storage, therefore data space is integrated mainly includes that both sides works: (1)
Entity integrated;(2) entity relationship is integrated.At present, the integrated work of existing data space] it is focused mainly on entity relationship collection
Become and propose some effective strategy or methods, but the research in terms of entity is integrated[44]Relatively fewer.Therefore
Data spatially integrate (especially entity is integrated) is significant.As the important step that entity is integrated, real
The classification tool of body is widely used, and such as, inquiry question answering system, Relation extraction, data space inquiry, machine translation, text gather
Class etc..Therefore, the entity classification technology in data space is significant.
At present, the sort research (naming) entity has caused a large amount of scholar extensive in natural language processing field (NLP)
Pay close attention to.These work are mainly divided into two big classes: the entity classification of coarseness and fine-grained entity classification.The entity of coarseness
Classification is intended to be divided into a group object coarseness class label set of a group less, and class number is typically less than 20 classes and class
Other does not has level, the entity class such as such as name, organization name, place name.Common method includes method based on machine learning, base
Method in the supplementary knowledge such as body and external resource.Such as, Chifu et al. uses unsupervised neural network model to name
Entity carries out unsupervised segmentation, Kliegr proposes a kind of based on Bag-of-Articles without supervision name entity classification side
Method and Gamallo and Garcia propose a kind of resource-based name entity classification system.Fine-grained entity classification is then
Entity is divided into more fine-grained classification, and its class number is more, and class hierarchy is more complicated.Such as, FIGER uses 112 kinds
Freebase type, HYENA employ 505 kinds of YAGO types.Typical method has based on context and based on grammar property
Method.Such as, Gillick et al. proposes a kind of fine granularity entity classification method of Context-dependent, according to grammar property,
Giuliano and Gliozzo proposes a kind of fine granularity entity classification method of instance-based learning algorithm, thus generates more
Add abundant human body.
But, the entity classification method in above NLP field utilizes contextual information, linguistic information and outside often
The priori entity class knowledge such as knowledge feature carry out classifying and classify to as if static, but in data space, entity divides
Class technology rarely has research.In data space environment, entity classification is a task the most challenging, this be mainly manifested in
Under several aspects: (1) entity information is rich.As described in chapter 2, data space entity is not only the title letter comprising it
Breath, also comprises abundant attribute character information and content characteristic information, it is true that this partial information is even more important, therefore, needs
Want the similarity of a kind of more appropriate similarity function assessment data space inter-entity.(2) entity class hysteresis quality.Due to number
Advocate according to space and build while with the integration mode of (Pay-ag-you-go), this causes being gradually to obtain on entity class knowledge essence
, therefore clustering technique is the more Appropriate method realizing entity classification.(3) entity Dynamic Evolution.Traditional entity divides
Class method has a strict assumed condition: entity is static, develops the most in time.But this assumed condition exists
Under data space environment the most applicable, the entity self-information of extraction and physical quantities the most constantly change.
Therefore, under Evolution Environment, how entity is classified and more challenge.
Summary of the invention
The invention aims to solve under Evolution Environment, existing cannot be and right by assuming that entity is resting state
Entity carries out the problem classified, and proposes a kind of entity classification method in data-oriented space.
A kind of entity classification method in data-oriented space, described method is realized by following steps:
Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on wheel
Wide value and the objective cost function of KL-divergence;
Step 2, design data space entity method for measuring similarity;
The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space of initial point select permeability and evolution
Entity classification problem;
Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, expand
The K-Means developed in exhibition step one clusters framework.
The invention have the benefit that
The present invention be directed to the data space entity developed, it is proposed that a kind of improvement, evolution K-Means clusters frame
Frame, i.e. defines based on profile value and the objective cost function of KL-divergence;It not only considers the quality currently clustered, i.e. snapshot generation
Valency, it is also contemplated that all history cluster the time smoothing of structure, and time smoothing is history cost;The data space of design
Entity method for measuring similarity, the method not only considers more rich entity self information, such as, the structured features letter of entity
Breath and destructuring characteristic information;There is pattern information in the history also contemplating inter-entity, thus more accurately measures entity
Between similarity;Propose the K-Means clustering algorithm developed, and solve the data space entity of initial point select permeability and evolution
Classification problem;Finally, the K-Means further expanding evolution clusters framework, changes in time or snapshot in number of clusters amount
In the case of entity adds in time or removes, process the versatility problem of the K-Means cluster framework developed.
Accompanying drawing explanation
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the snapshot plotting schematic diagram under the adjacent time step-length that the present invention relates to;Figure illustrates time step t-1 and
Snapshot plotting under t.Comprising 6 summits, one snapshot entity of each vertex correspondence in snapshot plotting, the numeral on limit represents the most according to the facts
Similarity between body.During time step t-1 to t, their similarity changes (e.g., snapshot entity V a bit1
With snapshot entity V2, snapshot entity V3With snapshot entity V4), some does not then change (such as snapshot entity V5With snapshot entity
V6)。
Assuming that snapshot plotting GtContaining n summit, thenIt is snapshot plotting GtOn an adjacency matrix.Over time
Evolution, the similarity history of snapshot inter-entity pass through a series of snapshot plotting < G1,G2,…,Gh,…,Gt> capture;
Fig. 3 is the cluster scene of the entity of the evolution that the present invention relates to, and illustrates in figure under time step t-1 and t
Snapshot plotting, wherein, each vertex representation snapshot entity, the numeral on limit represents the similarity scores of snapshot inter-entity.Obviously, exist
Under time step t-1, six snapshot entities should cluster according to the mode of cluster 1;Under time step t, cluster result
Not unique, such as, six snapshot entities may cluster according to the mode of cluster 2 and cluster 3.Obviously, cluster 2 and cluster
The mode of 3 all can ensure that the entity of current time step t clusters quality the most very well.But, according to basic thought principle, more
Add the mode tending to cluster 2, this is because it is more consistent with the result that clusters of historical time step-length t-1;
Fig. 4 is the snapshot cost that the present invention tests under the 1 different λ related to;
Fig. 5 is the snapshot cost that the present invention tests under 2 different beta related to;
Fig. 6 to Fig. 8 is the history cost that the present invention tests under the 3 different α related to;
Fig. 9 is the association relationship situation that the present invention tests under the 4 different evolution methods related to;
Figure 10 is the average operating time schematic diagram of each iteration of the SD-EKM that the present invention relates to;
Mean iterative number of time schematic diagram when Figure 11 is the convergence that the present invention relates to.
Detailed description of the invention
Detailed description of the invention one:
The entity classification method in the data-oriented space of present embodiment, shown in Fig. 1 flow chart, described method is passed through
Following steps realize:
Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on wheel
Wide value and the objective cost function of KL-divergence;
Step 2, design data space entity method for measuring similarity;
The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space of initial point select permeability and evolution
Entity classification problem;
Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, expand
The K-Means developed in exhibition step one clusters framework.
Detailed description of the invention two:
Unlike detailed description of the invention one, the entity in the data-oriented space of the present embodiment of present embodiment divides
Class method, the K-Means proposing described in step one to develop clusters framework, i.e. defines target generation based on profile value and KL-divergence
The process of valency function is,
Step one by one, use the mode of linear combination to define total objective cost function:
Cost function is made up of two parts: the snapshot cost of current time step and the history cost of historical time step-length,
It is designated as Cost respectivelysnapshotAnd Costtemporal;;The former is served only for measuring the result that currently clusters about current entity information
Snapshot quality, the metric of this reflection clustering algorithm, it is clear that the highest snapshot cost means that snapshot quality is the lowest;Then
Person then carrys out measuring period flatness according to the cluster fitting degree of structure of currently cluster structure and history, it is clear that the highest history
Cost means that the structural integrity that clusters of step-length continuous time is poor or time smoothing is the most weak, additionally, go through for difference
The history cost of history time step, its weight is the most different.The mode using linear combination defines total objective cost function, uses
In assessment evolution entity K-Means clustering result quality, total objective cost function comprise current time step snapshot cost and
The history cost of historical time step-length is two-part, formula specific as follows:
In formula, 0≤α≤1, represent the weight factor of snapshot cost;Represent the snapshot of current time step t
Cost;Represent the history cost of historical time step-length h;Factor et-hShow from current time step t more close to, it is gone through
History costShared weight is the heaviest, and its departure degree is the least, owing to total objective cost function is the smaller the better, i.e. from
Current time step t is the nearest, and the time smoothing of the structure that clusters of historical time step-length h is the best;
Step one two, the tolerance of snapshot cost based on profile value of carrying out:
If the snapshot plotting of current time step t is Gt=(Vt,Et,Wt), wherein | Vt|=n, WtPhase for snapshot inter-entity
Like property matrix, concrete similarity calculation method is shown in formula (11);The entity division obtained based on this snapshot plotting isWhereinAndSnapshot cost is intended to tolerance about recent snapshot
The snapshot quality of the result that currently clusters of entity, the metric of this reflection clustering algorithm, it is clear that the highest snapshot cost meaning
Snapshot quality the lowest.The current measurement criterion many ripe assessment clusters occurred, such as contingency table, error sum of squares standard
Then, profile value (Silhouette Value), accuracy rate based on class label and recall rate etc..Whether these criterions reference
The aspect performances such as the dependence of golden standard, similarity measurement, number of clusters amount prejudice are different.Due to entity number in the environment of this chapter
Measure more, depend on specific similarity measurement, only with reference to reasons such as itself entities, therefore use profile value criterion tolerance K-
Means clusters the quality of result, and wherein, profile value is also referred to as silhouette coefficient, is the one that proposes of Kaufman and Rousseeuw
Only with reference to data itself without reference to the Cluster Evaluation method of golden standard;In this Cluster Evaluation method each bunch with a profile
Represent, the object that is positioned at bunch by profile reflection and away from bunch object, utilize this Cluster Evaluation method reflect cohesion degree and
Two kinds of influence factors of separating degree, and the profile value effect that clusters the most greatly is the best;Snapshot cost is defined as:
In formula, k represents the number of lower bunch of entity division of current time step t,Represent pth bunch (entity division),Expression bunchMean profile value;Understanding according to formula (1), snapshot cost is the smaller the better;And mean profile value is the biggest
The effect that means to cluster is the best, and therefore mean profile value is with snapshot cost inversely;The physical significance of formula (2) is expressed as
In entity division ZtUnder, mean profile value is the biggest, and its quality that clusters is the best, thus more can reflect recent snapshot entity exactly
Feature;
Step one three, according to each bunchIn comprise one group of snapshot entityThen by each bunchAverage wheel
Wide valueThe meansigma methods of the profile value of all snapshot entities in being defined as bunch, particularly as follows:
In formula,Expression bunchMiddle snapshot entity;Expression bunchThe number of middle snapshot entity;Represent snapshot
EntityProfile value, its measure formulas is expressed as:
Wherein,Represent snapshot entityWith it belonging to bunchIn other snapshot entityAverage similarity,
Represent snapshot entityWith other bunchIn all snapshot entitiesMaximum average similarity;Value the biggest, show fast
According to the facts bodyAnd in bunch the average similarity of snapshot entity more than it and bunch between in the average similarity of snapshot entity, thus snapshot
EntityCorrectly classified;
Step one four, based on described in step one threeRepresent snapshot entityThe thing of measure formulas (4) of profile value
Reason meaning, definitionFormula be:
DefinitionFormula be:
In formula,For bunchMiddle snapshot entity,For bunchMiddle snapshot entity, wii'For snapshot entity in same clusterWith
Between similarity, wijFor snapshot entity in difference bunchWithBetween similarity;
The step First Five-Year Plan, substituted in formula (2) by formula (3) to (6), then snapshot cost is rewritten as:
Step one six, the history cost metric based on KL-divergence that carries out:
The first, history cost is intended to carry out measuring period put down according to the cluster fitting degree of structure of currently cluster structure and history
Sliding characteristic, it is clear that the least history cost means that the structural integrity that clusters between step-length continuous time is preferable, or the time puts down
Slip is the strongest.For discussion purposes,
If snapshot plotting G1,G2,…,Gh,…,Gt,
Under current time step t, based on corresponding snapshot plotting GtEntity division be designated as
Under historical time step-length h, based on corresponding historical snapshot figure GhEntity division be designated as
The second, define a kind of tolerance for comparing two kinds of clusterings, decomposed cluster (Graph-by figure
Factorization clustering) thought inspiration, use bipartite graph presentation-entity and bunch relation, by entity division ZtProblem
It is converted into joint probability distribution problem in bipartite graph:
Make BGt=(Vt,Ct,Ft,Pt) it is corresponding snapshot plotting Gt=(Vt,Et,Wt) a bipartite graph;Wherein,
For snapshot entity sets;Close for gathering;For the set on limit, two of limit
Summit is respectively from set VtWith set Ct;For n × k joint probability matrix, corresponding to the limit weight square of bipartite graph
Battle array;Joint probability formula is used to calculate, i.e.Determine entityWith bunchBetween connection
Close probabilityWherein,For bunchThe probability occurred,For bunchEntity under occurrence conditionThe probability occurred;
Such as sporocarpBelong to bunchSonjWith n be respectively bunchMiddle snapshot physical quantities and all fast the most according to the facts
The quantity of body;Otherwise,Due in the present invention, a kind of hard cluster of cluster rather than soft cluster, i.e. one entity
One bunch can only be belonged to, therefore, for joint probability matrix PtIn for arbitrarily row i, only exist a row j and make pijIt is not 0,
Thus
3rd, current, in classification with the document of cluster, exist a large amount of about the side comparing two clusterings
Method, such as central point differential method, X 2 method, correlation coefficient process, KL-divergence method etc..In this chapter environment, clustering problem
Regard as an entity with bunch joint probability distribution problem, therefore tolerance two clustering difference problems be just equivalent to measure two
Individual probability distribution variances problem.Owing to KL-divergence (being also relative entropy) is that one derives from information-theoretical tolerance, it is used for determining two
The difference of individual probability distribution, therefore uses KL-divergence method to measure:
The bipartite graph BG of given current time step tt=(Vt,Ct,Ft,Pt) and the bipartite graph BG of historical time step-length hh=
(Vh,Ch,Fh,Ph), the entity division of current time step tEntity division with historical time step-length hWherein, BGtCorresponding to Zt, BGhCorresponding to Zh, then the history cost of two time steps h and t is fixed
Justice is as follows:
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For snapshot entity under time step t and bunch between
Joint probability matrix PtMiddle element,For snapshot entity under historical time step-length h and bunch between joint probability matrix Ph
Middle element;
4th, understand from analysis above, joint probability matrix PtOr PhIt is a sparse matrix, i.e. there is non-zero
Element, and the KL-divergence of standard is not supportedOrIt is the situation of 0, to this end, joint probability matrix PtOr PhDo following smooth
Process: PtOr PhIn each elementOrPlus constant ε, and ε=e-12;Then the most regular to element after processing
Change, be designated asOrProbability matrix after smoothing processing is designated as respectivelyWithThen formula (8) is modified to:
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For under time step t after smoothing processing
Entity and bunch between joint probability matrixMiddle element,For under historical time step-length h through smoothing processing entity with
Bunch joint probability matrixMiddle element;
5th, formula (7) and formula (8) are substituted in formula (1), then target total cost function is equivalent to:
Wherein, 0≤α≤1 is the weight factor of snapshot cost, and k represents that current time step t's lower bunch (entity division) is individual
Number,Presentation-entity divides ZtMiddle pth element, wii'Or wijRepresent snapshot plotting Gt=(Vt,Et,Wt) WtMiddle element,OrRepresent GtMiddle snapshot entity, n represents bipartite graph BGt=(Vt,Ct,Ft,Pt) VtMiddle snapshot physical quantities,Represent flat
Joint probability matrix after sliding processMiddle element,Represent joint probability matrix after smoothing processingMiddle element.
Detailed description of the invention three:
Unlike detailed description of the invention one or two, the reality in the data-oriented space of the present embodiment of present embodiment
Body sorting technique, the process designing data space entity method for measuring similarity described in step 2 is,
On the one hand, snapshot entity self comprises abundant information, the most structurized attribute information, non-structured content
Information;On the other hand, in data space environment, entity occurs the most repeatedly, and this history pattern information occurs to sentencing
Disconnected two entities are the most similar also has certain effect, to this end, data space entity i.e. snapshot entity, according to self letter of entity
There is pattern information to measure the similarity of snapshot entity in the history of breath and entity, and i.e. the similarity function of snapshot entity is by self
Similarity and historical similarity two parts composition, expression formula is defined as:
Wherein, 0≤β≤1 is the weight of self similarity,For the snapshot entity under current time step t,For snapshot entityWithBetween self similarity,For snapshot entityWithBetween
Historical similarity;
Intuitively, the same or analogous ratio of entity attributes name with apoplexy due to endogenous wind is higher, and the genus of the entity in inhomogeneity
The property same or analogous ratio of name is relatively low;Additionally, some entity often comprises only unstructured information, similar two of its content
Entity, to a certain extent, they are likely to belong to same class.To this end, it is corresponding based on snapshot entity attributes characteristic information
Structured features information, the destructuring characteristic information corresponding with content characteristic information, self between snapshot entity is similar
Property is defined as follows:
Wherein, 0≤λ≤1 is the weight of attribute character similarity,WithIt is respectively snapshot
Entity attributes characteristic similarity and content characteristic similarity,For snapshot entity attributes feature,For the most according to the facts
The content characteristic of body;
If the number information pattern that two snapshot entities occur in the history step-length in past is relatively uniform, then for working as
Two snapshot entities of front time step, this historical information pattern dependency represents that the two snapshot entity has similarity,
Use classical Pearson correlation coefficients tolerance historical similarity, particularly as follows:
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithIn history
The number of times that time step h occurs,WithSnapshot entity respectivelyWithAverage in all historical time step-length occurrence numbers
Value;
Formula (12) and formula (13) are substituted into formula (11), then the similarity function of snapshot entity is rewritten as:
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithIn history
The number of times that time step h occurs,WithSnapshot entity respectivelyWithAverage in all historical time step-length occurrence numbers
Value,WithIt is respectively snapshot entityWithAttribute character,WithFor snapshot entityWith
Content characteristic;0≤β≤1 is self similarityWeight, 0≤λ≤1 is attribute character similarity
Weight.
Detailed description of the invention four:
Unlike detailed description of the invention three, the entity in the data-oriented space of the present embodiment of present embodiment divides
Class method, proposes the K-Means clustering algorithm developed described in step 3, and the data solving initial point select permeability and evolution is empty
Between the process of entity classification problem be,
The first, first provide some related definitions, preferably to select initial center point, then describe evolution in detail
K-Means clustering algorithm;
It is known that the quality that initial point selects greatly affects the quality of K-Means Clustering Effect, and tradition is selected at random
The method selected is easily caused algorithm the convergence speed and crosses the problems such as slow.Therefore, before solving initial point select permeability, following phase is carried out
Close and define:
The definition of the η-neighbours under t: a given snapshot plotting Gt=(Vt,Et,Wt) and parameter 0 < η≤1, then right
In any snapshot entityFor, the η under t-neighbours' formal definitions is:
Wherein | Vt| for snapshot plotting GtMiddle number of vertices,For WtMiddle element;
The definition of the similarity density under t: a given snapshot plotting Gt=(Vt,Et,Wt) and t under η-neighbour
OccupySo for any snapshot entityFor, the similarity Density Format under t is defined as:
Above-mentioned definition understands: a snapshot entitySimilarity density the highest, then its η-neighboursQuantity
The most and withIn the average similarity of other snapshot entity the highest;And the snapshot entity that similarity density is the highest,
It is the biggest as the probability at bunch center;The definition of the similarity density under t avoids and selects in such as density regions
Snapshot entity, the noise data of isolated snapshot entity, or bunch in edge snapshot entity as K-Means cluster bunch in
Heart point;
The second, the snapshot entity that selection principle is similarity density maximum of first initial center point is determined;
Determine the selection principle of the initial center point in addition to first initial center point: remove the initial center point selected
The snapshot entity of η-neighbours;Average similarity less than all initial center point selected;Similar higher than Current central point
Property density;Wherein, the average similarity value of all initial center point selected is 0.3, and the similarity density of Current central point is
10, then, this principle form can turn to equation below:
Wherein, 1≤l≤j-1 is the serial number having selected initial center point,Initial center has been selected for all
The union of the η-neighbours of point,For snapshot entityWith select initial center pointSimilarity,For snapshot
EntitySimilarity density under t, adds that coefficient 1 purpose is the situation preventing denominator from being zero;
The basic thought of the K-Means clustering algorithm the 3rd, performing evolution is as follows: in the institute to current time step
Having in time step, circulation performs K-Means clustering algorithm;Wherein, each time step performs K-Means clustering algorithm
Process is, selects initial center point based on similarity density and formula (15), is then iteratively performed following operation:
1) snapshot entity is assigned to bunch central point that similarity is the highest,
2) bunch central point is updated, until it reaches the condition of convergence that in formula (10), target cost is minimum;
The K-Means clustering algorithm detailed process developed is as follows:
Input: the snapshot entity sets O={O of a series of different time step-lengths1,O2,…,Oh,…,Ot, different time walks
Long corresponding bunch number set K={k1,k2,…,kh,…,kt};
Output: the cluster result set C={C of all time steps1,C2,…,Ch,…,Ct};Wherein, h express time step
Long, h=1,2 ..., t;
(1), to each time step h, circulation performs:
(2) formula (14), is utilized to calculate snapshot entity sets O under current time step hhCorresponding similarity matrix Wh, and
Build corresponding snapshot plotting Gh=(Vh,Eh,Wh);
(3), by bunch central point setIt is initialized as sky;
(4), carry out selecting the process of initial center point: first select the snapshot entity that similarity density is the highest as first
Initial center pointThen it is calculated the remaining initial center point of selection according to formula (15)Wherein, j is according to from 1hTo kh
Ascending order order, subscript h express time step-length;
(5), circulation performs: by snapshot entity sets OhIn each snapshot entityIt is assigned to bunch center most like with it
Place bunch (e.g.,);Update the central point of each bunch and record cluster result Ch;Until meeting target generation in formula (10)
Valency functionThe minimum condition of convergence;
The accumulative cluster result updating different time step-length;
And return the cluster result C of all time steps.
Detailed description of the invention five:
Unlike detailed description of the invention one, two or four, the data-oriented space of the present embodiment of present embodiment
Entity classification method, change in time or snapshot entity adds in time or removes in number of clusters amount described in step 4
In the case of, the process of the K-Means cluster framework developed in spread step one is,
In the evolution K-Means one of detailed description of the invention one to four Suo Shu clusters framework, cluster quantity the most in time
Change;And in all time steps, snapshot entity sets to be clustered is identical, i.e. snapshot entity does not add
Situation about entering or remove.But in actual applications, the two assumes that qualifications is the strictest.To this end, before this trifle extends
The evolution K-Means that face proposes clusters framework, to process following situation:
The first, when number of clusters amount changes in time:
Quantity k that clusters when historical time step-length hhQuantity k that clusters less than current time step ttTime, only need to increase phase
The row answered are to joint probability matrix PhIn, thus be extended toWhereinThis is because for
Newly increase bunch for, under historical time step-length h, snapshot entity-new bunch combines the probability of generation is zero.Now, after extension,And PtIt is all n × ktJoint probability matrix, therefore, formula (10) is revised as:
Quantity k that clusters when historical time step-length hhQuantity k that clusters more than current time step ttTime, only need to increase phase
The row answered are to joint probability matrix PtIn, thus be extended toWherein,This is because for
Delete bunch for, under current time step t, the probability of generation is combined in snapshot entity-deletion bunch is zero.Now, after extension,
PhWithIt is all n × khJoint probability matrix, therefore, formula (10) is revised as:
The second, when snapshot entity adds in time or removes:
Assuming that when historical time step-length h, PhIt is a nhThe joint probability matrix of × k, when current time step t, Pt
It is a nhThe joint probability matrix of × k, n0Individual snapshot entity occurs in time step h and t simultaneously;When walking at historical time
When the snapshot entity of long h is removed, for time step t, those the snapshot entities being removed were sent out with combining of current cluster
Raw probability is 0, at PtIncrease the row of correspondence middlely, thus obtainP t , wherein,And when in current time
When the snapshot entity of step-length t is newly added, for historical time step-length h, those the snapshot entities being newly added and history bunch
Associating probability of happening is 0, at PhThe middle row increasing correspondence, thus obtainP h , wherein,Now, extension
After,P h WithP t It is all (nh+nt-n0The joint probability matrix of) × k, therefore, formula (10) is revised as:
In formula, symbolRepresenting matrix X according in formula (9) smoothing processing method process after matrix,It it is matrixIn
Element.
Detailed description of the invention six:
Unlike detailed description of the invention five, the entity in the data-oriented space of the present embodiment of present embodiment divides
Class method, described snapshot entity is, t, then the snapshot entity under t can formalization representation be ot=(Attr, Cont);Its
In, Attr represents snapshot entity otStructured features information, such as attribute-name set in tuple, and Attr={a1,a2,…,
an};Cont represents snapshot entity otDestructuring characteristic information, such as keyword set in content, and Cont={keyword1,
keyword2,…,keywordm, n and m represents the element number of set Attr and Cont respectively;
Noting, under the most in the same time, structuring and destructuring characteristic information in entity o are it may happen that change, existing
In reality, this may be caused by many reasons, and the such as information source of entity extraction increases the change causing entity information.But this chapter
It is not concerned with entity information extraction problem under any time.And all snapshot entities under current time step t are designated asSnapshot plotting schematic diagram under adjacent time step-length is as shown in Figure 2.
Detailed description of the invention seven:
Unlike detailed description of the invention one, two, four or six, the data-oriented of the present embodiment of present embodiment is empty
Between entity classification method, under moment t, a paper class snapshot entity otComprise title, author, size attribute, also comprise
Data space, entity, the unstructured content information of classification;Then ot=({ title, author, size }, data space, entity,
Classification });
Detailed description of the invention eight:
Unlike detailed description of the invention seven, the entity in the data-oriented space of the present embodiment of present embodiment divides
Class method,
Described snapshot plotting is, time step t, then the snapshot plotting under t can formalization representation be Gt=(Vt,
Et,Wt), wherein at figure GtIn, each summitRepresent snapshot entity, each edgeRepresent snapshot entityWithTool
There is similarity,Represent limitWeight, i.e. snapshot entity under time step tWithBetween similarity divide
Number.Then the snapshot plotting schematic diagram under adjacent time step-length is as shown in Figure 3.
Experiment and interpretation of result:
Setup Experiments:
This experiment use from March, 2015 release version DBLP data as experiment needed for basic,
Download address is http://dblp.uni-trier.de/.The entity class of extraction includes paper, thesis for the doctorate, author, meeting
View, periodical, mechanism of university.Have following some should be noted: (1) paper entity come from inproceedings record or key key
With " journals " the Article record as prefix, thesis for the doctorate entity comes from phdthesis record, author's entity from
In WWW record or author label, meeting entity comes from key key with " conf " inproceedings record as prefix
In booktitle label, periodical entity comes from journal label or key key with " journals " as prefix
Booktitle label in inproceedings record, mechanism of university comes from school label;(2) span when being only chosen at
Degree is the entity produced between 2005 to 2014, and a time step is 1 year, and the physical quantities size after extraction is the most about
3M;(3) it is the evolution properties of analog data spatial entities, this chapter entity set to each time step, randomly choose 20% ratio
The entity of example, then removes some attribute informations or content information simple randomization.(4) it is the Pay-as-in analog data space
Collected all entities are not the most provided class label by you-go characteristic, i.e. do not have previously known classification information (Ground
truth).(5) for the extensibility of test this method, different classes of entity is constantly reduced entity according to the principle of equal proportion
Quantity, thus generate the DBLP data set of 2.5M, 2M, 1.5M and 1M size.
Experimental situation is provided that PC main frame uses Intel (R) Core (TM) i5-4570CPU 3.20GHz, and internal memory holds
Amount is 4G, and hard-disk capacity is 1TB, and operating system is Windows 7 (64bit), and in experiment, all algorithms all use Java language
Realize.Unless stated otherwise, in all experiments, the parameter k default value in the evolution K-Means algorithm of this chapter is 6, data set
Size is 3M.
Effect is assessed with extensibility
(1) the choosing of parameter
Test the change of the different parameters value impact on Clustering Effect respectively below by three groups of experiments, thus determine respectively
Parameter lambda, the optimal value of β and α.
In experiment 1. assessment entity similarity function, weight λ chooses the impact on Clustering Effect.Owing to this experiment is only closed
Noting the change impact on Clustering Effect in weight λ, therefore setup parameter α=1 and β=1, in addition the number of all time steps
It is aggregating according to collection, experiment 50 times of reruning on this basis, and records average snapshot cost corresponding to all λ.Horizontal stroke in Fig. 4
The different values of coordinate representation weight λ, vertical coordinate represents snapshot cost (see formula (7)).As can be known from Fig. 4, along with λ increases,
Snapshot cost tapers into;When reaching 0.6, snapshot Least-cost (being now 0.5), i.e. Clustering Effect is optimal;Afterwards, snapshot
Cost then becomes larger.This shows, for entity self similarity measurement, to compare the content information of entity, entity attributes
Characteristic information plays and more importantly acts on.This is primarily due to the entity for similar categorization, and they attribute character are similar
The similar probability of likelihood ratio content is big.Snapshot cost under different λ as shown in Figure 4.
In experiment 2. assessment entity similarity function, weight beta chooses the impact on Clustering Effect.Owing to this experiment is only closed
Note the change impact on Clustering Effect in weight beta, and test 1 and show best results during λ=0.6, for this this experiment setting
Parameter alpha=1 and λ=0.6, be then aggregating the data set of all time steps, and experiment runs 50 times on this basis, and
Record average snapshot cost corresponding to all β.In Fig. 5, abscissa represents the different values of weight beta, and vertical coordinate represents snapshot cost
(see formula (7)).As can be seen from Figure 5, along with β increases, snapshot cost overall trend tapers into;When reaching 0.75, snapshot generation
Valency is minimum (being now 0.36), i.e. Clustering Effect is optimal;Afterwards, snapshot cost then becomes larger.This shows for entity similar
Property tolerance for, not only it is also contemplated that the self information of entity, it should also be taken into account that pattern information occurs in the history of entity.Additionally,
For pattern information occurs in history, the self information of entity more can affect the quality of entity cluster (classification) effect.
In experiment 3. assessment objective cost function, weight α chooses the impact on Clustering Effect.According to the first two experiment
Conclusion, this experiment setup parameter β=0.75 and λ=0.6, then repeat on the DBLP data set of step-length continuous time (1-10)
Running experiment 50 times, and record the corresponding meansigma methods that all time steps are corresponding.Abscissa express time step-length in Fig. 6, vertical
Coordinate representation snapshot cost (see formula (7)).As can be seen from Figure 6, developing (i.e. time step increase) over time, snapshot cost is opened
Beginning strongly reduces, and the most almost trends towards stable.This is primarily due in evolutionary process, the self information of entity and history mould
Formula information is gradually abundant also until restraining, so that Clustering Effect strengthens and until being in relatively steady statue.Further, it is also possible to
Observing, along with α value is gradually increased, snapshot cost is gradually reduced.It is the biggest that this is primarily due to α value, and evolution clustering algorithm is the strongest
Adjust current cluster result quality, thus snapshot cost is the least.Abscissa express time step-length in Fig. 7, vertical coordinate represents history generation
Valency (see formula (9)).As can be seen from Figure 7, along with α value is gradually reduced, history cost tapers into.It is the least that this is primarily due to α value,
Evolution clustering algorithm more emphasizes the flatness of Historic Clustering result.In Fig. 8, abscissa represents weight α value size, and vertical coordinate represents
Total cost (see formula (10)).As it can be observed in the picture that when α value is 0.9, total Least-cost, the effect the most now clustered is best.
This shows that the evolution K-Means clustering algorithm that this chapter proposes can be rolled between snapshot cost and history cost well
In.The history cost under different α as shown in Figure 6 to 8.
(2) Contrast on effect of distinct methods
This method (Similary Density-Based Evolutionary K-is compared in experiment 4. in terms of Clustering Effect
Means Clustering, SD-EKM) and other pedestal methods.The following pedestal method of this experimental design: (1) simple method
(Evolutionary K-Means Clustering, N-EKM), the evolution K-i.e. selected based on random initial point
Means.It is another version of context of methods SD-EKM, and difference is that the mode that initial point selects is different.(2) classical
PCM-EKM method, it is a kind of based on reservation cluster member evolution K-Means clustering method that Yun Chi et al. proposes.Abide by
Following the experiment conclusion of Yun Chi, parameter alpha is also configured as 0.9, but has carried out small size change in this experiment, i.e. replaces original
Similarity measurement is the entity method for measuring similarity that this chapter proposes.(3) IND method, the data of the most each time step are the most solely
The vertical data run K-Means algorithm and do not consider historical time step-length.Owing to notebook data collection does not has previously known entity to divide
The tolerance that the cost function of category information (that is, theoretic entity classification result) and this several method is the most unified, therefore this
Experimental evaluation tolerance uses mutual information tolerance as with reference to (reader interested can be referring in detail to Xu et al.[136]Draw for two groups proposed
Mutual information definition between Fen), substantially, the mutual information between two groups of divisions is the highest, and probability similar between them is the biggest.All realities
Test experiment 50 times of reruning on the DBLP data set of step-length continuous time (1-10), and it is corresponding to record all time steps
Association relationship.Abscissa express time step-length in Fig. 9, vertical coordinate represents mutual information.As can be seen from Figure 9, (1) SD-EKM, N-EKM and
The mutual information of PCM-EKM method is significantly better than IND method, and develops over time, and the former association relationship is the most steady
Fixed.This is mainly the most above three methods and uses the thought clustered that develops, it is considered to the entity information of historical time step-length;(2)
The mutual information of SD-EKM and N-EKM method is relatively better than PCM-EKM method, and in the stability also phase of step-length continuous time
To preferably, this is primarily due to PCM-EKM method and only considers the historical information of a upper time step, and the method for the present invention
SD-EKM and N-EKM then considers the historical information of all historical time step-lengths.(3) SD-EKM method is better than N-EKM.This is mainly
Because method SD-EKM of the present invention uses the guideline of similarity density, largely avoid selection noise data (as low
Snapshot entity in density area) as initial center point, so that cluster result is more preferable.Distinct methods as shown in Figure 9
Contrast on effect schematic diagram.
(3) extensibility
Experiment 5. test this method SD-EKM upon execution between on extensibility.This experiment is at different size of data set
On rerun this method SD-EKM 50 times, hold when recording average operating time and this method convergence of each iteration respectively
The mean iterative number of time of row.In Figure 10, abscissa represents data set size, and vertical coordinate represents the average operating time of each iteration.
In Figure 11, abscissa represents data set size, mean iterative number of time when vertical coordinate represents SD-EKM algorithmic statement.From Figure 10 and Tu
11 understand, and the average operating time of each iteration of (1) SD-EKM of the present invention is almost linear with data set size;(2) calculate
Iterations required during method convergence is almost insensitive with data set size, about at about 650 times.These two groups experiments show this
The average operating time of invention SD-EKM is linear with data set size, is with good expansibility.As shown in Figure 10
The average operating time of each iteration, and mean iterative number of time schematic diagram during convergence shown in Figure 11.
The present invention also can have other various embodiments, in the case of without departing substantially from present invention spirit and essence thereof, and this area
Technical staff is when making various corresponding change and deformation according to the present invention, but these change accordingly and deformation all should belong to
The protection domain of appended claims of the invention.
Claims (8)
1. the entity classification method in a data-oriented space, it is characterised in that: described method is realized by following steps:
Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on profile value
Objective cost function with KL-divergence;
Step 2, design data space entity method for measuring similarity;
The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space entity of initial point select permeability and evolution
Classification problem;
Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, extension step
The K-Means developed in rapid one clusters framework.
The entity classification method in data-oriented space the most according to claim 1, it is characterised in that propose described in step one
The K-Means developed clusters framework, and i.e. defining process based on profile value and the objective cost function of KL-divergence is,
Step one by one, use the mode of linear combination to define total objective cost function:
Cost function is made up of two parts: the snapshot cost of current time step and the history cost of historical time step-length, respectively
It is designated as CostsnapshotAnd Costtemporal;The mode using linear combination defines total objective cost function, is used for assessing evolution
The K-Means clustering result quality of entity, total objective cost function comprises snapshot cost and the historical time step of current time step
Long history cost is two-part, formula specific as follows:
In formula, 0≤α≤1, represent the weight factor of snapshot cost;Represent the snapshot cost of current time step t;Represent the history cost of historical time step-length h;Factor et-hShow from current time step t more close to, its history generation
ValencyShared weight is the heaviest, and its departure degree is the least, i.e. from current time step t more close to, historical time step-length h
The time smoothing of the structure that clusters is the best;
Step one two, the tolerance of snapshot cost based on profile value of carrying out:
If the snapshot plotting of current time step t is Gt=(Vt,Et,Wt), wherein | Vt|=n, WtSimilarity for snapshot inter-entity
Matrix;The entity division obtained based on this snapshot plotting isWhereinAndUsing profile value criterion tolerance K-Means to cluster the quality of result, wherein, profile value is also referred to as silhouette coefficient,
Be a kind of only with reference to data itself the Cluster Evaluation method without reference to golden standard;In this Cluster Evaluation method each bunch with one
Individual profile represents, the object that is positioned at bunch by profile reflection and away from bunch object, in utilizing the reflection of this Cluster Evaluation method
Poly-degree and two kinds of influence factors of separating degree, and the profile value effect that clusters the most greatly is the best;Snapshot cost is defined as:
In formula, k represents the number of lower bunch of current time step t,Represent pth bunch,Expression bunchMean profile
Value;Mean profile value is with snapshot cost inversely;
Step one three, according to each bunchIn comprise one group of snapshot entityThen by each bunchMean profile valueThe meansigma methods of the profile value of all snapshot entities in being defined as bunch, particularly as follows:
In formula,Expression bunchMiddle snapshot entity;Expression bunchThe number of middle snapshot entity;Represent snapshot entityProfile value, its measure formulas is expressed as:
Wherein,Represent snapshot entityWith it belonging to bunchIn other snapshot entityAverage similarity,Represent
Snapshot entityWith other bunchIn all snapshot entitiesMaximum average similarity;Value the biggest, show fast the most according to the facts
BodyAnd in bunch the average similarity of snapshot entity more than it and bunch between in the average similarity of snapshot entity;
Step one four, based on described in step one threeRepresent snapshot entityProfile value measure formulas (4) physics meaning
Justice, definitionFormula be:
DefinitionFormula be:
In formula,For bunchMiddle snapshot entity,For bunchMiddle snapshot entity, wii'For snapshot entity in same clusterWithIt
Between similarity, wijFor snapshot entity in difference bunchWithBetween similarity;
The step First Five-Year Plan, substituted in formula (2) by formula (3) to (6), then snapshot cost is rewritten as:
Step one six, the history cost metric based on KL-divergence that carries out:
The first, snapshot plotting G is set1,G2,…,Gh,…,Gt,
Under current time step t, based on corresponding snapshot plotting GtEntity division be designated as
Under historical time step-length h, based on corresponding historical snapshot figure GhEntity division be designated as
The second, define a kind of tolerance for comparing two kinds of clusterings, use bipartite graph presentation-entity and bunch relation, by entity
Divide ZtProblem is converted into joint probability distribution problem in bipartite graph:
Make BGt=(Vt,Ct,Ft,Pt) it is corresponding snapshot plotting Gt=(Vt,Et,Wt) a bipartite graph;Wherein,For snapshot
Entity sets;Close for gathering;For the set on limit, two summits on limit are divided
Do not come from set VtWith set Ct;For n × k joint probability matrix, corresponding to the limit weight matrix of bipartite graph;Use
Joint probability formula calculates, i.e.Determine entityWith bunchBetween joint probability
Wherein,For bunchThe probability occurred,For bunchEntity under occurrence conditionThe probability occurred;Such as sporocarpBelong to bunchSonjWith n be respectively bunchMiddle snapshot physical quantities and the number of all snapshot entities
Amount;Otherwise,For joint probability matrix PtIn for arbitrarily row i, only exist a row j and make pijIt is not 0, thus
3rd, KL-divergence method is used to measure:
The bipartite graph BG of given current time step tt=(Vt,Ct,Ft,Pt) and the bipartite graph BG of historical time step-length hh=(Vh,
Ch,Fh,Ph), the entity division of current time step tEntity division with historical time step-length hWherein, BGtCorresponding to Zt, BGhCorresponding to Zh, then the history cost of two time steps h and t is fixed
Justice is as follows:
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For snapshot entity under time step t and bunch between connection
Close probability matrix PtMiddle element,For snapshot entity under historical time step-length h and bunch between joint probability matrix PhMiddle unit
Element;
4th, joint probability matrix PtOr PhIt is following smoothing processing: PtOr PhIn each elementOrPlus often
Amount ε, and ε=e-12;Then the most regular to element after processing, it is designated asOrProbability square after smoothing processing
Battle array is designated as respectivelyWithThen formula (8) is modified to:
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For entity after smoothing processing under time step t
And the joint probability matrix between bunchMiddle element,For under historical time step-length h through smoothing processing entity with bunch
Joint probability matrixMiddle element;
5th, formula (7) and formula (8) are substituted in formula (1), then target total cost function is equivalent to:
Wherein, 0≤α≤1 is the weight factor of snapshot cost, and k represents the number of lower bunch of current time step t,Presentation-entity
Divide ZtMiddle pth element, wii'Or wijRepresent snapshot plotting Gt=(Vt,Et,Wt) WtMiddle element,OrRepresent Gt
Middle snapshot entity, n represents bipartite graph BGt=(Vt,Ct,Ft,Pt) VtMiddle snapshot physical quantities,Combine after representing smoothing processing
Probability matrixMiddle element,Represent joint probability matrix after smoothing processingMiddle element.
The entity classification method in data-oriented space the most according to claim 1 or claim 2, it is characterised in that: set described in step 2
The process counting spatial entities method for measuring similarity is,
According to the self information of entity and the history of entity, data space entity i.e. snapshot entity, occurs that pattern information is measured soon
The according to the facts similarity of body, the i.e. similarity function of snapshot entity are made up of self similarity and historical similarity two parts, express
Formula is defined as:
Wherein, 0≤β≤1 is the weight of self similarity,For the snapshot entity under current time step t,For snapshot entityWithBetween self similarity,For snapshot entityWithBetween
Historical similarity;
Based on the structured features information that snapshot entity attributes characteristic information is corresponding, the non-structural corresponding with content characteristic information
Changing characteristic information, self similarity definition between snapshot entity is as follows:
Wherein, 0≤λ≤1 is the weight of attribute character similarity,WithIt is respectively snapshot entity
Attribute character similarity and content characteristic similarity,For snapshot entity attributes feature,For snapshot entity
Content characteristic;
Use classical Pearson correlation coefficients tolerance historical similarity, particularly as follows:
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithAt historical time
The number of times that step-length h occurs,WithSnapshot entity respectivelyWithMeansigma methods in all historical time step-length occurrence numbers;
Formula (12) and formula (13) are substituted into formula (11), then the similarity function of snapshot entity is rewritten as:
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithAt historical time
The number of times that step-length h occurs,WithSnapshot entity respectivelyWithIn the meansigma methods of all historical time step-length occurrence numbers,WithIt is respectively snapshot entityWithAttribute character,WithFor snapshot entityWithInterior
Hold feature;0≤β≤1 is self similarityWeight, 0≤λ≤1 is attribute character similarityPower
Weight.
The entity classification method in data-oriented space the most according to claim 3, it is characterised in that: propose described in step 3 to drill
The K-Means clustering algorithm changed, and the process solving the data space entity classification problem of initial point select permeability and evolution is,
The first, following related definition is carried out:
The definition of the η-neighbours under t: a given snapshot plotting Gt=(Vt,Et,Wt) and parameter 0 < η≤1, then for appointing
Meaning snapshot entityFor, the η under t-neighbours' formal definitions is:
Wherein, | Vt| for snapshot plotting GtMiddle number of vertices,For WtMiddle element;
The definition of the similarity density under t: a given snapshot plotting Gt=(Vt,Et,Wt) and t under η-neighboursSo for any snapshot entityFor, the similarity Density Format under t is defined as:
The second, the snapshot entity that selection principle is similarity density maximum of first initial center point is determined;
Determine the selection principle of the initial center point in addition to first initial center point: remove the η of the initial center point selected-
The snapshot entity of neighbours;Average similarity less than all initial center point selected;Close higher than the similarity of Current central point
Degree;This principle form can turn to equation below:
Wherein, 1≤l≤j-1 is the serial number having selected initial center point,The η having selected initial center point for all-
The union of neighbours,For snapshot entityWith select initial center pointSimilarity,For snapshot entity
Similarity density under t, adds that coefficient 1 purpose is the situation preventing denominator from being zero;
The basic thought of the K-Means clustering algorithm the 3rd, performing evolution is as follows: in the institute to current time step sometimes
Between in step-length, circulation performs K-Means clustering algorithm;Wherein, each time step performs the process of K-Means clustering algorithm
It is to select initial center point based on similarity density and formula (15), is then iteratively performed following operation:
1) snapshot entity is assigned to bunch central point that similarity is the highest,
2) bunch central point is updated, until it reaches the condition of convergence that in formula (10), target cost is minimum;
The K-Means clustering algorithm detailed process developed is as follows:
Input: the snapshot entity sets O={O of a series of different time step-lengths1,O2,…,Oh,…,Ot, different time step-length pair
Bunch number set K={k answered1,k2,…,kh,…,kt};
Output: the cluster result set C={C of all time steps1,C2,…,Ch,…,Ct};Wherein, h express time step-length, h
=1,2 ..., t;
(1), to each time step h, circulation performs:
(2) formula (14), is utilized to calculate snapshot entity sets O under current time step hhCorresponding similarity matrix Wh, and build
Corresponding snapshot plotting Gh=(Vh,Eh,Wh);
(3), by bunch central point setIt is initialized as sky;
(4), carry out selecting the process of initial center point: first select the snapshot entity that similarity density is the highest initial as first
Central pointThen it is calculated the remaining initial center point of selection according to formula (15)Wherein, j is according to from 1hTo khLiter
Sequence order, subscript h express time step-length;
(5), circulation performs: by snapshot entity sets OhIn each snapshot entityIt is assigned to bunch center place most like with it
Bunch;Update the central point of each bunch and record cluster result Ch;Until meeting objective cost function in formula (10)The minimum condition of convergence;
The accumulative cluster result updating different time step-length;
And return the cluster result C of all time steps.
5. according to the entity classification method in data-oriented space described in claim 1,2 or 4, it is characterised in that: described in step 4
In the case of number of clusters amount changes in time or snapshot entity adds in time or removes, spread step one develops
The process of K-Means cluster framework is,
The first, when number of clusters amount changes in time:
Quantity k that clusters when historical time step-length hhQuantity k that clusters less than current time step ttTime, only need to increase corresponding
Row arrive joint probability matrix PhIn, thus be extended toWhereinNow, after extension,With
PtIt is all n × ktJoint probability matrix, therefore, formula (10) is revised as:
Quantity k that clusters when historical time step-length hhQuantity k that clusters more than current time step ttTime, increase corresponding row and arrive
Joint probability matrix PtIn, it is extended toWherein,Now, after extension, PhWithBe all n ×
khJoint probability matrix, therefore, formula (10) is revised as:
The second, when snapshot entity adds in time or removes:
Assuming that when historical time step-length h, PhIt is a nhThe joint probability matrix of × k, when current time step t, PtIt is one
Individual nhThe joint probability matrix of × k, n0Individual snapshot entity occurs in time step h and t simultaneously;When in historical time step-length h
When snapshot entity is removed, for time step t, those the snapshot entities being removed combine probability of happening with current cluster
It is 0, at PtIncrease the row of correspondence middlely, thus obtainP t , wherein,And when at current time step t
Snapshot entity when being newly added, for historical time step-length h, those the snapshot entities being newly added were sent out with combining of history bunch
Raw probability is 0, at PhThe middle row increasing correspondence, thus obtainP h , wherein,Now, after extension,P h WithP t It is all (nh+nt-n0The joint probability matrix of) × k, therefore, formula (10) is revised as:
In formula, symbolRepresenting matrix X according in formula (9) smoothing processing method process after matrix,It it is matrixMiddle element.
The entity classification method in data-oriented space the most according to claim 5, it is characterised in that: described snapshot entity is,
Snapshot entity form under t is expressed as ot=(Attr, Cont);Wherein, Attr represents snapshot entity otStructuring
Characteristic information, and Attr={a1,a2,…,an};Cont represents snapshot entity otDestructuring characteristic information, and Cont=
{keyword1,keyword2,…,keywordm, n and m represents the element number of set Attr and Cont respectively;When current
Between all snapshot entities under step-length t be designated as
7. according to the entity classification method in data-oriented space described in claim 1,2,4 or 6, it is characterised in that: described snapshot
Entity is paper, then under moment t, and paper snapshot entity otComprise title, author, size attribute, also comprise data space, reality
Body, the unstructured content information of classification;Then ot=({ title, author, size }, { data space, entity, classification }).
The entity classification method in data-oriented space the most according to claim 7, it is characterised in that: described snapshot plotting is, one
Individual time step t, then the snapshot plotting under t can formalization representation be Gt=(Vt,Et,Wt), wherein at figure GtIn, Mei Geding
PointRepresent snapshot entity, each edgeRepresent snapshot entityWithThere is similarity,Represent limitWeight, i.e. snapshot entity under time step tWithBetween similarity scores.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610348890.4A CN106067029B (en) | 2016-05-24 | 2016-05-24 | The entity classification method in data-oriented space |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610348890.4A CN106067029B (en) | 2016-05-24 | 2016-05-24 | The entity classification method in data-oriented space |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106067029A true CN106067029A (en) | 2016-11-02 |
CN106067029B CN106067029B (en) | 2019-06-18 |
Family
ID=57420728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610348890.4A Expired - Fee Related CN106067029B (en) | 2016-05-24 | 2016-05-24 | The entity classification method in data-oriented space |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106067029B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108806355A (en) * | 2018-04-26 | 2018-11-13 | 浙江工业大学 | A kind of calligraphy and painting art interactive education system |
CN108932528A (en) * | 2018-06-08 | 2018-12-04 | 哈尔滨工程大学 | Similarity measurement and method for cutting in chameleon algorithm |
CN109543712A (en) * | 2018-10-16 | 2019-03-29 | 哈尔滨工业大学 | Entity recognition method on temporal dataset |
CN110033644A (en) * | 2019-04-22 | 2019-07-19 | 泰华智慧产业集团股份有限公司 | Parking position reserves air navigation aid and system |
CN111161819A (en) * | 2019-12-31 | 2020-05-15 | 重庆亚德科技股份有限公司 | Traditional Chinese medical record data processing system and method |
CN114386422A (en) * | 2022-01-14 | 2022-04-22 | 淮安市创新创业科技服务中心 | Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction |
CN116450830A (en) * | 2023-06-16 | 2023-07-18 | 暨南大学 | Intelligent campus pushing method and system based on big data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060200431A1 (en) * | 2005-03-01 | 2006-09-07 | Microsoft Corporation | Private clustering and statistical queries while analyzing a large database |
CN102388390A (en) * | 2009-04-01 | 2012-03-21 | 微软公司 | Clustering videos by location |
CN103902699A (en) * | 2014-03-31 | 2014-07-02 | 哈尔滨工程大学 | Data space retrieval method applied to big data environments and supporting multi-format feature |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
US20140320388A1 (en) * | 2013-04-25 | 2014-10-30 | Microsoft Corporation | Streaming k-means computations |
CN104731811A (en) * | 2013-12-20 | 2015-06-24 | 北京师范大学珠海分校 | Cluster information evolution analysis method for large-scale dynamic short texts |
-
2016
- 2016-05-24 CN CN201610348890.4A patent/CN106067029B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060200431A1 (en) * | 2005-03-01 | 2006-09-07 | Microsoft Corporation | Private clustering and statistical queries while analyzing a large database |
CN102388390A (en) * | 2009-04-01 | 2012-03-21 | 微软公司 | Clustering videos by location |
US20140320388A1 (en) * | 2013-04-25 | 2014-10-30 | Microsoft Corporation | Streaming k-means computations |
CN104731811A (en) * | 2013-12-20 | 2015-06-24 | 北京师范大学珠海分校 | Cluster information evolution analysis method for large-scale dynamic short texts |
CN103902699A (en) * | 2014-03-31 | 2014-07-02 | 哈尔滨工程大学 | Data space retrieval method applied to big data environments and supporting multi-format feature |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
Non-Patent Citations (6)
Title |
---|
GUANWEN ZHU 等: "Query Planning with Source Descriptions for Deep Web", 《PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON CYBERNETICS AND INFORMATICS》 * |
YUN CHI 等: "On evolutionary spectral clustering", 《ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA》 * |
侯薇 等: "一种基于隶属度优化的演化聚类算法", 《计算机研究与发展》 * |
祝官文 等: "基于主题和表单属性的深层网络数据源分类方法", 《电子学报》 * |
董红斌 等: "协同演化算法在聚类中的应用", 《模式识别与人工智能》 * |
高兵 等: "基于共享最近邻密度的演化数据流聚类算法", 《北京科技大学学报》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108806355A (en) * | 2018-04-26 | 2018-11-13 | 浙江工业大学 | A kind of calligraphy and painting art interactive education system |
CN108806355B (en) * | 2018-04-26 | 2020-05-08 | 浙江工业大学 | Painting and calligraphy art interactive education system |
CN108932528A (en) * | 2018-06-08 | 2018-12-04 | 哈尔滨工程大学 | Similarity measurement and method for cutting in chameleon algorithm |
CN109543712A (en) * | 2018-10-16 | 2019-03-29 | 哈尔滨工业大学 | Entity recognition method on temporal dataset |
CN110033644A (en) * | 2019-04-22 | 2019-07-19 | 泰华智慧产业集团股份有限公司 | Parking position reserves air navigation aid and system |
CN111161819A (en) * | 2019-12-31 | 2020-05-15 | 重庆亚德科技股份有限公司 | Traditional Chinese medical record data processing system and method |
CN114386422A (en) * | 2022-01-14 | 2022-04-22 | 淮安市创新创业科技服务中心 | Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction |
CN114386422B (en) * | 2022-01-14 | 2023-09-15 | 淮安市创新创业科技服务中心 | Intelligent auxiliary decision-making method and device based on enterprise pollution public opinion extraction |
CN116450830A (en) * | 2023-06-16 | 2023-07-18 | 暨南大学 | Intelligent campus pushing method and system based on big data |
CN116450830B (en) * | 2023-06-16 | 2023-08-11 | 暨南大学 | Intelligent campus pushing method and system based on big data |
Also Published As
Publication number | Publication date |
---|---|
CN106067029B (en) | 2019-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106067029A (en) | The entity classification method in data-oriented space | |
Paredes et al. | Machine learning or discrete choice models for car ownership demand estimation and prediction? | |
US20180349384A1 (en) | Differentially private database queries involving rank statistics | |
US9355367B2 (en) | System and method for using graph transduction techniques to make relational classifications on a single connected network | |
CN106817251B (en) | Link prediction method and device based on node similarity | |
Guo et al. | Local community detection algorithm based on local modularity density | |
Liu et al. | Deep learning approaches for link prediction in social network services | |
Adcock et al. | Tree decompositions and social graphs | |
Gong et al. | Identification of multi-resolution network structures with multi-objective immune algorithm | |
Chen et al. | Exploiting structural and temporal evolution in dynamic link prediction | |
Karpatne et al. | Predictive learning in the presence of heterogeneity and limited training data | |
CN107220311A (en) | A kind of document representation method of utilization locally embedding topic modeling | |
Shahbazi et al. | A survey on techniques for identifying and resolving representation bias in data | |
CN114154557A (en) | Cancer tissue classification method, apparatus, electronic device, and storage medium | |
Ban et al. | Micro-directional propagation method based on user clustering | |
CN106651461A (en) | Film personalized recommendation method based on gray theory | |
CN101241520A (en) | Model state creation method based on characteristic suppression in finite element modeling | |
Zhang et al. | Closeness degree-based hesitant trapezoidal fuzzy multicriteria decision making method for evaluating green suppliers with qualitative information | |
Rahmani Seryasat et al. | Predicting the number of comments on Facebook posts using an ensemble regression model | |
Chhabra et al. | Missing value imputation using hybrid k-means and association rules | |
CN117036781A (en) | Image classification method based on tree comprehensive diversity depth forests | |
Paul et al. | Community detection using Local Group Assimilation | |
de Sá et al. | A novel approach to estimated Boulingand-Minkowski fractal dimension from complex networks | |
Huang et al. | Community detection algorithm for social network based on node intimacy and graph embedding model | |
Ma et al. | Discover semantic topics in patents within a specific domain |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190618 Termination date: 20200524 |