CN106067029A - The entity classification method in data-oriented space - Google Patents

The entity classification method in data-oriented space Download PDF

Info

Publication number
CN106067029A
CN106067029A CN201610348890.4A CN201610348890A CN106067029A CN 106067029 A CN106067029 A CN 106067029A CN 201610348890 A CN201610348890 A CN 201610348890A CN 106067029 A CN106067029 A CN 106067029A
Authority
CN
China
Prior art keywords
entity
snapshot
sigma
time step
bunch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610348890.4A
Other languages
Chinese (zh)
Other versions
CN106067029B (en
Inventor
王念滨
王红滨
周连科
祝官文
何鸣
王瑛琦
宋奎勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201610348890.4A priority Critical patent/CN106067029B/en
Publication of CN106067029A publication Critical patent/CN106067029A/en
Application granted granted Critical
Publication of CN106067029B publication Critical patent/CN106067029B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The entity classification method in data-oriented space, belongs to natural language processing field.Under Evolution Environment, existing cannot be by assuming that entity be resting state, and the problem classifying entity.A kind of entity classification method in data-oriented space, first, for the data space entity developed, proposes K Means that improve, that develop and clusters framework, i.e. define based on profile value and the objective cost function of KL divergence;Secondly, the data space entity method for measuring similarity of a kind of novelty is devised;Then, according to heuristic rule, the K Means clustering algorithm developed is proposed.Additionally, further expand the evolution cluster framework that this chapter proposes, to process the situation that number of clusters amount changes in time or snapshot entity adds in time or removes.The present invention can not only capture current entity cluster result in high quality, moreover it is possible to robustly reflecting history clusters situation.

Description

The entity classification method in data-oriented space
Technical field
The present invention relates to a kind of entity classification method in data-oriented space.
Background technology
Data space is integrated is one of the important channel that builds of data space.Is various structures faced by data space Change, semantic relation complexity, the large-scale data of distribution storage, therefore data space is integrated mainly includes that both sides works: (1) Entity integrated;(2) entity relationship is integrated.At present, the integrated work of existing data space] it is focused mainly on entity relationship collection Become and propose some effective strategy or methods, but the research in terms of entity is integrated[44]Relatively fewer.Therefore Data spatially integrate (especially entity is integrated) is significant.As the important step that entity is integrated, real The classification tool of body is widely used, and such as, inquiry question answering system, Relation extraction, data space inquiry, machine translation, text gather Class etc..Therefore, the entity classification technology in data space is significant.
At present, the sort research (naming) entity has caused a large amount of scholar extensive in natural language processing field (NLP) Pay close attention to.These work are mainly divided into two big classes: the entity classification of coarseness and fine-grained entity classification.The entity of coarseness Classification is intended to be divided into a group object coarseness class label set of a group less, and class number is typically less than 20 classes and class Other does not has level, the entity class such as such as name, organization name, place name.Common method includes method based on machine learning, base Method in the supplementary knowledge such as body and external resource.Such as, Chifu et al. uses unsupervised neural network model to name Entity carries out unsupervised segmentation, Kliegr proposes a kind of based on Bag-of-Articles without supervision name entity classification side Method and Gamallo and Garcia propose a kind of resource-based name entity classification system.Fine-grained entity classification is then Entity is divided into more fine-grained classification, and its class number is more, and class hierarchy is more complicated.Such as, FIGER uses 112 kinds Freebase type, HYENA employ 505 kinds of YAGO types.Typical method has based on context and based on grammar property Method.Such as, Gillick et al. proposes a kind of fine granularity entity classification method of Context-dependent, according to grammar property, Giuliano and Gliozzo proposes a kind of fine granularity entity classification method of instance-based learning algorithm, thus generates more Add abundant human body.
But, the entity classification method in above NLP field utilizes contextual information, linguistic information and outside often The priori entity class knowledge such as knowledge feature carry out classifying and classify to as if static, but in data space, entity divides Class technology rarely has research.In data space environment, entity classification is a task the most challenging, this be mainly manifested in Under several aspects: (1) entity information is rich.As described in chapter 2, data space entity is not only the title letter comprising it Breath, also comprises abundant attribute character information and content characteristic information, it is true that this partial information is even more important, therefore, needs Want the similarity of a kind of more appropriate similarity function assessment data space inter-entity.(2) entity class hysteresis quality.Due to number Advocate according to space and build while with the integration mode of (Pay-ag-you-go), this causes being gradually to obtain on entity class knowledge essence , therefore clustering technique is the more Appropriate method realizing entity classification.(3) entity Dynamic Evolution.Traditional entity divides Class method has a strict assumed condition: entity is static, develops the most in time.But this assumed condition exists Under data space environment the most applicable, the entity self-information of extraction and physical quantities the most constantly change. Therefore, under Evolution Environment, how entity is classified and more challenge.
Summary of the invention
The invention aims to solve under Evolution Environment, existing cannot be and right by assuming that entity is resting state Entity carries out the problem classified, and proposes a kind of entity classification method in data-oriented space.
A kind of entity classification method in data-oriented space, described method is realized by following steps:
Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on wheel Wide value and the objective cost function of KL-divergence;
Step 2, design data space entity method for measuring similarity;
The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space of initial point select permeability and evolution Entity classification problem;
Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, expand The K-Means developed in exhibition step one clusters framework.
The invention have the benefit that
The present invention be directed to the data space entity developed, it is proposed that a kind of improvement, evolution K-Means clusters frame Frame, i.e. defines based on profile value and the objective cost function of KL-divergence;It not only considers the quality currently clustered, i.e. snapshot generation Valency, it is also contemplated that all history cluster the time smoothing of structure, and time smoothing is history cost;The data space of design Entity method for measuring similarity, the method not only considers more rich entity self information, such as, the structured features letter of entity Breath and destructuring characteristic information;There is pattern information in the history also contemplating inter-entity, thus more accurately measures entity Between similarity;Propose the K-Means clustering algorithm developed, and solve the data space entity of initial point select permeability and evolution Classification problem;Finally, the K-Means further expanding evolution clusters framework, changes in time or snapshot in number of clusters amount In the case of entity adds in time or removes, process the versatility problem of the K-Means cluster framework developed.
Accompanying drawing explanation
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the snapshot plotting schematic diagram under the adjacent time step-length that the present invention relates to;Figure illustrates time step t-1 and Snapshot plotting under t.Comprising 6 summits, one snapshot entity of each vertex correspondence in snapshot plotting, the numeral on limit represents the most according to the facts Similarity between body.During time step t-1 to t, their similarity changes (e.g., snapshot entity V a bit1 With snapshot entity V2, snapshot entity V3With snapshot entity V4), some does not then change (such as snapshot entity V5With snapshot entity V6)。
Assuming that snapshot plotting GtContaining n summit, thenIt is snapshot plotting GtOn an adjacency matrix.Over time Evolution, the similarity history of snapshot inter-entity pass through a series of snapshot plotting < G1,G2,…,Gh,…,Gt> capture;
Fig. 3 is the cluster scene of the entity of the evolution that the present invention relates to, and illustrates in figure under time step t-1 and t Snapshot plotting, wherein, each vertex representation snapshot entity, the numeral on limit represents the similarity scores of snapshot inter-entity.Obviously, exist Under time step t-1, six snapshot entities should cluster according to the mode of cluster 1;Under time step t, cluster result Not unique, such as, six snapshot entities may cluster according to the mode of cluster 2 and cluster 3.Obviously, cluster 2 and cluster The mode of 3 all can ensure that the entity of current time step t clusters quality the most very well.But, according to basic thought principle, more Add the mode tending to cluster 2, this is because it is more consistent with the result that clusters of historical time step-length t-1;
Fig. 4 is the snapshot cost that the present invention tests under the 1 different λ related to;
Fig. 5 is the snapshot cost that the present invention tests under 2 different beta related to;
Fig. 6 to Fig. 8 is the history cost that the present invention tests under the 3 different α related to;
Fig. 9 is the association relationship situation that the present invention tests under the 4 different evolution methods related to;
Figure 10 is the average operating time schematic diagram of each iteration of the SD-EKM that the present invention relates to;
Mean iterative number of time schematic diagram when Figure 11 is the convergence that the present invention relates to.
Detailed description of the invention
Detailed description of the invention one:
The entity classification method in the data-oriented space of present embodiment, shown in Fig. 1 flow chart, described method is passed through Following steps realize:
Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on wheel Wide value and the objective cost function of KL-divergence;
Step 2, design data space entity method for measuring similarity;
The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space of initial point select permeability and evolution Entity classification problem;
Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, expand The K-Means developed in exhibition step one clusters framework.
Detailed description of the invention two:
Unlike detailed description of the invention one, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method, the K-Means proposing described in step one to develop clusters framework, i.e. defines target generation based on profile value and KL-divergence The process of valency function is,
Step one by one, use the mode of linear combination to define total objective cost function:
Cost function is made up of two parts: the snapshot cost of current time step and the history cost of historical time step-length, It is designated as Cost respectivelysnapshotAnd Costtemporal;;The former is served only for measuring the result that currently clusters about current entity information Snapshot quality, the metric of this reflection clustering algorithm, it is clear that the highest snapshot cost means that snapshot quality is the lowest;Then Person then carrys out measuring period flatness according to the cluster fitting degree of structure of currently cluster structure and history, it is clear that the highest history Cost means that the structural integrity that clusters of step-length continuous time is poor or time smoothing is the most weak, additionally, go through for difference The history cost of history time step, its weight is the most different.The mode using linear combination defines total objective cost function, uses In assessment evolution entity K-Means clustering result quality, total objective cost function comprise current time step snapshot cost and The history cost of historical time step-length is two-part, formula specific as follows:
Cost t o t a l t = &alpha; &CenterDot; Cost s n a p s h o t t + ( 1 - &alpha; ) &CenterDot; &Sigma; h = 1 t - 1 e t - h Cost t e m p o r a l h - - - ( 1 )
In formula, 0≤α≤1, represent the weight factor of snapshot cost;Represent the snapshot of current time step t Cost;Represent the history cost of historical time step-length h;Factor et-hShow from current time step t more close to, it is gone through History costShared weight is the heaviest, and its departure degree is the least, owing to total objective cost function is the smaller the better, i.e. from Current time step t is the nearest, and the time smoothing of the structure that clusters of historical time step-length h is the best;
Step one two, the tolerance of snapshot cost based on profile value of carrying out:
If the snapshot plotting of current time step t is Gt=(Vt,Et,Wt), wherein | Vt|=n, WtPhase for snapshot inter-entity Like property matrix, concrete similarity calculation method is shown in formula (11);The entity division obtained based on this snapshot plotting isWhereinAndSnapshot cost is intended to tolerance about recent snapshot The snapshot quality of the result that currently clusters of entity, the metric of this reflection clustering algorithm, it is clear that the highest snapshot cost meaning Snapshot quality the lowest.The current measurement criterion many ripe assessment clusters occurred, such as contingency table, error sum of squares standard Then, profile value (Silhouette Value), accuracy rate based on class label and recall rate etc..Whether these criterions reference The aspect performances such as the dependence of golden standard, similarity measurement, number of clusters amount prejudice are different.Due to entity number in the environment of this chapter Measure more, depend on specific similarity measurement, only with reference to reasons such as itself entities, therefore use profile value criterion tolerance K- Means clusters the quality of result, and wherein, profile value is also referred to as silhouette coefficient, is the one that proposes of Kaufman and Rousseeuw Only with reference to data itself without reference to the Cluster Evaluation method of golden standard;In this Cluster Evaluation method each bunch with a profile Represent, the object that is positioned at bunch by profile reflection and away from bunch object, utilize this Cluster Evaluation method reflect cohesion degree and Two kinds of influence factors of separating degree, and the profile value effect that clusters the most greatly is the best;Snapshot cost is defined as:
Cost s n a p s h o t t = 1 - 1 k &Sigma; p = 1 k s i l ( V p t ) - - - ( 2 )
In formula, k represents the number of lower bunch of entity division of current time step t,Represent pth bunch (entity division),Expression bunchMean profile value;Understanding according to formula (1), snapshot cost is the smaller the better;And mean profile value is the biggest The effect that means to cluster is the best, and therefore mean profile value is with snapshot cost inversely;The physical significance of formula (2) is expressed as In entity division ZtUnder, mean profile value is the biggest, and its quality that clusters is the best, thus more can reflect recent snapshot entity exactly Feature;
Step one three, according to each bunchIn comprise one group of snapshot entityThen by each bunchAverage wheel Wide valueThe meansigma methods of the profile value of all snapshot entities in being defined as bunch, particularly as follows:
s i l ( V p t ) = 1 | V p t | &Sigma; o i t &Element; V p t s i l ( o i t ) - - - ( 3 )
In formula,Expression bunchMiddle snapshot entity;Expression bunchThe number of middle snapshot entity;Represent snapshot EntityProfile value, its measure formulas is expressed as:
s i l ( o i t ) = b ( o i t ) - a ( o i t ) m a x { a ( o i t ) , b ( o i t ) } - - - ( 4 )
Wherein,Represent snapshot entityWith it belonging to bunchIn other snapshot entityAverage similarity, Represent snapshot entityWith other bunchIn all snapshot entitiesMaximum average similarity;Value the biggest, show fast According to the facts bodyAnd in bunch the average similarity of snapshot entity more than it and bunch between in the average similarity of snapshot entity, thus snapshot EntityCorrectly classified;
Step one four, based on described in step one threeRepresent snapshot entityThe thing of measure formulas (4) of profile value Reason meaning, definitionFormula be:
b ( o i t ) = 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; t - - - ( 5 )
DefinitionFormula be:
a ( o i t ) = max V q t &NotEqual; V p t { 1 | V q t | &Sigma; o j t &Element; V p t w i j t } - - - ( 6 )
In formula,For bunchMiddle snapshot entity,For bunchMiddle snapshot entity, wii'For snapshot entity in same clusterWith Between similarity, wijFor snapshot entity in difference bunchWithBetween similarity;
The step First Five-Year Plan, substituted in formula (2) by formula (3) to (6), then snapshot cost is rewritten as:
Cost s n a p s h o t t = 1 - 1 k &Sigma; p = 1 k s i l ( V p t ) = 1 - 1 k &Sigma; p = 1 k &Sigma; o i t &Element; V p t 1 | V p t | b ( o i t ) - a ( o i t ) max { a ( o i t ) , b ( o i t ) } = 1 - 1 k &Sigma; p = 1 k &Sigma; o i t &Element; V p t 1 | V p t | 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; - max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } max { 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; , max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } } - - - ( 7 )
Step one six, the history cost metric based on KL-divergence that carries out:
The first, history cost is intended to carry out measuring period put down according to the cluster fitting degree of structure of currently cluster structure and history Sliding characteristic, it is clear that the least history cost means that the structural integrity that clusters between step-length continuous time is preferable, or the time puts down Slip is the strongest.For discussion purposes,
If snapshot plotting G1,G2,…,Gh,…,Gt,
Under current time step t, based on corresponding snapshot plotting GtEntity division be designated as
Under historical time step-length h, based on corresponding historical snapshot figure GhEntity division be designated as
The second, define a kind of tolerance for comparing two kinds of clusterings, decomposed cluster (Graph-by figure Factorization clustering) thought inspiration, use bipartite graph presentation-entity and bunch relation, by entity division ZtProblem It is converted into joint probability distribution problem in bipartite graph:
Make BGt=(Vt,Ct,Ft,Pt) it is corresponding snapshot plotting Gt=(Vt,Et,Wt) a bipartite graph;Wherein, For snapshot entity sets;Close for gathering;For the set on limit, two of limit Summit is respectively from set VtWith set CtFor n × k joint probability matrix, corresponding to the limit weight square of bipartite graph Battle array;Joint probability formula is used to calculate, i.e.Determine entityWith bunchBetween connection Close probabilityWherein,For bunchThe probability occurred,For bunchEntity under occurrence conditionThe probability occurred; Such as sporocarpBelong to bunchSonjWith n be respectively bunchMiddle snapshot physical quantities and all fast the most according to the facts The quantity of body;Otherwise,Due in the present invention, a kind of hard cluster of cluster rather than soft cluster, i.e. one entity One bunch can only be belonged to, therefore, for joint probability matrix PtIn for arbitrarily row i, only exist a row j and make pijIt is not 0, Thus
3rd, current, in classification with the document of cluster, exist a large amount of about the side comparing two clusterings Method, such as central point differential method, X 2 method, correlation coefficient process, KL-divergence method etc..In this chapter environment, clustering problem Regard as an entity with bunch joint probability distribution problem, therefore tolerance two clustering difference problems be just equivalent to measure two Individual probability distribution variances problem.Owing to KL-divergence (being also relative entropy) is that one derives from information-theoretical tolerance, it is used for determining two The difference of individual probability distribution, therefore uses KL-divergence method to measure:
The bipartite graph BG of given current time step tt=(Vt,Ct,Ft,Pt) and the bipartite graph BG of historical time step-length hh= (Vh,Ch,Fh,Ph), the entity division of current time step tEntity division with historical time step-length hWherein, BGtCorresponding to Zt, BGhCorresponding to Zh, then the history cost of two time steps h and t is fixed Justice is as follows:
Cost t e m p o r a l h = K L ( P t | | P h ) = &Sigma; i = 1 n &Sigma; j = 1 k p i j t &times; l o g ( p i j t / p i j h ) - - - ( 8 )
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For snapshot entity under time step t and bunch between Joint probability matrix PtMiddle element,For snapshot entity under historical time step-length h and bunch between joint probability matrix Ph Middle element;
4th, understand from analysis above, joint probability matrix PtOr PhIt is a sparse matrix, i.e. there is non-zero Element, and the KL-divergence of standard is not supportedOrIt is the situation of 0, to this end, joint probability matrix PtOr PhDo following smooth Process: PtOr PhIn each elementOrPlus constant ε, and ε=e-12;Then the most regular to element after processing Change, be designated asOrProbability matrix after smoothing processing is designated as respectivelyWithThen formula (8) is modified to:
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For under time step t after smoothing processing Entity and bunch between joint probability matrixMiddle element,For under historical time step-length h through smoothing processing entity with Bunch joint probability matrixMiddle element;
5th, formula (7) and formula (8) are substituted in formula (1), then target total cost function is equivalent to:
Wherein, 0≤α≤1 is the weight factor of snapshot cost, and k represents that current time step t's lower bunch (entity division) is individual Number,Presentation-entity divides ZtMiddle pth element, wii'Or wijRepresent snapshot plotting Gt=(Vt,Et,Wt) WtMiddle element,OrRepresent GtMiddle snapshot entity, n represents bipartite graph BGt=(Vt,Ct,Ft,Pt) VtMiddle snapshot physical quantities,Represent flat Joint probability matrix after sliding processMiddle element,Represent joint probability matrix after smoothing processingMiddle element.
Detailed description of the invention three:
Unlike detailed description of the invention one or two, the reality in the data-oriented space of the present embodiment of present embodiment Body sorting technique, the process designing data space entity method for measuring similarity described in step 2 is,
On the one hand, snapshot entity self comprises abundant information, the most structurized attribute information, non-structured content Information;On the other hand, in data space environment, entity occurs the most repeatedly, and this history pattern information occurs to sentencing Disconnected two entities are the most similar also has certain effect, to this end, data space entity i.e. snapshot entity, according to self letter of entity There is pattern information to measure the similarity of snapshot entity in the history of breath and entity, and i.e. the similarity function of snapshot entity is by self Similarity and historical similarity two parts composition, expression formula is defined as:
w i j t = Sim t ( o i t , o j t ) = &beta; &CenterDot; Sim s e l f t ( o i t , o j t ) + ( 1 - &beta; ) &CenterDot; Sim h i s t ( o i t , o j t ) - - - ( 11 )
Wherein, 0≤β≤1 is the weight of self similarity,For the snapshot entity under current time step t,For snapshot entityWithBetween self similarity,For snapshot entityWithBetween Historical similarity;
Intuitively, the same or analogous ratio of entity attributes name with apoplexy due to endogenous wind is higher, and the genus of the entity in inhomogeneity The property same or analogous ratio of name is relatively low;Additionally, some entity often comprises only unstructured information, similar two of its content Entity, to a certain extent, they are likely to belong to same class.To this end, it is corresponding based on snapshot entity attributes characteristic information Structured features information, the destructuring characteristic information corresponding with content characteristic information, self between snapshot entity is similar Property is defined as follows:
Sim s e l f t ( o i t , o j t ) = &lambda; &CenterDot; sim a t t r t ( o i t , o j t ) + ( 1 - &lambda; ) &CenterDot; sim c o n t t ( o i t , o j t ) = &lambda; &CenterDot; o i t . A t t r &cap; o j t . A t t r o i t . A t t r &cup; o j t . A t t r + ( 1 - &lambda; ) &CenterDot; o i t . C o n t &cap; o j t . C o n t o i t . C o n t &cup; o j t . C o n t - - - ( 12 )
Wherein, 0≤λ≤1 is the weight of attribute character similarity,WithIt is respectively snapshot Entity attributes characteristic similarity and content characteristic similarity,For snapshot entity attributes feature,For the most according to the facts The content characteristic of body;
If the number information pattern that two snapshot entities occur in the history step-length in past is relatively uniform, then for working as Two snapshot entities of front time step, this historical information pattern dependency represents that the two snapshot entity has similarity, Use classical Pearson correlation coefficients tolerance historical similarity, particularly as follows:
sim h i s t ( o i t , o j t ) = &rho; ( o i t , o j t ) = &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) ( n j h - &mu; j t ) &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) 2 &Sigma; h = 1 t - 1 ( n j h - &mu; j t ) 2 - - - ( 13 )
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithIn history The number of times that time step h occurs,WithSnapshot entity respectivelyWithAverage in all historical time step-length occurrence numbers Value;
Formula (12) and formula (13) are substituted into formula (11), then the similarity function of snapshot entity is rewritten as:
w i j t = &beta; &CenterDot; Sim s e l f t ( o i t , o j t ) + ( 1 - &beta; ) &CenterDot; Sim h i s t ( o i t , o j t ) &beta; &CenterDot; ( &lambda; &CenterDot; o i t . A t t r &cap; o j t . A t t r o i t . A t t r &cup; o j t . A t t r + ( 1 - &lambda; ) &CenterDot; o i t . C o n t &cap; o j t . C o n t o i t . C o n t &cup; o j t . C o n t ) + ( 1 - &beta; ) &CenterDot; ( &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) ( n j h - &mu; j t ) &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) 2 &Sigma; h = 1 t - 1 ( n j h - &mu; j t ) 2 ) - - - ( 14 )
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithIn history The number of times that time step h occurs,WithSnapshot entity respectivelyWithAverage in all historical time step-length occurrence numbers Value,WithIt is respectively snapshot entityWithAttribute character,WithFor snapshot entityWith Content characteristic;0≤β≤1 is self similarityWeight, 0≤λ≤1 is attribute character similarity Weight.
Detailed description of the invention four:
Unlike detailed description of the invention three, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method, proposes the K-Means clustering algorithm developed described in step 3, and the data solving initial point select permeability and evolution is empty Between the process of entity classification problem be,
The first, first provide some related definitions, preferably to select initial center point, then describe evolution in detail K-Means clustering algorithm;
It is known that the quality that initial point selects greatly affects the quality of K-Means Clustering Effect, and tradition is selected at random The method selected is easily caused algorithm the convergence speed and crosses the problems such as slow.Therefore, before solving initial point select permeability, following phase is carried out Close and define:
The definition of the η-neighbours under t: a given snapshot plotting Gt=(Vt,Et,Wt) and parameter 0 < η≤1, then right In any snapshot entityFor, the η under t-neighbours' formal definitions is: Wherein | Vt| for snapshot plotting GtMiddle number of vertices,For WtMiddle element;
The definition of the similarity density under t: a given snapshot plotting Gt=(Vt,Et,Wt) and t under η-neighbour OccupySo for any snapshot entityFor, the similarity Density Format under t is defined as:
Density s i m ( o i t ) = | N ( o i t , &eta; ) | &times; log ( 1 + 1 | N ( o i t , &eta; ) | - 1 &Sigma; o j t &Element; N ( o i t , &eta; ) , o j t &NotEqual; o j t N ( o i t , &eta; ) w i j ) ;
Above-mentioned definition understands: a snapshot entitySimilarity density the highest, then its η-neighboursQuantity The most and withIn the average similarity of other snapshot entity the highest;And the snapshot entity that similarity density is the highest, It is the biggest as the probability at bunch center;The definition of the similarity density under t avoids and selects in such as density regions Snapshot entity, the noise data of isolated snapshot entity, or bunch in edge snapshot entity as K-Means cluster bunch in Heart point;
The second, the snapshot entity that selection principle is similarity density maximum of first initial center point is determined;
Determine the selection principle of the initial center point in addition to first initial center point: remove the initial center point selected The snapshot entity of η-neighbours;Average similarity less than all initial center point selected;Similar higher than Current central point Property density;Wherein, the average similarity value of all initial center point selected is 0.3, and the similarity density of Current central point is 10, then, this principle form can turn to equation below:
c j = arg max o i t &Element; V t , o i t &NotElement; &cup; l = 1 j - 1 N ( o s l t , &eta; ) , { Density s i m ( o i t ) / ( 1 + 1 j - 1 &Sigma; l = 1 j - 1 w s l i t ) } - - - ( 15 )
Wherein, 1≤l≤j-1 is the serial number having selected initial center point,Initial center has been selected for all The union of the η-neighbours of point,For snapshot entityWith select initial center pointSimilarity,For snapshot EntitySimilarity density under t, adds that coefficient 1 purpose is the situation preventing denominator from being zero;
The basic thought of the K-Means clustering algorithm the 3rd, performing evolution is as follows: in the institute to current time step Having in time step, circulation performs K-Means clustering algorithm;Wherein, each time step performs K-Means clustering algorithm Process is, selects initial center point based on similarity density and formula (15), is then iteratively performed following operation:
1) snapshot entity is assigned to bunch central point that similarity is the highest,
2) bunch central point is updated, until it reaches the condition of convergence that in formula (10), target cost is minimum;
The K-Means clustering algorithm detailed process developed is as follows:
Input: the snapshot entity sets O={O of a series of different time step-lengths1,O2,…,Oh,…,Ot, different time walks Long corresponding bunch number set K={k1,k2,…,kh,…,kt};
Output: the cluster result set C={C of all time steps1,C2,…,Ch,…,Ct};Wherein, h express time step Long, h=1,2 ..., t;
(1), to each time step h, circulation performs:
(2) formula (14), is utilized to calculate snapshot entity sets O under current time step hhCorresponding similarity matrix Wh, and Build corresponding snapshot plotting Gh=(Vh,Eh,Wh);
(3), by bunch central point setIt is initialized as sky;
(4), carry out selecting the process of initial center point: first select the snapshot entity that similarity density is the highest as first Initial center pointThen it is calculated the remaining initial center point of selection according to formula (15)Wherein, j is according to from 1hTo kh Ascending order order, subscript h express time step-length;
(5), circulation performs: by snapshot entity sets OhIn each snapshot entityIt is assigned to bunch center most like with it Place bunch (e.g.,);Update the central point of each bunch and record cluster result Ch;Until meeting target generation in formula (10) Valency functionThe minimum condition of convergence;
The accumulative cluster result updating different time step-length;
And return the cluster result C of all time steps.
Detailed description of the invention five:
Unlike detailed description of the invention one, two or four, the data-oriented space of the present embodiment of present embodiment Entity classification method, change in time or snapshot entity adds in time or removes in number of clusters amount described in step 4 In the case of, the process of the K-Means cluster framework developed in spread step one is,
In the evolution K-Means one of detailed description of the invention one to four Suo Shu clusters framework, cluster quantity the most in time Change;And in all time steps, snapshot entity sets to be clustered is identical, i.e. snapshot entity does not add Situation about entering or remove.But in actual applications, the two assumes that qualifications is the strictest.To this end, before this trifle extends The evolution K-Means that face proposes clusters framework, to process following situation:
The first, when number of clusters amount changes in time:
Quantity k that clusters when historical time step-length hhQuantity k that clusters less than current time step ttTime, only need to increase phase The row answered are to joint probability matrix PhIn, thus be extended toWhereinThis is because for Newly increase bunch for, under historical time step-length h, snapshot entity-new bunch combines the probability of generation is zero.Now, after extension,And PtIt is all n × ktJoint probability matrix, therefore, formula (10) is revised as:
Quantity k that clusters when historical time step-length hhQuantity k that clusters more than current time step ttTime, only need to increase phase The row answered are to joint probability matrix PtIn, thus be extended toWherein,This is because for Delete bunch for, under current time step t, the probability of generation is combined in snapshot entity-deletion bunch is zero.Now, after extension, PhWithIt is all n × khJoint probability matrix, therefore, formula (10) is revised as:
The second, when snapshot entity adds in time or removes:
Assuming that when historical time step-length h, PhIt is a nhThe joint probability matrix of × k, when current time step t, Pt It is a nhThe joint probability matrix of × k, n0Individual snapshot entity occurs in time step h and t simultaneously;When walking at historical time When the snapshot entity of long h is removed, for time step t, those the snapshot entities being removed were sent out with combining of current cluster Raw probability is 0, at PtIncrease the row of correspondence middlely, thus obtainP t , wherein,And when in current time When the snapshot entity of step-length t is newly added, for historical time step-length h, those the snapshot entities being newly added and history bunch Associating probability of happening is 0, at PhThe middle row increasing correspondence, thus obtainP h , wherein,Now, extension After,P h WithP t It is all (nh+nt-n0The joint probability matrix of) × k, therefore, formula (10) is revised as:
In formula, symbolRepresenting matrix X according in formula (9) smoothing processing method process after matrix,It it is matrixIn Element.
Detailed description of the invention six:
Unlike detailed description of the invention five, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method, described snapshot entity is, t, then the snapshot entity under t can formalization representation be ot=(Attr, Cont);Its In, Attr represents snapshot entity otStructured features information, such as attribute-name set in tuple, and Attr={a1,a2,…, an};Cont represents snapshot entity otDestructuring characteristic information, such as keyword set in content, and Cont={keyword1, keyword2,…,keywordm, n and m represents the element number of set Attr and Cont respectively;
Noting, under the most in the same time, structuring and destructuring characteristic information in entity o are it may happen that change, existing In reality, this may be caused by many reasons, and the such as information source of entity extraction increases the change causing entity information.But this chapter It is not concerned with entity information extraction problem under any time.And all snapshot entities under current time step t are designated asSnapshot plotting schematic diagram under adjacent time step-length is as shown in Figure 2.
Detailed description of the invention seven:
Unlike detailed description of the invention one, two, four or six, the data-oriented of the present embodiment of present embodiment is empty Between entity classification method, under moment t, a paper class snapshot entity otComprise title, author, size attribute, also comprise Data space, entity, the unstructured content information of classification;Then ot=({ title, author, size }, data space, entity, Classification });
Detailed description of the invention eight:
Unlike detailed description of the invention seven, the entity in the data-oriented space of the present embodiment of present embodiment divides Class method,
Described snapshot plotting is, time step t, then the snapshot plotting under t can formalization representation be Gt=(Vt, Et,Wt), wherein at figure GtIn, each summitRepresent snapshot entity, each edgeRepresent snapshot entityWithTool There is similarity,Represent limitWeight, i.e. snapshot entity under time step tWithBetween similarity divide Number.Then the snapshot plotting schematic diagram under adjacent time step-length is as shown in Figure 3.
Experiment and interpretation of result:
Setup Experiments:
This experiment use from March, 2015 release version DBLP data as experiment needed for basic, Download address is http://dblp.uni-trier.de/.The entity class of extraction includes paper, thesis for the doctorate, author, meeting View, periodical, mechanism of university.Have following some should be noted: (1) paper entity come from inproceedings record or key key With " journals " the Article record as prefix, thesis for the doctorate entity comes from phdthesis record, author's entity from In WWW record or author label, meeting entity comes from key key with " conf " inproceedings record as prefix In booktitle label, periodical entity comes from journal label or key key with " journals " as prefix Booktitle label in inproceedings record, mechanism of university comes from school label;(2) span when being only chosen at Degree is the entity produced between 2005 to 2014, and a time step is 1 year, and the physical quantities size after extraction is the most about 3M;(3) it is the evolution properties of analog data spatial entities, this chapter entity set to each time step, randomly choose 20% ratio The entity of example, then removes some attribute informations or content information simple randomization.(4) it is the Pay-as-in analog data space Collected all entities are not the most provided class label by you-go characteristic, i.e. do not have previously known classification information (Ground truth).(5) for the extensibility of test this method, different classes of entity is constantly reduced entity according to the principle of equal proportion Quantity, thus generate the DBLP data set of 2.5M, 2M, 1.5M and 1M size.
Experimental situation is provided that PC main frame uses Intel (R) Core (TM) i5-4570CPU 3.20GHz, and internal memory holds Amount is 4G, and hard-disk capacity is 1TB, and operating system is Windows 7 (64bit), and in experiment, all algorithms all use Java language Realize.Unless stated otherwise, in all experiments, the parameter k default value in the evolution K-Means algorithm of this chapter is 6, data set Size is 3M.
Effect is assessed with extensibility
(1) the choosing of parameter
Test the change of the different parameters value impact on Clustering Effect respectively below by three groups of experiments, thus determine respectively Parameter lambda, the optimal value of β and α.
In experiment 1. assessment entity similarity function, weight λ chooses the impact on Clustering Effect.Owing to this experiment is only closed Noting the change impact on Clustering Effect in weight λ, therefore setup parameter α=1 and β=1, in addition the number of all time steps It is aggregating according to collection, experiment 50 times of reruning on this basis, and records average snapshot cost corresponding to all λ.Horizontal stroke in Fig. 4 The different values of coordinate representation weight λ, vertical coordinate represents snapshot cost (see formula (7)).As can be known from Fig. 4, along with λ increases, Snapshot cost tapers into;When reaching 0.6, snapshot Least-cost (being now 0.5), i.e. Clustering Effect is optimal;Afterwards, snapshot Cost then becomes larger.This shows, for entity self similarity measurement, to compare the content information of entity, entity attributes Characteristic information plays and more importantly acts on.This is primarily due to the entity for similar categorization, and they attribute character are similar The similar probability of likelihood ratio content is big.Snapshot cost under different λ as shown in Figure 4.
In experiment 2. assessment entity similarity function, weight beta chooses the impact on Clustering Effect.Owing to this experiment is only closed Note the change impact on Clustering Effect in weight beta, and test 1 and show best results during λ=0.6, for this this experiment setting Parameter alpha=1 and λ=0.6, be then aggregating the data set of all time steps, and experiment runs 50 times on this basis, and Record average snapshot cost corresponding to all β.In Fig. 5, abscissa represents the different values of weight beta, and vertical coordinate represents snapshot cost (see formula (7)).As can be seen from Figure 5, along with β increases, snapshot cost overall trend tapers into;When reaching 0.75, snapshot generation Valency is minimum (being now 0.36), i.e. Clustering Effect is optimal;Afterwards, snapshot cost then becomes larger.This shows for entity similar Property tolerance for, not only it is also contemplated that the self information of entity, it should also be taken into account that pattern information occurs in the history of entity.Additionally, For pattern information occurs in history, the self information of entity more can affect the quality of entity cluster (classification) effect.
In experiment 3. assessment objective cost function, weight α chooses the impact on Clustering Effect.According to the first two experiment Conclusion, this experiment setup parameter β=0.75 and λ=0.6, then repeat on the DBLP data set of step-length continuous time (1-10) Running experiment 50 times, and record the corresponding meansigma methods that all time steps are corresponding.Abscissa express time step-length in Fig. 6, vertical Coordinate representation snapshot cost (see formula (7)).As can be seen from Figure 6, developing (i.e. time step increase) over time, snapshot cost is opened Beginning strongly reduces, and the most almost trends towards stable.This is primarily due in evolutionary process, the self information of entity and history mould Formula information is gradually abundant also until restraining, so that Clustering Effect strengthens and until being in relatively steady statue.Further, it is also possible to Observing, along with α value is gradually increased, snapshot cost is gradually reduced.It is the biggest that this is primarily due to α value, and evolution clustering algorithm is the strongest Adjust current cluster result quality, thus snapshot cost is the least.Abscissa express time step-length in Fig. 7, vertical coordinate represents history generation Valency (see formula (9)).As can be seen from Figure 7, along with α value is gradually reduced, history cost tapers into.It is the least that this is primarily due to α value, Evolution clustering algorithm more emphasizes the flatness of Historic Clustering result.In Fig. 8, abscissa represents weight α value size, and vertical coordinate represents Total cost (see formula (10)).As it can be observed in the picture that when α value is 0.9, total Least-cost, the effect the most now clustered is best. This shows that the evolution K-Means clustering algorithm that this chapter proposes can be rolled between snapshot cost and history cost well In.The history cost under different α as shown in Figure 6 to 8.
(2) Contrast on effect of distinct methods
This method (Similary Density-Based Evolutionary K-is compared in experiment 4. in terms of Clustering Effect Means Clustering, SD-EKM) and other pedestal methods.The following pedestal method of this experimental design: (1) simple method (Evolutionary K-Means Clustering, N-EKM), the evolution K-i.e. selected based on random initial point Means.It is another version of context of methods SD-EKM, and difference is that the mode that initial point selects is different.(2) classical PCM-EKM method, it is a kind of based on reservation cluster member evolution K-Means clustering method that Yun Chi et al. proposes.Abide by Following the experiment conclusion of Yun Chi, parameter alpha is also configured as 0.9, but has carried out small size change in this experiment, i.e. replaces original Similarity measurement is the entity method for measuring similarity that this chapter proposes.(3) IND method, the data of the most each time step are the most solely The vertical data run K-Means algorithm and do not consider historical time step-length.Owing to notebook data collection does not has previously known entity to divide The tolerance that the cost function of category information (that is, theoretic entity classification result) and this several method is the most unified, therefore this Experimental evaluation tolerance uses mutual information tolerance as with reference to (reader interested can be referring in detail to Xu et al.[136]Draw for two groups proposed Mutual information definition between Fen), substantially, the mutual information between two groups of divisions is the highest, and probability similar between them is the biggest.All realities Test experiment 50 times of reruning on the DBLP data set of step-length continuous time (1-10), and it is corresponding to record all time steps Association relationship.Abscissa express time step-length in Fig. 9, vertical coordinate represents mutual information.As can be seen from Figure 9, (1) SD-EKM, N-EKM and The mutual information of PCM-EKM method is significantly better than IND method, and develops over time, and the former association relationship is the most steady Fixed.This is mainly the most above three methods and uses the thought clustered that develops, it is considered to the entity information of historical time step-length;(2) The mutual information of SD-EKM and N-EKM method is relatively better than PCM-EKM method, and in the stability also phase of step-length continuous time To preferably, this is primarily due to PCM-EKM method and only considers the historical information of a upper time step, and the method for the present invention SD-EKM and N-EKM then considers the historical information of all historical time step-lengths.(3) SD-EKM method is better than N-EKM.This is mainly Because method SD-EKM of the present invention uses the guideline of similarity density, largely avoid selection noise data (as low Snapshot entity in density area) as initial center point, so that cluster result is more preferable.Distinct methods as shown in Figure 9 Contrast on effect schematic diagram.
(3) extensibility
Experiment 5. test this method SD-EKM upon execution between on extensibility.This experiment is at different size of data set On rerun this method SD-EKM 50 times, hold when recording average operating time and this method convergence of each iteration respectively The mean iterative number of time of row.In Figure 10, abscissa represents data set size, and vertical coordinate represents the average operating time of each iteration. In Figure 11, abscissa represents data set size, mean iterative number of time when vertical coordinate represents SD-EKM algorithmic statement.From Figure 10 and Tu 11 understand, and the average operating time of each iteration of (1) SD-EKM of the present invention is almost linear with data set size;(2) calculate Iterations required during method convergence is almost insensitive with data set size, about at about 650 times.These two groups experiments show this The average operating time of invention SD-EKM is linear with data set size, is with good expansibility.As shown in Figure 10 The average operating time of each iteration, and mean iterative number of time schematic diagram during convergence shown in Figure 11.
The present invention also can have other various embodiments, in the case of without departing substantially from present invention spirit and essence thereof, and this area Technical staff is when making various corresponding change and deformation according to the present invention, but these change accordingly and deformation all should belong to The protection domain of appended claims of the invention.

Claims (8)

1. the entity classification method in a data-oriented space, it is characterised in that: described method is realized by following steps:
Step one, for develop data space entity, propose develop K-Means cluster framework, i.e. define based on profile value Objective cost function with KL-divergence;
Step 2, design data space entity method for measuring similarity;
The K-Means clustering algorithm that step 3, proposition are developed, and solve the data space entity of initial point select permeability and evolution Classification problem;
Step 4, in the case of number of clusters amount changes in time or snapshot entity adds in time or removes, extension step The K-Means developed in rapid one clusters framework.
The entity classification method in data-oriented space the most according to claim 1, it is characterised in that propose described in step one The K-Means developed clusters framework, and i.e. defining process based on profile value and the objective cost function of KL-divergence is,
Step one by one, use the mode of linear combination to define total objective cost function:
Cost function is made up of two parts: the snapshot cost of current time step and the history cost of historical time step-length, respectively It is designated as CostsnapshotAnd Costtemporal;The mode using linear combination defines total objective cost function, is used for assessing evolution The K-Means clustering result quality of entity, total objective cost function comprises snapshot cost and the historical time step of current time step Long history cost is two-part, formula specific as follows:
Cost t o t a l t = &alpha; &CenterDot; Cost s n a p s h o t t + ( 1 - &alpha; ) &CenterDot; &Sigma; h = 1 t - 1 e t - h Cost t e m p o r a l h - - - ( 1 )
In formula, 0≤α≤1, represent the weight factor of snapshot cost;Represent the snapshot cost of current time step t;Represent the history cost of historical time step-length h;Factor et-hShow from current time step t more close to, its history generation ValencyShared weight is the heaviest, and its departure degree is the least, i.e. from current time step t more close to, historical time step-length h The time smoothing of the structure that clusters is the best;
Step one two, the tolerance of snapshot cost based on profile value of carrying out:
If the snapshot plotting of current time step t is Gt=(Vt,Et,Wt), wherein | Vt|=n, WtSimilarity for snapshot inter-entity Matrix;The entity division obtained based on this snapshot plotting isWhereinAndUsing profile value criterion tolerance K-Means to cluster the quality of result, wherein, profile value is also referred to as silhouette coefficient, Be a kind of only with reference to data itself the Cluster Evaluation method without reference to golden standard;In this Cluster Evaluation method each bunch with one Individual profile represents, the object that is positioned at bunch by profile reflection and away from bunch object, in utilizing the reflection of this Cluster Evaluation method Poly-degree and two kinds of influence factors of separating degree, and the profile value effect that clusters the most greatly is the best;Snapshot cost is defined as:
Cost s n a p s h o t t = 1 - 1 k &Sigma; p = 1 k s i l ( V p t ) - - - ( 2 )
In formula, k represents the number of lower bunch of current time step t,Represent pth bunch,Expression bunchMean profile Value;Mean profile value is with snapshot cost inversely;
Step one three, according to each bunchIn comprise one group of snapshot entityThen by each bunchMean profile valueThe meansigma methods of the profile value of all snapshot entities in being defined as bunch, particularly as follows:
s i l ( V p t ) = 1 | V p t | &Sigma; o i t &Element; V p t s i l ( o i t ) - - - ( 3 )
In formula,Expression bunchMiddle snapshot entity;Expression bunchThe number of middle snapshot entity;Represent snapshot entityProfile value, its measure formulas is expressed as:
s i l ( o i t ) = b ( o i t ) - a ( o i t ) m a x { a ( o i t ) , b ( o i t ) } - - - ( 4 )
Wherein,Represent snapshot entityWith it belonging to bunchIn other snapshot entityAverage similarity,Represent Snapshot entityWith other bunchIn all snapshot entitiesMaximum average similarity;Value the biggest, show fast the most according to the facts BodyAnd in bunch the average similarity of snapshot entity more than it and bunch between in the average similarity of snapshot entity;
Step one four, based on described in step one threeRepresent snapshot entityProfile value measure formulas (4) physics meaning Justice, definitionFormula be:
b ( o i t ) = 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; t - - - ( 5 )
DefinitionFormula be:
a ( o i t ) = max V q t &NotEqual; V p t { 1 | V q t | &Sigma; o j t &Element; V p t w i j t } - - - ( 6 )
In formula,For bunchMiddle snapshot entity,For bunchMiddle snapshot entity, wii'For snapshot entity in same clusterWithIt Between similarity, wijFor snapshot entity in difference bunchWithBetween similarity;
The step First Five-Year Plan, substituted in formula (2) by formula (3) to (6), then snapshot cost is rewritten as:
Cost s n a p s h o t t = 1 - 1 k &Sigma; p = 1 k s i l ( V p t ) = 1 - 1 k &Sigma; p = 1 k &Sigma; o i t &Element; V p t 1 | V p t | b ( o i t ) - a ( o i t ) max { a ( o i t ) , b ( o i t ) } = 1 - 1 k &Sigma; p = 1 k &Sigma; o i t &Element; V p t 1 | V p t | 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; - max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } max { 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; , max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } } - - - ( 7 )
Step one six, the history cost metric based on KL-divergence that carries out:
The first, snapshot plotting G is set1,G2,…,Gh,…,Gt,
Under current time step t, based on corresponding snapshot plotting GtEntity division be designated as
Under historical time step-length h, based on corresponding historical snapshot figure GhEntity division be designated as
The second, define a kind of tolerance for comparing two kinds of clusterings, use bipartite graph presentation-entity and bunch relation, by entity Divide ZtProblem is converted into joint probability distribution problem in bipartite graph:
Make BGt=(Vt,Ct,Ft,Pt) it is corresponding snapshot plotting Gt=(Vt,Et,Wt) a bipartite graph;Wherein,For snapshot Entity sets;Close for gathering;For the set on limit, two summits on limit are divided Do not come from set VtWith set CtFor n × k joint probability matrix, corresponding to the limit weight matrix of bipartite graph;Use Joint probability formula calculates, i.e.Determine entityWith bunchBetween joint probability Wherein,For bunchThe probability occurred,For bunchEntity under occurrence conditionThe probability occurred;Such as sporocarpBelong to bunchSonjWith n be respectively bunchMiddle snapshot physical quantities and the number of all snapshot entities Amount;Otherwise,For joint probability matrix PtIn for arbitrarily row i, only exist a row j and make pijIt is not 0, thus
3rd, KL-divergence method is used to measure:
The bipartite graph BG of given current time step tt=(Vt,Ct,Ft,Pt) and the bipartite graph BG of historical time step-length hh=(Vh, Ch,Fh,Ph), the entity division of current time step tEntity division with historical time step-length hWherein, BGtCorresponding to Zt, BGhCorresponding to Zh, then the history cost of two time steps h and t is fixed Justice is as follows:
Cost t e m p o r a l h = K L ( P t | | P h ) = &Sigma; i = 1 n &Sigma; j = 1 k p i j t &times; log ( p i j t / p i j h ) - - - ( 8 )
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For snapshot entity under time step t and bunch between connection Close probability matrix PtMiddle element,For snapshot entity under historical time step-length h and bunch between joint probability matrix PhMiddle unit Element;
4th, joint probability matrix PtOr PhIt is following smoothing processing: PtOr PhIn each elementOrPlus often Amount ε, and ε=e-12;Then the most regular to element after processing, it is designated asOrProbability square after smoothing processing Battle array is designated as respectivelyWithThen formula (8) is modified to:
Cost t e m p o r a l h = K L ( P t ^ , P h ^ ) = &Sigma; i = 1 n &Sigma; j = 1 k p i j t ^ &times; l o g ( p i j t ^ / p i j h ^ ) - - - ( 9 )
Wherein, n is the quantity of snapshot entity, the number that k is bunch,For entity after smoothing processing under time step t And the joint probability matrix between bunchMiddle element,For under historical time step-length h through smoothing processing entity with bunch Joint probability matrixMiddle element;
5th, formula (7) and formula (8) are substituted in formula (1), then target total cost function is equivalent to:
Cost t o t a l t = &alpha; &CenterDot; Cost s n a p s h o t t + ( 1 - &alpha; ) &CenterDot; &Sigma; h = 1 t - 1 e t - h Cost t e m p o r a l h = &alpha; ( 1 - 1 k &Sigma; p = 1 k &Sigma; o i t &Element; V p t 1 | V p t | 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; - max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } max { 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; , max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } } ) + ( 1 - &alpha; ) &Sigma; h = 1 t - 1 &Sigma; i = 1 n &Sigma; j = 1 k e t - h &times; p i j t ^ &times; log ( p i j t ^ / p i j h ^ ) - - - ( 10 )
Wherein, 0≤α≤1 is the weight factor of snapshot cost, and k represents the number of lower bunch of current time step t,Presentation-entity Divide ZtMiddle pth element, wii'Or wijRepresent snapshot plotting Gt=(Vt,Et,Wt) WtMiddle element,OrRepresent Gt Middle snapshot entity, n represents bipartite graph BGt=(Vt,Ct,Ft,Pt) VtMiddle snapshot physical quantities,Combine after representing smoothing processing Probability matrixMiddle element,Represent joint probability matrix after smoothing processingMiddle element.
The entity classification method in data-oriented space the most according to claim 1 or claim 2, it is characterised in that: set described in step 2 The process counting spatial entities method for measuring similarity is,
According to the self information of entity and the history of entity, data space entity i.e. snapshot entity, occurs that pattern information is measured soon The according to the facts similarity of body, the i.e. similarity function of snapshot entity are made up of self similarity and historical similarity two parts, express Formula is defined as:
w i j t = Sim t ( o i t , o j t ) = &beta; &CenterDot; Sim s e l f t ( o i t , o j t ) + ( 1 - &beta; ) &CenterDot; Sim h i s t ( o i t , o j t ) - - - ( 11 )
Wherein, 0≤β≤1 is the weight of self similarity,For the snapshot entity under current time step t,For snapshot entityWithBetween self similarity,For snapshot entityWithBetween Historical similarity;
Based on the structured features information that snapshot entity attributes characteristic information is corresponding, the non-structural corresponding with content characteristic information Changing characteristic information, self similarity definition between snapshot entity is as follows:
Sim s e l f t ( o i t , o j t ) = &lambda; &CenterDot; sim a t t r t ( o i t , o j t ) + ( 1 - &lambda; ) &CenterDot; sim c o n t t ( o i t , o j t ) = &lambda; &CenterDot; o i t . A t t r &cap; o j t . A t t r o i t . A t t r &cup; o j t . A t t r + ( 1 - &lambda; ) &CenterDot; o i t . C o n t &cap; o j t . C o n t o i t . C o n t &cup; o j t . C o n t - - - ( 12 )
Wherein, 0≤λ≤1 is the weight of attribute character similarity,WithIt is respectively snapshot entity Attribute character similarity and content characteristic similarity,For snapshot entity attributes feature,For snapshot entity Content characteristic;
Use classical Pearson correlation coefficients tolerance historical similarity, particularly as follows:
sim h i s t ( o i t , o j t ) = &rho; ( o i t , o j t ) = &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) ( n j h - &mu; j t ) &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) 2 &Sigma; h = 1 t - 1 ( n j h - &mu; j t ) 2 - - - ( 13 )
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithAt historical time The number of times that step-length h occurs,WithSnapshot entity respectivelyWithMeansigma methods in all historical time step-length occurrence numbers;
Formula (12) and formula (13) are substituted into formula (11), then the similarity function of snapshot entity is rewritten as:
w i j t = &beta; &CenterDot; Sim s e l f t ( o i t , o j t ) + ( 1 - &beta; ) &CenterDot; Sim h i s t ( o i t , o j t ) = &beta; &CenterDot; ( &lambda; &CenterDot; o i t . A t t r &cap; o j t . A t t r o i t . A t t r &cup; o j t . A t t r + ( 1 - &lambda; ) &CenterDot; o i t . C o n t &cap; o j t . C o n t o i t . C o n t &cup; o j t . C o n t ) + ( 1 - &beta; ) &CenterDot; ( &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) ( n j h - &mu; j t ) &Sigma; h = 1 t - 1 ( n i h - &mu; i t ) 2 &Sigma; h = 1 t - 1 ( n j h - &mu; j t ) 2 ) - - - ( 14 )
Wherein,For the snapshot entity under current time step t,WithIt is respectively snapshot entityWithAt historical time The number of times that step-length h occurs,WithSnapshot entity respectivelyWithIn the meansigma methods of all historical time step-length occurrence numbers,WithIt is respectively snapshot entityWithAttribute character,WithFor snapshot entityWithInterior Hold feature;0≤β≤1 is self similarityWeight, 0≤λ≤1 is attribute character similarityPower Weight.
The entity classification method in data-oriented space the most according to claim 3, it is characterised in that: propose described in step 3 to drill The K-Means clustering algorithm changed, and the process solving the data space entity classification problem of initial point select permeability and evolution is,
The first, following related definition is carried out:
The definition of the η-neighbours under t: a given snapshot plotting Gt=(Vt,Et,Wt) and parameter 0 < η≤1, then for appointing Meaning snapshot entityFor, the η under t-neighbours' formal definitions is:
N ( o i t , &eta; ) = { o i t | &eta; &le; w i j t &le; 1 , 1 &le; j &le; | V t | } ,
Wherein, | Vt| for snapshot plotting GtMiddle number of vertices,For WtMiddle element;
The definition of the similarity density under t: a given snapshot plotting Gt=(Vt,Et,Wt) and t under η-neighboursSo for any snapshot entityFor, the similarity Density Format under t is defined as:
Density s i m ( o i t ) = | N ( o i t , &eta; ) | &times; log ( 1 + 1 | N ( o i t , &eta; ) | - 1 &Sigma; o j t &Element; N ( o i t , &eta; ) , o j t &NotEqual; o j t N ( o i t , &eta; ) w i j ) ;
The second, the snapshot entity that selection principle is similarity density maximum of first initial center point is determined;
Determine the selection principle of the initial center point in addition to first initial center point: remove the η of the initial center point selected- The snapshot entity of neighbours;Average similarity less than all initial center point selected;Close higher than the similarity of Current central point Degree;This principle form can turn to equation below:
c j = arg max o i t &Element; V t , o i t &NotElement; &cup; l = 1 j - 1 N ( o s l t , &eta; ) , { Density s i m ( o i t ) / ( 1 + 1 j - 1 &Sigma; l = 1 j - 1 w s l i t ) } - - - ( 15 )
Wherein, 1≤l≤j-1 is the serial number having selected initial center point,The η having selected initial center point for all- The union of neighbours,For snapshot entityWith select initial center pointSimilarity,For snapshot entity Similarity density under t, adds that coefficient 1 purpose is the situation preventing denominator from being zero;
The basic thought of the K-Means clustering algorithm the 3rd, performing evolution is as follows: in the institute to current time step sometimes Between in step-length, circulation performs K-Means clustering algorithm;Wherein, each time step performs the process of K-Means clustering algorithm It is to select initial center point based on similarity density and formula (15), is then iteratively performed following operation:
1) snapshot entity is assigned to bunch central point that similarity is the highest,
2) bunch central point is updated, until it reaches the condition of convergence that in formula (10), target cost is minimum;
The K-Means clustering algorithm detailed process developed is as follows:
Input: the snapshot entity sets O={O of a series of different time step-lengths1,O2,…,Oh,…,Ot, different time step-length pair Bunch number set K={k answered1,k2,…,kh,…,kt};
Output: the cluster result set C={C of all time steps1,C2,…,Ch,…,Ct};Wherein, h express time step-length, h =1,2 ..., t;
(1), to each time step h, circulation performs:
(2) formula (14), is utilized to calculate snapshot entity sets O under current time step hhCorresponding similarity matrix Wh, and build Corresponding snapshot plotting Gh=(Vh,Eh,Wh);
(3), by bunch central point setIt is initialized as sky;
(4), carry out selecting the process of initial center point: first select the snapshot entity that similarity density is the highest initial as first Central pointThen it is calculated the remaining initial center point of selection according to formula (15)Wherein, j is according to from 1hTo khLiter Sequence order, subscript h express time step-length;
(5), circulation performs: by snapshot entity sets OhIn each snapshot entityIt is assigned to bunch center place most like with it Bunch;Update the central point of each bunch and record cluster result Ch;Until meeting objective cost function in formula (10)The minimum condition of convergence;
The accumulative cluster result updating different time step-length;
And return the cluster result C of all time steps.
5. according to the entity classification method in data-oriented space described in claim 1,2 or 4, it is characterised in that: described in step 4 In the case of number of clusters amount changes in time or snapshot entity adds in time or removes, spread step one develops The process of K-Means cluster framework is,
The first, when number of clusters amount changes in time:
Quantity k that clusters when historical time step-length hhQuantity k that clusters less than current time step ttTime, only need to increase corresponding Row arrive joint probability matrix PhIn, thus be extended toWhereinNow, after extension,With PtIt is all n × ktJoint probability matrix, therefore, formula (10) is revised as:
Cost t o t a l t = &alpha; &CenterDot; Cost s n a p s h o t t + ( 1 - &alpha; ) &CenterDot; &Sigma; h = 1 t - 1 e t - h Cost t e m p o r a l h = &alpha; ( 1 - 1 k t &Sigma; p = 1 k t &Sigma; o i t &Element; V p t 1 | V p t | 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; - max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } max { 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; , max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } } ) + ( 1 - &alpha; ) &Sigma; h = 1 t - 1 &Sigma; i = 1 n &Sigma; j = 1 k t e t - h &times; p i j t ^ &times; log ( p i j t ^ / p i j h &RightArrow; ^ ) - - - ( 16 )
Quantity k that clusters when historical time step-length hhQuantity k that clusters more than current time step ttTime, increase corresponding row and arrive Joint probability matrix PtIn, it is extended toWherein,Now, after extension, PhWithBe all n × khJoint probability matrix, therefore, formula (10) is revised as:
Cost t o t a l t = &alpha; &CenterDot; Cost s n a p s h o t t + ( 1 - &alpha; ) &CenterDot; &Sigma; h = 1 t - 1 e t - h Cost t e m p o r a l h = &alpha; ( 1 - 1 k t &Sigma; p = 1 k t &Sigma; o i t &Element; V p t 1 | V p t | 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; - max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } max { 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; , max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } } ) + ( 1 - &alpha; ) &Sigma; h = 1 t - 1 &Sigma; i = 1 n &Sigma; j = 1 k h e t - h &times; p i j t &LeftArrow; ^ &times; log ( p i j t &LeftArrow; ^ / p i j h ^ ) - - - ( 17 )
The second, when snapshot entity adds in time or removes:
Assuming that when historical time step-length h, PhIt is a nhThe joint probability matrix of × k, when current time step t, PtIt is one Individual nhThe joint probability matrix of × k, n0Individual snapshot entity occurs in time step h and t simultaneously;When in historical time step-length h When snapshot entity is removed, for time step t, those the snapshot entities being removed combine probability of happening with current cluster It is 0, at PtIncrease the row of correspondence middlely, thus obtainP t , wherein,And when at current time step t Snapshot entity when being newly added, for historical time step-length h, those the snapshot entities being newly added were sent out with combining of history bunch Raw probability is 0, at PhThe middle row increasing correspondence, thus obtainP h , wherein,Now, after extension,P h WithP t It is all (nh+nt-n0The joint probability matrix of) × k, therefore, formula (10) is revised as:
Cost t o t a l t = &alpha; &CenterDot; Cost s n a p s h o t t + ( 1 - &alpha; ) &CenterDot; &Sigma; h = 1 t - 1 e t - h Cost t e m p o r a l h = &alpha; ( 1 - 1 k &Sigma; p = 1 k &Sigma; o i t &Element; V p t 1 | V p t | 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; - max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } max { 1 | V p t | - 1 &Sigma; o i &prime; t &Element; V p t , o i t &NotEqual; o i &prime; t w ii &prime; , max q &NotEqual; p { 1 | V q t | &Sigma; o j t &Element; V q t w i j } } ) + ( 1 - &alpha; ) &Sigma; h = 1 t - 1 &Sigma; i = 1 n h + n t - n 0 &Sigma; j = 1 k e t - h &times; p i j t - ^ &times; log ( p i j t - ^ / p i j h - ^ ) ;
In formula, symbolRepresenting matrix X according in formula (9) smoothing processing method process after matrix,It it is matrixMiddle element.
The entity classification method in data-oriented space the most according to claim 5, it is characterised in that: described snapshot entity is, Snapshot entity form under t is expressed as ot=(Attr, Cont);Wherein, Attr represents snapshot entity otStructuring Characteristic information, and Attr={a1,a2,…,an};Cont represents snapshot entity otDestructuring characteristic information, and Cont= {keyword1,keyword2,…,keywordm, n and m represents the element number of set Attr and Cont respectively;When current Between all snapshot entities under step-length t be designated as
7. according to the entity classification method in data-oriented space described in claim 1,2,4 or 6, it is characterised in that: described snapshot Entity is paper, then under moment t, and paper snapshot entity otComprise title, author, size attribute, also comprise data space, reality Body, the unstructured content information of classification;Then ot=({ title, author, size }, { data space, entity, classification }).
The entity classification method in data-oriented space the most according to claim 7, it is characterised in that: described snapshot plotting is, one Individual time step t, then the snapshot plotting under t can formalization representation be Gt=(Vt,Et,Wt), wherein at figure GtIn, Mei Geding PointRepresent snapshot entity, each edgeRepresent snapshot entityWithThere is similarity,Represent limitWeight, i.e. snapshot entity under time step tWithBetween similarity scores.
CN201610348890.4A 2016-05-24 2016-05-24 The entity classification method in data-oriented space Expired - Fee Related CN106067029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610348890.4A CN106067029B (en) 2016-05-24 2016-05-24 The entity classification method in data-oriented space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610348890.4A CN106067029B (en) 2016-05-24 2016-05-24 The entity classification method in data-oriented space

Publications (2)

Publication Number Publication Date
CN106067029A true CN106067029A (en) 2016-11-02
CN106067029B CN106067029B (en) 2019-06-18

Family

ID=57420728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610348890.4A Expired - Fee Related CN106067029B (en) 2016-05-24 2016-05-24 The entity classification method in data-oriented space

Country Status (1)

Country Link
CN (1) CN106067029B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108806355A (en) * 2018-04-26 2018-11-13 浙江工业大学 A kind of calligraphy and painting art interactive education system
CN108932528A (en) * 2018-06-08 2018-12-04 哈尔滨工程大学 Similarity measurement and method for cutting in chameleon algorithm
CN109543712A (en) * 2018-10-16 2019-03-29 哈尔滨工业大学 Entity recognition method on temporal dataset
CN110033644A (en) * 2019-04-22 2019-07-19 泰华智慧产业集团股份有限公司 Parking position reserves air navigation aid and system
CN111161819A (en) * 2019-12-31 2020-05-15 重庆亚德科技股份有限公司 Traditional Chinese medical record data processing system and method
CN114386422A (en) * 2022-01-14 2022-04-22 淮安市创新创业科技服务中心 Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction
CN116450830A (en) * 2023-06-16 2023-07-18 暨南大学 Intelligent campus pushing method and system based on big data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060200431A1 (en) * 2005-03-01 2006-09-07 Microsoft Corporation Private clustering and statistical queries while analyzing a large database
CN102388390A (en) * 2009-04-01 2012-03-21 微软公司 Clustering videos by location
CN103902699A (en) * 2014-03-31 2014-07-02 哈尔滨工程大学 Data space retrieval method applied to big data environments and supporting multi-format feature
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
US20140320388A1 (en) * 2013-04-25 2014-10-30 Microsoft Corporation Streaming k-means computations
CN104731811A (en) * 2013-12-20 2015-06-24 北京师范大学珠海分校 Cluster information evolution analysis method for large-scale dynamic short texts

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060200431A1 (en) * 2005-03-01 2006-09-07 Microsoft Corporation Private clustering and statistical queries while analyzing a large database
CN102388390A (en) * 2009-04-01 2012-03-21 微软公司 Clustering videos by location
US20140320388A1 (en) * 2013-04-25 2014-10-30 Microsoft Corporation Streaming k-means computations
CN104731811A (en) * 2013-12-20 2015-06-24 北京师范大学珠海分校 Cluster information evolution analysis method for large-scale dynamic short texts
CN103902699A (en) * 2014-03-31 2014-07-02 哈尔滨工程大学 Data space retrieval method applied to big data environments and supporting multi-format feature
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
GUANWEN ZHU 等: "Query Planning with Source Descriptions for Deep Web", 《PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON CYBERNETICS AND INFORMATICS》 *
YUN CHI 等: "On evolutionary spectral clustering", 《ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA》 *
侯薇 等: "一种基于隶属度优化的演化聚类算法", 《计算机研究与发展》 *
祝官文 等: "基于主题和表单属性的深层网络数据源分类方法", 《电子学报》 *
董红斌 等: "协同演化算法在聚类中的应用", 《模式识别与人工智能》 *
高兵 等: "基于共享最近邻密度的演化数据流聚类算法", 《北京科技大学学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108806355A (en) * 2018-04-26 2018-11-13 浙江工业大学 A kind of calligraphy and painting art interactive education system
CN108806355B (en) * 2018-04-26 2020-05-08 浙江工业大学 Painting and calligraphy art interactive education system
CN108932528A (en) * 2018-06-08 2018-12-04 哈尔滨工程大学 Similarity measurement and method for cutting in chameleon algorithm
CN109543712A (en) * 2018-10-16 2019-03-29 哈尔滨工业大学 Entity recognition method on temporal dataset
CN110033644A (en) * 2019-04-22 2019-07-19 泰华智慧产业集团股份有限公司 Parking position reserves air navigation aid and system
CN111161819A (en) * 2019-12-31 2020-05-15 重庆亚德科技股份有限公司 Traditional Chinese medical record data processing system and method
CN114386422A (en) * 2022-01-14 2022-04-22 淮安市创新创业科技服务中心 Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction
CN114386422B (en) * 2022-01-14 2023-09-15 淮安市创新创业科技服务中心 Intelligent auxiliary decision-making method and device based on enterprise pollution public opinion extraction
CN116450830A (en) * 2023-06-16 2023-07-18 暨南大学 Intelligent campus pushing method and system based on big data
CN116450830B (en) * 2023-06-16 2023-08-11 暨南大学 Intelligent campus pushing method and system based on big data

Also Published As

Publication number Publication date
CN106067029B (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN106067029A (en) The entity classification method in data-oriented space
Paredes et al. Machine learning or discrete choice models for car ownership demand estimation and prediction?
US20180349384A1 (en) Differentially private database queries involving rank statistics
US9355367B2 (en) System and method for using graph transduction techniques to make relational classifications on a single connected network
CN106817251B (en) Link prediction method and device based on node similarity
Guo et al. Local community detection algorithm based on local modularity density
Liu et al. Deep learning approaches for link prediction in social network services
Adcock et al. Tree decompositions and social graphs
Gong et al. Identification of multi-resolution network structures with multi-objective immune algorithm
Chen et al. Exploiting structural and temporal evolution in dynamic link prediction
Karpatne et al. Predictive learning in the presence of heterogeneity and limited training data
CN107220311A (en) A kind of document representation method of utilization locally embedding topic modeling
Shahbazi et al. A survey on techniques for identifying and resolving representation bias in data
CN114154557A (en) Cancer tissue classification method, apparatus, electronic device, and storage medium
Ban et al. Micro-directional propagation method based on user clustering
CN106651461A (en) Film personalized recommendation method based on gray theory
CN101241520A (en) Model state creation method based on characteristic suppression in finite element modeling
Zhang et al. Closeness degree-based hesitant trapezoidal fuzzy multicriteria decision making method for evaluating green suppliers with qualitative information
Rahmani Seryasat et al. Predicting the number of comments on Facebook posts using an ensemble regression model
Chhabra et al. Missing value imputation using hybrid k-means and association rules
CN117036781A (en) Image classification method based on tree comprehensive diversity depth forests
Paul et al. Community detection using Local Group Assimilation
de Sá et al. A novel approach to estimated Boulingand-Minkowski fractal dimension from complex networks
Huang et al. Community detection algorithm for social network based on node intimacy and graph embedding model
Ma et al. Discover semantic topics in patents within a specific domain

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190618

Termination date: 20200524