CN110533072A - Based on the SOAP service similarity calculation and clustering method of Bigraph structure under Web environment - Google Patents

Based on the SOAP service similarity calculation and clustering method of Bigraph structure under Web environment Download PDF

Info

Publication number
CN110533072A
CN110533072A CN201910692227.XA CN201910692227A CN110533072A CN 110533072 A CN110533072 A CN 110533072A CN 201910692227 A CN201910692227 A CN 201910692227A CN 110533072 A CN110533072 A CN 110533072A
Authority
CN
China
Prior art keywords
data
term
bigraph
cluster
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910692227.XA
Other languages
Chinese (zh)
Other versions
CN110533072B (en
Inventor
陆佳炜
赵伟
周焕
吴涵
张元鸣
高飞
肖刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201910692227.XA priority Critical patent/CN110533072B/en
Publication of CN110533072A publication Critical patent/CN110533072A/en
Application granted granted Critical
Publication of CN110533072B publication Critical patent/CN110533072B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Based on the SOAP service similarity calculation and clustering method of Bigraph structure under Web environment.A kind of SOAP service similarity calculation and clustering method based on Bigraph structure, comprising the following steps: first step formal definitions;Second step characteristic value calculates;Third step field weight calculation;Step 4: generating the Bigraph hierarchical structure of term;5th step constructs similarity matrix;6th step service cluster;7th step, data cell are according to operating rule with new global optimum's object;8th step, each histocyte are run as individual execution unit with parallel structure evolution; defining a series of calculating step is a calculating; since the histocyte comprising primary data cell object collection; in each is calculated; it can mean that one or more evolutionary rule is applied on current data cell object collection; when reaching the shutdown constraint condition of system, system autostop, calculated result is presented in the external environment of system.The present invention can more accurately calculate similarity, obtain better cluster result.

Description

SOAP service similarity calculation and cluster under Web environment based on Bigraph structure Method
Technical field
The present invention relates to web services similarity clustering problems, especially SOAP service similarity clustering problem
Background technique
With the development of 2.0 technology of Web, quantity of service and its type on internet are continuously increased, this is more to hold Easily, faster mode develops Internet of Things application and provides possibility, so that how accurately and effectively to find required atomic service Or Services Composition becomes a problem.Service clustering technique can be effectively facilitated service discovery, in recent years, it has been suggested that Many different types of service clustering methods cluster Mashup service, Web API and Web service.
Existing method mainly utilizes the information such as Mashup in service describing to describe, and API description, WSDL document etc. will take The similitude described of being engaged in carries out service cluster as the functional similarity of service.Other method is used by further exploring Family marks the information in label, to improve the performance of service cluster.Obviously, service describing and service labels are all text informations. In general, these methods speculate service similarity by semantic similarity, the cluster operation to service is instructed.In fact, they mention The measuring similarity standard of the similarity for being mostly used in quantification service description and label out is all based on the semanteme in text Information.In addition, Pan W et al. propose it is a kind of based on the novel Mashup of structural similarity and genetic algorithm service cluster side Method describes Mashups by bimodulus figure, the relationship between Web APIs, quantifies each pair of Mashup using SimRank algorithm and takes Mashup is finally serviced effectively clustering by the structural similarity between business.Lu Jiawei et al. passes through the clothes that will be isolated It is a kind of global social interaction server network that business, which is coupled, calculates the social similarity between service, proposes one kind towards global social interaction server The service clustering method of net, the description, service field, QoS information for comprehensively considering service carry out the calculating of phase knowledge and magnanimity, to improve Service the precision of cluster.
Currently, most of existing methods calculate SOAP service by using service function description (WSDL document) Functional similarity between Web service executes service cluster operation, and Liu et al. people describes to mention in text from the WSDL of Web service Four characteristics of Web service: content, context, host name and Web Service name are taken, to carry out Web service cluster. Elgazzar et al. analyzes WSDL document, and is clustered them according to functional similarity, and Yu and Rege also proposed a kind of benefit The clustering method that service discovery is improved with service community learning algorithm, in addition, ontology is also commonly used between Web service Semantic Similarity Measurement and matching, to promote the cluster and discovery of service.Such as Pop et al. devises a module Assessment description two Semantic Web Services Ontological concept between matching degree, and using ant-based method to they into Row cluster, to realize efficient service discovery.Nayak et al. is based on cluster hierarchical clustering algorithm, proposes with additional semantic Web service with cluster is found.
It further comprises in the clustering method of functional similarity and is clustered with the label information of service, such as Wu et al. A kind of new method for being known as WTCluster is proposed, promotes the cluster and discovery of Web service using label, and use LDA model integrates label data and WSDL document, and the probability topic distribution for obtaining Web service improves service cluster Precision.Aznag et al. proposes a kind of alogical matching process, and this method uses relevant topic model from semantic service Theme is extracted in description, and the correlation between the theme of extraction is modeled.
Non-functional factor, such as relationship, the service quality (QoS) between service context, service, also by many researchs Personnel are used to refine and enhance service discovery and cluster, such as Zhou et al. and are inputted, exported, language based on service offer element Adopted relationship proposes a kind of improved Fuzzy C-Mean Algorithm and is clustered, and Skoutas et al. uses multi-standard dominance relationship pair Web service has carried out sequence and cluster, and Chen et al. describes a kind of mixing QoS prediction technique, can alleviate collaborative filtering Data sparsity problem, Kumara et al. propose a kind of service recommendation method based on cluster, and this method uses between service Semantic Similarity and relevance come to service carry out cluster, and by filtering process selection with more preferable qos value service Cluster provides service for the service currently called.
Summary of the invention:
To solve the problems, such as that SOAP service clusters under web environment, the present invention is by extracting hiding term from WSDL document Information on services is divided into two classes by information, i.e. service self information and Service context information, to calculate term characteristic value, and is led to It crosses calculated characteristic value and generates special Bigraph hierarchical model.SOAP service is calculated by Bigraph hierarchical model Similarity, in combination with the k-means algorithm preprocessed data collection based on density, using tissue P system, in conjunction with based on level Agnes algorithm is divided, genetic algorithm (GA) is based on, is based on weighted fuzzy clustering (FCM) algorithm, is proposed a kind of based on Bigraph The SOAP service similarity calculation and clustering method of structure.
In order to solve the above technical problem, the present invention provides the following technical solutions:
A kind of SOAP service similarity calculation and clustering method based on Bigraph structure, comprising the following steps:
First step formal definitions
1.1, term defines:
Enable TL={ T1,T2,…TnIt is the one group of term set serviced in corpus, n is the quantity of term, A={ a1, a2,…amIt is the atomic vocabulary for forming term TL, i.e. the vocabulary can not be subdivided again, and m corresponds to all atomic vocabulary numbers Amount, defines the frequency of termThat is term TiThe number of the appearance of appearance, the whole terms for being same as calculating in corpus TL go out The summation of occurrence number, corresponding atomic vocabulary frequencyCalculate the summation of all vocabulary frequency of occurrence, under calculation formula shown in:
NumTLFor all term quantity of TL, NumAPromising atomic vocabulary frequency of occurrence summation;
1.2, tissue P system (P System) definition:
One degree by data cell tissue P system can take formal definitions as following eight tuple for i.e. 3 of 3:
ω=(OB1,OB2,OB3,OR1,OR2,OR3,OR',OEo)
Wherein:
OB1、OB2And OB3For each histiocytic object set, i.e. data cell aggregation;
OR1、OR2And OR3For each histiocytic evolutionary rule, respectively represent based on Agnes and k-means algorithm, base In weighted FCM algorithm and clustering rule based on GA algorithm;
OR' represents each histiocytic transhipment rule in entire P system, regular by transhipment, can between cell and cell To carry out the shared of object and exchange;
OEo=0 is the output area of system, represents environment;
1.3, organization object defines
In Data Clustering Algorithm, tissue P system function is that optimal cluster is searched for for the data set clustered Therefore the cluster centre of data is indicated that defining the histocyte object T in P system is one by center with a group objects N*d dimension vector, as follows:
T=(t11, t12..., t1d..., ti1, ti2..., tid..., tN1, tN2..., tNd)
Wherein N, which represents data cell T, N number of cluster, this N number of cluster C1,C2,…,CNCorresponding cluster center is t1,t2,…, tN, it is similar to data point, each of object cluster center is all a d dimension vector, then tiIt can be expressed as ti1,ti2,… tid, i=1,2 ..., N.tidRepresent d-th of component at i-th of aggregate of data center;
OBiThe object set in P system in i-th of evolution film is represented, interior includes a group objects, these objects pass through Evolutionary mechanism in different tissues cell carries out evolution reaction, and defining the initial object quantity in each evolution film is m, group At its object set Q. in the evolutionary process of P system is implemented, system needs the superiority and inferiority of a tolerance mechanism evaluation existing object, By the clustering problem performance function J for calculating sample integral tufts variancem, the high-quality judgement of object is carried out, wherein sjRepresent data Some data set in cluster, JmIt is worth smaller, illustrates that object is better, is sorted by the judgement of object, have one in each evolution film Its a optimal object, i.e. local optimum object OBibest, and an optimal object is preserved in the environment of system, i.e., it is global most Excellent object, is denoted as Tbest;When whole system reaches shutdown status, global optimum's object in environment is required Solution and optimal cluster centre;
Second step characteristic value calculates, and process is as follows:
2.1 unique characteristics values calculate
A term T is found in service corpusi, its information content I (P is calculated by method of information theoryi), in this base It, can be by term T on plinthiCharacteristic value Spe (Ti) assignment is as follows
Spe(Ti)=I (Pi) (3)
By calculating joint probability distribution P { pi,qjCalculate term characteristic value, wherein pi∈ P and qj∈ Q, piBe from One word of selection in terminology TL, and qjIt is one word of acquirement from atomic vocabulary A, wherein { p1,p2,…pnAnd { q1, q2,…,qmIndicated respectively by stochastic variable P, Q, piAnd qjMutual information calculating calculated by following formula;
The list of feature values of term pi is shown as I (pi, Q), the relationship of pi term and lexicon Q is indicated, in conjunction with art in corpus The formula that the frequency of language and vocabulary calculates pi characteristic value is as follows:
Spe(Ti)≈I(Pi, Q) and (5)
According to Bayes' theorem,
The self information characteristic value SelfSpe (Ti) of final SOAP service calculates as follows
Term generally comprises 1 to 2 vocabulary in the WSDL document of analytic routines, thereforeVocabulary in representative term It is set approximately to 1 calculating, θ represents weighted value, sets based on method of information theory, and value range is 0 to 1;
2.2 contextual information characteristic values calculate
According to method of information theory, the contextual information of service is characterized in the entropy of the distribution of the term Word probability based on modification, is This, is calculated by the following formula its entropy;
Wherein NT representative term TiModification quantity, (modm,Ti) represent modmModify term TiProbability, entropy is by institute (the mod havingm,Ti) average information is calculated, in a specific field, the qualifier distribution of term is more close, therefore Term entropy in a specific neck is lower, calculates term T by entropyiContextual information characteristic value ContextSpe(Ti) it is as follows:
Wherein 1≤j≤K, K be all identical definition qualifier quantity and,Represent each qualifier.
2.3 composite character values calculate
The unique characteristics value and contextual information feature calculated by formula (7) and (9), cover the feature of descriptor with And the information that word itself cannot describe, the characteristic value for acquiring mixing eventually by formula (10) are as follows:
Mixed coefficint α value is 0.65 according to experimental setup between zero and one, by normalized, itself spy of service Value indicative, contextual feature value and composite character value value are between zero and one;
Third step field weight calculation, process are as follows:
3.1 field weighted values calculate
The size of weight is embodied by term at the same level, and the weight of the bigger term at the same level of definition structure similarity is bigger, Calculation method is as follows.
Wherein,For a new terminology TnTerm set at the same level, HybridSpe (Ts) and HybridSpe (Tn) respectively Represent the characteristic value of each term and new term at the same level.If newly added term is without term at the same level, directly definition is weighed Weight values are 0.5, GiFor current Bigraph structure, bigraph (Bigraph) be binary group a B=<BP, BL>, be by figure spirit prize Winner Milner proposes that BP, BL are the location drawing (place graph) and connection figure (link graph) respectively;BP is a three Tuple, BP=<V, E, P>be made of the node collection V of figure, the set E and interface P on side, nested node are father in the location drawing Subrelation indicates embedding between node with branch's relationship.BL is equally also by the node collection V of figure, the set E on side and to connect with BP Mouth P forms a triple, and BL is used to indicate the connection relationship between node;
3.2 term weighted values calculate
The similarity of term is calculated by comparing the word similarity of two terms, is calculated as follows:
Wherein,WithIt respectively represents in term TiAnd TnIn composition word quantity,Represent this two Same word quantity in a term, defines that the related sub-structures term similar that a new terminology includes is more, then weight is got over It is high.Term weighted value is acquired according to the similarity of term, calculation formula is as follows.
Wherein NP is the total collection of the higher level of term, peer and junior's term, TiRepresent one in these term items;
Step 4: generating the Bigraph hierarchical structure of term:
The Bigraph hierarchical structure for constructing different terms, similar to the location drawing of Bigraph, wherein Bigraph's is every One node on behalf, one term object, the value of node represent the characteristic value of the term object, and the Bigraph hierarchical structure is certainly It is constructed under above, steps are as follows:
4.1, the composite character value for calculating the term that WSDL document neutralization is extracted from Google according to formula (10) is put into It in array A, and is arranged according to ascending order, selects the term object of front 3 to constitute as three nodes of Bigraph initial Bigraph structure T;
4.2, for term T remaining in array An, it is added in existing Bigraph hierarchical structure, if TxMeet (HybridSpe(Tn)-0.3<HybridSpe(Tx)<HybridSpe(Tn)+0.3, then by TxLabeled as destination node, TxFor The term of some Bigraph levels, by these destination nodes, to determine TnLocating target minor structure position, so that it is determined that Both candidate nodes minor structure;
4.3, by comprehensively considering the field weight W of new terminology and candidate minor structureDS(Gi) and term weight WTS(Gi), Final node weights are calculated by formula 14, to find optimal minor structure;
Wf(Gi)=ω WDS(Gi)+(1-ω)WTS(Gi) (14)
Wherein, ω is coefficient, and range runs 4.2-4.3 until all terms are added to 0 to 1, by iteration In Bigraph level;
5th step constructs similarity matrix:
Similarity is calculated using following formula:
Wherein, the maximum number of plies for the Bigraph hierarchical structure that D representative term is constituted, dis (T1,T2) represent two terms T1,T2The shortest distance in the Bigraph hierarchical structure, i.e. similarity of the SOAP service in some feature calculate SOAP The similarity for servicing each feature, the similarity by the sum of feature similarity as service, by the similarity relationship between service It is built into similarity matrix;
6th step service cluster
The point that the selection of cluster centre point needs to concentrate data calculates the value of integral tufts variance, but deposits in data set In many non-alternative points, there are the isolated points of data noise and edge, and these points not only will affect the selection at cluster center, and And can additionally increase calculating cost, while needing the quantity of artificial predesignated aggregate of data, the present invention is lacked in view of above Point proposes that a kind of K-means algorithm based on density improves, and by calculating the density number of each point, extracts highly dense degree Data point as cluster center.By improved K-means algorithm, it is poly- that pretreatment is carried out to initial data set S to be clustered Class, S are made of the data point that M dimension is d, and the dot density of the K-means algorithm based on density calculates as follows:
Wherein Density (Si) represent in SiR within the scope of put total number, distance calculate sim (Si,Sj) adopt as service SiAnd SjSimilarity.
For this purpose, the cluster process based on density K-means algorithm is as follows:
6.1, to data prediction, pass through calculating different data S using based on density K-means algorithmiThe distance between, According to radius R, data are divided into different clusters, choose density highest, i.e. Density (Si) highest K SiMake For cluster center, finally by similarity to data clusters, process is as follows:
6.1.1 each data S is calculated according to formula 16iIn organization object Q at a distance from each aggregate of data center, really Recognize SiNumber at each aggregate of data midpoint is ranked up data acquisition system based on density;
6.1.2 the most S of quantity in K density highest, that is, R range point before choosingk, as new aggregate of data center Ck
6.1.3 according to the distance between different clusters of division, each S is obtainediAnd CkSimilarity sim (Si,Ck), according to Average similarity Avesim, if sim (Ck,Si) > Avesim, then by SiIt is divided into aggregate of data Ck, finally obtain N number of aggregate of data;
6.2 histocyte O1 evolutionary rule
O1Using Agnes as evolutionary rule, guidance completes intracellular object and evolves, according to similarity between setting cluster Threshold value Cs merges the N number of initial cluster obtained by density k-means algorithm by Agnes algorithm, and process is as follows:
6.2.1 according to any two aggregate of data Ci,CjAverage similarity dis (the C of interior datai,Cj), construct similarity moment Battle array D
Wherein SXFor aggregate of data CiIn data point, SYFor aggregate of data CjIn data point, U, V are respectively Ci,CjMiddle data The quantity of point;
6.2.2 dis (C is selectedi,Cj) maximum aggregate of data Ci,Cj, according to similarity threshold Cs between cluster, if dis (Ci,Cj) > Cs is then by aggregate of data CiAnd CjMerge;
6.2.3 step 6.2.2 is repeated until meeting similarity threshold requirement between all aggregates of data;
6.3 histocyte O2Evolutionary rule
O2Using the FCM algorithm based on sample weighting as evolutionary rule, guidance completes intracellular object and evolves, tradition FCM algorithm objective function and cluster center calculation do not consider the otherness of sample, one is carried out to all samples and is treated as Benevolence processing, but have the defects that the isolated point in easy dilated data set or noise data influence, to reduce some heavy Sample is wanted to the contribution of cluster, leads to the accuracy decline of cluster.It is influenced to reduce sample variation to Clustering Effect, the present invention A kind of FCM clustering algorithm based on sample weighting is proposed, by being reasonably weighted to objective function and cluster centre function Processing improves Clustering Effect;
For data set S={ s1,s2,…,sn,
6.3.1 FCM degree of membership is calculated according to the following formula:
Wherein uijThe angle value that is subordinate to that i-th of data belongs to jth cluster is represented, i.e. it is maximum to be divided into degree of membership for i-th of data Aggregate of data j, | | si-tj| | it is data siTo cluster center tjEuclidean distance, n is data bulk, it is found that all data The sum of degree of membership is 1, that is, is metJ=1,2 ..., n;
6.3.2 weight and entropy information are calculated
The confusion degree of thermodynamic entropy representative information, the present invention is based on entropy definition effectively to be divided data degree of membership Analysis, and sample weighting is carried out to FCM objective function, Entropy Changes amount E is defined firstiRepresent degree of membership uijEffectiveness, and lead to Cross calculating weight wiMeasure data siIt is shown under their calculation formula to the influence degree of the secondary cluster:
6.3.3 according to Ei,wiCalculate new objective function
Weight coefficient wiMeetThen newly define the objective function F (S, t) such as formula (22) of FCM:
M is Weighted Index, is greater than the integer equal to 1, in order to seek the extreme value of objective function under Prescribed Properties, is utilized Method of Lagrange multipliers constructs the following function of fresh target function:
Ask extreme value optimal condition as follows objective function:
Calculate new cluster centre tjAre as follows:
Update degree of membership uij, i-th of data is divided into the maximum Data Data cluster of degree of membership
If 6.3.4 | F (S, t)i-1-F(S,t)i| greater than the threshold value of setting, step 6.3.3 is repeated, otherwise terminates algorithm, It exports result F (S, t)iIndicate the FCM target function value that i-th iteration obtains;
6.4 histocyte O3 evolutionary rules
O3Using three kinds of the selecting of GA, intersection, variation genetic manipulations as evolutionary rule, guidance is completed each right into the cell The evolution of elephant, evolutionary step are as follows:
6.4.1O3It is by m object in own cells and by the object merging that other two histocyte transhipments come New object evolution pond P;
6.4.2O3Selection, intersection and mutation operation are executed to new object evolution pond P, wherein selection operation is using optimal Conversation strategy carries out, and intersects and mutation operation is made a variation using the intersection and single-point of integer form, the specific method is as follows:
6.4.2.1 the assessed value p of each object k is calculatedk, N is the quantity of aggregate of data, tiFor the center of i-th of aggregate of data, pmSmaller to illustrate that classification method is more suitable, the object is easier to be genetic to the next generation.
6.4.2.2 each object k fitness function fitness is definedk
fitnessk=α (1- α)index-1 (30)
Wherein α be the parameter set value range as 0 to 1, index be the number of iterations.
6.4.2.3 selection operation, according to object fitness institute accounting
Wherein u is the sum of object in object pool, and for each object, a random number p is randomly generated in circulation, if p < CifkThe object is then genetic to the next generation;
6.4.2.4 the crossover location in crossover operation is determined by crossover probability Pc, selects two from evolution pond at random Object carries out crossover operation, each component of traverse object, if following bad generation random number p p < Pc, exchanges two in the position Object in the position after component, terminate traversal;
6.4.2.5 defining mutation probability Pm, for each object, random chance p is set, if Probability p is less than variation Probability PmIf z is according to mutation probability PmThe change point (i.e. some component) of identified object, the then value after making a variation are zθ, the object after variation is expressed as:
Wherein [0,1] δ ∈ is the random number generated at random ,+,-number foundation a probability occurs;
6.4.3 step 6.4.1-6.4.2 is repeated, to keep the object scale in evolution pond to stablize, O3To pair after evolution As being screened, carried out according to the fitness of object it is superseded, retain the highest m object of fitness reconstitute object evolve Pond P';
7th step data cell is according to operating Policy Updates global optimum object
Between histiocytic cell membrane in system exist transhipment channel, different objects different histocytes it Between shared and exchanged, the transhipment rule support that the system of requiring defines defines transhipment in the tissue P system of design Rule instructs to exchange between histocyte information, and rule is as follows:
(x, T1, T2... Tm,/T '1, T '2... T 'm, y), x ≠ y, x, y=1,2,3.
This transhipment rule represent histocyte x and histocyte y can two-way carry out object transhipment, wherein T1,T2,… TmFor the m object of histocyte x, similarly T1’,T2’,…Tm' be histocyte y m object, can by the transhipment rule To reach following effect:
7.1) m object T in histocyte x1,T2,…TmIt is transported in histocyte y,
7.2) m object T in histocyte y1’,T2’,…Tm' be transported in histocyte x;
(x, Txbest/Tbest, OEo), x ≠ y, x, y=1,2,3.
This transhipment rule represents histocyte x and system environments is transported through, wherein TxbestIt is thin for current computation organization Local optimum object in born of the same parents x, TbestFor global optimum's object in current environment, rule, histocyte are transported by this Optimal object in x is transported in environment, and at the same time updating global optimum's object of the environment;
8th step is shut down and output
Each histocyte in system is run as individual execution unit with parallel structure evolution, therefore the system It is parallel distributed, within the system, defines a series of calculatings step and be one and calculate, it can be from including primary data The histocyte of cell object collection starts, and in each is calculated, can mean that one or more evolutionary rule is applied In on current data cell object collection, when reaching the shutdown constraint condition of system, system autostop, calculated result is in Now in the external environment of system.
In order to reduce the complexity of system, using the halt condition simply calculated based on maximum execution, specifically, The shutdown when the tissue P system goes to the max calculation number of setting, and export global optimum's object set in current environment.
The invention has the benefit that generating special Bigraph by extracting hiding term information from WSDL document Hierarchical model, by based on composition word information on services is divided into two classes, i.e., service self information and service context letter Breath introduces a kind of new term characteristics value calculating method.Most of terms are the composite terms with one group of modifier, from I is important one group of internal feature in representative domain corpus information.Contextual information helps to make up service itself The deficiency of information.Final characteristic value is calculated by the combination of self information and contextual information.It can more accurately calculate similar Degree.
Simultaneously using the k-means based on density and based on tissue P system is used, will be calculated based on distinguishing hierarchy Agnes Method as evolutionary rule can be effectively combined these three based on genetic algorithm (GA), based on weighted fuzzy clustering (FCM) algorithm The advantages of clustering algorithm, obtains better cluster result,
Detailed description of the invention:
Fig. 1 is the SOAP service similarity calculation flow chart based on Bigraph structure.
Fig. 2 is term set.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings
Referring to Figures 1 and 2, the SOAP service similarity calculation under a kind of Web environment based on Bigraph structure and cluster Method, service self information includes the term information for constituting vocabulary and the internal structure of the term, and most important term is compound Term, this facilitates to indicate that the meaning of term.The internal structure of each component word or term determines the feature of this service Value, if a term includes multiple words, this term specific total is greater than its some internal word.By This is as it can be seen that a term word comprising more multi-semantic meaning has higher specificity.
For example, three composite terms of analysis: T1=NovelAuthor, T2=FictionNovelAuthor and T3= ScienceFictionNovelAuthor.As can be seen that Novel, Fiction and Science are equal in above three term For qualifier, and Author is defined terms, composite terms T1It is the combination of qualifier Novel and defined terms Author, so as to To think that Author and Science has subordinate hierarchical relationship, similarly, composite terms T2By in T1Middle addition one is new to repair Excuse Fiction is constituted, T3By in T2Middle addition Science is constituted, and with the increase of qualifier quantity, composite terms have More specifical meaning is also implied that with higher characteristic value, therefore composite terms T1、T2And T3Characteristic value sequence are as follows: T1< T2<T3
It is as follows in conjunction with Fig. 1 specific implementation step:
First step formal definitions
1.1, term defines:
Enable TL={ T1,T2,…TnIt is service one group of term set of corpus, Fig. 2 is the example of a TL, and n is term Quantity, A={ a1,a2,…amIt is the atomic vocabulary for forming term TL, i.e. the vocabulary can not be subdivided again, and m corresponds to institute Some atomic vocabulary quantity, defines the frequency of termThat is term TiThe number of the appearance of appearance is same as calculating corpus TL In whole term frequency of occurrence summation, corresponding atomic vocabulary frequencyThe summation of all vocabulary frequency of occurrence is calculated, Shown under calculation formula:
NumTLFor all term quantity of TL, NumAPromising atomic vocabulary frequency of occurrence summation.The Num in Fig. 2TL= 8, NumA=21;
1.2, tissue P system (P System) definition:
One degree by data cell tissue P system can take formal definitions as following eight tuple for i.e. 3 of 3:
ω=(OB1,OB2,OB3,OR1,OR2,OR3,OR',OEo)
Wherein:
OB1、OB2And OB3For each histiocytic object set, i.e. data cell aggregation;
OR1、OR2And OR3For each histiocytic evolutionary rule, respectively represent based on Agnes and k-means algorithm, base In weighted FCM algorithm and clustering rule based on GA algorithm;
OR' represents each histiocytic transhipment rule in entire P system, regular by transhipment, can between cell and cell To carry out the shared of object and exchange;
OEo=0 is the output area of system, represents environment;
1.3, organization object defines
In Data Clustering Algorithm, tissue P system function is that optimal cluster is searched for for the data set clustered Therefore center can indicate the cluster centre of data with a group objects, the present invention defines the histocyte pair in P system As T be a N*d dimension vector, it is specific as follows:
T=(t11, t12..., t1d..., ti1, ti2..., tid..., tN1, tN2..., tNd)
Wherein N, which represents data cell T, N number of cluster, this N number of cluster C1,C2,…,CNCorresponding cluster center is t1,t2,…, tN, it is similar to data point, each of object cluster center is all a d dimension vector, then tiIt can be expressed as ti1,ti2,… tid, i=1,2 ..., N.tidRepresent d-th of component at i-th of aggregate of data center;
OBiThe object set in P system in i-th of evolution film is represented, interior includes a group objects, these objects pass through Evolutionary mechanism in different tissues cell carries out evolution reaction, and defining the initial object quantity in each evolution film is m, group At its object set Q.In the evolutionary process of P system is implemented, system needs the superiority and inferiority of a tolerance mechanism evaluation existing object, The clustering problem performance function J that the present invention passes through calculating sample integral tufts variancem, the high-quality judgement of object is carried out, wherein sjGeneration Some data set in table aggregate of data, JmIt is worth smaller, illustrates that object is better, is sorted by the judgement of object, in each evolution film There are its optimal object, i.e. local optimum object OBibest, and an optimal object is preserved in the environment of system, i.e., Global optimum's object, is denoted as Tbest, when whole system reaches shutdown status, global optimum's object in environment is Required solution and optimal cluster centre;
Second step characteristic value calculates
Corresponding feature is extracted from WSDL, example " service name " as shown in figure 1, " port name ", " action name ", " defeated Enter information " and " output information " five features, the service corpus of construction, for the calculating of characteristic value, process is as follows:
2.1 unique characteristics values calculate
A term T is found in service corpusi, its information content I (P can be calculated by method of information theoryi), In On the basis of this, by term TiCharacteristic value Spe (Ti) assignment is as follows:
Spe(Ti)=I (Pi)
By calculating joint probability distribution P { pi,qjCalculate term characteristic value, wherein pi∈ P and qj∈ Q, piBe from One word of selection in terminology TL, and qjIt is one word of acquirement from atomic vocabulary A, wherein { p1,p2,…pnAnd { q1, q2,…,qmIndicated respectively by stochastic variable P, Q, piAnd qjMutual information calculating calculated by following formula:
The list of feature values of term pi is shown as I (pi, Q), the relationship of pi term and lexicon Q is indicated, in conjunction with art in corpus The formula that the frequency of language and vocabulary calculates pi characteristic value is as follows:
Spe(Ti)=I (Pi, Q)
According to Bayes' theorem,
The self information characteristic value SelfSpe (Ti) of final SOAP service calculates as follows
Term generally comprises 1 to 2 vocabulary in the WSDL document of analytic routines, thereforeVocabulary in representative term It is set approximately to 1 calculating, θ represents weighted value, sets based on method of information theory, and value range is 0 to 1;
2.2 contextual information characteristic values calculate
According to method of information theory, the contextual information of service is characterized in the entropy of the distribution of the term Word probability based on modification, is This, is calculated by the following formula its entropy:
Wherein NT representative term TiModification quantity, (modm,Ti) represent modmModify term TiProbability, entropy is by institute (the mod havingm,Ti) average information is calculated, in a specific field, the qualifier distribution of term is more close, therefore Term entropy in a specific neck is lower, can calculate term T by entropyiContextual information characteristic value ContextSpe(Ti) it is as follows:
Wherein 1≤j≤K, K be all identical definition qualifier quantity and,Represent each qualifier;
2.3 composite character values calculate
The unique characteristics value and contextual information feature calculated by formula (7) and (9), can cover the spy of descriptor The information that sign and word itself cannot describe, the characteristic value for acquiring mixing eventually by formula (10) are as follows:
Mixed coefficint α value is 0.65 according to experimental setup between zero and one, by normalized, itself spy of service Value indicative, contextual feature value and composite character value value are between zero and one.
Third step field weight computations:
3.1 field weighted values calculate
The weight based on domain features value is needed in the process of Bigraph structural generation, the size of weight passes through peer Term embodies, and the weight of the bigger term at the same level of definition structure similarity is bigger, and circular is as follows.
Wherein,For a new terminology TnTerm set at the same level, HybridSpe (Ts) and HybridSpe (Tn) respectively The characteristic value for representing each term and new term at the same level, if newly added term is without term at the same level, directly definition is weighed Weight values are 0.5, GiFor current Bigraph structure;
3.2 term weighted values calculate
The similarity of term is calculated by comparing the word similarity of two terms, is calculated as follows:
Wherein,WithIt respectively represents in term TiAnd TnIn composition word quantity,Represent this two Same word quantity in a term, defines that the related sub-structures term similar that a new terminology includes is more, then weight is got over It is high.Can be in the hope of term weighted value according to the similarity of term, calculation formula is as follows:
Wherein NP is the total collection of the higher level of term, peer and junior's term, TiRepresent one in these term items;
The Bigraph hierarchical structure of 4th step generation term:
The present invention proposes a kind of term Bigraph schichtenaufbau algorithm, constructs the Bigraph hierarchical structure of different terms, Similar to the location drawing of Bigraph, wherein one term object of each node on behalf of Bigraph, the value representative of node are somebody's turn to do The characteristic value of term object, the Bigraph hierarchical structure are constructed from top to bottom, and steps are as follows:
4.1, the composite character value for calculating the term that WSDL document neutralization is extracted from Google according to formula 10 is put into number It in group A, and is arranged according to ascending order, selects the term object of front 3 to constitute as three nodes of Bigraph initial Bigraph structure T;
4.2, for term T remaining in array An, it is added in existing Bigraph hierarchical structure, if TxMeet (HybridSpe(Tn)-0.3<HybridSpe(Tx)<HybridSpe(Tn)+0.3, then by TxLabeled as destination node, TxFor The term of some Bigraph levels, by these destination nodes, to determine TnLocating target minor structure position, so that it is determined that Candidate Bigraph structure;
4.3, by comprehensively considering the field weight W of new terminology Yu candidate's Bigraph structureDS(Ci) and term weight WTS (Gi), it is calculated by the following formula to obtain final node weights, to find best Bigraph structure;
Wf(Gi)=ω WDS(Gi)+(1-ω)WTS(Gi)
Wherein, ω is coefficient, and range runs 4.2-4.3 until all terms are added to 0 to 1, by iteration In Bigraph level;
5th step constructs similarity matrix:
Similarity is calculated using following formula:
Wherein, the maximum number of plies for the Bigraph hierarchical structure that D representative term is constituted, dis (T1,T2) represent two terms T1,T2The shortest distance in the Bigraph hierarchical structure, i.e. similarity of the SOAP service in some feature calculate SOAP The similarity for servicing each feature, the similarity by the sum of feature similarity as service, by the similarity relationship between service It is built into similarity matrix;
6th step service cluster
The point that the selection of cluster centre point needs to concentrate data calculates the value of integral tufts variance, but deposits in data set In many non-alternative points, there are the isolated points of data noise and edge, and these points not only will affect the selection at cluster center, and And it can additionally increase calculating cost, while needing the quantity of artificial predesignated aggregate of data, it is contemplated that disadvantage mentioned above proposes A kind of K-means algorithm based on density improves, and the main density number by calculating each point extracts highly dense degree Data point is as cluster center, by improved K-means algorithm, carries out pretreatment cluster, S to initial data set S to be clustered It is made of the data point that M dimension is d, the dot density of the K-means algorithm based on density calculates as follows:
Wherein Density (Si) represent in SiR within the scope of put total number, distance calculate sim (Si,Sj) adopt as service SiAnd SjSimilarity;
For this purpose, the cluster process based on density K-means algorithm is as follows:
6.1, to data prediction, pass through calculating different data S using based on density K-means algorithmiThe distance between, According to radius R, data are divided into different clusters, choose density highest, i.e. Density (Si) highest K SiMake For cluster center, finally by similarity to data clusters, process is as follows:
6.1.1 each data S is calculated according to formula 16iThe distance at each aggregate of data center in organization object Q confirms Si In the number at each aggregate of data midpoint, data acquisition system is ranked up based on density;
6.1.2 the most S of quantity in K density highest, that is, R range point before choosingk, as new aggregate of data center Ck
6.1.3 according to the distance between different clusters of division, each S is obtainediAnd CkSimilarity sim (Si,Ck), according to Average similarity Avesim, if sim (Ck,Si) > Avesim, then by SiIt is divided into aggregate of data Ck, finally obtain N number of aggregate of data;
6.2 histocyte O1 evolutionary rules
O1Using Agnes as evolutionary rule, guidance completes intracellular object and evolves.According to similarity between setting cluster Threshold value Cs merges the N number of initial cluster obtained by density k-means algorithm by Agnes algorithm, and process is as follows:
6.2.1 according to any two aggregate of data Ci,CjAverage similarity dis (the C of interior datai,Cj), construct similarity moment Battle array D
Wherein SXFor aggregate of data CiIn data point, SYFor aggregate of data CjIn data point, U, V are respectively Ci,CjMiddle data The quantity of point;
6.2.2 dis (C is selectedi,Cj) maximum aggregate of data Ci,Cj, according to similarity threshold Cs between cluster, if dis (Ci,Cj) > Cs is then by aggregate of data CiAnd CjMerge;
6.2.3 step 6.2.2 is repeated until meeting similarity threshold requirement between all aggregates of data;
6.3 histocyte O2Evolutionary rule
O2Using the FCM algorithm based on sample weighting as evolutionary rule, guidance completes intracellular object and evolves, tradition FCM algorithm objective function and cluster center calculation do not consider the otherness of sample, one is carried out to all samples and is treated as Benevolence processing, but have the defects that the isolated point in easy dilated data set or noise data influence, to reduce some heavy Sample is wanted to the contribution of cluster, leads to the accuracy decline of cluster;It is influenced to reduce sample variation to Clustering Effect, proposes one FCM clustering algorithm of the kind based on sample weighting, by being reasonably weighted processing to objective function and cluster centre function, Improve Clustering Effect;
For data set S={ s1,s2,…,sn,
6.3.1 FCM degree of membership is calculated according to the following formula:
Wherein uijThe angle value that is subordinate to that i-th of data belongs to jth cluster is represented, i.e. it is maximum to be divided into degree of membership for i-th of data Aggregate of data j, | | si-tj| | it is data siTo cluster center tjEuclidean distance, n is data bulk.It can be found that all data The sum of degree of membership is 1, that is, is metJ=1,2 ..., n;
6.3.2 weight and entropy information are calculated
The confusion degree of thermodynamic entropy representative information effectively analyzes data degree of membership based on entropy definition, and right FCM objective function carries out sample weighting, defines Entropy Changes amount E firstiRepresent degree of membership uijEffectiveness, and pass through calculating Weight wiMeasure data siIt is shown under their calculation formula to the influence degree of the secondary cluster:
6.3.3 according to Ei,wiCalculate new objective function
Weight coefficient wiMeetObjective function F (S, the t) formula for then newly defining FCM is as follows:
M is Weighted Index, is greater than the integer equal to 1, in order to seek the extreme value of objective function under Prescribed Properties, is utilized Method of Lagrange multipliers constructs the following function of fresh target function:
Ask extreme value optimal condition as follows objective function:
Calculate new cluster centre tjAre as follows:
Update degree of membership uij, i-th of data is divided into the maximum Data Data cluster of degree of membership;
If 6.3.4 | F (S, t)i-1-F(S,t)i| greater than the threshold value of setting, step 6.3.3 is repeated, otherwise terminates algorithm, It exports result F (S, t)iIndicate the FCM target function value that i-th iteration obtains;
6.4 histocyte O3 evolutionary rules
O3Using three kinds of the selecting of GA, intersection, variation genetic manipulations as evolutionary rule, guidance is completed each right into the cell The evolution of elephant, evolutionary step are as follows:
6.4.1O3It is by m object in own cells and by the object merging that other two histocyte transhipments come New object evolution pond P;
6.4.2O3Selection, intersection and mutation operation are executed to new object evolution pond P, wherein selection operation is using optimal Conversation strategy carries out, and intersects and mutation operation is made a variation using the intersection and single-point of integer form, the specific method is as follows:
6.4.2.1 the assessed value p of each object k is calculatedk, N is the quantity of aggregate of data, tiFor the center of i-th of aggregate of data, pmSmaller to illustrate that classification method is more suitable, the object is easier to be genetic to the next generation.
6.4.2.2 each object k fitness function fitness is definedk
fitnessk=α (1- α)index-1
Wherein α be the parameter set value range as 0 to 1, index be the number of iterations;
6.4.2.3 selection operation, according to object fitness institute accounting
Wherein u is the sum of object in object pool, and for each object, a random number p is randomly generated in circulation, if p < CifkThe object is then genetic to the next generation;
6.4.2.4 the crossover location in crossover operation is determined by crossover probability Pc, selects two from evolution pond at random Object carries out crossover operation, each component of traverse object, if following bad generation random number p p < Pc, exchanges two in the position Object in the position after component, terminate traversal;
6.4.2.5 defining mutation probability Pm, for each object, random chance p is set, if Probability p is less than variation Probability PmIf z is according to mutation probability PmThe change point (i.e. some component) of identified object, the then value after making a variation are zθ, the object after variation is expressed as:
Wherein [0,1] δ ∈ is the random number generated at random ,+,-number foundation a probability occurs;
6.4.3 step 6.4.1-6.4.2 is repeated, to keep the object scale in evolution pond to stablize, O3To pair after evolution As being screened, carried out according to the fitness of object it is superseded, retain the highest m object of fitness reconstitute object evolve Pond P';
7th step data cell is according to operating rule with new global optimum's object
Between histiocytic cell membrane in system exist transhipment channel, different objects different histocytes it Between shared and exchanged, the transhipment rule support that the system of requiring defines defines transhipment in the tissue P system of design Rule instructs to exchange information between histocyte, and specific rules are as follows:
(x, T1, T2... Tm,/T '1, T '2... T 'm, y), x ≠ y, x, y=1,2,3.
This transhipment rule represent histocyte x and histocyte y can two-way carry out object transhipment, wherein T1,T2,… TmFor the m object of histocyte x, similarly T1’,T2’,…Tm' be histocyte y m object;It can by the transhipment rule To reach following effect:
7.1) m object T in histocyte x1,T2,…TmIt is transported in histocyte y,
7.2) m object T in histocyte y1’,T2’,…Tm' be transported in histocyte x,
(xTxbest/Tbest, OEo), x ≠ y, x, y=1,2,3.
This transhipment rule represents histocyte x and system environments is transported through, wherein TxbestIt is thin for current computation organization Local optimum object in born of the same parents x, TbestFor global optimum's object in current environment, rule, histocyte are transported by this Optimal object in x is transported in environment, and at the same time updating global optimum's object of the environment;
8th step is shut down and output
Each histocyte in system is run as individual execution unit with parallel structure evolution, therefore the system It is parallel distributed, within the system, defines a series of calculatings step and be one and calculate, it can be from including primary data The histocyte of cell object collection starts, and in each is calculated, can mean that one or more evolutionary rule is applied In on current data cell object collection, when reaching the shutdown constraint condition of system, system autostop, calculated result is in Now in the external environment of system.
In order to reduce the complexity of system, using the halt condition simply calculated based on maximum execution, specifically, The shutdown when the tissue P system goes to the max calculation number of setting, and export global optimum's object set in current environment.

Claims (10)

1. a kind of SOAP service similarity calculation and clustering method based on Bigraph structure, which is characterized in that the method packet Include following steps:
First step formal definitions;
Second step characteristic value calculates;
Third step field weight calculation;
The Bigraph hierarchical structure of 4th step generation term:
The Bigraph hierarchical structure for constructing different terms, similar to the location drawing of Bigraph, wherein each of Bigraph is saved Point represent a term object, the value of node represents the characteristic value of the term object, the Bigraph hierarchical structure from top to bottom into Row construction;
5th step constructs similarity matrix:
Similarity is calculated using following formula:
Wherein, the maximum number of plies for the Bigraph hierarchical structure that D representative term is constituted, dis (T1, T2) represent two term T1, T2In It is each to calculate SOAP service for the shortest distance in the Bigraph hierarchical structure, i.e. similarity of the SOAP service in some feature The similarity of feature, the similarity by the sum of feature similarity as service, is built into phase for the similarity relationship between service Like degree matrix;
6th step service cluster
The point that the selection of cluster centre point needs to concentrate data calculates the value of integral tufts variance, but there is many in data set Non- alternative point, there are the isolated points of data noise and edge, and these points not only will affect the selection at cluster center, but also can be additional Increase and calculate cost, while needing the quantity of artificial predesignated aggregate of data, the present invention considers disadvantage mentioned above, proposes one kind K-means algorithm based on density improves, and by calculating the density number of each point, extracts the data point conduct of highly dense degree Cluster center carries out pretreatment cluster to initial data set S to be clustered, S is by M dimension by improved K-means algorithm The data point of d is constituted, and the dot density of the K-means algorithm based on density calculates as follows:
Wherein Density (Si) represent in SiR within the scope of put total number, distance calculate sim (Si, Sj) adopt to service SiAnd Sj Similarity;
7th step data cell is according to operating Policy Updates global optimum object
There is transhipment channel between histiocytic cell membrane in system, different objects carries out between different histocytes It shares and exchanges, the transhipment rule support that the system of requiring defines defines transhipment rule in the tissue P system of design to refer to It leads and exchanges information between histocyte, rule is as follows:
(x, T1, T2... Tm,/T '1, T '2... T 'm, y), x ≠ y, x, y=1,2,3.
This transhipment rule represent histocyte x and histocyte y can two-way carry out object transhipment, wherein T1, T2... TmFor The m object of histocyte x, similarly T1', T2' ... Tm' be histocyte y m object;It can be reached by the transhipment rule To following effect:
7.1) m object T in histocyte x1, T2... TmIt is transported in histocyte y,
7.2) m object T in histocyte y1', T2' ... Tm' be transported in histocyte x;
(x, Txbest/Tbest, OEo), x ≠ y, x, y=1,2,3.
This transhipment rule represents histocyte x and system environments is transported through, wherein TxbestFor in current computation organization's cell x Local optimum object, TbestIt is regular by this transhipment for global optimum's object in current environment, in histocyte x Optimal object is transported in environment, and at the same time updating global optimum's object of the environment;
8th step is shut down and output
Each histocyte in system is run as individual execution unit with parallel structure evolution, therefore the system is parallel It is distributed, within the system, defines a series of calculatings step and be one and calculate, it can be from including primary data cell object The histocyte of collection starts, and in each is calculated, it is current to can mean that one or more evolutionary rule is applied to On data cell object collection, when reaching the shutdown constraint condition of system, system autostop, calculated result is presented in system In external environment.
2. as described in claim 1 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the first step, the process of formal definitions is as follows:
1.1, term defines:
Enable TL={ T1, T2... TnIt is the one group of term set serviced in corpus, n is the quantity of term, A={ a1, a2, ...amIt is the atomic vocabulary for forming term TL, i.e. the vocabulary can not be subdivided again, and m corresponds to all atomic vocabulary numbers Amount, defines the frequency of termThat is term TiThe number of the appearance of appearance is same as calculating whole terms in corpus TL and occurs The summation of number, corresponding atomic vocabulary frequencyCalculate the summation of all vocabulary frequency of occurrence, under calculation formula shown in:
NumTLFor all term quantity of TL, NumAPromising atomic vocabulary frequency of occurrence summation;
1.2, tissue P system definition:
One degree by data cell tissue P system can take formal definitions as following eight tuple for i.e. 3 of 3:
ω=(OB1, OB2, OB3, OR1, OR2, OR3, OR ', OEo)
Wherein:
OB1、OB2And OB3For each histiocytic object set, i.e. data cell aggregation;
OR1、OR2And OR3For each histiocytic evolutionary rule, respectively represent based on Agnes and k-means algorithm, based on weighting FCM algorithm and clustering rule based on GA algorithm;
Each histiocytic transhipment rule in the entire P system of OR ' representative, can be by transhipment rule, between cell and cell It row object shared and exchanges;
OEo=0 is the output area of system, represents environment;
1.3, organization object defines
In Data Clustering Algorithm, tissue P system function is that optimal cluster centre is searched for for the data set clustered, Therefore, the cluster centre of data is indicated with a group objects, defining the histocyte object T in P system is a N*d dimension Vector, as follows:
T=(t11, t12..., t1d..., ti1, ti2..., tid..., tN1, tN2..., tNd)
Wherein N, which represents data cell T, N number of cluster, this N number of cluster C1, C2..., CNCorresponding cluster center is t1, t2..., tN, Similar to data point, each of object cluster center is all a d dimension vector, then tiIt can be expressed as ti1, ti2, ...tid, i=1,2 ..., N, tidRepresent d-th of component at i-th of aggregate of data center;
OBiThe object set in P system in i-th of evolution film is represented, interior includes a group objects, these objects pass through different groups The evolutionary mechanism knitted in cell carries out evolution reaction, and the initial object quantity defined in each evolution film is m, forms its object Collect Q, in the evolutionary process of P system is implemented, system needs the superiority and inferiority of a tolerance mechanism evaluation existing object, by calculating sample The clustering problem performance function J of this integral tufts variancem, the high-quality judgement of object is carried out, wherein sjRepresent certain number in aggregate of data According to collection, JmIt is worth smaller, illustrates that object is better, is sorted by the judgement of object, have its optimal pair in each evolution film As i.e. local optimum object OBibest, and an optimal object is preserved in the environment of system, i.e. global optimum's object, be denoted as Tbest;When whole system reaches shutdown status, global optimum's object in environment is required solution and optimal Cluster centre;
3. as claimed in claim 2 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the second step, characteristic value calculating process is as follows:
2.1 unique characteristics values calculate
A term T is found in service corpusi, its information content I (P is calculated by method of information theoryi), on this basis, It can be by term TiCharacteristic value Spe (Ti) assignment is as follows
Spe(Ti)=I (Pi) (3)
By calculating joint probability distribution P { pi, qjCalculate term characteristic value, wherein pi∈ P and qj∈ Q, piIt is from term Collect and selects a word in TL, and qjIt is one word of acquirement from atomic vocabulary A, wherein { p1, p2... pnAnd { q1, q2..., qmIndicated respectively by stochastic variable P, Q, piAnd qjMutual information calculating calculated by following formula:
The list of feature values of term pi is shown as I (pi, Q), the relationship of pi term and lexicon Q is indicated, in conjunction with term in corpus and word The formula that the frequency of remittance calculates pi characteristic value is as follows:
Spe(Ti)≈I(pi, Q) and (5)
According to Bayes' theorem,
The self information characteristic value SelfSpe (Ti) of final SOAP service calculates as follows
Term generally comprises 1 to 2 vocabulary in the WSDL document of analytic routines, thereforeVocabulary in representative term is approximate It is set as 1 calculating, θ represents weighted value, sets based on method of information theory, and value range is 0 to 1;
2.2 contextual information characteristic values calculate
According to method of information theory, the contextual information of service is characterized in the entropy of the distribution of the term Word probability based on modification, for this purpose, logical It crosses following formula and calculates its entropy;
Wherein NT representative term TiModification quantity, (modm, Ti) represent modmModify term TiProbability, entropy is by all (modm, Ti) average information is calculated, in a specific field, the qualifier distribution of term is more close, therefore at one Term entropy in specific neck is lower, calculates term T by entropyiContextual information characteristic value ContextSpe (Ti) such as Under:
Wherein 1≤j≤K, K be all identical definition qualifier quantity and,Represent each qualifier;
2.3 composite character values calculate
The unique characteristics value and contextual information feature calculated by formula (7) and (9), covers the feature and word of descriptor The information that itself cannot be described, the characteristic value for acquiring mixing eventually by formula (10) are as follows:
Mixed coefficint α value is 0.65 according to experimental setup between zero and one, by normalized, the unique characteristics value of service, Contextual feature value and composite character value value are between zero and one.
4. as claimed in claim 3 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the third step, field weight computations are as follows:
3.1 field weighted values calculate
The size of weight is embodied by term at the same level, and the weight of the bigger term at the same level of definition structure similarity is bigger, is calculated Method is as follows:
Wherein,For a new terminology TnTerm set at the same level, HybridSpe (Ts) and HybridSpe (Tn) respectively represent The characteristic value of each peer term and new term directly defines weighted value if newly added term is without term at the same level It is 0.5, GiFor current Bigraph structure, bigraph (Bigraph) be binary group a B=<BP, BL>, be to be obtained by figure spirit prize Person Milner proposes that BP, BL are the location drawing (place graph) and connection figure (link graph) respectively, and BP is a triple, BP=< V, E, P > is made of the node collection V of figure, the set E and interface P on side, and nested node closes in the location drawing for father and son System, indicates embedding between node with branch's relationship, BL is equally also by the node collection V of figure, the set E on side and interface P group with BP At a triple, BL is used to indicate the connection relationship between node;
3.2 term weighted values calculate
The similarity of term is calculated by comparing the word similarity of two terms, is calculated as follows:
Wherein,WithIt respectively represents in term TiAnd TnIn composition word quantity,Represent the two terms In same word quantity, define that the related sub-structures term similar that a new terminology includes is more, then weight is higher, according to art The similarity of language acquires term weighted value, and calculation formula is as follows:
Wherein NP is the total collection of the higher level of term, peer and junior's term, TiRepresent one in these term items.
5. as claimed in claim 4 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the 4th step, the step of generating the Bigraph hierarchical structure of term is as follows:
4.1, WSDL document neutralization is calculated to be put into array A from the composite character value of the term extracted in Google, and according to Ascending order arrangement selects the term object of front 3 to constitute initial Bigraph structure T as three nodes of Bigraph;
4.2, for term T remaining in array An, it is added in existing Bigraph hierarchical structure, if TxMeet (HybridSpe (Tn) -0.3 < HybridSpe (Tx) < HybridSpe (Tn)+0.3, then by TxLabeled as destination node, TxIt is existing The term of Bigraph level, by these destination nodes, to determine TnLocating target minor structure position, so that it is determined that candidate Node substructure;
4.3, by comprehensively considering the field weight W of new terminology and candidate minor structureDS(Gi) and term weight WTS(Gi), pass through public affairs Final node weights are calculated in formula (14), to find optimal minor structure;
Wf(Gi)=ω WDS(Gi)+(1-ω)WTS(Gi) (14)
Wherein, ω is coefficient, and range runs 4.2-4.3 until all terms are added to Bigraph 0 to 1, by iteration In level.
6. the SOAP service similarity calculation and clustering method based on Bigraph structure as described in one of Claims 1 to 5, It is characterized in that, the cluster process based on density K-means algorithm is as follows in the 6th step:
6.1, to data prediction, pass through calculating different data S using based on density K-means algorithmiThe distance between, according to Data are divided into different clusters by radius R, choose density highest, i.e. Density (Si) highest K SiAs in cluster The heart, finally by similarity to data clusters;
6.2 histocyte O1 evolutionary rules
O1Using Agnes as evolutionary rule, guidance completes intracellular object and evolves, according to similarity threshold Cs between setting cluster, The N number of initial cluster obtained by density k-means algorithm is merged by Agnes algorithm;
6.3 histocyte O2Evolutionary rule
O2Using the FCM algorithm based on sample weighting as evolutionary rule, guidance completes intracellular object and evolves, and traditional FCM is calculated The objective function and cluster center calculation of method do not consider the otherness of sample, carry out processing of making no exception to all samples, But have the defects that the isolated point in easy dilated data set or noise data influence, to reduce some significant samples pair The contribution of cluster leads to the accuracy decline of cluster;It influences, proposes a kind of based on sample to reduce sample variation to Clustering Effect The FCM clustering algorithm of weighting improves cluster effect by being reasonably weighted processing to objective function and cluster centre function Fruit;
6.4 histocyte O3 evolutionary rules
O3Using three kinds of the selecting of GA, intersection, variation genetic manipulations as evolutionary rule, guidance complete each object into the cell into Change.
7. as claimed in claim 6 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In described 6.1 process is as follows:
6.1.1 each data S is calculated according to formula 16iIn organization object Q at a distance from each aggregate of data center, S is confirmediIn The number at each aggregate of data midpoint is ranked up data acquisition system based on density;
6.1.2 the most S of quantity in K density highest, that is, R range point before choosingk, as new aggregate of data center Ck
6.1.3 according to the distance between different clusters of division, each S is obtainediAnd CkSimilarity sim (Si, Ck), according to average Similarity Avesim, if sim (Ck, Si) > Avesim, then by SiIt is divided into aggregate of data Ck, finally obtain N number of aggregate of data;
8. as claimed in claim 7 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in described 6.2, the process of histocyte O1 evolutionary rule is as follows:
6.2.1 according to any two aggregate of data Ci, CjAverage similarity dis (the C of interior datai, Cj), construct similarity matrix D
Wherein SXFor aggregate of data CiIn data point, SYFor aggregate of data CjIn data point, U, V are respectively Ci, CjMiddle data point Quantity;
6.2.2 dis (C is selectedi, Cj) maximum aggregate of data Ci, Cj, according to similarity threshold Cs between cluster, if dis (Ci, Cj) > Cs Then by aggregate of data CiAnd CjMerge;
6.2.3 step 6.2.2 is repeated until meeting similarity threshold requirement between all aggregates of data.
9. as claimed in claim 7 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in described 6.3, for data set S={ s1, s2..., sn, process is as follows:
6.3.1 FCM degree of membership is calculated according to the following formula:
Wherein uijThe angle value that is subordinate to that i-th of data belongs to jth cluster is represented, i.e. i-th of data are divided into the maximum data of degree of membership Cluster j, | | si-tj| | it is data siTo cluster center tjEuclidean distance, n is data bulk, it is found that all data memberships The sum of degree is 1, that is, is met
6.3.2 weight and entropy information are calculated
The confusion degree of thermodynamic entropy representative information, the present invention is based on entropy definition effectively to be analyzed data degree of membership, and Sample weighting is carried out to FCM objective function, defines Entropy Changes amount E firstiRepresent degree of membership uijEffectiveness, and pass through calculating Weight wiMeasure data siIt is shown under their calculation formula to the influence degree of the secondary cluster:
6.3.3 according to Ei, wiCalculate new objective function
Weight coefficient wiMeetThen newly define the objective function F (S, t) such as formula (22) of FCM:
M is Weighted Index, is greater than the integer equal to 1, bright using glug in order to seek the extreme value of objective function under Prescribed Properties The following function of day multiplier method construction fresh target function:
Ask extreme value optimal condition as follows objective function:
Calculate new cluster centre tjAre as follows:
Update degree of membership uij, i-th of data is divided into the maximum Data Data cluster of degree of membership;
If 6.3.4 | F (S, t)i-1- F (S, t)i| greater than the threshold value of setting, step 6.3.3 is repeated, otherwise terminates algorithm, output knot Fruit F (S, t)iIndicate the FCM target function value that i-th iteration obtains.
10. as claimed in claim 9 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature It is, in described 6.4, evolutionary step is as follows:
6.4.1 O3It is new by m object in own cells and by the object merging that other two histocyte transhipments come Object evolution pond P;
6.4.2 O3Selection, intersection and mutation operation are executed to new object evolution pond P, wherein selection operation uses optimal save strategy Strategy carries out, and intersects and mutation operation is made a variation using the intersection and single-point of integer form, the specific method is as follows:
6.4.2.1 the assessed value p of each object k is calculatedk, N is the quantity of aggregate of data, tiFor the center of i-th of aggregate of data, PmMore Small to illustrate that classification method is more suitable, the object is easier to be genetic to the next generation;
6.4.2.2 each object k fitness function fitness is definedk
fitnessk=α (1- α)index-1 (30)
Wherein α be the parameter set value range as 0 to 1, index be the number of iterations;
6.4.2.3 selection operation, according to object fitness institute accounting
Wherein u is the sum of object in object pool, and for each object, a random number p is randomly generated in circulation, if p < CifkThen The object is genetic to the next generation;
6.4.2.4 the crossover location in crossover operation is determined by crossover probability Pc, selects two objects from evolution pond at random Crossover operation, each component of traverse object are carried out, if following bad generation random number p p < Pc, exchanges two objects in the position Component after in the position terminates traversal
6.4.2.5 defining mutation probability Pm, for each object, random chance p is set, if Probability p is less than mutation probability PmIf z is according to mutation probability PmThe change point (i.e. some component) of identified object, the then value after making a variation are zθ, variation Object afterwards is expressed as:
Wherein [0,1] δ ∈ is the random number generated at random ,+,-number foundation a probability occurs;
6.4.3 step 6.4.1-6.4.2 is repeated, to keep the object scale in evolution pond to stablize, O3To the object after evolution into Row screening, carried out according to the fitness of object it is superseded, retain the highest m object of fitness reconstitute object evolution pond P '.
CN201910692227.XA 2019-07-30 2019-07-30 SOAP service similarity calculation and clustering method based on Bigraph structure in Web environment Active CN110533072B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910692227.XA CN110533072B (en) 2019-07-30 2019-07-30 SOAP service similarity calculation and clustering method based on Bigraph structure in Web environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910692227.XA CN110533072B (en) 2019-07-30 2019-07-30 SOAP service similarity calculation and clustering method based on Bigraph structure in Web environment

Publications (2)

Publication Number Publication Date
CN110533072A true CN110533072A (en) 2019-12-03
CN110533072B CN110533072B (en) 2022-09-23

Family

ID=68660492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910692227.XA Active CN110533072B (en) 2019-07-30 2019-07-30 SOAP service similarity calculation and clustering method based on Bigraph structure in Web environment

Country Status (1)

Country Link
CN (1) CN110533072B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113628225A (en) * 2021-08-24 2021-11-09 合肥工业大学 Fuzzy C-means image segmentation method and system based on structural similarity and image region block
CN114362973A (en) * 2020-09-27 2022-04-15 中国科学院软件研究所 K-means and FCM clustering combined flow detection method and electronic device
WO2022156328A1 (en) * 2021-01-19 2022-07-28 青岛科技大学 Restful-type web service clustering method fusing service cooperation relationships
CN115148330A (en) * 2022-05-24 2022-10-04 中国医学科学院北京协和医院 POP treatment scheme forming method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023141A1 (en) * 2007-08-06 2012-01-26 Atasa Ltd. System and method for representing, organizing, storing and retrieving information
CN107135092A (en) * 2017-03-15 2017-09-05 浙江工业大学 A kind of Web service clustering method towards global social interaction server net
CN109005049A (en) * 2018-05-25 2018-12-14 浙江工业大学 Service combining method based on Bigraph consistency algorithm under a kind of internet environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023141A1 (en) * 2007-08-06 2012-01-26 Atasa Ltd. System and method for representing, organizing, storing and retrieving information
CN107135092A (en) * 2017-03-15 2017-09-05 浙江工业大学 A kind of Web service clustering method towards global social interaction server net
CN109005049A (en) * 2018-05-25 2018-12-14 浙江工业大学 Service combining method based on Bigraph consistency algorithm under a kind of internet environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DOMINIK WACHHOLDER等: ""Bigraph-Ensured Interoperability for System(-of-Systems) Emergence"", 《OTM 2014 WORKSHOPS,LNCS 8842》 *
吴海华等: ""基于新型聚类算法Increase K-Means的Blog相似度分析"", 《厦门大学学报(自然科学版)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114362973A (en) * 2020-09-27 2022-04-15 中国科学院软件研究所 K-means and FCM clustering combined flow detection method and electronic device
CN114362973B (en) * 2020-09-27 2023-02-28 中国科学院软件研究所 K-means and FCM clustering combined flow detection method and electronic device
WO2022156328A1 (en) * 2021-01-19 2022-07-28 青岛科技大学 Restful-type web service clustering method fusing service cooperation relationships
CN113628225A (en) * 2021-08-24 2021-11-09 合肥工业大学 Fuzzy C-means image segmentation method and system based on structural similarity and image region block
CN113628225B (en) * 2021-08-24 2024-02-20 合肥工业大学 Fuzzy C-means image segmentation method and system based on structural similarity and image region block
CN115148330A (en) * 2022-05-24 2022-10-04 中国医学科学院北京协和医院 POP treatment scheme forming method and system
CN115148330B (en) * 2022-05-24 2023-07-25 中国医学科学院北京协和医院 POP treatment scheme forming method and system

Also Published As

Publication number Publication date
CN110533072B (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN106779087B (en) A kind of general-purpose machinery learning data analysis platform
CN110533072A (en) Based on the SOAP service similarity calculation and clustering method of Bigraph structure under Web environment
CN103559504B (en) Image target category identification method and device
Baradwaj et al. Mining educational data to analyze students' performance
Huang et al. A graph neural network-based node classification model on class-imbalanced graph data
CN110826639B (en) Zero sample image classification method trained by full data
CN110298434A (en) A kind of integrated deepness belief network based on fuzzy division and FUZZY WEIGHTED
CN110532429B (en) Online user group classification method and device based on clustering and association rules
Astudillo et al. On achieving semi-supervised pattern recognition by utilizing tree-based SOMs
CN111723285A (en) Depth spectrum convolution collaborative filtering recommendation method based on scores
Wang et al. I am going MAD: Maximum discrepancy competition for comparing classifiers adaptively
CN112183652A (en) Edge end bias detection method under federated machine learning environment
CN103136309B (en) Social intensity is modeled by kernel-based learning algorithms
Rizzo et al. Approximate classification with web ontologies through evidential terminological trees and forests
González-Almagro et al. Semi-supervised constrained clustering: An in-depth overview, ranked taxonomy and future research directions
Ye et al. Rebalanced zero-shot learning
CN110659363A (en) Web service mixed evolution clustering method based on membrane computing
CN110110628A (en) A kind of detection method and detection device of frequency synthesizer deterioration
CN108388769A (en) The protein function module recognition method of label propagation algorithm based on side driving
CN114896514B (en) Web API label recommendation method based on graph neural network
Govindarajan Text mining technique for data mining application
CN115840853A (en) Course recommendation system based on knowledge graph and attention network
Mendoza et al. Predicting affinity ties in a surname network
Shahzad Classification and Associative Classification Rule Discovery Using Ant Colony Optimization
Dattachaudhuri et al. Transparent neural based expert system for credit risk (TNESCR): an automated credit risk evaluation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant