CN110533072A

CN110533072A - Based on the SOAP service similarity calculation and clustering method of Bigraph structure under Web environment

Info

Publication number: CN110533072A
Application number: CN201910692227.XA
Authority: CN
Inventors: 陆佳炜; 赵伟; 周焕; 吴涵; 张元鸣; 高飞; 肖刚
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2019-12-03
Anticipated expiration: 2039-07-30
Also published as: CN110533072B

Abstract

Based on the SOAP service similarity calculation and clustering method of Bigraph structure under Web environment.A kind of SOAP service similarity calculation and clustering method based on Bigraph structure, comprising the following steps: first step formal definitions；Second step characteristic value calculates；Third step field weight calculation；Step 4: generating the Bigraph hierarchical structure of term；5th step constructs similarity matrix；6th step service cluster；7th step, data cell are according to operating rule with new global optimum's object；8th step, each histocyte are run as individual execution unit with parallel structure evolution; defining a series of calculating step is a calculating; since the histocyte comprising primary data cell object collection; in each is calculated; it can mean that one or more evolutionary rule is applied on current data cell object collection; when reaching the shutdown constraint condition of system, system autostop, calculated result is presented in the external environment of system.The present invention can more accurately calculate similarity, obtain better cluster result.

Description

SOAP service similarity calculation and cluster under Web environment based on Bigraph structure Method

Technical field

The present invention relates to web services similarity clustering problems, especially SOAP service similarity clustering problem

Background technique

With the development of 2.0 technology of Web, quantity of service and its type on internet are continuously increased, this is more to hold Easily, faster mode develops Internet of Things application and provides possibility, so that how accurately and effectively to find required atomic service Or Services Composition becomes a problem.Service clustering technique can be effectively facilitated service discovery, in recent years, it has been suggested that Many different types of service clustering methods cluster Mashup service, Web API and Web service.

Existing method mainly utilizes the information such as Mashup in service describing to describe, and API description, WSDL document etc. will take The similitude described of being engaged in carries out service cluster as the functional similarity of service.Other method is used by further exploring Family marks the information in label, to improve the performance of service cluster.Obviously, service describing and service labels are all text informations. In general, these methods speculate service similarity by semantic similarity, the cluster operation to service is instructed.In fact, they mention The measuring similarity standard of the similarity for being mostly used in quantification service description and label out is all based on the semanteme in text Information.In addition, Pan W et al. propose it is a kind of based on the novel Mashup of structural similarity and genetic algorithm service cluster side Method describes Mashups by bimodulus figure, the relationship between Web APIs, quantifies each pair of Mashup using SimRank algorithm and takes Mashup is finally serviced effectively clustering by the structural similarity between business.Lu Jiawei et al. passes through the clothes that will be isolated It is a kind of global social interaction server network that business, which is coupled, calculates the social similarity between service, proposes one kind towards global social interaction server The service clustering method of net, the description, service field, QoS information for comprehensively considering service carry out the calculating of phase knowledge and magnanimity, to improve Service the precision of cluster.

Currently, most of existing methods calculate SOAP service by using service function description (WSDL document) Functional similarity between Web service executes service cluster operation, and Liu et al. people describes to mention in text from the WSDL of Web service Four characteristics of Web service: content, context, host name and Web Service name are taken, to carry out Web service cluster. Elgazzar et al. analyzes WSDL document, and is clustered them according to functional similarity, and Yu and Rege also proposed a kind of benefit The clustering method that service discovery is improved with service community learning algorithm, in addition, ontology is also commonly used between Web service Semantic Similarity Measurement and matching, to promote the cluster and discovery of service.Such as Pop et al. devises a module Assessment description two Semantic Web Services Ontological concept between matching degree, and using ant-based method to they into Row cluster, to realize efficient service discovery.Nayak et al. is based on cluster hierarchical clustering algorithm, proposes with additional semantic Web service with cluster is found.

It further comprises in the clustering method of functional similarity and is clustered with the label information of service, such as Wu et al. A kind of new method for being known as WTCluster is proposed, promotes the cluster and discovery of Web service using label, and use LDA model integrates label data and WSDL document, and the probability topic distribution for obtaining Web service improves service cluster Precision.Aznag et al. proposes a kind of alogical matching process, and this method uses relevant topic model from semantic service Theme is extracted in description, and the correlation between the theme of extraction is modeled.

Non-functional factor, such as relationship, the service quality (QoS) between service context, service, also by many researchs Personnel are used to refine and enhance service discovery and cluster, such as Zhou et al. and are inputted, exported, language based on service offer element Adopted relationship proposes a kind of improved Fuzzy C-Mean Algorithm and is clustered, and Skoutas et al. uses multi-standard dominance relationship pair Web service has carried out sequence and cluster, and Chen et al. describes a kind of mixing QoS prediction technique, can alleviate collaborative filtering Data sparsity problem, Kumara et al. propose a kind of service recommendation method based on cluster, and this method uses between service Semantic Similarity and relevance come to service carry out cluster, and by filtering process selection with more preferable qos value service Cluster provides service for the service currently called.

Summary of the invention:

To solve the problems, such as that SOAP service clusters under web environment, the present invention is by extracting hiding term from WSDL document Information on services is divided into two classes by information, i.e. service self information and Service context information, to calculate term characteristic value, and is led to It crosses calculated characteristic value and generates special Bigraph hierarchical model.SOAP service is calculated by Bigraph hierarchical model Similarity, in combination with the k-means algorithm preprocessed data collection based on density, using tissue P system, in conjunction with based on level Agnes algorithm is divided, genetic algorithm (GA) is based on, is based on weighted fuzzy clustering (FCM) algorithm, is proposed a kind of based on Bigraph The SOAP service similarity calculation and clustering method of structure.

In order to solve the above technical problem, the present invention provides the following technical solutions:

A kind of SOAP service similarity calculation and clustering method based on Bigraph structure, comprising the following steps:

First step formal definitions

1.1, term defines:

Enable TL={ T₁,T₂,…T_nIt is the one group of term set serviced in corpus, n is the quantity of term, A={ a₁, a₂,…a_mIt is the atomic vocabulary for forming term TL, i.e. the vocabulary can not be subdivided again, and m corresponds to all atomic vocabulary numbers Amount, defines the frequency of termThat is term T_iThe number of the appearance of appearance, the whole terms for being same as calculating in corpus TL go out The summation of occurrence number, corresponding atomic vocabulary frequencyCalculate the summation of all vocabulary frequency of occurrence, under calculation formula shown in:

Num_TLFor all term quantity of TL, Num_APromising atomic vocabulary frequency of occurrence summation；

1.2, tissue P system (P System) definition:

One degree by data cell tissue P system can take formal definitions as following eight tuple for i.e. 3 of 3:

ω=(OB₁,OB₂,OB_3,OR₁,OR₂,OR₃,OR',OE_o)

Wherein:

OB₁、OB₂And OB₃For each histiocytic object set, i.e. data cell aggregation；

OR₁、OR₂And OR₃For each histiocytic evolutionary rule, respectively represent based on Agnes and k-means algorithm, base In weighted FCM algorithm and clustering rule based on GA algorithm；

OR' represents each histiocytic transhipment rule in entire P system, regular by transhipment, can between cell and cell To carry out the shared of object and exchange；

OEo=0 is the output area of system, represents environment；

1.3, organization object defines

In Data Clustering Algorithm, tissue P system function is that optimal cluster is searched for for the data set clustered Therefore the cluster centre of data is indicated that defining the histocyte object T in P system is one by center with a group objects N*d dimension vector, as follows:

T=(t₁₁, t₁₂..., t_1d..., t_i1, t_i2..., t_id..., t_N1, t_N2..., t_Nd)

Wherein N, which represents data cell T, N number of cluster, this N number of cluster C₁,C₂,…,C_NCorresponding cluster center is t₁,t₂,…, t_N, it is similar to data point, each of object cluster center is all a d dimension vector, then t_iIt can be expressed as t_i1,t_i2,… t_id, i=1,2 ..., N.t_idRepresent d-th of component at i-th of aggregate of data center；

OB_iThe object set in P system in i-th of evolution film is represented, interior includes a group objects, these objects pass through Evolutionary mechanism in different tissues cell carries out evolution reaction, and defining the initial object quantity in each evolution film is m, group At its object set Q. in the evolutionary process of P system is implemented, system needs the superiority and inferiority of a tolerance mechanism evaluation existing object, By the clustering problem performance function J for calculating sample integral tufts variance_m, the high-quality judgement of object is carried out, wherein s_jRepresent data Some data set in cluster, J_mIt is worth smaller, illustrates that object is better, is sorted by the judgement of object, have one in each evolution film Its a optimal object, i.e. local optimum object OB_ibest, and an optimal object is preserved in the environment of system, i.e., it is global most Excellent object, is denoted as T_best；When whole system reaches shutdown status, global optimum's object in environment is required Solution and optimal cluster centre；

Second step characteristic value calculates, and process is as follows:

2.1 unique characteristics values calculate

A term T is found in service corpus_i, its information content I (P is calculated by method of information theory_i), in this base It, can be by term T on plinth_iCharacteristic value Spe (T_i) assignment is as follows

Spe(T_i)=I (P_i) (3)

By calculating joint probability distribution P { p_i,q_jCalculate term characteristic value, wherein p_i∈ P and q_j∈ Q, p_iBe from One word of selection in terminology TL, and q_jIt is one word of acquirement from atomic vocabulary A, wherein { p₁,p₂,…p_nAnd { q₁, q₂,…,q_mIndicated respectively by stochastic variable P, Q, p_iAnd q_jMutual information calculating calculated by following formula；

The list of feature values of term pi is shown as I (p_i, Q), the relationship of pi term and lexicon Q is indicated, in conjunction with art in corpus The formula that the frequency of language and vocabulary calculates pi characteristic value is as follows:

Spe(T_i)≈I(P_i, Q) and (5)

According to Bayes' theorem,

The self information characteristic value SelfSpe (Ti) of final SOAP service calculates as follows

Term generally comprises 1 to 2 vocabulary in the WSDL document of analytic routines, thereforeVocabulary in representative term It is set approximately to 1 calculating, θ represents weighted value, sets based on method of information theory, and value range is 0 to 1；

2.2 contextual information characteristic values calculate

According to method of information theory, the contextual information of service is characterized in the entropy of the distribution of the term Word probability based on modification, is This, is calculated by the following formula its entropy；

Wherein NT representative term T_iModification quantity, (mod_m,T_i) represent mod_mModify term T_iProbability, entropy is by institute (the mod having_m,T_i) average information is calculated, in a specific field, the qualifier distribution of term is more close, therefore Term entropy in a specific neck is lower, calculates term T by entropy_iContextual information characteristic value ContextSpe(T_i) it is as follows:

Wherein 1≤j≤K, K be all identical definition qualifier quantity and,Represent each qualifier.

2.3 composite character values calculate

The unique characteristics value and contextual information feature calculated by formula (7) and (9), cover the feature of descriptor with And the information that word itself cannot describe, the characteristic value for acquiring mixing eventually by formula (10) are as follows:

Mixed coefficint α value is 0.65 according to experimental setup between zero and one, by normalized, itself spy of service Value indicative, contextual feature value and composite character value value are between zero and one；

Third step field weight calculation, process are as follows:

3.1 field weighted values calculate

The size of weight is embodied by term at the same level, and the weight of the bigger term at the same level of definition structure similarity is bigger, Calculation method is as follows.

Wherein,For a new terminology T_nTerm set at the same level, HybridSpe (T_s) and HybridSpe (T_n) respectively Represent the characteristic value of each term and new term at the same level.If newly added term is without term at the same level, directly definition is weighed Weight values are 0.5, G_iFor current Bigraph structure, bigraph (Bigraph) be binary group a B=<BP, BL>, be by figure spirit prize Winner Milner proposes that BP, BL are the location drawing (place graph) and connection figure (link graph) respectively；BP is a three Tuple, BP=<V, E, P>be made of the node collection V of figure, the set E and interface P on side, nested node are father in the location drawing Subrelation indicates embedding between node with branch's relationship.BL is equally also by the node collection V of figure, the set E on side and to connect with BP Mouth P forms a triple, and BL is used to indicate the connection relationship between node；

3.2 term weighted values calculate

The similarity of term is calculated by comparing the word similarity of two terms, is calculated as follows:

Wherein,WithIt respectively represents in term T_iAnd T_nIn composition word quantity,Represent this two Same word quantity in a term, defines that the related sub-structures term similar that a new terminology includes is more, then weight is got over It is high.Term weighted value is acquired according to the similarity of term, calculation formula is as follows.

Wherein NP is the total collection of the higher level of term, peer and junior's term, T_iRepresent one in these term items；

Step 4: generating the Bigraph hierarchical structure of term:

The Bigraph hierarchical structure for constructing different terms, similar to the location drawing of Bigraph, wherein Bigraph's is every One node on behalf, one term object, the value of node represent the characteristic value of the term object, and the Bigraph hierarchical structure is certainly It is constructed under above, steps are as follows:

4.1, the composite character value for calculating the term that WSDL document neutralization is extracted from Google according to formula (10) is put into It in array A, and is arranged according to ascending order, selects the term object of front 3 to constitute as three nodes of Bigraph initial Bigraph structure T；

4.2, for term T remaining in array A_n, it is added in existing Bigraph hierarchical structure, if T_xMeet (HybridSpe(T_n)-0.3<HybridSpe(T_x)<HybridSpe(T_n)+0.3, then by T_xLabeled as destination node, T_xFor The term of some Bigraph levels, by these destination nodes, to determine T_nLocating target minor structure position, so that it is determined that Both candidate nodes minor structure；

4.3, by comprehensively considering the field weight W of new terminology and candidate minor structure_DS(G_i) and term weight W_TS(G_i), Final node weights are calculated by formula 14, to find optimal minor structure；

W_f(G_i)=ω W_DS(G_i)+(1-ω)W_TS(G_i) (14)

Wherein, ω is coefficient, and range runs 4.2-4.3 until all terms are added to 0 to 1, by iteration In Bigraph level；

5th step constructs similarity matrix:

Similarity is calculated using following formula:

Wherein, the maximum number of plies for the Bigraph hierarchical structure that D representative term is constituted, dis (T₁,T₂) represent two terms T₁,T₂The shortest distance in the Bigraph hierarchical structure, i.e. similarity of the SOAP service in some feature calculate SOAP The similarity for servicing each feature, the similarity by the sum of feature similarity as service, by the similarity relationship between service It is built into similarity matrix；

6th step service cluster

The point that the selection of cluster centre point needs to concentrate data calculates the value of integral tufts variance, but deposits in data set In many non-alternative points, there are the isolated points of data noise and edge, and these points not only will affect the selection at cluster center, and And can additionally increase calculating cost, while needing the quantity of artificial predesignated aggregate of data, the present invention is lacked in view of above Point proposes that a kind of K-means algorithm based on density improves, and by calculating the density number of each point, extracts highly dense degree Data point as cluster center.By improved K-means algorithm, it is poly- that pretreatment is carried out to initial data set S to be clustered Class, S are made of the data point that M dimension is d, and the dot density of the K-means algorithm based on density calculates as follows:

Wherein Density (S_i) represent in S_iR within the scope of put total number, distance calculate sim (S_i,S_j) adopt as service S_iAnd S_jSimilarity.

For this purpose, the cluster process based on density K-means algorithm is as follows:

6.1, to data prediction, pass through calculating different data S using based on density K-means algorithm_iThe distance between, According to radius R, data are divided into different clusters, choose density highest, i.e. Density (S_i) highest K S_iMake For cluster center, finally by similarity to data clusters, process is as follows:

6.1.1 each data S is calculated according to formula 16_iIn organization object Q at a distance from each aggregate of data center, really Recognize S_iNumber at each aggregate of data midpoint is ranked up data acquisition system based on density；

6.1.2 the most S of quantity in K density highest, that is, R range point before choosing_k, as new aggregate of data center C_k。

6.1.3 according to the distance between different clusters of division, each S is obtained_iAnd C_kSimilarity sim (S_i,C_k), according to Average similarity Avesim, if sim (C_k,S_i) > Avesim, then by S_iIt is divided into aggregate of data C_k, finally obtain N number of aggregate of data；

6.2 histocyte O1 evolutionary rule

O₁Using Agnes as evolutionary rule, guidance completes intracellular object and evolves, according to similarity between setting cluster Threshold value Cs merges the N number of initial cluster obtained by density k-means algorithm by Agnes algorithm, and process is as follows:

6.2.1 according to any two aggregate of data C_i,C_jAverage similarity dis (the C of interior data_i,C_j), construct similarity moment Battle array D

Wherein S_XFor aggregate of data C_iIn data point, S_YFor aggregate of data C_jIn data point, U, V are respectively C_i,C_jMiddle data The quantity of point；

6.2.2 dis (C is selected_i,C_j) maximum aggregate of data C_i,C_j, according to similarity threshold Cs between cluster, if dis (C_i,C_j) > Cs is then by aggregate of data C_iAnd C_jMerge；

6.2.3 step 6.2.2 is repeated until meeting similarity threshold requirement between all aggregates of data；

6.3 histocyte O₂Evolutionary rule

O₂Using the FCM algorithm based on sample weighting as evolutionary rule, guidance completes intracellular object and evolves, tradition FCM algorithm objective function and cluster center calculation do not consider the otherness of sample, one is carried out to all samples and is treated as Benevolence processing, but have the defects that the isolated point in easy dilated data set or noise data influence, to reduce some heavy Sample is wanted to the contribution of cluster, leads to the accuracy decline of cluster.It is influenced to reduce sample variation to Clustering Effect, the present invention A kind of FCM clustering algorithm based on sample weighting is proposed, by being reasonably weighted to objective function and cluster centre function Processing improves Clustering Effect；

For data set S={ s₁,s₂,…,s_n,

6.3.1 FCM degree of membership is calculated according to the following formula:

Wherein u_ijThe angle value that is subordinate to that i-th of data belongs to jth cluster is represented, i.e. it is maximum to be divided into degree of membership for i-th of data Aggregate of data j, | | s_i-t_j| | it is data s_iTo cluster center t_jEuclidean distance, n is data bulk, it is found that all data The sum of degree of membership is 1, that is, is metJ=1,2 ..., n；

6.3.2 weight and entropy information are calculated

The confusion degree of thermodynamic entropy representative information, the present invention is based on entropy definition effectively to be divided data degree of membership Analysis, and sample weighting is carried out to FCM objective function, Entropy Changes amount E is defined first_iRepresent degree of membership u_ijEffectiveness, and lead to Cross calculating weight w_iMeasure data s_iIt is shown under their calculation formula to the influence degree of the secondary cluster:

6.3.3 according to E_i,w_iCalculate new objective function

Weight coefficient w_iMeetThen newly define the objective function F (S, t) such as formula (22) of FCM:

M is Weighted Index, is greater than the integer equal to 1, in order to seek the extreme value of objective function under Prescribed Properties, is utilized Method of Lagrange multipliers constructs the following function of fresh target function:

Ask extreme value optimal condition as follows objective function:

Calculate new cluster centre t_jAre as follows:

Update degree of membership u_ij, i-th of data is divided into the maximum Data Data cluster of degree of membership

If 6.3.4 | F (S, t)_i-1-F(S,t)_i| greater than the threshold value of setting, step 6.3.3 is repeated, otherwise terminates algorithm, It exports result F (S, t)_iIndicate the FCM target function value that i-th iteration obtains；

6.4 histocyte O3 evolutionary rules

O₃Using three kinds of the selecting of GA, intersection, variation genetic manipulations as evolutionary rule, guidance is completed each right into the cell The evolution of elephant, evolutionary step are as follows:

6.4.1O₃It is by m object in own cells and by the object merging that other two histocyte transhipments come New object evolution pond P；

6.4.2O₃Selection, intersection and mutation operation are executed to new object evolution pond P, wherein selection operation is using optimal Conversation strategy carries out, and intersects and mutation operation is made a variation using the intersection and single-point of integer form, the specific method is as follows:

6.4.2.1 the assessed value p of each object k is calculated_k, N is the quantity of aggregate of data, t_iFor the center of i-th of aggregate of data, p_mSmaller to illustrate that classification method is more suitable, the object is easier to be genetic to the next generation.

6.4.2.2 each object k fitness function fitness is defined_k

fitness_k=α (1- α)^index-1 (30)

Wherein α be the parameter set value range as 0 to 1, index be the number of iterations.

6.4.2.3 selection operation, according to object fitness institute accounting

Wherein u is the sum of object in object pool, and for each object, a random number p is randomly generated in circulation, if p < Cif_kThe object is then genetic to the next generation；

6.4.2.4 the crossover location in crossover operation is determined by crossover probability Pc, selects two from evolution pond at random Object carries out crossover operation, each component of traverse object, if following bad generation random number p p < Pc, exchanges two in the position Object in the position after component, terminate traversal；

6.4.2.5 defining mutation probability P_m, for each object, random chance p is set, if Probability p is less than variation Probability P_mIf z is according to mutation probability P_mThe change point (i.e. some component) of identified object, the then value after making a variation are z_θ, the object after variation is expressed as:

Wherein [0,1] δ ∈ is the random number generated at random ,+,-number foundation a probability occurs；

6.4.3 step 6.4.1-6.4.2 is repeated, to keep the object scale in evolution pond to stablize, O₃To pair after evolution As being screened, carried out according to the fitness of object it is superseded, retain the highest m object of fitness reconstitute object evolve Pond P'；

7th step data cell is according to operating Policy Updates global optimum object

Between histiocytic cell membrane in system exist transhipment channel, different objects different histocytes it Between shared and exchanged, the transhipment rule support that the system of requiring defines defines transhipment in the tissue P system of design Rule instructs to exchange between histocyte information, and rule is as follows:

(x, T₁, T₂... T_m,/T '₁, T '₂... T '_m, y), x ≠ y, x, y=1,2,3.

This transhipment rule represent histocyte x and histocyte y can two-way carry out object transhipment, wherein T₁,T₂,… T_mFor the m object of histocyte x, similarly T₁’,T₂’,…T_m' be histocyte y m object, can by the transhipment rule To reach following effect:

7.1) m object T in histocyte x₁,T₂,…T_mIt is transported in histocyte y,

7.2) m object T in histocyte y₁’,T₂’,…T_m' be transported in histocyte x；

(x, T_xbest/T_best, OE_o), x ≠ y, x, y=1,2,3.

This transhipment rule represents histocyte x and system environments is transported through, wherein T_xbestIt is thin for current computation organization Local optimum object in born of the same parents x, T_bestFor global optimum's object in current environment, rule, histocyte are transported by this Optimal object in x is transported in environment, and at the same time updating global optimum's object of the environment；

8th step is shut down and output

Each histocyte in system is run as individual execution unit with parallel structure evolution, therefore the system It is parallel distributed, within the system, defines a series of calculatings step and be one and calculate, it can be from including primary data The histocyte of cell object collection starts, and in each is calculated, can mean that one or more evolutionary rule is applied In on current data cell object collection, when reaching the shutdown constraint condition of system, system autostop, calculated result is in Now in the external environment of system.

In order to reduce the complexity of system, using the halt condition simply calculated based on maximum execution, specifically, The shutdown when the tissue P system goes to the max calculation number of setting, and export global optimum's object set in current environment.

The invention has the benefit that generating special Bigraph by extracting hiding term information from WSDL document Hierarchical model, by based on composition word information on services is divided into two classes, i.e., service self information and service context letter Breath introduces a kind of new term characteristics value calculating method.Most of terms are the composite terms with one group of modifier, from I is important one group of internal feature in representative domain corpus information.Contextual information helps to make up service itself The deficiency of information.Final characteristic value is calculated by the combination of self information and contextual information.It can more accurately calculate similar Degree.

Simultaneously using the k-means based on density and based on tissue P system is used, will be calculated based on distinguishing hierarchy Agnes Method as evolutionary rule can be effectively combined these three based on genetic algorithm (GA), based on weighted fuzzy clustering (FCM) algorithm The advantages of clustering algorithm, obtains better cluster result,

Detailed description of the invention:

Fig. 1 is the SOAP service similarity calculation flow chart based on Bigraph structure.

Fig. 2 is term set.

Specific embodiment

The invention will be further described below in conjunction with the accompanying drawings

Referring to Figures 1 and 2, the SOAP service similarity calculation under a kind of Web environment based on Bigraph structure and cluster Method, service self information includes the term information for constituting vocabulary and the internal structure of the term, and most important term is compound Term, this facilitates to indicate that the meaning of term.The internal structure of each component word or term determines the feature of this service Value, if a term includes multiple words, this term specific total is greater than its some internal word.By This is as it can be seen that a term word comprising more multi-semantic meaning has higher specificity.

For example, three composite terms of analysis: T₁=NovelAuthor, T₂=FictionNovelAuthor and T₃= ScienceFictionNovelAuthor.As can be seen that Novel, Fiction and Science are equal in above three term For qualifier, and Author is defined terms, composite terms T₁It is the combination of qualifier Novel and defined terms Author, so as to To think that Author and Science has subordinate hierarchical relationship, similarly, composite terms T₂By in T₁Middle addition one is new to repair Excuse Fiction is constituted, T₃By in T₂Middle addition Science is constituted, and with the increase of qualifier quantity, composite terms have More specifical meaning is also implied that with higher characteristic value, therefore composite terms T₁、T₂And T₃Characteristic value sequence are as follows: T₁< T₂<T₃。

It is as follows in conjunction with Fig. 1 specific implementation step:

First step formal definitions

1.1, term defines:

Enable TL={ T₁,T₂,…T_nIt is service one group of term set of corpus, Fig. 2 is the example of a TL, and n is term Quantity, A={ a₁,a₂,…a_mIt is the atomic vocabulary for forming term TL, i.e. the vocabulary can not be subdivided again, and m corresponds to institute Some atomic vocabulary quantity, defines the frequency of termThat is term T_iThe number of the appearance of appearance is same as calculating corpus TL In whole term frequency of occurrence summation, corresponding atomic vocabulary frequencyThe summation of all vocabulary frequency of occurrence is calculated, Shown under calculation formula:

Num_TLFor all term quantity of TL, Num_APromising atomic vocabulary frequency of occurrence summation.The Num in Fig. 2_TL= 8, Num_A=21；

1.2, tissue P system (P System) definition:

ω=(OB₁,OB₂,OB_3,OR₁,OR₂,OR₃,OR',OE_o)

Wherein:

OEo=0 is the output area of system, represents environment；

1.3, organization object defines

In Data Clustering Algorithm, tissue P system function is that optimal cluster is searched for for the data set clustered Therefore center can indicate the cluster centre of data with a group objects, the present invention defines the histocyte pair in P system As T be a N*d dimension vector, it is specific as follows:

T=(t₁₁, t₁₂..., t_1d..., t_i1, t_i2..., t_id..., t_N1, t_N2..., t_Nd)

OB_iThe object set in P system in i-th of evolution film is represented, interior includes a group objects, these objects pass through Evolutionary mechanism in different tissues cell carries out evolution reaction, and defining the initial object quantity in each evolution film is m, group At its object set Q.In the evolutionary process of P system is implemented, system needs the superiority and inferiority of a tolerance mechanism evaluation existing object, The clustering problem performance function J that the present invention passes through calculating sample integral tufts variance_m, the high-quality judgement of object is carried out, wherein s_jGeneration Some data set in table aggregate of data, J_mIt is worth smaller, illustrates that object is better, is sorted by the judgement of object, in each evolution film There are its optimal object, i.e. local optimum object OB_ibest, and an optimal object is preserved in the environment of system, i.e., Global optimum's object, is denoted as T_best, when whole system reaches shutdown status, global optimum's object in environment is Required solution and optimal cluster centre；

Second step characteristic value calculates

Corresponding feature is extracted from WSDL, example " service name " as shown in figure 1, " port name ", " action name ", " defeated Enter information " and " output information " five features, the service corpus of construction, for the calculating of characteristic value, process is as follows:

2.1 unique characteristics values calculate

A term T is found in service corpus_i, its information content I (P can be calculated by method of information theory_i), In On the basis of this, by term T_iCharacteristic value Spe (T_i) assignment is as follows:

Spe(T_i)=I (P_i)

By calculating joint probability distribution P { p_i,q_jCalculate term characteristic value, wherein p_i∈ P and q_j∈ Q, p_iBe from One word of selection in terminology TL, and q_jIt is one word of acquirement from atomic vocabulary A, wherein { p₁,p₂,…p_nAnd { q₁, q₂,…,q_mIndicated respectively by stochastic variable P, Q, p_iAnd q_jMutual information calculating calculated by following formula:

Spe(T_i)=I (P_i, Q)

According to Bayes' theorem,

2.2 contextual information characteristic values calculate

According to method of information theory, the contextual information of service is characterized in the entropy of the distribution of the term Word probability based on modification, is This, is calculated by the following formula its entropy:

Wherein NT representative term T_iModification quantity, (mod_m,T_i) represent mod_mModify term T_iProbability, entropy is by institute (the mod having_m,T_i) average information is calculated, in a specific field, the qualifier distribution of term is more close, therefore Term entropy in a specific neck is lower, can calculate term T by entropy_iContextual information characteristic value ContextSpe(T_i) it is as follows:

Wherein 1≤j≤K, K be all identical definition qualifier quantity and,Represent each qualifier；

2.3 composite character values calculate

The unique characteristics value and contextual information feature calculated by formula (7) and (9), can cover the spy of descriptor The information that sign and word itself cannot describe, the characteristic value for acquiring mixing eventually by formula (10) are as follows:

Mixed coefficint α value is 0.65 according to experimental setup between zero and one, by normalized, itself spy of service Value indicative, contextual feature value and composite character value value are between zero and one.

Third step field weight computations:

3.1 field weighted values calculate

The weight based on domain features value is needed in the process of Bigraph structural generation, the size of weight passes through peer Term embodies, and the weight of the bigger term at the same level of definition structure similarity is bigger, and circular is as follows.

Wherein,For a new terminology T_nTerm set at the same level, HybridSpe (T_s) and HybridSpe (T_n) respectively The characteristic value for representing each term and new term at the same level, if newly added term is without term at the same level, directly definition is weighed Weight values are 0.5, G_iFor current Bigraph structure；

3.2 term weighted values calculate

Wherein,WithIt respectively represents in term T_iAnd T_nIn composition word quantity,Represent this two Same word quantity in a term, defines that the related sub-structures term similar that a new terminology includes is more, then weight is got over It is high.Can be in the hope of term weighted value according to the similarity of term, calculation formula is as follows:

The Bigraph hierarchical structure of 4th step generation term:

The present invention proposes a kind of term Bigraph schichtenaufbau algorithm, constructs the Bigraph hierarchical structure of different terms, Similar to the location drawing of Bigraph, wherein one term object of each node on behalf of Bigraph, the value representative of node are somebody's turn to do The characteristic value of term object, the Bigraph hierarchical structure are constructed from top to bottom, and steps are as follows:

4.1, the composite character value for calculating the term that WSDL document neutralization is extracted from Google according to formula 10 is put into number It in group A, and is arranged according to ascending order, selects the term object of front 3 to constitute as three nodes of Bigraph initial Bigraph structure T；

4.2, for term T remaining in array A_n, it is added in existing Bigraph hierarchical structure, if T_xMeet (HybridSpe(T_n)-0.3<HybridSpe(T_x)<HybridSpe(T_n)+0.3, then by T_xLabeled as destination node, T_xFor The term of some Bigraph levels, by these destination nodes, to determine T_nLocating target minor structure position, so that it is determined that Candidate Bigraph structure；

4.3, by comprehensively considering the field weight W of new terminology Yu candidate's Bigraph structure_DS(C_i) and term weight W_TS (G_i), it is calculated by the following formula to obtain final node weights, to find best Bigraph structure；

W_f(G_i)=ω W_DS(G_i)+(1-ω)W_TS(G_i)

5th step constructs similarity matrix:

Similarity is calculated using following formula:

6th step service cluster

The point that the selection of cluster centre point needs to concentrate data calculates the value of integral tufts variance, but deposits in data set In many non-alternative points, there are the isolated points of data noise and edge, and these points not only will affect the selection at cluster center, and And it can additionally increase calculating cost, while needing the quantity of artificial predesignated aggregate of data, it is contemplated that disadvantage mentioned above proposes A kind of K-means algorithm based on density improves, and the main density number by calculating each point extracts highly dense degree Data point is as cluster center, by improved K-means algorithm, carries out pretreatment cluster, S to initial data set S to be clustered It is made of the data point that M dimension is d, the dot density of the K-means algorithm based on density calculates as follows:

Wherein Density (S_i) represent in S_iR within the scope of put total number, distance calculate sim (S_i,S_j) adopt as service S_iAnd S_jSimilarity；

6.1.1 each data S is calculated according to formula 16_iThe distance at each aggregate of data center in organization object Q confirms S_i In the number at each aggregate of data midpoint, data acquisition system is ranked up based on density；

6.1.2 the most S of quantity in K density highest, that is, R range point before choosing_k, as new aggregate of data center C_k；

6.2 histocyte O1 evolutionary rules

O₁Using Agnes as evolutionary rule, guidance completes intracellular object and evolves.According to similarity between setting cluster Threshold value Cs merges the N number of initial cluster obtained by density k-means algorithm by Agnes algorithm, and process is as follows:

6.3 histocyte O₂Evolutionary rule

O₂Using the FCM algorithm based on sample weighting as evolutionary rule, guidance completes intracellular object and evolves, tradition FCM algorithm objective function and cluster center calculation do not consider the otherness of sample, one is carried out to all samples and is treated as Benevolence processing, but have the defects that the isolated point in easy dilated data set or noise data influence, to reduce some heavy Sample is wanted to the contribution of cluster, leads to the accuracy decline of cluster；It is influenced to reduce sample variation to Clustering Effect, proposes one FCM clustering algorithm of the kind based on sample weighting, by being reasonably weighted processing to objective function and cluster centre function, Improve Clustering Effect；

For data set S={ s₁,s₂,…,s_n,

Wherein u_ijThe angle value that is subordinate to that i-th of data belongs to jth cluster is represented, i.e. it is maximum to be divided into degree of membership for i-th of data Aggregate of data j, | | s_i-t_j| | it is data s_iTo cluster center t_jEuclidean distance, n is data bulk.It can be found that all data The sum of degree of membership is 1, that is, is metJ=1,2 ..., n；

6.3.2 weight and entropy information are calculated

The confusion degree of thermodynamic entropy representative information effectively analyzes data degree of membership based on entropy definition, and right FCM objective function carries out sample weighting, defines Entropy Changes amount E first_iRepresent degree of membership u_ijEffectiveness, and pass through calculating Weight w_iMeasure data s_iIt is shown under their calculation formula to the influence degree of the secondary cluster:

6.3.3 according to E_i,w_iCalculate new objective function

Weight coefficient w_iMeetObjective function F (S, the t) formula for then newly defining FCM is as follows:

Ask extreme value optimal condition as follows objective function:

Calculate new cluster centre t_jAre as follows:

Update degree of membership u_ij, i-th of data is divided into the maximum Data Data cluster of degree of membership；

6.4 histocyte O3 evolutionary rules

6.4.2.2 each object k fitness function fitness is defined_k

fitness_k=α (1- α)^index-1

Wherein α be the parameter set value range as 0 to 1, index be the number of iterations；

6.4.2.3 selection operation, according to object fitness institute accounting

7th step data cell is according to operating rule with new global optimum's object

Between histiocytic cell membrane in system exist transhipment channel, different objects different histocytes it Between shared and exchanged, the transhipment rule support that the system of requiring defines defines transhipment in the tissue P system of design Rule instructs to exchange information between histocyte, and specific rules are as follows:

(x, T₁, T₂... T_m,/T '₁, T '₂... T '_m, y), x ≠ y, x, y=1,2,3.

This transhipment rule represent histocyte x and histocyte y can two-way carry out object transhipment, wherein T₁,T₂,… T_mFor the m object of histocyte x, similarly T₁’,T₂’,…T_m' be histocyte y m object；It can by the transhipment rule To reach following effect:

7.1) m object T in histocyte x₁,T₂,…T_mIt is transported in histocyte y,

7.2) m object T in histocyte y₁’,T₂’,…T_m' be transported in histocyte x,

(xT_xbest/T_best, OE_o), x ≠ y, x, y=1,2,3.

8th step is shut down and output

Claims

1. a kind of SOAP service similarity calculation and clustering method based on Bigraph structure, which is characterized in that the method packet Include following steps:

First step formal definitions；

Second step characteristic value calculates；

Third step field weight calculation；

The Bigraph hierarchical structure of 4th step generation term:

The Bigraph hierarchical structure for constructing different terms, similar to the location drawing of Bigraph, wherein each of Bigraph is saved Point represent a term object, the value of node represents the characteristic value of the term object, the Bigraph hierarchical structure from top to bottom into Row construction；

5th step constructs similarity matrix:

Similarity is calculated using following formula:

Wherein, the maximum number of plies for the Bigraph hierarchical structure that D representative term is constituted, dis (T₁, T₂) represent two term T₁, T₂In It is each to calculate SOAP service for the shortest distance in the Bigraph hierarchical structure, i.e. similarity of the SOAP service in some feature The similarity of feature, the similarity by the sum of feature similarity as service, is built into phase for the similarity relationship between service Like degree matrix；

6th step service cluster

The point that the selection of cluster centre point needs to concentrate data calculates the value of integral tufts variance, but there is many in data set Non- alternative point, there are the isolated points of data noise and edge, and these points not only will affect the selection at cluster center, but also can be additional Increase and calculate cost, while needing the quantity of artificial predesignated aggregate of data, the present invention considers disadvantage mentioned above, proposes one kind K-means algorithm based on density improves, and by calculating the density number of each point, extracts the data point conduct of highly dense degree Cluster center carries out pretreatment cluster to initial data set S to be clustered, S is by M dimension by improved K-means algorithm The data point of d is constituted, and the dot density of the K-means algorithm based on density calculates as follows:

Wherein Density (S_i) represent in S_iR within the scope of put total number, distance calculate sim (S_i, S_j) adopt to service S_iAnd S_j Similarity；

There is transhipment channel between histiocytic cell membrane in system, different objects carries out between different histocytes It shares and exchanges, the transhipment rule support that the system of requiring defines defines transhipment rule in the tissue P system of design to refer to It leads and exchanges information between histocyte, rule is as follows:

(x, T₁, T₂... T_m,/T '₁, T '₂... T '_m, y), x ≠ y, x, y=1,2,3.

This transhipment rule represent histocyte x and histocyte y can two-way carry out object transhipment, wherein T₁, T₂... T_mFor The m object of histocyte x, similarly T₁', T₂' ... T_m' be histocyte y m object；It can be reached by the transhipment rule To following effect:

7.1) m object T in histocyte x₁, T₂... T_mIt is transported in histocyte y,

7.2) m object T in histocyte y₁', T₂' ... T_m' be transported in histocyte x；

(x, T_xbest/T_best, OE_o), x ≠ y, x, y=1,2,3.

This transhipment rule represents histocyte x and system environments is transported through, wherein T_xbestFor in current computation organization's cell x Local optimum object, T_bestIt is regular by this transhipment for global optimum's object in current environment, in histocyte x Optimal object is transported in environment, and at the same time updating global optimum's object of the environment；

8th step is shut down and output

Each histocyte in system is run as individual execution unit with parallel structure evolution, therefore the system is parallel It is distributed, within the system, defines a series of calculatings step and be one and calculate, it can be from including primary data cell object The histocyte of collection starts, and in each is calculated, it is current to can mean that one or more evolutionary rule is applied to On data cell object collection, when reaching the shutdown constraint condition of system, system autostop, calculated result is presented in system In external environment.

2. as described in claim 1 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the first step, the process of formal definitions is as follows:

1.1, term defines:

Enable TL={ T₁, T₂... T_nIt is the one group of term set serviced in corpus, n is the quantity of term, A={ a₁, a₂, ...a_mIt is the atomic vocabulary for forming term TL, i.e. the vocabulary can not be subdivided again, and m corresponds to all atomic vocabulary numbers Amount, defines the frequency of termThat is term T_iThe number of the appearance of appearance is same as calculating whole terms in corpus TL and occurs The summation of number, corresponding atomic vocabulary frequencyCalculate the summation of all vocabulary frequency of occurrence, under calculation formula shown in:

1.2, tissue P system definition:

ω=(OB₁, OB₂, OB₃, OR₁, OR₂, OR₃, OR ', OE_o)

Wherein:

OR₁、OR₂And OR₃For each histiocytic evolutionary rule, respectively represent based on Agnes and k-means algorithm, based on weighting FCM algorithm and clustering rule based on GA algorithm；

Each histiocytic transhipment rule in the entire P system of OR ' representative, can be by transhipment rule, between cell and cell It row object shared and exchanges；

OEo=0 is the output area of system, represents environment；

1.3, organization object defines

In Data Clustering Algorithm, tissue P system function is that optimal cluster centre is searched for for the data set clustered, Therefore, the cluster centre of data is indicated with a group objects, defining the histocyte object T in P system is a N*d dimension Vector, as follows:

T=(t₁₁, t₁₂..., t_1d..., t_i1, t_i2..., t_id..., t_N1, t_N2..., t_Nd)

Wherein N, which represents data cell T, N number of cluster, this N number of cluster C₁, C₂..., C_NCorresponding cluster center is t₁, t₂..., t_N, Similar to data point, each of object cluster center is all a d dimension vector, then t_iIt can be expressed as t_i1, t_i2, ...t_id, i=1,2 ..., N, t_idRepresent d-th of component at i-th of aggregate of data center；

OB_iThe object set in P system in i-th of evolution film is represented, interior includes a group objects, these objects pass through different groups The evolutionary mechanism knitted in cell carries out evolution reaction, and the initial object quantity defined in each evolution film is m, forms its object Collect Q, in the evolutionary process of P system is implemented, system needs the superiority and inferiority of a tolerance mechanism evaluation existing object, by calculating sample The clustering problem performance function J of this integral tufts variance_m, the high-quality judgement of object is carried out, wherein s_jRepresent certain number in aggregate of data According to collection, J_mIt is worth smaller, illustrates that object is better, is sorted by the judgement of object, have its optimal pair in each evolution film As i.e. local optimum object OB_ibest, and an optimal object is preserved in the environment of system, i.e. global optimum's object, be denoted as T_best；When whole system reaches shutdown status, global optimum's object in environment is required solution and optimal Cluster centre；

3. as claimed in claim 2 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the second step, characteristic value calculating process is as follows:

2.1 unique characteristics values calculate

A term T is found in service corpus_i, its information content I (P is calculated by method of information theory_i), on this basis, It can be by term T_iCharacteristic value Spe (T_i) assignment is as follows

Spe(T_i)=I (P_i) (3)

By calculating joint probability distribution P { p_i, q_jCalculate term characteristic value, wherein p_i∈ P and q_j∈ Q, p_iIt is from term Collect and selects a word in TL, and q_jIt is one word of acquirement from atomic vocabulary A, wherein { p₁, p₂... p_nAnd { q₁, q₂..., q_mIndicated respectively by stochastic variable P, Q, p_iAnd q_jMutual information calculating calculated by following formula:

The list of feature values of term pi is shown as I (p_i, Q), the relationship of pi term and lexicon Q is indicated, in conjunction with term in corpus and word The formula that the frequency of remittance calculates pi characteristic value is as follows:

Spe(T_i)≈I(p_i, Q) and (5)

According to Bayes' theorem,

Term generally comprises 1 to 2 vocabulary in the WSDL document of analytic routines, thereforeVocabulary in representative term is approximate It is set as 1 calculating, θ represents weighted value, sets based on method of information theory, and value range is 0 to 1；

2.2 contextual information characteristic values calculate

According to method of information theory, the contextual information of service is characterized in the entropy of the distribution of the term Word probability based on modification, for this purpose, logical It crosses following formula and calculates its entropy；

Wherein NT representative term T_iModification quantity, (mod_m, T_i) represent mod_mModify term T_iProbability, entropy is by all (mod_m, T_i) average information is calculated, in a specific field, the qualifier distribution of term is more close, therefore at one Term entropy in specific neck is lower, calculates term T by entropy_iContextual information characteristic value ContextSpe (T_i) such as Under:

2.3 composite character values calculate

The unique characteristics value and contextual information feature calculated by formula (7) and (9), covers the feature and word of descriptor The information that itself cannot be described, the characteristic value for acquiring mixing eventually by formula (10) are as follows:

Mixed coefficint α value is 0.65 according to experimental setup between zero and one, by normalized, the unique characteristics value of service, Contextual feature value and composite character value value are between zero and one.

4. as claimed in claim 3 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the third step, field weight computations are as follows:

3.1 field weighted values calculate

The size of weight is embodied by term at the same level, and the weight of the bigger term at the same level of definition structure similarity is bigger, is calculated Method is as follows:

Wherein,For a new terminology T_nTerm set at the same level, HybridSpe (T_s) and HybridSpe (T_n) respectively represent The characteristic value of each peer term and new term directly defines weighted value if newly added term is without term at the same level It is 0.5, G_iFor current Bigraph structure, bigraph (Bigraph) be binary group a B=<BP, BL>, be to be obtained by figure spirit prize Person Milner proposes that BP, BL are the location drawing (place graph) and connection figure (link graph) respectively, and BP is a triple, BP=< V, E, P > is made of the node collection V of figure, the set E and interface P on side, and nested node closes in the location drawing for father and son System, indicates embedding between node with branch's relationship, BL is equally also by the node collection V of figure, the set E on side and interface P group with BP At a triple, BL is used to indicate the connection relationship between node；

3.2 term weighted values calculate

Wherein,WithIt respectively represents in term T_iAnd T_nIn composition word quantity,Represent the two terms In same word quantity, define that the related sub-structures term similar that a new terminology includes is more, then weight is higher, according to art The similarity of language acquires term weighted value, and calculation formula is as follows:

Wherein NP is the total collection of the higher level of term, peer and junior's term, T_iRepresent one in these term items.

5. as claimed in claim 4 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in the 4th step, the step of generating the Bigraph hierarchical structure of term is as follows:

4.1, WSDL document neutralization is calculated to be put into array A from the composite character value of the term extracted in Google, and according to Ascending order arrangement selects the term object of front 3 to constitute initial Bigraph structure T as three nodes of Bigraph；

4.2, for term T remaining in array A_n, it is added in existing Bigraph hierarchical structure, if T_xMeet (HybridSpe (T_n) -0.3 < HybridSpe (T_x) < HybridSpe (T_n)+0.3, then by T_xLabeled as destination node, T_xIt is existing The term of Bigraph level, by these destination nodes, to determine T_nLocating target minor structure position, so that it is determined that candidate Node substructure；

4.3, by comprehensively considering the field weight W of new terminology and candidate minor structure_DS(G_i) and term weight W_TS(G_i), pass through public affairs Final node weights are calculated in formula (14), to find optimal minor structure；

W_f(G_i)=ω W_DS(G_i)+(1-ω)W_TS(G_i) (14)

Wherein, ω is coefficient, and range runs 4.2-4.3 until all terms are added to Bigraph 0 to 1, by iteration In level.

6. the SOAP service similarity calculation and clustering method based on Bigraph structure as described in one of Claims 1 to 5, It is characterized in that, the cluster process based on density K-means algorithm is as follows in the 6th step:

6.1, to data prediction, pass through calculating different data S using based on density K-means algorithm_iThe distance between, according to Data are divided into different clusters by radius R, choose density highest, i.e. Density (S_i) highest K S_iAs in cluster The heart, finally by similarity to data clusters；

6.2 histocyte O1 evolutionary rules

O₁Using Agnes as evolutionary rule, guidance completes intracellular object and evolves, according to similarity threshold Cs between setting cluster, The N number of initial cluster obtained by density k-means algorithm is merged by Agnes algorithm；

6.3 histocyte O₂Evolutionary rule

O₂Using the FCM algorithm based on sample weighting as evolutionary rule, guidance completes intracellular object and evolves, and traditional FCM is calculated The objective function and cluster center calculation of method do not consider the otherness of sample, carry out processing of making no exception to all samples, But have the defects that the isolated point in easy dilated data set or noise data influence, to reduce some significant samples pair The contribution of cluster leads to the accuracy decline of cluster；It influences, proposes a kind of based on sample to reduce sample variation to Clustering Effect The FCM clustering algorithm of weighting improves cluster effect by being reasonably weighted processing to objective function and cluster centre function Fruit；

6.4 histocyte O3 evolutionary rules

O₃Using three kinds of the selecting of GA, intersection, variation genetic manipulations as evolutionary rule, guidance complete each object into the cell into Change.

7. as claimed in claim 6 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In described 6.1 process is as follows:

6.1.1 each data S is calculated according to formula 16_iIn organization object Q at a distance from each aggregate of data center, S is confirmed_iIn The number at each aggregate of data midpoint is ranked up data acquisition system based on density；

6.1.3 according to the distance between different clusters of division, each S is obtained_iAnd C_kSimilarity sim (S_i, C_k), according to average Similarity Avesim, if sim (C_k, S_i) > Avesim, then by S_iIt is divided into aggregate of data C_k, finally obtain N number of aggregate of data；

8. as claimed in claim 7 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in described 6.2, the process of histocyte O1 evolutionary rule is as follows:

6.2.1 according to any two aggregate of data C_i, C_jAverage similarity dis (the C of interior data_i, C_j), construct similarity matrix D

Wherein S_XFor aggregate of data C_iIn data point, S_YFor aggregate of data C_jIn data point, U, V are respectively C_i, C_jMiddle data point Quantity；

6.2.2 dis (C is selected_i, C_j) maximum aggregate of data C_i, C_j, according to similarity threshold Cs between cluster, if dis (C_i, C_j) > Cs Then by aggregate of data C_iAnd C_jMerge；

6.2.3 step 6.2.2 is repeated until meeting similarity threshold requirement between all aggregates of data.

9. as claimed in claim 7 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature exists In in described 6.3, for data set S={ s₁, s₂..., s_n, process is as follows:

Wherein u_ijThe angle value that is subordinate to that i-th of data belongs to jth cluster is represented, i.e. i-th of data are divided into the maximum data of degree of membership Cluster j, | | s_i-t_j| | it is data s_iTo cluster center t_jEuclidean distance, n is data bulk, it is found that all data memberships The sum of degree is 1, that is, is met

6.3.2 weight and entropy information are calculated

The confusion degree of thermodynamic entropy representative information, the present invention is based on entropy definition effectively to be analyzed data degree of membership, and Sample weighting is carried out to FCM objective function, defines Entropy Changes amount E first_iRepresent degree of membership u_ijEffectiveness, and pass through calculating Weight w_iMeasure data s_iIt is shown under their calculation formula to the influence degree of the secondary cluster:

6.3.3 according to E_i, w_iCalculate new objective function

M is Weighted Index, is greater than the integer equal to 1, bright using glug in order to seek the extreme value of objective function under Prescribed Properties The following function of day multiplier method construction fresh target function:

Ask extreme value optimal condition as follows objective function:

Calculate new cluster centre t_jAre as follows:

If 6.3.4 | F (S, t)_i-1- F (S, t)_i| greater than the threshold value of setting, step 6.3.3 is repeated, otherwise terminates algorithm, output knot Fruit F (S, t)_iIndicate the FCM target function value that i-th iteration obtains.

10. as claimed in claim 9 based on the SOAP service similarity calculation and clustering method of Bigraph structure, feature It is, in described 6.4, evolutionary step is as follows:

6.4.1 O₃It is new by m object in own cells and by the object merging that other two histocyte transhipments come Object evolution pond P；

6.4.2 O₃Selection, intersection and mutation operation are executed to new object evolution pond P, wherein selection operation uses optimal save strategy Strategy carries out, and intersects and mutation operation is made a variation using the intersection and single-point of integer form, the specific method is as follows:

6.4.2.1 the assessed value p of each object k is calculated_k, N is the quantity of aggregate of data, t_iFor the center of i-th of aggregate of data, P_mMore Small to illustrate that classification method is more suitable, the object is easier to be genetic to the next generation；

6.4.2.2 each object k fitness function fitness is defined_k

fitness_k=α (1- α)^index-1 (30)

6.4.2.3 selection operation, according to object fitness institute accounting

Wherein u is the sum of object in object pool, and for each object, a random number p is randomly generated in circulation, if p < Cif_kThen The object is genetic to the next generation；

6.4.2.4 the crossover location in crossover operation is determined by crossover probability Pc, selects two objects from evolution pond at random Crossover operation, each component of traverse object are carried out, if following bad generation random number p p < Pc, exchanges two objects in the position Component after in the position terminates traversal

6.4.2.5 defining mutation probability P_m, for each object, random chance p is set, if Probability p is less than mutation probability P_mIf z is according to mutation probability P_mThe change point (i.e. some component) of identified object, the then value after making a variation are z_θ, variation Object afterwards is expressed as:

6.4.3 step 6.4.1-6.4.2 is repeated, to keep the object scale in evolution pond to stablize, O₃To the object after evolution into Row screening, carried out according to the fitness of object it is superseded, retain the highest m object of fitness reconstitute object evolution pond P '.