CN102202012A

CN102202012A - Group dividing method and system of communication network

Info

Publication number: CN102202012A
Application number: CN201110141970XA
Authority: CN
Inventors: 郭世泽; 陈哲; 王小娟; 陆哲明; 段榕; 赵建鹏; 杨云
Original assignee: No54 Inst Headquarters Of General Staff P L A
Current assignee: No54 Inst Headquarters Of General Staff P L A
Priority date: 2011-05-30
Filing date: 2011-05-30
Publication date: 2011-09-28
Anticipated expiration: 2031-05-30
Also published as: CN102202012B

Abstract

The invention provides a group dividing method of a communication network, which comprises the following steps: preprocessing the communication data; creating a communication relationship network according to the obtained preprocessing result to obtain the nodes representing a communication sender and a communication receiver in the communication network as well as a side representing the communication relationship between the communication sender and the communication receiver; constructing a demand text vector and a communication text vector according to a query word provided by the user; calculating the node centrality of each node in the communication relationship network; calculating the communication relationship strength among the nodes in communication relationship in the communication relationship network, the similarity of the sides among the nodes and the satisfaction degree of the user on the sides among the nodes; performing side clustering operation on the sides in the communication relationship network to generate multiple groups; finding respective core members in the group according to the node centrality and communication theme; expanding the members in the group; and dividing the expanded members in the group to generate a new group.

Description

The corporations' division methods and the system of communication network

Technical field

The present invention relates to the data mining field, particularly a kind of corporations' division methods and system of communication network.

Background technology

Fetion, mail, MSN, meanss of communication such as QQ become the important means that people carry out information interchange gradually, and the convenience of contact makes its application increasingly extensive.Communication network is social networks embodiment on the internet, and communication data provides the research sample for the discovery of social rule.By analyzing communication data, find user's interest public organization and core member according to user's request, this method is also referred to as corporations' division methods, and corporations' division result has been shone upon the group in the reality, has practical significance.

For corporations' division methods of communication network, prior art mainly is divided into two kinds:

A kind of Complex Networks Theory that is based on is divided communication network, and as spectral method, stratification is based on method of modularity etc.What the corporations of complex network divided concern is topology of networks, division result can be good at reflecting topology of networks, but comprised a large amount of extraneous data in the communication network, the speed that the existence of these data makes corporations divide on the one hand is restricted, though make that on the other hand dividing the result belongs to a group on topological structure, the Content of Communication of this group is not that the user pays close attention to.For the corporations that are met user's request divide, need screen communication data based on user's request.

Another kind is based on Content of Communication communication network is carried out corporations' division, as the k-means algorithm, and Bayes etc., the Content of Communication that Content of Communication is similar is divided into corporations.Adopt the resulting corporations of this method,, and can meet consumers' demand by screening though Content of Communication is similar, for the group of same Content of Communication may be corresponding in real society different " groups ".

Consideration Content of Communication and user's request are carried out corporations and are divided, need to consider on the one hand the requirement of the renewal of communication data every day to algorithm speed, to consider to communicate by letter text on the other hand to the influence of node and side attribute, thereby the analysis result that obtains is met consumers' demand.

Summary of the invention

The objective of the invention is to overcome existing corporations division methods and in partition process, lay particular stress on to some extent, can't meet consumers' demand, can not reflect the feature of node well.

To achieve these goals, the invention provides a kind of corporations' division methods of communication network, comprising:

Step 1), communication data is carried out preliminary treatment, obtain the information that comprises communication data ID, caller information, recipient's information, call duration time, Content of Communication about communication data;

Step 2), create the communications and liaison relational network that is used to reflect described communication network architecture according to the resulting preliminary treatment result of step 1), obtain being used for representing the sender of communications of described communication network, communication receiver's node by described communications and liaison relational network, and the limit that is used to represent correspondence between described sender of communications, communication receiver;

Step 3), the query word structure demand text vector that provides according to the user and the text vector of communicating by letter;

The node center degree of each node in step 4), the described link relation network of calculating; Described node center degree comprises node intermediary degree, node tightness and node contact degree;

Step 5), calculate communications and liaison relationship strength, the similarity between each internodal limit and user between each node that has the communications and liaison relation in the described communications and liaison relational network to the satisfaction on described internodal limit;

Step 6), be cluster operation in the described communications and liaison relational network, generate a plurality of corporations while doing based on described Content of Communication;

Step 7), in described corporations, seek separately core member according to described node center degree and communication theme;

Step 8), on described core member's basis, the member in the corporations is expanded;

Step 9), the member that process in the described corporations is expanded divide, and generate new corporations.

In the technique scheme, described step 6) comprises:

Step 6-1), determine the number of the corporations that the limit cluster will generate;

Step 6-2), be each corporations' generation initial cores separately;

Step 6-3), for every in communication network limit, calculate the similarity between the initial cores in itself and described each corporations successively;

Step 6-4), according to step 6-3) result of calculation, the limit in the described communication network is added in the corporations with the initial cores place of its similarity maximum;

Step 6-5), adjust the cluster centre of described each corporations;

Step 6-6), repeated execution of steps 6-3)-step 6-5), up to satisfying stop condition.

In the technique scheme, described step 6-2) comprising:

Step 6-2-1), according to the similarity between described each internodal limit, if s _Ij=0, then limit i and limit j are formed to depositing in the set A;

Step 6-2-2), in every group among the set of computations A with the class degree value of limit i

And the class degree value of limit j

Whether judge these two class degree values all greater than preassigned threshold value, have only that limit i and limit j formed to being isolated limit when described two class degree values during all less than described threshold value, will for the limit i on isolated limit and limit j formed to from set A, deleting;

Step 6-2-3), limit i in the set A and limit j are carried out step-by-step and operation

With satisfy the limit i of minimum value and limit j deposit in cluster centre center=(i, j) in;

Step 6-2-4), search with cluster centre center in the limit k of all limit similarity minimums as new cluster centre, if k does not exist, then return the cluster centre that finds, this cluster centre is exactly an initial cluster center; If it is a plurality of that k has, described k is deposited among the set center, re-execute step 6-2-3 then).

In the technique scheme, described step 7) comprises:

Step 7-1), be each member's computing node centrad in the corporations;

Step 7-2), theme as member's computing node weight in the corporations based on communication;

Step 7-3), node is sorted, obtain the core member according to ranking results by described node center degree and described node weights.

In the technique scheme, described step 8) comprises:

Step 8-1), get m and node i beeline and form set of node { v greater than 2 node ₁, v ₂..., v _m; The number of times that belongs to same corporations with variable fnum record and node i;

Step 8-2), from the set of node that previous step is produced, choose a undressed subclass, judge whether node and the node i in this node subclass belongs to same corporations;

Step 8-3), repeating step 8-2), the frequency p according to the fnum of each node calculates each node if frequency p, thinks then that this node and node i belong to same corporations greater than another threshold value, otherwise then is not.

In the technique scheme, described step 9) comprises:

Step 9-1), communication network is divided into n corporations, each node is exactly independently corporations; Wherein, initial modularity value Q=0, initial a _iAnd intermediate variable b _IjSatisfy:

a_{i} = \frac{\underset{j}{Σ} w_{ij} e_{ij}}{2 \underset{i, j}{Σ} w_{ij}}

b_{ij} = \frac{w_{ij} e_{ij}}{2 \underset{i, j}{Σ} w_{ij}}

E when node i has the limit to be connected with node j wherein _Ij=1; E when not having the limit to connect between node i and the node j _Ij=0; w _IjBe limit e _IjCorresponding weights; The element of module Increment Matrix satisfies when initial:

Δ Q_{ij} = b_{ij} + b_{ji} - 2 a_{i} a_{j} = \frac{w_{ij} e_{ij}}{\underset{i, j}{Σ} w_{ij}} - \frac{(\underset{k}{Σ} w_{ik} e_{ik}) (\underset{k}{Σ} w_{jk} e_{jk})}{2 {(\underset{i, j}{Σ} w_{ij})}^{2}}

Step 9-2), from raft H, select maximum Δ Q _Ij, merging corresponding i of corporations and j, the label of the corporations after mark merges is j; And update module degree increment Delta Q _Ij, raft H and auxiliary vectorial a _i: this step comprises:

Step 9-2-1), Δ Q _IjRenewal, delete the element of the capable and i of i row, upgrade the element of the capable and j row of j, thereby obtain

Step 9-2-2), the renewal of raft H, upgrade Δ Q at every turn _IjAfter, upgrade the greatest member of corresponding row and column in the raft;

Step 9-2-3), auxiliary vector upgrades:

a′ _j＝a _i+a _j

a′ _i＝0

Modularity value Q+ Δ Q after record merges simultaneously _Ij

Step 9-3), repeating step 9-2) merge end condition up to satisfying.

The present invention also provides a kind of corporations of communication network to divide system, comprising: data preprocessing module, communications and liaison relational network structure module, text vector constructing module, node center degree computing module, side attribute computing module, limit cluster module, core member search module, member's expansion module and member and divide module; Wherein,

Described data preprocessing module is carried out preliminary treatment to communication data, obtains the information about communication data that comprises communication data ID, caller information, recipient's information, call duration time, Content of Communication;

Described communications and liaison relational network makes up module and creates the communications and liaison relational network that is used to reflect described communication network architecture according to resulting preliminary treatment result, obtain being used for representing the sender of communications of described communication network, communication receiver's node by described communications and liaison relational network, and the limit that is used to represent correspondence between described sender of communications, communication receiver;

Described text vector constructing module is according to user the query word structure demand text vector that provides and the text vector of communicating by letter;

Described node center degree computing module calculates the node center degree of each node in the described link relation network; Described node center degree comprises node intermediary degree, node tightness and node contact degree;

Described side attribute computing module calculates communications and liaison relationship strength, the similarity between each internodal limit and user between each node that has communications and liaison relations in the described link relation network to the satisfaction on described internodal limit;

Described limit cluster module is the cluster operation while doing in the described communications and liaison relational network based on described Content of Communication, generates a plurality of corporations;

Described core member searches module and seek separately core member according to described node center degree and communication theme in described corporation;

Described member's expansion module is expanded the member in the corporations on described core member's basis;

Described member divides module the member through expansion in the described corporations is divided, and generates new corporations.

The invention has the advantages that:

1, method and system of the present invention has extracted from communication network and has comprised and be used for representing the sender of communications of described communication network, communication receiver's node, be used to represent the limit of correspondence between described sender of communications, communication receiver, the node center degree, each internodal link relation intensity, similarity between each internodal limit and user to the satisfaction on described internodal limit in interior information, for the excavation and the analysis of follow-up communication data provides technical support than horn of plenty.

2, method and system of the present invention once spreads (being the diffusion that the cluster done when dividing of limit cluster, corporations and incorporator are done when expanding) by twice cluster and has realized corporations' division, divide the result accurately, reliable.

Description of drawings

Fig. 1 is a corporations of the present invention division methods flow chart in one embodiment;

Fig. 2 is the related in one embodiment schematic diagram that is used to store the pretreated form of process;

The flow chart of Fig. 3 in the corporations of the present invention division methods member in the corporations being expanded;

Fig. 4 is that corporations of the present invention divide system's schematic diagram in one embodiment.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is illustrated.

Before embodiments of the present invention are elaborated, at first related notion related among the present invention is described.

1, set of node N

Set of node N is the set of each communication node in the communication network.

2, limit collection E

Limit collection E is used in the record communication process as the communication node of transmit leg and as the correspondence between recipient's the communication node, is typically expressed as one 0,1 matrix, wherein e _IjThere is the limit to connect e between=1 expression node i and the node j _IjThere is not the limit to connect between=0 expression node i and the node j.

3, user's request Q

The scale of considering communication network is very huge, and in order to improve accuracy rate, the user need provide the demand text to come the lock onto target scope.For example, a user thinks the information of locking about " security ", and then this user need provide as keywords such as " security ", " stocks " and inquire about as the demand text, and all discussed the people of these speech with locked.Described user's request normally occurs with the form of speech.Need to prove, even user's request is clear and definite, can both may be the People's University as " National People's Congress " owing to the inconsistent ambiguity that causes of word also, also may be people's congress, thus also to expand the demand text, thus make up user inquiring vector Q.

4, nodal community collection L _N

Property set L for node i _NComprise following three:

1), communication number of the account:

Mapping relations between record node and the communication number of the account.

2), information of neighbor nodes table:

If there is the limit to connect between node i and the node j, then node i is called the neighbours of node j, and each node has the information of neighbor nodes table of self.The information of the neighbor node of one node is kept in the information of neighbor nodes table of this node.

3), node center degree C:

Each node is owing to the difference on its topological structure has different status in communication network.Node center degree C is an index that is used to indicate the communication node significance level taking all factors into consideration node tightness, intermediary's degree and contact degree, is represented with a matrix usually.

5, side attribute collection L _E

For limit e _IjProperty set L _EComprise following three:

1), communications and liaison intensity matrix W

In communication network, the communication communications and liaison intensity between the needs assessment node (being called for short communications and liaison intensity).If the direct communication behavior is arranged between the node, then the communications and liaison intensity reflects is that it gets in touch with intensity in reality; If there is not the direct communication behavior, then the communications and liaison intensity reflects is its possibility that produces information interchange in reality.Can take all factors into consideration information such as call duration time, communication frequency, topological structure and make up communications and liaison intensity matrix W.

2), similarity matrix S

The limit is expressed as the vector with semanteme, according to the similarity between the vector calculation limit.Similarity matrix S is that cluster analysis provides support.

3), user satisfaction CE

Every limit can be given a user satisfaction CE according to the user's request text, user satisfaction is used for judging that this limit is whether in user's AOI.

More than being the explanation to related notion of the present invention, in the following embodiments, will be example with the mail network, to how excavating the information in the mail network, and then realize that the process that corporations divide describes.In other embodiments, also can set up information excavating and corporations' division with reference to correlated process such as communication networks such as landline telephone, portable terminals.

Before mail network was analyzed, inevitable requirement had the related data of mail communication.These data can utilize prior art to obtain from the communication network such as the Internet, no longer repeat at this.Below with reference to Fig. 1, to how according to the mail communication data by the communication network mined information, and then realize that the process that corporations divide describes.

Step 10, to the preliminary treatment of mail communication data.

Preliminary treatment to the mail communication data mainly is the information that will obtain following many aspects:

1), communication data ID

Communication data is numbered, and ID is a unique identification of distinguishing communication data.In the present embodiment, be generally an envelope mail and give an ID.And in other embodiments,, give an ID for once talking with as in instant messagings such as MSN and QQ.

2), caller information

The information of transmit leg in the communication data.In the present embodiment, caller information can be the e-mail address of transmit leg, in other embodiments, also can be number of the account, IP address of transmit leg etc., as long as can the unique identification transmit leg.

3), recipient's information

Recipient's information in the communication data.In the present embodiment, recipient's information can be recipient's e-mail address, in other embodiments, also can be number of the account, IP address of recipient etc., as long as can the unique identification recipient.

4), call duration time

The time of origin of communication data.In the present embodiment, call duration time can be the time that transmit leg sends mail, or the recipient receives the time of mail.In other embodiments, in the instant messaging process, other call duration time identification method can be arranged also, as with chat time started of primary network chat as call duration time.

5), Content of Communication

Content of Communication is exactly the content of text of communication data, as the theme and the text of Email, in the present embodiment, not with the information in the Email attachment as Content of Communication.In other embodiments, also can read text message in the annex by related software, and with it as Content of Communication.Owing in Chinese, do not have tangible line of demarcation between speech and the speech, therefore,, need do word segmentation processing to the content of text in the communication data as a kind of preferred implementation, obtain the Content of Communication of forming by a plurality of words.

A communication process in the communication network can obtain the information of above-mentioned five aspects, and the information of all or part communication process of whole communication network in a period of time is put together the basic data that just can be formed for describing the mail communication network.As a kind of preferred implementation, can be classified to these basic datas, and classification results is stored respectively with a plurality of tables.

In the present embodiment, with reference to figure 2, in the several below forms of sorted storage:

A, mapping table: this form is a mapping table, can find the pairing node name information of communication number of the account by inquiring about this table;

B, e-mail messages message: this form is the Content of Communication table, " mail numbering " mid is the major key of this table, unique " mail numbering " mid is all arranged as sign for each communication, if the theme and the text of communication that be mail then this table essential record is if be other communication formats then be chat record;

C, related information table recipient info: this form is that Content of Communication receives information table, in this table, can inquire essential information in " e-mail messages " message table by field " mail numbering " mid;

D, related information table: this form is the contact table, has write down receiving and sending messages between the communication number of the account in this form;

E, weight table: this form is the weight information table of communication number of the account contact;

F, interactive information table: this table comprises text message vector sum user satisfaction for the interactive information table between the communication number of the account.

Step 20, create the communications and liaison relational network according to the resulting preliminary treatment result of previous step.

In step before, from the mail communication of reality, obtained corresponding data, these data itself can not reflect the integral status of mail network intuitively, therefore need to set up the communications and liaison relational network according to mail data in this step.

In the process of setting up the communications and liaison relational network, create a communication node for each communication number of the account, whether needs are created the limit between communication node according to the decision of the content in the resulting form after the preliminary treatment then.If have correspondence between two communication numbers of the account, there is the limit to exist between these two the pairing communication nodes of communication number of the account so, otherwise, just there is not corresponding limit.

When setting up the communications and liaison relational network, can obtain set of node N and limit collection E according to the mail communication data.The composition of set of node N and limit collection E and data structure have had corresponding explanation in preamble, therefore do not repeat herein.

Step 30, structure communication text vector and demand text vector.

In the preprocessing process of step 10, mention, can obtain text message (being Content of Communication) in the communication process by preprocessing process, and these text messages done word segmentation processing, these text messages are done following processing below by following operation.

Step 31, structure inverted index

On the basis of word segmentation result, utilize index dictionary and inactive vocabulary to make up inverted index.Index dictionary, the vocabulary and utilize the index dictionary and inactive vocabulary makes up the common practise of the process of inverted index for this area, therefore repetition herein of stopping using.

Step 32, establishment demand text vector and the text vector of communicating by letter

Include content aspect multiple, user's request customer-furnished comprising having, that represent with the form of query word usually in the text in communication.These texts relevant with user's request are called as the demand text, and the vector of being created by the demand text is called as the demand text vector.The form of demand text vector Q is as follows:

{(t ₁，tw ₁)，(t ₂，tw ₂)，...，(t _m，tw _m)}

Wherein, t ₁, t ₂..., t _mBe the inquiry lexical item, these speech are all arranged according to ascending order; Tw ₁, tw ₂..., tw _mFor being used to describe the weight of inquiry lexical item in the in the eyes of significance level of user.

Inquiry lexical item by the demand text can make up communication text vector { (t ₁, tw ₁), (t ₂, tw ₂) ..., (t _m, tw _m), and the weight of inquiry lexical item can be calculated by following formula, calculates the inquiry lexical item t among the mail j _iWeight tw _Ji:

{tw}_{ji} = f_{ij} \times \log \frac{N}{f_{i}}

F wherein _IjBe to comprise speech t among the mail j in the communication text collection _iNumber, N be communication text collection number.

Calculate weight tw by above-mentioned formula _JiAfter, just can calculate each inquiry lexical item t through weighted calculation ₁, t ₂..., t _mWeight tw in whole communication text collection ₁, tw ₂..., tw _mNeed to prove, though hereinbefore, in demand text vector and feature text vector, the weight of inquiry lexical item is all used such as the form of tw and is represented, but this weight reflects in the demand text vector be corresponding inquiry this in user's significance level in the heart, the frequency dependence that then in the text of communicating by letter, occurs with the inquiry lexical item in the communication text vector.

Step 33, expansion demand text

Consider the diversity of the employed query word of user, as in the example of an inquiry about computerized information, the user who has can be called computer " computer ", in order to make Query Result more accurate, complete, needs expansion demand text.

When expansion demand text, need add relevant lexical item by certain strategy, make the text after the expansion can intactly describe implicit notion or theme.

The operation of expansion demand text can may further comprise the steps:

Step 33-1, at first calculate a lexical item t and the inquiry co-occurrence frequency of lexical item q in text j:

cof(t，q|j)＝log(tf(t，j)+1.0)×log(tf(q，j)+1.0)

Wherein, and tf (t, j) or tf (q, j) expression speech t or the occurrence number of q in text j.

Step 33-2, after obtaining the co-occurrence frequency of a lexical item and inquiry lexical item, can further calculate this lexical item and the degree of association of inquiring about between lexical item.

Suppose between each speech among the initial demand text Q separate, the degree of association that can measure lexical item t and Q according to the product of the co-occurrence frequency of each speech among lexical item t and the Q in local text set S.Lexical item t and the Q degree of association in S is defined as:

cohd (t, Q | S) = \underset{q &Element; Q}{Π} {(cood (t, q | S) + 1.0)}^{idf (q | C) idf (t | C)}

Wherein idf (| C) be defined as:

idf (| C) = \frac{\log (N)}{\log (df (| C) + μ)}

Df (| C) the text number of certain lexical item appears among the expression corpus C, μ be one greater than 0 adjustable parameter, default value is 100.

Step 33-3, calculate valuation functions, judge whether described lexical item t will be expanded in the demand text by the result of calculation of described valuation functions by the degree of association.

On the basis of aforementioned degree of association computing formula, take the logarithm in both sides, and the computing formula that obtains valuation functions score (t) is as follows:

score (t) = \underset{q &Element; Q}{Σ} idf (q | C) idf (t | C) \log (cood (t, q | S) + 1.0)

Define lodd below _{Q, C}(t is under the condition of given overall text set C and user's request text vector Q q|S), lexical item t and the query word q local dependency degree (LocalDependence Degree) in the local document S set, and its computing formula is as follows:

lodd _Q，C(t，q|S)＝idf(q|C)idf(t|C)log(cood(t，q|S)+1.0)

Then Zhi Qian valuation functions can be reduced to:

score (t) = \underset{q &Element; Q}{Σ} {lodd}_{Q, C} (t, q | S)

After obtaining the score value of valuation functions, just can select the higher lexical item of score value to carry out the expansion of demand text, on the one hand to those in local text set S with query vector Q in the lexical item of the numerous co-occurrence of word frequency give higher score value, concentrate lexical item then to carry out to a certain degree punishment (regulating the degree of punishment by the parameter μ in the idf computing formula) to those at overall mail on the other hand, make the lexical item that the score value finally chosen is the highest and the theme of user's request text have higher correlation with higher frequency.

Step 40, computing node centrad.

Definitional part at preamble is mentioned, and the node center degree comprises node intermediary degree, node tightness and three indexs of node contact degree, with regard to how calculating these indexs describes respectively below.

Step 41, computing node intermediary degree

The mean value of the shortest path number by node k is called intermediary's degree coefficient of node k, is designated as C _A(k), then:

C_{A} (k) = \frac{Σ_{i}^{n} Σ_{j}^{n} g_{ij} (k)}{{(n - 1)}^{2}}

Wherein, g _Ij(k) be a two-valued variable, whether the shortest path between expression node i, the j then is 1 by k, otherwise is 0 by node k.

Step 42, computing node contact degree

The mean value of the node number that will directly link to each other with node k is called degree of the contact coefficient of node k, is designated as C _B(k), then:

C_{B} (k) = \frac{Σ_{i = 1}^{n} a (i, k)}{(n - 1)}

Wherein n is the nodal point number of a network, and a (i is a two-valued variable k), is 1 explanation node i, directly link to each other between the k, and be that 0 explanation does not directly link to each other.

Step 43, node tightness

The mean value of the shortest path sum in node k and the network between all nodes is called the tightness coefficient of k, is designated as C _C(k), then:

C_{C} (k) = \frac{Σ_{i}^{k} l (i, k)}{{(n - 1)}^{2}}

Wherein (i k) is shortest path length between node i, the k to l.

Centrad vector C (k)=(C that just can computing node k after obtaining node intermediary degree, node tightness and node contact degree _A(k), C _B(k), C _C(k)).

Step 50, calculating communications and liaison intensity matrix W

To node i, the communications and liaison relationship strength assessment between the j comprises four indexs: number of communications, call duration time span, shortest path length, shared neighbours' number.Respectively the computational process of these indexs is described below.

Step 51, calculating number of communications

Number of communications is many more between node, shows that its contacts are frequent, concerns tight more.The number of communications of node i, j is calculated as follows:

comm_num _ij＝send _ij+receive _ij

Wherein, send _IjThe number of times that the expression node i is initiated communication to node j, receive _IjThe expression node i receives the number of communications that node j initiates.

Step 52, calculating call duration time span

The inter-node communication time span is long more, shows that the interdependent node contact history is of a specified duration more, concerns closely more, and the call duration time span of node i, j is:

dur_day _ij＝latest_day _ij-earliest_day _ij

Wherein, latest_day _IjBe the node i that monitors recently, the call duration time between j, earliest_day _IjIt is the initial communication time between node i, j.

Step 53, calculating shortest path length

Internodal shortest path length is short more, shows that the substantivity of its contacts is strong more, concerns tight more.Node i, the shortest path length shortest_len between j _IjExpression, it is meant that node i has the limit number that the path comprised of minimum edges number in all paths of j.

Step 54, shared neighbours' number

It is many more to share neighbours' node between node, shows that the possibility of its relationship cycle that exists together is big more, concerns tight more.The neighbor node set of scanning node i and j obtains sharing neighbours' number:

sharenode_num _ij＝|neighbor _i∩neighbor _j|

Step 55, after calculating number of communications, call duration time span, shortest path length, sharing neighbours' number, just can calculate the function closeness (i that is used to assess two node communications and liaison relationship strength, j), (i, j) value has been formed described communications and liaison intensity matrix W to function closeness on a plurality of dimensions.Described function closeness (i, computing formula j) is:

closeness (i, j)

= k_{1} \times \frac{comm_{num}_{ij}}{Max_num} + k_{2} \times \frac{dur_{day}_{ij}}{Max_day}

+ k_{3} \times \frac{sharenode_{num}_{ij}}{Max_node} + k_{4} \times (1 - \frac{shortest_{len}_{ij}}{Max_len})

Wherein, Max_num is a maximum communication number of times mutual between all nodes; Max_day is a maximum time span mutual between all nodes; Max_node is that maximum mutual between all nodes is shared neighbours' number; Max_len is the longest mutual between all a nodes shortest path; k _iBe weight coefficient.

Step 60, calculating similarity matrix S

Step 61, utilize vector space model to the edge-vector between node i and the node j unify the expression, every limit is a vector.Edge-vector between node i and the node j is defined as the mean value of all communication text vectors between node i and the node j.That is:

e_{i} = (a_{1}^{i}, a_{2}^{i}, \cdot \cdot \cdot \cdot \cdot \cdot, a_{n}^{i})

Wherein,

a_{j}^{i} = \frac{Σ_{k = 1}^{r} E_{w} - {ID}_{w} (m_{k}, t_{j})}{r}, 1 \leq j \leq n

E _w-ID _w(m _k, t _j) representation feature speech t _jAt communication text m _kIn weight. step 62, calculate the similarity between any both sides

Utilize cosine formula to calculate the vector on any both sides

With

Between similarity, its computing formula is:

s_{ij} = \cos (e_{i}, e_{j}) = \frac{e_{i} \cdot e_{j}}{\sqrt{{(e_{i})}^{2}} \times \sqrt{{(e_{j})}^{2}}} = \frac{Σ_{k = 1}^{n} (a_{k}^{i} \times a_{k}^{j})}{\sqrt{Σ_{k = 1}^{n} {(a_{k}^{i})}^{2}} \times \sqrt{Σ_{k = 1}^{n} {(a_{k}^{j})}^{2}}}

s _IjIts value is big more, and angle is more little, and similarity is high more.If Then think e _iAnd e _jSimilar, otherwise dissimilar.Wherein, Be similarity threshold.

Step 63, structure similarity matrix S

Carry out according to the abovementioned steps opposite side obtaining similarity matrix S on the basis of similarity calculating in twos:

Given threshold value If

Then similar, otherwise dissimilar, the matrix S after can filtering in view of the above, wherein

s_{ij} = \{\begin{matrix} 1 & s_{ij} &GreaterEqual; &PartialD; \\ 0 & s_{ij} < &PartialD; \end{matrix}

Step 70, calculating user satisfaction CE

By the user's request text is expanded, Content of Communication can be introduced.Detailed process is as follows:

The weight of step 71, computation requirement text

At first need definite each inquiry lexical item in the in the eyes of weight of user in order to obtain user's satisfaction, before the weight of computation requirement text, at first do as giving a definition:

R represents the text collection of meeting consumers' demand;

C represents all text collections;

N_C represents all text numbers in the set

All text numbers of meeting consumers' demand during N_sim represents to gather.

The weight of computation requirement text can adopt the correlation technique of prior art, in the present embodiment, can be according to the experiment of the relevant feedback of Rocchio, with the demand text as query vector, the desirable query vector that the text that satisfies the demands and the text that do not satisfy the demands are all made a distinction

Value on each dimension is as the weight of demand text.The computing formula of described desirable query vector is:

{\overset{&RightArrow;}{Q}}_{opt} = \frac{1}{N_sim} \underset{d_{j} &Element; R}{Σ} \frac{{\overset{&RightArrow;}{d}}_{j}}{| {\overset{&RightArrow;}{d}}_{j} |} - \frac{1}{N_C-N_sim} \underset{d_{j} &Element; C - R}{Σ} \frac{{\overset{&RightArrow;}{d}}_{j}}{| {\overset{&RightArrow;}{d}}_{j} |}

Wherein, d _jThe j dimension of the vector that expression is corresponding, The value of the j dimension of the vector that expression is corresponding;

In the actual conditions, because the text number that satisfies the demands can't be known in advance, therefore when Practical Calculation, at first construct an initial query vector, be that the user gives one [0 with each lexical item, 1] value is represented its significance level, according to the text that satisfies the demands of user's appointment it is progressively revised then, up to reaching an ideal results.The classic algorithm that Rocchio proposes is as follows:

{\overset{&RightArrow;}{Q}}_{opt} = α \times {\overset{&RightArrow;}{q}}_{initial} + β \times \underset{d_{j} &Element; R}{Σ} \frac{{\overset{&RightArrow;}{d}}_{j}}{| {\overset{&RightArrow;}{d}}_{j} |} - γ \times \underset{d_{j} &Element; C - R}{Σ} \frac{{\overset{&RightArrow;}{d}}_{j}}{| {\overset{&RightArrow;}{d}}_{j} |}

Wherein α, β, γ are three constants that are used to adjust; Expression initial query vector.

The user satisfaction of step 72, calculating text m

The satisfaction s of text m _mBe expressed as the vector T of text m _mWith user's request text vector T _QBetween similar value.

s_{m} = \cos (T_{m}, T_{Q}) = \frac{T_{m} \cdot T_{Q}}{\sqrt{{(T_{m})}^{2}} \times \sqrt{{(T_{Q})}^{2}}} = \frac{Σ_{k = 1}^{n} (t_{k}^{m} \times t_{k}^{Q})}{\sqrt{Σ_{k = 1}^{n} {(t_{k}^{m})}^{2}} \times \sqrt{Σ_{k = 1}^{n} {(t_{k}^{Q})}^{2}}}

Step 73, calculating limit user satisfaction

The mean value of all text satisfactions that node i is communicated by letter with node j is called limit user satisfaction CE:

CE = \frac{1}{N_{k}} Σ_{i = 1}^{N_{k}} s_{i}

Wherein, N _kThe amount of text of communicating by letter with node j for node i.

Mining process to relevant information in the mail communication network in step before illustrates,

Utilize these information can realize that corporations divide.

Step 80, be cluster operation while doing based on Content of Communication.

Described limit cluster is all limits in the communication network will be divided into several corporations, and for Content of Communication, the difference between the limit of different corporations is comparatively obvious, and the limit in the same corporations should be comparatively approaching.The purpose of cluster operation while doing in the communication network is quick lock in user's request scope.The implementation method of described limit cluster operation has multiple, as stratification, partitioning, based on computational methods of grid etc., can adopt the k-means method in the present embodiment.The concrete steps that the k-means method that is adopted in the present embodiment is done the limit cluster operation describe below.

Step 81, determining the number of the corporations that will generate by the limit cluster, is n with this number indicia;

Step 82, be that each corporations generate initial cores separately;

Step 83, for every in communication network limit, calculate the similarity between the initial cores in itself and each corporations successively;

Step 84, according to the result of calculation of step 83, the limit in the communication network is added in the corporations with the initial cores place of its similarity maximum;

Step 85, adjustment cluster centre; In this step, described adjustment cluster centre can adopt the mean value such as each member in the compute classes, with described mean value as common method in the new prior aries such as cluster centre;

Step 86, repeated execution of steps 83-step 85, up to satisfying stop condition, this moment, resulting each corporations were exactly the limit clustering result.Related stop condition can have multiplely in this step, and as in adjusting the process of cluster centre, difference is less than a preassigned threshold value between the core of former and later two classes.

In above-mentioned steps 82, relate to the process that generates initial cores, the establishment with regard to initial cores is illustrated below.

Index 1: the similarity between the initial cores is as much as possible little, makes more possible little of similarity between the corporations at initial cores place.

Index 2: for guaranteeing the initial cores vector is not the limit that isolates, and adds up the limit number similar to it, makes it greater than given threshold value.

Index 3: overlapping few more good more in the limit that two selected cluster centres are relevant.

The selected process of initial cores is as follows:

Step 82-1, in similarity matrix S, if s _Ij=0, then limit i and limit j are formed to depositing in the set A;

In every group among step 82-2, the set of computations A with the class degree value of limit i

And the class degree value of limit j

Whether judge these two class degree values all greater than preassigned threshold value (as 2), have only that limit i and limit j formed to being isolated limit when described two class degree values during all less than described threshold value, will for the limit i on isolated limit and limit j formed to from set A, deleting.

Step 82-3, limit i in the set A and limit j are carried out step-by-step and operation

Step 82-4, search with cluster centre center in the limit k of all limit similarity minimums as new cluster centre.If k does not exist, then return the cluster centre that finds, this cluster centre is exactly an initial cluster center.If it is a plurality of that k has, described k is deposited among the set center, re-execute step 82-3 then.

Step 90, in corporations, find the core member.

The process of finding the core member is as follows.

Step 91, whether be that the core member judges to the member in the corporations based on the node center degree.

Composition about the node center degree has had detailed explanation with calculating in the step 40 of preamble, therefore, do not repeated in this step.Wherein, the contact degree in the node center degree has reflected the active degree of node in network, and the contact degree of a node is very high to mean that it is likely server; What intermediary's degree was weighed is that certain special node is positioned at the degree between other node; Tightness weighed the distance of distance between a node and other node, reflected that a node arrives the speed of other all nodes.The node center degree is integrated above-mentioned three has described the degree of the middle cardiac status of node k in network.

Step 92, based on communication theme whether be that the core member judges to the incorporator.

In this step, whether be that the core member judges that multiple implementation is arranged to the incorporator, adopted the HITS algorithm in the present embodiment based on the communication theme.Described HITS algorithm is that its basic principle is according to a given general reference theme by a kind of web page interlinkage parser of the Kleinberg proposition of IBM, determines authority's page or leaf of this theme by link analysis.In conjunction with the characteristics of communication behavior self, utilize this algorithm to find that core member's process is as follows:

Step 92-1, determine to comprise the node set of HITS algorithm effect:

Step a), concentrate from the Query Result that obtains based on the user's request text and to get the highest preceding t position of rank and put into result set R _σ(being called Root Set).

Step b), to described result set R _σExpand.Described expansion is divided into two aspects, and the one, with all R _σIn the active communication node of node extend to described result set R _σIn; The 2nd, pointing to described R _σIn in the passive nodes in communication of each node, get any d node and extend to original result set R _σIn, thereby form S _σ(being called Base Set).

S set after the expansion _σCan satisfy three characteristics: S preferably _σLess relatively; S _σMiddle interdependent node is abundant; S _σThe authoritative node that comprises most most worthies.

The center weight of step 92-2, computing node and authoritative weight.

The node set S that will have the communications and liaison relation _σBe expressed as a directed graph, (p, q) expression node p and node q communicate directed edge.A good Centroid (hub) points to many good authoritative node (authorities), and a good authoritative node (authority) also has a plurality of good Centroids (hubs) to point to it simultaneously.For any node p, the authorityweight (authoritative weight) of A (p) expression node p, the hub weight (center weight) of H (p) expression node p, satisfy normalization condition:

\underset{p &Element; S_{σ}}{Σ} A^{2} (p) = 1

And

\underset{p &Element; S_{σ}}{Σ} H^{2} (p) = 1

Kleinberg is divided into dual mode with the transmission of node weights, i.e. I operation and O operation:

I is operating as the transmission of Centroid to authoritative node, is expressed as:

A (p) &LeftArrow; \underset{q &Element; \cdot Q}{Σ} H (q)

Q={q| (p, q) ∈ E} wherein;

O is operating as the transmission of authoritative node to Centroid, is expressed as:

H (p) &LeftArrow; \underset{q &Element; \cdot Q}{Σ} A (q)

Wherein (p, q) ∈ E} can obtain the final weight of all nodes to Q={q| by interative computation.

After step 93, all nodes all have a centrad and node weights, take all factors into consideration the value of two aspects, according to descending, the forward node of overall ranking is exactly the core member with it.

Step 100, the member in the corporations is expanded.

After previous step finds core member in the corporations, serve as that the expansion to member in the corporations is realized on the basis with these core members.Described member's expansion can be by judging whether a member and core member belong to same corporations and realize.Connect tight relatively the node that belongs to same corporations, the node outside corporations.In like manner, for the information that node i has, the amount of information that obtains in the same corporations is greater than and obtains amount of information outside the corporations.Therefore, the amount of information that obtains according to each node in the information communication process can judge whether two nodes belong to same corporations.With reference to figure 3, detailed process is as follows:

Step 101, get m and the node i beeline is formed set of node { v greater than 2 node ₁, v ₂..., v _m; Write down the number of times that belongs to same corporations with node i with variable fnum, the initial value of this variable is 0.

Step 102, from the set of node that step 101 is produced, choose a undressed subclass, judge whether node and the node i in this node subclass belongs to same corporations; This step comprises:

Step 102-1, in the node subclass, choose a node j, choose aforesaid node i in addition as source node, initialization M as terminal note _i=1, M _j=0, set { M _k=0}; M wherein _i, M _j, M _kBy representing node i, the useful amount of information of j, k respectively; K ∈ 1,2 ..., n_node} and k ≠ i, j; N_node is the node number in the network;

Step 102-2, upgrade M successively by ascending order _kValue, the value of information of node k is as follows:

M_{k} = \underset{i}{Σ} M_{i} \frac{w_{ik} e_{ik}}{\underset{j}{Σ} w_{ij} e_{ki}}

Wherein, when having the limit to connect between node i and the node j, e _Ij=1; E when not having the limit to connect between node i and the node j _Ij=0.w _Ij=1 is limit e _IjCorresponding weights.

Step 102-3, repeat above-mentioned step 102-2, change not obvious up to the value of information of node.

Step 102-4, select the divided information threshold value according to redundancy.

Each node all has a value of information, as long as exist the maximum difference place to scratch near position intermediate and information, just whole network can be divided into two corporations.Described herein has introduced redundancy near position intermediate.Such as, number of network node is n, if redundancy is, means that the size of corporations is roughly at 20% o'clock

Like this as long as be positioned at (0.3n, adjacent two nodes of searching difference maximum in node 0.7n).

Behind step 102-5, the selected good threshold, if the value of information of node k then belongs to same corporations with source node greater than a threshold value (being 70% in the present embodiment), corresponding fnum+1; If less than this threshold value then do not belong to same corporations.

Step 103, repeating step 102 calculate the frequency p of each node according to the fnum of each node, if frequency p greater than another threshold value (having adopted 0.6 in the present embodiment), thinks that then this node and node i belong to same corporations, otherwise then is not.

Step 110, divide based on the corporations of complex network.

When carrying out corporations' division, can adopt partitioning and act of union.Be example in the present embodiment with the act of union, the process that corporations are divided describes.

In this step, can relate to the notion of modularity, do following explanation earlier:

Modularity: suppose that network is divided into k corporations, define the symmetrical matrix E=(e of a k * k dimension _Ij), element e wherein _IjThe limit of node that connects two different corporations in the expression network shared ratio in all limits, these two nodes lay respectively at i corporations and j corporations.Modularity represents that with Q its computing formula is as follows:

Q = \underset{i}{Σ} (e_{ii} - a_{i}^{2}) = Tre - | | e^{2} | |

Wherein || e ²|| all element sum among the representing matrix x.

Also relate to following three kinds of data structures in this step:

(1) modularity Increment Matrix Δ Q _IjWith its each line stores is a balanced binary tree, and raft.

(2) raft H.Comprised modularity Increment Matrix Δ Q in this heap _IjIn greatest member of each row, comprise the numbering i and the j of two corporations of element correspondence simultaneously.

(3) auxiliary vectorial a _i

After related notion and data structure being done as above explanation, it is as follows to adopt act of union to do the concrete steps that corporations divide:

Step 111, communication network is divided into n corporations, each node is exactly independently corporations.At this moment, initial modularity value Q=0.Initial a _i, and intermediate variable b _IjSatisfy:

a_{i} = \frac{\underset{j}{Σ} w_{ij} e_{ij}}{2 \underset{i, j}{Σ} w_{ij}}

b_{ij} = \frac{w_{ij} e_{ij}}{2 \underset{i, j}{Σ} w_{ij}}

E when node i has the limit to be connected with node j wherein _Ij=1; E when not having the limit to connect between node i and the node j _Ij=0.w _IjBe limit e _IjCorresponding weights.The element of module Increment Matrix satisfies when initial:

Δ Q_{ij} = b_{ij} + b_{ji} - 2 a_{i} a_{j} = \frac{w_{ij} e_{ij}}{\underset{i, j}{Σ} w_{ij}} - \frac{(\underset{k}{Σ} w_{ik} e_{ik}) (\underset{k}{Σ} w_{jk} e_{jk})}{2 {(\underset{i, j}{Σ} w_{ij})}^{2}}

Step 112, from raft H, select maximum Δ Q _Ij, merging corresponding i of corporations and j, the label of the corporations after mark merges is j; And update module degree increment Delta Q _Ij, raft H and auxiliary vectorial a _i:

Step 112-1, Δ Q _IjRenewal, delete the element of the capable and i of i row, upgrade the element of the capable and j row of j, thereby obtain

Δ Q is upgraded in the renewal of step 112-2, raft H at every turn _IjAfter, upgrade the greatest member of corresponding row and column in the raft.

Step 112-3, auxiliary vector upgrade:

a′ _j＝a _i+a _j

a′ _i＝0

Modularity value Q+ Δ Q after record merges simultaneously _Ij

Step 113, repeating step 112 merge end condition up to satisfying.Described merging end condition has multiple, and in one embodiment, described merging end condition all belongs in the corporations for all nodes.In another embodiment, consider that modularity Q only has a peak value, therefore can be made as after the greatest member in the modularity Increment Matrix is born by positive changing to that just can stop can be also with merging end condition.

The present invention also provides a kind of corporations of communication network to divide system, with reference to figure 4, comprising: data preprocessing module, link relation network struction module, text vector constructing module, node center degree computing module, side attribute computing module, limit cluster module, core member search module, member's expansion module and member and divide module; Wherein,

Described side attribute computing module calculates link relation intensity, the similarity between each internodal limit and user between each node that has link relation in the described link relation network to the satisfaction on described internodal limit;

Described limit cluster module is the cluster operation while doing in the described link relation network based on described Content of Communication, generates a plurality of corporations;

By above method and system, can realize division, thereby the member is classified according to their attribute or feature different corporations in the communication network.

It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims

1. corporations' division methods of a communication network comprises:

The node center degree of each node in step 4), the described communications and liaison relational network of calculating; Described node center degree comprises node intermediary degree, node tightness and node contact degree;

Step 5), calculate communications and liaison relationship strength, the similarity between each internodal limit and user between each node that has link relation in the described communications and liaison relational network to the satisfaction on described internodal limit;

2. corporations' division methods of communication network according to claim 1 is characterized in that, described step 6) comprises:

Step 6-2), be each corporations' generation initial cores separately;

Step 6-5), adjust the cluster centre of described each corporations;

3. corporations' division methods of communication network according to claim 2 is characterized in that, described step 6-2) comprising:

Step 6-2-1), according to the similarity between described each internodal limit, if similarity s _Ij=0, then limit i and limit j are formed to depositing in the set A;

And the class degree value of limit j

Step 6-2-3), limit i in the set A and limit j are carried out step-by-step and operation With satisfy the limit i of minimum value and limit j deposit in cluster centre center=(i, j) in;

4. corporations' division methods of communication network according to claim 1 is characterized in that, described step 7) comprises:

Step 7-1), be each member's computing node centrad in the corporations;

5. corporations' division methods of communication network according to claim 1 is characterized in that, described step 8) comprises:

6. corporations' division methods of communication network according to claim 1 is characterized in that, described step 9) comprises:

Step 9-1), communication network is divided into n corporations, each node is exactly independently corporations; Wherein, the initial modularity value Q=0 that is used for the representation module degree, initial auxiliary vectorial a _iAnd intermediate variable b _IjSatisfy:

a_{i} = \frac{\underset{j}{Σ} w_{ij} e_{ij}}{2 \underset{i, j}{Σ} w_{ij}}

b_{ij} = \frac{w_{ij} e_{ij}}{2 \underset{i, j}{Σ} w_{ij}}

E when node i has the limit to be connected with node j wherein _Ij=1; E when not having the limit to connect between node i and the node j _Ij=0; w _IjBe limit e _IjCorresponding weights; The element Δ Q of module Increment Matrix _IjWhen initial, satisfy:

Δ Q_{ij} = b_{ij} + b_{ji} - 2 a_{i} a_{j} = \frac{w_{ij} e_{ij}}{\underset{i, j}{Σ} w_{ij}} - \frac{(\underset{k}{Σ} w_{ik} e_{ik}) (\underset{k}{Σ} w_{jk} e_{jk})}{2 {(\underset{i, j}{Σ} w_{ij})}^{2}}

Step 9-2), from raft H, select maximum Δ Q _Ij, merging corresponding i of corporations and j, the label of the corporations after mark merges is j; And renewal Δ Q _Ij, raft H and auxiliary vectorial a _i: this step comprises:

Step 9-2-3), auxiliary vector upgrades:

a′ _j＝a _i+a _j

a′ _i＝0

Modularity value Q+ Δ Q after record merges simultaneously _Ij

Step 9-3), repeating step 9-2) merge end condition up to satisfying.

7. the corporations of a communication network divide system, it is characterized in that, comprising: data preprocessing module, communications and liaison relational network structure module, text vector constructing module, node center degree computing module, side attribute computing module, limit cluster module, core member search module, member's expansion module and member and divide module; Wherein,