CN104899657A

CN104899657A - Method for predicting association fusion events

Info

Publication number: CN104899657A
Application number: CN201510314273.8A
Authority: CN
Inventors: 唐晓晟; 李巍; 胡铮; 张诗悦
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2015-06-09
Filing date: 2015-06-09
Publication date: 2015-09-09

Abstract

The invention discloses a method for predicting association fusion events, which comprises the steps of a first step, dividing network original data according to a preset time slice, selecting data of a plurality of time slices from the network original data as training data; a second step, dividing the training data into static associations and dynamic associations; a third step, extracting a key factor index between two random associations based on the training data; and a fourth step, performing supervised training on the key factor index, and determining whether fusion between the two random associations occurs according to the studying result of supervised training. The method realizes prediction for the fusion event of two random associations or a plurality of associations, and furthermore has high prediction reliability. The method is adapted to analyze a majority of weighted or unweighted networks.

Description

The Forecasting Methodology of corporations' fusion event

Technical field

The present invention relates to Data Mining, particularly relate to the Forecasting Methodology of a kind of corporations fusion event.

Background technology

In our life, complex network is ubiquitous, and its common feature is huge, complex structure.Such as social networks is exactly the typical complex network of one be made up of human relationships in real life, its node on behalf network user or Fiel can in people, internodal connection represents friend relation between the network user or real interpersonal relation.This by node in social networks, and the structure be connected to form between node is called network topology structure, this structure presents different features in the social networks of dissimilar, different phase.

Corporations are the seed network representing complex network key character.Corporations have network topology structure equally, and community structure can along with corporations develop and critical event present different features.The behavior guidance that represent certain kind of groups of critical event in complex network.Such as, corporations in social networks represent various social circle, can be circle of friends, emotional affection circle, colleague's circle etc.These corporations may mean the formation of some interest factor or social factor.Carry out prediction to critical event to contribute to excavating these factors in advance and being used, instruct network behavior further.Therefore, to the prediction of corporations' evolution critical event in research or application aspect all has very important meaning.

Corporations' evolution critical event comprises the extinction of corporations, newborn, shrinks, expansion, division and fusion.At present, some researchs are had to the Forecasting Methodology of corporations' evolution critical event, but is only limitted to the evolvement trend predicting single corporations.Corporations' fusion event relate to multiple corporations, whether research work in the past only achieves has the prediction of merging tendency to single corporations, and not clear and definite method does not predict which corporation can merge within a period of time in future.

To sum up, need a kind of method badly and carry out more detailed prediction to by the corporations occurring to merge.

Summary of the invention

One of technical matters to be solved by this invention is to provide a kind of method and carries out more detailed prediction to by the corporations occurring to merge.

In order to solve the problems of the technologies described above, the embodiment of the application provides the Forecasting Methodology of a kind of corporations fusion event, comprising: step one, the timeslice of network raw data according to setting split, and therefrom chooses multiple timeslice data as training data; Step 2, training data carried out to the division of static corporations and dynamic corporations; Step 3, extract the key factor index between any Liang Ge corporations based on training data; Step 4, exercise supervision to described key factor index training, and judge whether any Liang Ge corporations can merge according to the learning outcome of supervised training.

Preferably, key factor index comprises the external structure index of similarity of the inner structure index between Liang Ge corporations, the single order change indicator of described inner structure index, second order change indicator and Liang Ge corporations.

Preferably, following expression is utilized to extract inner structure index between described Liang Ge corporations:

B_{d} (i, j) = \frac{E_{i, j}}{(E_{i} / N_{i} + E_{j} / N_{j}) / 2}

In formula, B _d(i, j) is the inner structure index between corporations i and corporations j, E _i,jfor the linking number between corporations i and corporations j, E _iand E _jbe respectively the linking number of corporations i and corporations j inside, N _iand N _jbe respectively the nodes of corporations i and corporations j inside.

Preferably, following expression is utilized to extract the external structure index of similarity of described Liang Ge corporations:

Sim (i, j) = \frac{\overset{m}{\underset{k &NotEqual; i, j}{\underset{k = 1}{Σ}}} (w_{i, k} \times w_{j, k})}{\sqrt{{\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} w_{i, k}^{2}} + \sqrt{{\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} w_{j, k}^{2}} - {\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} (w_{j, k} \times w_{j, k})}

In formula, Sim (i, j) is the external structure index of similarity of corporations i and corporations j; w _i,kand w _j,krepresent the power between corporations i and corporations k and between corporations j and corporations k respectively, wherein, e _i,kand E _j,kbe respectively the linking number between corporations i and corporations k and between corporations j and corporations k, N _i, N _jand N _kbe respectively the nodes of corporations i, corporations j and corporations k inside; M is corporations' sequence number numbers.

Preferably, comprise the following steps in step 4: utilize the key factor index obtained based on training data to build forecast model, and determine that the separatrix value merged occurs in corporations; The key factor index obtained based on the data apart from the nearest timeslice of time point to be predicted is substituted into described forecast model, and predicting the outcome of obtaining is compared to judge whether corporations can merge with described separatrix value.

Preferably, following expression is utilized to build described forecast model:

\begin{matrix} {Td}_{t = t_{0}} = P_{B} ({R (i, j)}_{t = t_{0}} = 1 | B_{d} {(i, j)}_{t = t_{0} - Δt}) \times (1 + \log_{1 + \max | {ΔB}_{d} {(i, j)}_{t = t_{0} - Δt} |} (1 + {ΔB}_{d} {(i, j)}_{t = t_{0} - Δt})) \\ \times (1 + \log_{1 + \max | ΔΔ B_{d} {(i, j)}_{t = t_{0} - Δt} |} (1 + {ΔΔB}_{d} {(i, j)}_{t = t_{0} - Δt})) + Sim {(i, j)}_{t = t_{0} - Δt} \end{matrix}

In formula, for the tendency degree of fusion happens between corporations i and corporations j,

P_{B} (R {(i, j)}_{t = t_{0}} = 1 | B_{d} {(i, j)}_{t = t_{0} - Δt})

For probability simulation function,

{ΔB}_{d} {(i, j)}_{t = t_{0} - Δt}, ΔΔ B_{d} {(i, j)}_{t = t_{0} - Δt}

With be respectively the single order change indicator of the inner structure index between corporations i and corporations j, second order change indicator and external structure index of similarity; t ₀and t ₀-Δ t represents different time points respectively, and Δ t is the time interval.

Preferably, determining that the step of the separatrix value that corporations' generation is merged comprises: predict that the tendency angle value obtained is normalized by according to described forecast model; Utilize the tendency angle value after process and extract based on training data the corporations obtained and merge situation and set up reference function; The separatrix value of fusion is there is in tendency angle value when reference function being obtained maximal value as corporations.

Preferably, reference function is set up according to following expression:

F = \frac{2 αβ}{α + β}

In formula, F is reference function, α and β is parameter, TD ₀for extracting the corporations of the generation fusion obtained based on training data to corresponding tendency angle value.

Preferably, step 4 comprises the following steps: the vector key factor index obtained based on training data formed substitutes into SVM forecast model to carry out training to determine that the sorter merged occurs in corporations; The vector of the key factor index obtained based on the data apart from the nearest timeslice of time point to be predicted composition is substituted into described SVM forecast model, and predicts the outcome according to the classification obtained and judge whether corporations can merge.

Preferably, also comprised before step 3: based on described static corporations and described dynamic corporations, each corporation is predicted respectively, obtain corporations' set that will participate in merging; In step 3, extract the key factor index in the set of described corporations arbitrarily between Liang Ge corporations based on training data.

Compared with prior art, the one or more embodiments in such scheme can have the following advantages or beneficial effect by tool:

By extracting the key factor index between Liang Ge corporations, whether to any Liang Ge corporations or multiple corporations can the prediction of fusion happens, the method predicting reliability is high if achieving, can the pervasive analysis of complex network of having the right in the overwhelming majority or having no right.

Other advantages of the present invention, target, to set forth in the following description to a certain extent with feature, and to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, or can be instructed from the practice of the present invention.Target of the present invention and other advantages can by instructionss below, claims, and in accompanying drawing, specifically noted structure realizes and obtains.

Accompanying drawing explanation

Accompanying drawing is used to provide the further understanding of technical scheme to the application or prior art, and forms a part for instructions.Wherein, the expression accompanying drawing of the embodiment of the present application and the embodiment one of the application are used from the technical scheme explaining the application, but do not form the restriction to technical scheme.

Fig. 1 is the schematic flow sheet of the Forecasting Methodology of corporations' fusion event of the embodiment of the present application;

Fig. 2 is inner structure index cumulative distribution curve figure;

Fig. 3 is external structure index of similarity cumulative distribution curve figure;

Fig. 4 be the embodiment of the present application utilize key factor index exercise supervision training schematic flow sheet;

Fig. 5 is the schematic flow sheet of the Forecasting Methodology of corporations' fusion event of the embodiment of the present application.

Embodiment

Describe embodiments of the present invention in detail below with reference to drawings and Examples, to the present invention, how application technology means solve technical matters whereby, and the implementation procedure reaching relevant art effect can fully understand and implement according to this.Each feature in the embodiment of the present application and embodiment, can be combined with each other under prerequisite of not conflicting mutually, the technical scheme formed is all within protection scope of the present invention.

In addition, the step shown in process flow diagram of accompanying drawing can perform in the computer system of such as one group of computer executable instructions.Further, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.

Description below is carried out for common-use words in some fields occurred in this application:

Network topology structure: be called network topology structure by node and the internodal structure connected to form in network.

Network corporations: by instrument research complex networks such as the graph theorys in mathematics.Given Graph G, network corporations are subgraph G' that an each point is closely connected.Community structure the most intuitively quantization method is that network corporations internal density is greater than external den.

Timeslice: namely when fixing some time points, snapshot is carried out to network, cut into slices at certain time point as to the network of continuous Change and Development, be called timeslice.

Static corporations: the corporations dividing out in certain timeslice.

Dynamic corporations: corporations' evolutional path of formation that the static corporations dividing out in a series of timeslice are linked in sequence according to time order and function.

Corporations' fusion event: the node in the corporations of two and above number is detected in certain time in future and is communicated in corporations.

Supervised training: namely carry out iterative computation according to training set inputoutput data and obtain the models such as classification, prediction.

Training data: the historical data being used for obtaining training pattern.

Fig. 1 is the schematic flow sheet of the Forecasting Methodology of corporations' fusion event of the embodiment of the present application.The method comprises: step S110, the timeslice of network raw data according to setting split, and therefrom chooses multiple timeslice data as training data; Step S120, training data carried out to the division of static corporations and dynamic corporations; Step S130, extract the key factor index between any Liang Ge corporations based on training data; Step S140, exercise supervision to described key factor index training, and judge whether any Liang Ge corporations can merge according to the learning outcome of supervised training.

The method that the thought that the present invention is based on node rank link prediction proposes the Level Link prediction of a kind of corporations judges corporations' fusion event.The prediction of nodes Level Link refers to by information such as known network structures, predicts the possibility producing connection in this network between any two connectionless nodes within certain period in the future.Be specially, choose one be positioned at time point to be predicted before, and the moment definition nearer apart from time point to be predicted is current time t ₀, choose t ₀the t ' of point sometime (t ' < t before ₀), with [t ', t ₀] raw data in time range, and based on [t ', t ₀) achievement data that extracts of raw data in time range as training data, and utilizes the selected time interval to carry out timeslice segmentation to the raw data in above-mentioned time range.

In an embodiment of the application, during screening raw data, the historical data be close with time point to be predicted to be chosen as far as possible.This is because the network behavior in reality is usually expressed as the regularity of short-term, namely the impact of the factor of a network within state presented the sometime in the future main time period more close with this moment is relevant.Therefore, before generally getting time point to be predicted, the raw data of 4-7 timeslice is as training data, to avoid introducing a large amount of invalid datas and impact prediction effect, limits the scope of raw data, ensures the validity of data.For example, to predict this event of setting up of the mutual powder relation of Sina's microblogging, split the time interval used self-defined in the scope of 3 ~ 20 days.Concrete, time point to be predicted is on May 15th, 2013, and raw data comprises the mutual powder relation data between 3.3 general-purpose family id and 3,650,000 routine users, is specially the formation time of the mutual powder relation of Sina's microblogging.The time span of raw data is in November, 2010 ~ 2013 year November, choose from raw data on May 1,15 days to 2013 March in 2013 during this period of time in data as training data, be spaced apart Δ t=15 days access time.1 day May in 2013 before a time interval of time point to be predicted is set to current time t ₀.Utilize Δ t to carry out timeslice segmentation to the training data in above-mentioned time range, timeslice can be obtained and be respectively on May 1st, 2013, on April 15th, 2013, on April 1st, 2013 and on March 15th, 2013.

Next, the excavation of static corporations and dynamic corporations is carried out based on the training data chosen.The object that static corporations divide determines corporations' state of each timeslice.The division of static corporations may be used for the extraction of subsequent dynamic corporations, and the division of static corporations is simultaneously conducive to extracting the parameters relationship between corporations.

In an embodiment of the application, have employed the excavation that Fast--Unfolding algorithm carries out static corporations.Fast-Unfolding algorithm is the static corporations mining algorithm based on modularity, this algorithm is based on the self-similarity of complex network, and each Loop partition corporations out be have employed to the concept of level, complete hierarchical community structure can be shown within the extremely short time.Its Output rusults contain after each step iterative computation corporations' Result so that select the result after certain iteration according to demand.The detailed process of this algorithm comprises, and first each node in network is regarded as independently corporations being numbered it.Then, for each node, consider its each neighbor node, calculate respectively and this node is deleted from the corporations belonging to oneself script, and bind together later modularity with the corporations at its neighbor node place, by the value of more each modularity, by this node binding in the maximum corporations of the growth of modularity.By to each node sequence, repeatedly perform said process, till cannot obtaining the lifting of modularity again.Next, by the corporations excavated are regarded as node, set up a new network, and utilize alternative manner above to continue iteration.Wherein, the weight summation of the connection between the corporations of the weight between new node representated by new node replaces.Node in new network has been connected and composed from ring between node in same corporations.Final when modularity no longer increases, iteration stopping when namely the community structure of network no longer changes.Fast-Unfolding algorithm workable, simultaneously because the complexity of this algorithm is linear, so travelling speed is exceedingly fast.

It should be noted that, in other embodiments of the application, other the static corporations method for digging based on modularity can also be adopted, such as Newman greedy algorithm, Newman fast algorithm etc.The static corporations corresponding with each timeslice can be obtained gather by carrying out static corporations dividing.

The generation of corporations' evolution critical event has different states by making the structure of corporations at different time sheet.The set of the state of a certain corporations in a series of timeslice is represented with dynamic corporations.Each state in set is organized according to the sequencing of the evolutional path of dynamic corporations.The excavation of dynamic corporations is generally carried out based on the division result of static corporations.Be specially, for any two adjacent time sheet t and t+1, by the static corporations C of timeslice t _twith the static corporations C of timeslice t+1 _t+1mate, and will the C of certain simulated condition be met _t+1add C _tin the dynamic corporations evolutionary series at place, repeatedly perform the evolutional path that said process extracts dynamic corporations.

In an embodiment of the application, have employed the excavation that Louvain algorithm carries out dynamic corporations.This algorithm when the set of initialization dynamic corporations, for each corporations during static corporations corresponding with timeslice the earliest in training data gather set up the dynamic corporations evolutionary series that take it as initial corporations.When mating two static corporations, Jaccard likeness coefficient is used to judge whether Liang Ge corporations mate, when Jaccard likeness coefficient is greater than selected matching threshold (such as getting 0.3), think that Liang Ge corporations mate, and the corporations that the match is successful to be added with the coupling corporations of correspondence be position last in the dynamic corporations evolutionary series of initial corporations.When Jaccard likeness coefficient is less than or equal to selected matching threshold, think that Liang Ge corporations do not mate, and do not have the corporations that the match is successful to generate a new evolution corporations sequence for this.If there is two or more corporations and certain evolution corporations sequences match, then the evolution corporations sequence that generation one is new, identical the member of the first two corporations sequence in this coupling moment.

For example, the training data based on Sina's muffler carries out corporations' excavation, can obtain the data file of static corporations for " merge_yyyymmdd.comm ", and the data of each timeslice is stored in respectively in a data file.Every a line in this data file represents corporations, and often row is made up of each node serial number data belonging to these corporations.Can obtain the data file of dynamic corporations for " merge_yyyyddmm_4/15.timeline ", wherein 4 representatives are with t simultaneously ₀this Timeline for current time has got 4 timeslices altogether, and 15 to represent time slice interval Δ t be 15 days.Data layout in this data file be " dynamic corporations numbering 1: numbering 1, timeslice numbering 2=static state corporations of timeslice numbering 1=static corporations numbering 2 ... " form.

It should be noted that, in other embodiments of the application, different dynamic corporations extracting method can also be adopted, such as Louvain algorithm, FEDN algorithm etc.The static corporations obtained according to excavation and dynamic corporations can extract t easily ₀corporations' fusion event in moment.Be specially, if at timeslice t ₀the dynamic corporations D that two of-Δ t are different _iand D _jat timeslice t ₀match same corporations, then claim these two dynamic corporations to merge.Prediction index can also be extracted easily from training data based on excavating the corporations obtained.

Prediction index should show the rule comprised in training data as much as possible.In prior art when carrying out the extraction of prediction index, with single corporations for extracting object, what obtain when therefore utilizing index to predict is also whether single corporations participate in corporations' fusion event, and generation fusion can not be predicted between which corporation.For solving the problems of the technologies described above, in the present invention, extract prediction index for any pair corporation in network, the index extracted is relevant with the network topology structure of a pair corporation, and predicting the outcome is whether a pair corporation can merge.

Particularly, in the embodiment of the application, being extracted the prediction index whether multiple Liang Ge of impact corporations can merge, is key factor index by their unified definitions, different according to the effect whether different Index Influence corporations can merge, be divided into direct factor index and indirect factor index.Further, direct factor index comprises the inner structure index between Liang Ge corporations, and the single order change indicator of inner structure index and second order change indicator, and indirect factor index comprises the external structure index of similarity of Liang Ge corporations.To introduce respectively below.

Inner structure index B between corporations i and corporations j _dbe defined as:

B_{d} (i, j) = \frac{E_{i, j}}{(E_{i} / N_{i} + E_{j} / N_{j}) / 2} - - - (1)

In formula, E _i,jrepresent the linking number between corporations i and corporations j, E _iand E _jrepresent the inside linking number of corporations i and corporations j respectively, N _iand N _jrepresent the internal node number of corporations i and corporations j respectively.

Concrete, can preliminary judgement according to the definition of corporations and network structure thereof, for given Liang Ge corporations, between corporations, linking number is more, and the possibility that corporations merge is larger, and corporations' internal density is less, and the possibility of corporations' fusion is larger.

Fig. 2 is inner structure index cumulative distribution curve figure.The time point t got in office ₀-Δ t, extracts the inner structure index between any two corporations of this timeslice, and observes it at time point t ₀fusion situation.Afterwards respectively for the corporations of merging to the corporations of not merging to the cumulative distribution curve (CDF drawing inner structure index, Cumulative Distribution Function), in figure, curve 1 represents the cumulative distribution curve of the right inner structure index of corporations not occurring to merge, and curve 2 represents the cumulative distribution curve of the inner structure index that the corporations that occur to merge are right.As can be seen from Figure 2, the span that inner structure desired value is merging corporations' centering concentrates on 1 ~ 15, and concentrates on 0 ~ 0.5 in the span not merging corporations' centering.Illustrate that this inner structure index can be distinguished effectively and merge corporations and non-fused corporations.

Further, B _dthe possibility that merges of value larger expression corporations larger, and B _dsingle order changes delta B _drepresent B _dalong with the speedup of Time evolution, speedup is larger, illustrates that the trend that corporations merge is more obvious.B _dsecond order changes delta Δ B _dalso be same reason.Therefore, according to B _ddefine its single order change indicator Δ B _dwith second order change indicator Δ Δ B _d.As shown in expression formula (2), (3):

{ΔB}_{d} {(i, j)}_{t_{0} - Δt} = B_{d} {(i, j)}_{t_{0} - Δt} - B_{d} {(i, j)}_{t_{0} - 2 Δt} - - - (2)

Δ {ΔB}_{d} {(i, j)}_{t_{0} - Δt} = {ΔB}_{d} {(i, j)}_{t_{0} - Δt} - {ΔB}_{d} {(i, j)}_{t_{0} - 2 Δt} - - - (3)

In formula,

B_{d} {(i, j)}_{t_{0} - Δt}, {ΔB}_{d} {(i, j)}_{t_{0} - Δt}

With represent that the time is t respectively ₀inner structure index between corporations i during-Δ t and corporations j and single order change indicator thereof and second order change indicator.Wherein, second order change indicator can be expressed as the form of expression formula (4) further:

\begin{matrix} {ΔΔB}_{d} {(i, j)}_{t_{0} - Δt} = {ΔB}_{d} {(i, j)}_{t_{0} - Δt} - {ΔB}_{d} {(i, j)}_{t_{0} - 2 Δt} \\ = B_{d} {(i, j)}_{t_{0} - Δt} - 2 B_{d} {(i, j)}_{t_{0} - 2 Δt} + B_{d} {(i, j)}_{t_{0} - 3 Δt} \end{matrix} - - - (4)

Therefore, in the embodiment of the application, apply above-mentioned direct factor index when predicting, need the data choosing four timeslices as training data, be respectively t ₀, t ₀-Δ t, t ₀-2 Δ t and t ₀-3 Δ t.

When extracting indirect factor index, apply the extracting method of the local structure similarity degree between node.Predict that the index producing the possibility of connection between two nodes is two node n _iand n _jbetween structural similarity, as shown in expression formula (5):

Sim (n_{i}, n_{j}) = \frac{{\overset{&RightArrow;}{n}}_{i} \cdot {\overset{&RightArrow;}{n}}_{j}}{| {\overset{&RightArrow;}{n}}_{i} | + | {\overset{&RightArrow;}{n}}_{j} | - {\overset{&RightArrow;}{n}}_{i} \cdot {\overset{&RightArrow;}{n}}_{j}} = \frac{Σ_{k = 1}^{m} (w_{ik} \times w_{jk})}{\sqrt{Σ_{k = 1}^{m} w_{ik}^{2}} + \sqrt{Σ_{k = 1}^{m} w_{jk}^{2}} - Σ_{k = 1}^{m} (w_{ik} \times w_{jk})} - - - (5)

In formula, w _ijfor node n _iwith node n _jbetween connect weights, if having no right network, then these weights are 1.If node n _iwith node n _jbetween without connect, then w _ijvalue be 0.

In the embodiment of the application, extract the external structure index of similarity between Liang Ge corporations by the weights between definition Liang Ge corporations.Particularly, the weight w between corporations i and corporations j _i,jas shown in expression formula (6):

w_{i, j} = \frac{{E_{i, j}}^{2}}{N_{i} \times N_{j}} - - - (6)

In formula, E _i,jrepresent the linking number between corporations i and corporations j, N _iand N _jrepresent the nodes of corporations i and corporations j inside respectively.

Weight w _i,jbe used for weighing the relation weights between the corporations that have ignored concrete inner structure.Between corporations, more multiple weighing value is larger for linking number, and when between corporations, linking number is certain, the weights between nodes Yue Shao corporations of corporations own are relatively large.

External structure index of similarity between the corporations i finally obtained and corporations j is as shown in expression formula (7), and wherein m represents corporations' sequence number number,

Sim (i, j) = \frac{\overset{m}{\underset{k &NotEqual; i, j}{\underset{k = 1}{Σ}}} (w_{i, k} \times w_{j, k})}{\sqrt{{\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} w_{i, k}^{2}} + \sqrt{{\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} w_{j, k}^{2}} - {\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} (w_{j, k} \times w_{j, k})}, w_{i, k} = \frac{{E_{i, k}}^{2}}{N_{i} \times N_{k}} w_{i, k} = \frac{{E_{j, k}}^{2}}{N_{j} \times N_{k}} - - - (7)

Fig. 3 is external structure index of similarity cumulative distribution curve figure.Adopt and method like the CDF class of a curve setting up inner structure index, set up the CDF curve of external structure index of similarity.In figure, curve 1 represents the cumulative distribution curve of the right inner structure index of corporations not occurring to merge, and curve 2 represents the cumulative distribution curve of the inner structure index that the corporations that occur to merge are right.As can be seen from Figure 3, the span that external structure index of similarity value is merging corporations' centering concentrates between 0.00015 ~ 0.006, and is 0 in the value major part not merging corporations' centering.Namely when external structure index of similarity value is larger, the probability that corporations occur to merge is higher, and this index effectively can be distinguished and merges corporations and do not merge corporations.

Above-mentioned each direct factor index and indirect factor index, be all through the result of coincidence theory analysis that verification experimental verification obtains and practical operation.Both taken into full account the major influence factors between corporations and minor effect factor, and be easy to again calculate and realize.Wherein, direct factor index defines based on the architectural characteristic of link completely, has universality, can directly apply in the analysis of complex network.Indirect factor Index Establishment has on the basis of friend relation attribute at social networks.If two strangers have much common friend, then these two strangers become the possibility of friend can be larger, the prediction therefore for the fusion event of social networks can obtain good effect.

Next, judge whether any Liang Ge corporations will merge based on the training that exercises supervision of above-mentioned key factor index.In an embodiment of the application, utilize key factor index exercise supervision training process as shown in Figure 4, comprise: step S410, utilize the key factor index that obtains based on training data to build forecast model, and determine that the separatrix value merged occurs in corporations; Step S420, the key factor index obtained based on the data apart from the nearest timeslice of time point to be predicted is substituted into described forecast model, and predicting the outcome of obtaining is compared to judge whether corporations can merge with described separatrix value.

Concrete, first set up the probability simulation function between probability that inner structure index and corporations occur to merge.The fusion situation of corporations is represented, when corporations i and corporations j is at t with function R (i, j) ₀moment, when merging, R (i, j) got 1, when corporations i and corporations j is at t ₀moment, when not merging, R (i, j) got 0, can represent with expression formula (8):

Then work as t ₀-Δ t, and ought inner structure index now for BD ₀time, corporations i and corporations j is at t ₀the probability that moment will occur to merge is as shown in expression formula (9):

P (R {(i, j)}_{t = t_{0}} = 1 | B_{d} {(i, j)}_{t = t_{0} - Δt} = {BD}_{0}) = \frac{\underset{i &NotEqual; j}{\underset{i, j &Element; m}{Σ}} {R (i, j)}_{t = t_{0}}}{N {(i, j)}_{t = t_{0}}} | B_{d} {(i, j)}_{t = t_{0} - Δt} = {BD}_{0} - - - (9)

In formula,

\underset{i &NotEqual; j}{\underset{i, j &Element; m}{Σ}} R {(i, j)}_{t = t_{0}} | B_{d} {(i, j)}_{t = t_{0} - Δt} = {BD}_{0}

Represent and work as value is BD ₀t ₀there is the corporations' logarithm merged in the moment, represent and work as value is BD ₀t ₀moment all corporations logarithm.Further, according to a series of value obtains a series of conditional probability value, probability function matching is carried out to them, obtains probability simulation function for example, when the historical data based on Sina's microblogging carries out Function Fitting, the probability simulation function obtained has the form as shown in expression formula (10):

P_{B} ({R (i, j)}_{t = t_{0}} = 1 | B_{d} {(i, j)}_{t = t_{0} - Δt}) = b \times 1 n [B_{d} {(i, j)}_{t = t_{0} - Δt} - α] + T_{0} - - - (10)

Wherein, a, b, T ₀the parameter of the probability function produced in fit procedure for utilizing training data, and a=-0.5, b=0.038, T ₀=0.03.

Further, utilize obtain probability simulation function, the single order change indicator of connecting inner structure index, second order change indicator and Liang Ge corporations external structure index of similarity build forecast model, as shown in expression formula (11):

\begin{matrix} {Td}_{t = t_{0}} = P_{B} ({R (i, j)}_{t = t_{0}} = 1 | B_{d} {(i, j)}_{t = t_{0} - Δt}) \times (1 + \log_{1 + \max | {ΔB}_{d} {(i, j)}_{t = t_{0} - Δt} |} (1 + {ΔB}_{d} {(i, j)}_{t = t_{0} - Δt})) \\ \times (1 + \log_{1 + \max | ΔΔ B_{d} {(i, j)}_{t = t_{0} - Δt} |} (1 + {ΔΔB}_{d} {(i, j)}_{t = t_{0} - Δt})) + Sim {(i, j)}_{t = t_{0} - Δt} \end{matrix} - - - (11)

In formula, represent t ₀the tendency degree of fusion happens between moment corporations i and corporations j.According to the probability function relation that the inner structure index set up before and corporations merge, and according to analysis before, Δ B _dwith Δ Δ B _dlarger, it is larger that possibility is merged in corporations, roughly determines forecast model and B _d, Δ B _dwith Δ Δ B _dbetween relation.Further, on the basis that inner structure index is certain, external structure index of similarity value is larger, and it is larger that possibility is merged in corporations.Therefore, the forecast model merging tendency degree for weighing corporations is constructed.It should be noted that in addition, forecast model is not unique.

Utilizing forecast model and training data to determine there is the separatrix value merged in corporations, comprise, predict that the tendency angle value obtained is normalized by according to forecast model, utilize the tendency angle value after process and extracts based on training data the corporations obtained and merges situation and set up reference function, tendency angle value when reference function being obtained maximal value is as the separatrix value of corporations' generation fusion.

Concrete, the tendency degree of expression formula (11) is normalized and can obtains expression formula (12):

{Td}_{merge} |_{t = t_{0}} = \frac{{Td}_{t = t_{0}} - \min ({Td}_{t = t_{0}})}{\max ({Td}_{t = t_{0}}) - \min ({Td}_{t = t_{0}})} - - - (12)

By obtain for t ₀the tendency degree descending sort predicted the outcome after according to normalization of timeslice, and and t ₀the truth that the corporations in moment merge carries out contrast to set up reference function.For often pair be predicted correctly for will occur merge corporations corresponding to tendency angle value TD ₀, calculate as follows:

In formula, TD ₀for extracting the corporations of the generation fusion obtained based on training data to corresponding tendency angle value.Further, set up reference function according to expression formula (13), and TD when expression formula (13) being obtained maximal value ₀value as judge corporations whether will occur merge separatrix value div ₀.

F = \frac{2 αβ}{α + β} - - - (13)

Choose div ₀the value size relative equilibrium of corresponding α and β of the value that Shi Xiwang chooses, therefore the blending average of α and β is got in the calculating of reference function F value.

After obtaining forecast model and judging whether corporations the separatrix value merged will occur, just can according to current time t ₀training data predicted time be t ₀the fusion situation of the corporations of+Δ t, specifically comprises: based on t ₀the training data in moment extracts key factor index, substitutes into the forecast model of expression formula (11) using obtained key factor index as the input data of forecast model, and by normalizedly predicting the outcome of obtaining according to expression formula (12) merge separatrix with corporations and be worth div ₀compare to judge whether corporations will merge.Concrete, when time, corresponding right the predicting the outcome as merging of corporations, on the contrary be then predicted as and do not merge.

According to the embodiment of the application, based on the corporations pair that will occur to merge predicted, a concrete fusion event can be released further and which corporation be made up of.If such as predict, corporations i and corporations j will merge, and corporations j and corporations k will merge, and corporations k and corporations i will merge, and so corporations i, j, k just constitutes a fusion event.By predicting the fusion behavior between any Liang Ge corporations, effectively reducing range of observation, reducing the difficulty of large data when carrying out corporations' convergence analysis.

For example, in the above-mentioned example predicted Sina's microblogging, by calculating α, β and F, TD when can obtain making F to obtain maximal value 0.13 ₀value be 0.1247, using 0.1247 as judge corporations whether will occur merge separatrix value.As the Td obtained by forecast model _mergewhen being greater than 0.1247, corresponding right the predicting the outcome as merging of corporations, works as Td _mergewhen being less than or equal to 0.1247, corresponding right the predicting the outcome as not merging of corporations.

The method predicted based on the fusion behavior of forecast model to corporations of training data foundation is a kind of training method having supervision, experimental data shows simultaneously, the accuracy rate and the recall rate that realize this function are that effect is good, relevant practical application can be supported, and provide convenience for the further research of corporations' evolution aspect.

Certainly, when being judged by supervised training whether any Liang Ge corporations merge by generation with, existing forecast model can also be used to carry out training and predicting.In other embodiments of the application, adopt SVM model to predict, implementation process is: first the vector that the key factor index obtained based on training data forms is substituted into SVM forecast model and carry out training to determine that the sorter merged occurs in corporations; Again the vector of the key factor index obtained based on the data apart from the nearest timeslice of time point to be predicted composition is substituted into described SVM forecast model, and predict the outcome according to the classification obtained and judge whether corporations can merge.Concrete, note t ₀-Δ t timeslice corporations C _iand C _jbetween direct factor index be respectively inner structure and the single order change of inner structure change with second order indirect factor is the external structure similarity of corporations these four parameters be normalized respectively, then n-th training set input variable that this module uses is

X_{n} = (B_{d}^{'} {(C_{i}, C_{j})}_{t_{0} - Δt}, Δ B_{d}^{'} {(C_{i}, C_{j})}_{t_{0} - Δt}, {ΔΔB}_{d}^{'} {(C_{i}, C_{j})}_{t_{0} - Δt}, {Sim}^{'} {(C_{i}, C_{j})}_{t_{0} - Δt})^{T}

Article n-th, training set data output variable is

Further, the input of all training sets, output variable are substituted in sorter, can train and obtain disaggregated model and lineoid, then substitute into the t after normalization ₀the prediction input data of timeslice, wherein n-th prediction input data input variable is

X_{n} = (B_{d}^{'} {(C_{i}, C_{j})}_{t_{0}}, Δ B_{d}^{'} {(C_{i}, C_{j})}_{t_{0}}, {ΔΔB}_{d}^{'} {(C_{i}, C_{j})}_{t_{0}}, {Sim}^{'} {(C_{i}, C_{j})}_{t_{0}})^{T} .

Carrying it in disaggregated model according to result of calculation is+1 or-1 can learn which side of these data at lineoid, namely its predict the outcome into, at t ₀whether+Δ t corporations can merge.

It should be noted that, in real social networks, the non-fusion event of corporations far away more than corporations' fusion event, therefore gather to initial training concentrate and do not merge corporations' sample size far more than fusion corporations sample size.Therefore, the tendency degree model adopting the embodiment of the present application to set up all can obtain corporations for arbitrary key factor desired value and merge tendency angle value, and the corporations obtained fusion tendency angle value has obvious discrimination.And predicting the outcome of SVM model can be inclined to the many classes of sample, there will be the fusion corporations number doped is the situation of zero.If it is balanced to sample to reach two class sample sizes to sample, then can greatly reduce sample data quality, the training pattern obtained also can serious distortion.

In addition, whether Forecasting Methodology above is all directly predicted merging between any Liang Ge corporations, but apparently, predict which Liang Ge corporation will merge, or which corporation will merge, be all based on corporations self have occur merge possibility basis on, therefore, in another embodiment of the application, by first whether there is the possibility participating in merging to single corporations judge, reduce the scope of prediction further, as shown in Figure 5, the method comprises: step S510, the timeslice of network raw data according to setting is split, and therefrom choose multiple timeslice data as training data, step S520, training data carried out to the division of static corporations and dynamic corporations, step S530, based on described static corporations and described dynamic corporations, each corporation to be predicted respectively, obtain corporations' set that will participate in merging, step S540, to extract the key factor index arbitrarily between Liang Ge corporations in the set of described corporations based on training data, step S550, exercise supervision to described key factor index training, and judge whether any Liang Ge corporations can merge according to the learning outcome of supervised training.

Particularly, in the step predicted respectively each corporation based on described static corporations and described dynamic corporations, any method that the single corporations of prediction of the prior art can be adopted whether to participate in merging realizes.For example, the sorting algorithm of existing SVM support vector machine is adopted, for t ₀each corporations of-Δ t timeslice, extract three basic indexs: the internal edges of corporations' size, corporations and the ratio (In-Degree) of corporations' node degree summation and corporations are at timeslice t ₀during-Δ t and at timeslice t ₀jaccard similarity coefficient during-2 Δ t, and " the single order change indicator " and " second order change indicator " of these three indexs.Observe these corporations at t ₀timeslice whether with other corporations' fusion happens (fusion is designated as 1, does not merge and is designated as-1).Above index and fusion results substitution SVM classifier are carried out training and obtains forecast model.Equally, t is got ₀each index of the corporations of timeslice, substitute in forecast model, classification results is " 1 ", then predict the outcome as these corporations are about to and other corporations' fusion happens, if contrary classification results is "-1 ", then predict the outcome into these corporations can not with other corporations' fusion happens.Determining the set that will participate in the corporations of merging thus, like this when setting up key factor index based on above-mentioned set, greatly can reduce workload, improve forecasting efficiency.

Corporations' fusion forecasting method of the embodiment of the present application can apply the prediction with network of having the right, and has more general range of application.By defining the key factor index between any Liang Ge corporations, realize the effective prediction to the fusion behavior between Liang Ge corporations or between multiple corporations.

Although the embodiment disclosed by the present invention is as above, the embodiment that described content just adopts for the ease of understanding the present invention, and be not used to limit the present invention.Technician in any the technical field of the invention; under the prerequisite not departing from the spirit and scope disclosed by the present invention; any amendment and change can be done what implement in form and in details; but scope of patent protection of the present invention, the scope that still must define with appending claims is as the criterion.

Claims

1. a Forecasting Methodology for corporations' fusion event, comprising:

Step one, the timeslice of network raw data according to setting to be split, and therefrom choose multiple timeslice data as training data;

Step 2, training data carried out to the division of static corporations and dynamic corporations;

Step 3, extract the key factor index between any Liang Ge corporations based on training data;

Step 4, exercise supervision to described key factor index training, and judge whether any Liang Ge corporations can merge according to the learning outcome of supervised training.

2. method according to claim 1, is characterized in that,

Described key factor index comprises the external structure index of similarity of the inner structure index between Liang Ge corporations, the single order change indicator of described inner structure index, second order change indicator and Liang Ge corporations.

3. method according to claim 2, is characterized in that, utilizes following expression to extract inner structure index between described Liang Ge corporations:

B_{d} (i, j) = \frac{E_{i, j}}{(E_{i} / N_{i} + E_{j} / N_{j}) / 2}

4. method according to claim 2, is characterized in that, utilizes following expression to extract the external structure index of similarity of described Liang Ge corporations:

Sim (i, j) = \frac{{\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} (w_{i, k} \times w_{j, k})}{\sqrt{{\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} w_{i, k}^{2}} + \sqrt{{\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} w_{j, k}^{2}} - {\underset{k = 1}{Σ}}_{k &NotEqual; i, j}^{m} (w_{i, k} \times w_{j, k})}

5. method according to claim 2, is characterized in that, comprises the following steps in described step 4:

Utilize the key factor index obtained based on training data to build forecast model, and determine that the separatrix value merged occurs in corporations;

The key factor index obtained based on the data apart from the nearest timeslice of time point to be predicted is substituted into described forecast model, and predicting the outcome of obtaining is compared to judge whether corporations can merge with described separatrix value.

6. method according to claim 5, is characterized in that, utilizes following expression to build described forecast model:

\begin{matrix} {Td}_{t = t_{0}} = P_{B} (R {(i, j)}_{t = t_{0}} = 1 | B_{d} {(i, j)}_{t = t_{0} - Δt}) \times (1 + \log_{1 + \max | {ΔB}_{d} {(i, j)}_{t = t_{0} - Δt} |} (1 + {ΔB}_{d} {(i, j)}_{t = t_{0} - Δt})) \\ \times (1 + \log_{1 + \max | {ΔΔB}_{d} {(i, j)}_{t = t_{0} - Δt} |} (1 + {ΔΔB}_{d} {(i, j)}_{t = t_{0} - Δt})) + Sim {(i, j)}_{t = t_{0} - Δt} \end{matrix}

P_{B} (R {(i, j)}_{t = t_{0}} = 1 | B_{d} {(i, j)}_{t = t_{0} - Δt})

For probability simulation function, with be respectively the single order change indicator of the inner structure index between corporations i and corporations j, second order change indicator and external structure index of similarity; t ₀and t ₀-Δ t represents different time points respectively, and Δ t is the time interval.

7. the method according to claim 5 or 6, is characterized in that, is determining that the step of the separatrix value that corporations' generation is merged comprises:

Predict that the tendency angle value obtained is normalized by according to described forecast model;

Utilize the tendency angle value after process and extract based on training data the corporations obtained and merge situation and set up reference function;

The separatrix value of fusion is there is in tendency angle value when reference function being obtained maximal value as corporations.

8. method according to claim 7, is characterized in that, described reference function is set up according to following expression:

F = \frac{2 αβ}{α + β}

9. method according to claim 1, is characterized in that, step 4 comprises the following steps:

The vector key factor index obtained based on training data formed substitutes into SVM forecast model to carry out training to determine that the sorter merged occurs in corporations;

The vector of the key factor index obtained based on the data apart from the nearest timeslice of time point to be predicted composition is substituted into described SVM forecast model, and predicts the outcome according to the classification obtained and judge whether corporations can merge.

10. method according to claim 1, is characterized in that, also comprises before step 3:

Based on described static corporations and described dynamic corporations, each corporation is predicted respectively, obtain corporations' set that will participate in merging;

In step 3, extract the key factor index in the set of described corporations arbitrarily between Liang Ge corporations based on training data.