Disclosure of Invention
Aiming at the problem that the analysis efficiency of the company relation network is very low by mainly adopting the manual statistics and the bidding content correction mode in the traditional bidding company analysis mode, the invention provides a bidding company network relation prediction method based on a community structure, which comprises the following steps:
collecting information of bidding companies participating in bidding, and preprocessing the collected data according to cooperation relationship data among companies in bidding field to which the companies belong;
taking a company as a node, when two companies participate in bidding of the same project, the two companies have an edge relationship, so that a company relationship network is constructed;
extracting local topological characteristics of the company nodes according to the relation between the nodes and the neighbor nodes thereof, and taking the local topological characteristics as local similarity of the nodes;
detecting a community structure of a company node by using a community discovery algorithm to obtain community division;
calculating link closeness among different communities according to the structural information of the divided communities;
the local similarity of two nodes and the link closeness of the social interval where the two nodes are located are fused to obtain the similarity of different node pairs, the similarity is used as the probability of generating the connected edges between the nodes, then all the probabilities are sequenced in an ascending order, and the connected edges with the maximum similarity between the nodes of the bidding company data set are output.
Further, in the corporate relation network, each corporate node is represented by a unique ID that is not repeated, and then local similarity calculation is performed on the nodes; when the node is visualized, the shape and the size of the node are expressed as the degree of the node, and the more companies participating in project bidding with the node, namely the more edges connected with the node of the company, the greater the degree of the node.
As an alternative implementation, using the Common Neighbor algorithm to capture the topological relationship between company nodes, the local similarity between the unconnected point pairs can be expressed as:
CN(x,y)=|Γ(x)∩Γ(y)|;
wherein Γ (x), Γ (y) represent a set of neighbor nodes connected to the nodes x, y, respectively.
As an alternative implementation, the topological relation between company nodes is captured by using an Adamic Adar algorithm, and then the local similarity between the unconnected point pair can be expressed as:
wherein Γ (x), Γ (y) represent the set of neighbor nodes connected to nodes x, y, respectively, and k (z) is represented as the degree of node z.
As an alternative embodiment, resource Allocation algorithm is used to capture the topological relationship between company nodes, expressed as:
wherein Γ (x), Γ (y) represent the set of neighbor nodes connected to nodes x, y, respectively, and k (z) is represented as the degree of node z.
Further, the method for obtaining the corresponding community structure by using the Louvain algorithm comprises the following steps:
traversing all nodes in the network, calculating the modularity gain of the community where the node is divided into the neighbor nodes, and dividing the modularity gain into the communities corresponding to the maximum forward gain;
reconstructing the network, and combining all nodes in the same community into nodes in a new network;
updating the internal weights of the nodes in the new network into the sum of the internal weights of the combined node set, and updating the edge weights among the nodes in the new network into the sum of the weights on the connecting edges among communities corresponding to the two nodes;
repeating the steps until the modularity is not changed any more, and outputting the network and community division constructed under different granularities.
Further, the modularity Q is expressed as:
wherein k is i Degree of node i, c i And c j Respectively representing two communities where the node i and the node j are located; delta (c) i ,c j ) In order to judge whether the node i and the node j belong to the same community, if the two nodes belong to the same community, the value is 1, otherwise, the value is 0; m is the total number of edges in the network; a is that ij Is an element of the adjacency matrix of the network, expressed as:
further, the degree of tightness of the connection between communities is expressed as:
wherein CCI (c) i ,c j ) Representing community c i With community c j A link index between; Λ (c) i ) Representing community c i Is (c) i ) Indicating that it is not at c i But with c i With linked nodes.
Further, the link probability of the node pairs is obtained by fusing the contact compactness and the local similarity between communities, and the link probability of the node pairs is:
CCI(i,j)=S(i,j)+CCI(c i ,c j );
wherein CCI (i, j) is the connection probability of the node pair (i, j) obtained by fusing the contact compactness degree and the local similarity between communities; s (i, j) is the local similarity of the node i and the node j; CCI (c) i ,c j ) For community c where node i is located i Community c with node j j A link index between them.
The invention also provides a network relation prediction system of bidding companies based on the community structure, which comprises a data acquisition and cleaning module, a network modeling module, a relation prediction module and a risk assessment module, wherein:
the data acquisition and cleaning module is used for collecting information of bidding companies and cooperative relation data among the bidding field companies of the bidding companies, extracting simple keywords from the collected data at a server side and deleting redundant information;
the network modeling module is used for reading the data of the data acquisition and cleaning module and calculating by using a relation prediction algorithm to obtain a similarity matrix between nodes;
the relation prediction module is used for sequencing the similarity matrix obtained by the network modeling module to the network and returning the results arranged in front to the client interface according to the requirements;
and the risk assessment module is used for fusing all the output information of the nodes and judging the risk degree of the company according to the node similarity result.
Compared with the traditional statistics-based bidding data analysis, on one hand, the system saves a great deal of time for related personnel; on the other hand, related personnel analyze the occurred event instead of analyzing in advance to prevent the event from occurring, but under the background of the modern big data technology, the invention can analyze the phenomenon of multi-string multi-link in the bidding field, a large number of string bidding behaviors occur among a plurality of different small groups, the probability of the occurrence of the string bidding behaviors is judged by calculating the discovered degree of tightness of the inter-community connection, the occurrence of malignant string bidding events is prevented, and relatively fair guarantee is brought to a certain extent for common bidding projects.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a network relation prediction method of bidding companies based on a community structure, which combines with fig. 1 and 2 and comprises the following steps:
collecting information of bidding companies participating in bidding, and preprocessing the collected data according to cooperation relationship data among companies in bidding field to which the companies belong;
taking a company as a node, when two companies participate in bidding of the same project, the two companies have an edge relationship, so that a company relationship network is constructed;
extracting local topological characteristics of the company nodes according to the relation between the nodes and the neighbor nodes thereof, and taking the local topological characteristics as local similarity of the nodes;
detecting a community structure of a company node by using a community discovery algorithm to obtain community division;
calculating link closeness among different communities according to the structural information of the divided communities;
the local similarity of two nodes and the link closeness of the social interval where the two nodes are located are fused to obtain the similarity of different node pairs, the similarity is used as the probability of generating the connected edges between the nodes, then all the probabilities are sequenced in an ascending order, and the connected edges with the maximum similarity between the nodes of the bidding company data set are output.
The invention can take the information of any company and the cooperation relation data between the companies in the bidding field to which the company belongs as input data, and the relevant historical data of the company of the issued decision book in the historical data is selected only for better verifying the effectiveness of the invention.
In this embodiment, bidding transaction data (simply referred to as Company data set) in the building field of a certain city in 2007-2017 is selected as a research object, and the total data amount is 19156, which relates to 2851 companies, and the Company transaction relationship is 9090, and each transaction data includes 15 indexes of a tenderer (bidding agency), a bidder, a bid evaluation method, a bid winning condition, bid competing time, a project name, a regional attribute, an enterprise qualification attribute and the like.
An enterprise has limited business capability, so that the number of times of participating in bidding within a certain time is also in a certain range (except special cases). Counting the degree of vertexes (each vertex represents an enterprise) in a complex network (the quantized representation of how frequently the enterprise participates in bidding), the distribution (Degree Distribution) of the degree is shown in fig. 3, the abscissa of the distribution is the degree (value) of the vertexes, the ordinate is the number (count) of enterprises, the average degree is 6.378, that is, under normal conditions, the quantization of how frequently any enterprise participates in bidding is about 6.378, the degree of most vertexes is within 10, but a few nodes exceed 10, even a few nodes deviate greatly from the quantization value, that is, the bidding times of the enterprises represented by the vertexes are too frequent. Thus, the enterprise represented by such vertices may be a specialist for some co-labeling.
Thus, in the corporate relation network in this embodiment, each corporate node is represented by a unique ID that is not repeated, and then local similarity calculation is performed on the node; when the node is visualized, the shape and the size of the node are expressed as the degree of the node, and the more companies participating in project bidding with the node, namely the more edges connected with the node of the company, the larger the degree of the node; in addition, the local similarity among the nodes of the invention can screen out some clients accompanied by marks, the invention provides a calculation mode of the local similarity in 3, and of course, the local similarity can be any one or a plurality of linear or nonlinear fusion of the 3 types, and the calculation of the three types of local similarity comprises:
the topology relation between company nodes is captured by utilizing a Common Neighbor algorithm (CN for short), and then the local similarity between the unconnected point pairs can be expressed as follows:
CN(x,y)=|Γ(x)∩Γ(y)|;
wherein Γ (x), Γ (y) represent a set of neighbor nodes connected to the nodes x, y, respectively.
Capturing topological relation among company nodes by using an Adamic Adar algorithm (AA for short), the local similarity among unconnected point pairs can be expressed as follows:
wherein Γ (x), Γ (y) represent the set of neighbor nodes connected to nodes x, y, respectively, and k (z) is represented as the degree of node z.
The Resource Allocation algorithm (RA) is adopted to capture the topological relation among company nodes, and the topological relation is expressed as:
wherein Γ (x), Γ (y) represent the set of neighbor nodes connected to nodes x, y, respectively, and k (z) is represented as the degree of node z.
In the implementation process, the local similarity S (i, j) of the node i and the node j is one or more than one linear or nonlinear fusion of topological relations among company nodes captured by the three methods.
The purpose of community detection is to find such an alliance, namely community structure information, in a constructed enterprise relationship network. The embodiment adopts Louvain algorithm to obtain the corresponding community structure, which comprises the following steps:
traversing all nodes in the network, calculating the modularity gain of the community where the node is divided into the neighbor nodes, and dividing the modularity gain into the communities corresponding to the maximum forward gain;
reconstructing the network, and combining all nodes in the same community into nodes in a new network;
intra-node in new networkThe part weight is updated to be the sum of the weights in the combined node set, namely the nodes in the new networkThe weights of (2) are expressed as:
wherein,node +.>Weight of->For node->Is a merging node of->For being by node->And node->The formed edge->The weight, et-1, is the edge set of the network at the granularity of t-1;
the edge weight between nodes in the new network is updated as the sum of the weights on the connecting edges between communities corresponding to the two nodes, namely:
repeating the steps until the modularity is not changed any more, and outputting the network and community division constructed under different granularities.
Further, the modularity Q is expressed as:
wherein k is i Degree of node i, c i And c j Respectively representing two communities where the node i and the node j are located; delta (c) i ,c j ) In order to judge whether the node i and the node j belong to the same community, if the two nodes belong to the same community, the value is 1, otherwise, the value is 0; m is the total number of edges in the network; a is that ij Is an element of the adjacency matrix of the network, expressed as:
FIG. 4 is a graph of the size distribution (size distribution) of the number of communities and the number of community nodes in a network of bidding companies based on a community structure, wherein the abscissa of the graph is modularity (size) and the ordinate of the graph is size (size), and the graph is the number of communities (numbers of nodes) in the present invention, and as shown in FIG. 4, the modularity of the network is calculated to be 0.587 by using the Louvain algorithm, and 76 communities are found in total; the number of the communities with nodes kept at about 50 is large, and the fact that clusters are easily formed when the number of bidding companies participating in the same project is less than 50 can be seen; in comparison, the number of nodes in the community is very small, more than 200, indicating that the probability of forming a cluster is low.
Aiming at the phenomenon of multi-string multi-link existing in practice, the association closeness between different communities can be obtained by calculating the community link index, and the related event risk is judged according to the magnitude of the community link index, in the embodiment, the community link index between communities is calculated, so that community node has higher community closeness within 50, the communities are more closely associated, and the association closeness between communities is expressed as:
wherein CCI (c) i ,c j ) Representing community c i With community c j A link index between; Λ (c) i ) Representing community c i Is (c) i ) Indicating that it is not at c i But with c i With linked nodes.
Fusing the contact tightness degree and the local similarity between communities to obtain the link probability of the node pairs, and obtaining the node which is most likely to generate the connecting edge, wherein the link probability of the node pairs is as follows:
CCI(i,j)=S(i,j)+CCI(c i ,c j );
wherein CCI (i, j) is the connection probability of the node pair (i, j) obtained by fusing the contact compactness degree and the local similarity between communities; s (i, j) is the local similarity of the node i and the node j; CCI (c) i ,c j ) For community c where node i is located i Community c with node j j A link index between; the larger the connection probability of the node pair is, the larger the label accompanying probability exists between two companies, namely between the node i and the node j, and when the network relation of the label-accompanying companies is predicted, the two companies with the larger the connection probability of the node pair are taken as the node pair with the label accompanying probability.
To verify the validity of the proposed method, the present invention uses a Company dataset, which is an undirected graph. 10% of the selected dataset was randomly deleted as the test set, and its prediction Accuracy (AUC) is shown in table 1 below:
table 1 AUC of Company dataset
It can be seen that the method proposed by the present invention (bolded words in table 1) improves the prediction accuracy by 10% compared with other methods. In addition, experiments were performed on Ranking Score (RS), and 909 strips in the dataset were randomly selected as the test set. RS measures that the selected edge is at the final ordered position, and the earlier the rank is, the better the prediction effect of the method is, namely, the smaller the RS is, the better the RS is, and the RS is shown in table 2:
table 2 Ranking Scores (RS) for Company datasets
It can be seen that the similarity ranking of the proposed method (bolded words in table 2) at the predicted node is also top, with a lower score. The method provided by the invention has great improvement on precision and node similarity ranking, most of reasons are that the nodes have community attributes, and part of nodes in the network are active among communities. Therefore, the method provided by the invention can be applied to searching whether a community structure exists in the company bidding behavior network or finding the association tightness degree among communities; secondly, the method can predict the probability of the next cooperation of any two company nodes and potential partners; the hidden rule behind the transaction data is revealed through the data experimental result, and the transaction track and behavior characteristics of the 'string mark and surrounding mark' are described.
When the method is actually applied, whether the related company has string bidding behavior can be predicted according to the information of the related company and the cooperation relationship data between the companies in the bidding field to which the company belongs.
The invention also provides a network relation prediction system of bidding companies based on the community structure, which comprises a data acquisition and cleaning module, a network modeling module, a relation prediction module and a risk assessment module, wherein:
the data acquisition and cleaning module is used for collecting information of bidding companies and cooperative relation data among companies in bidding fields of the bidding companies, preprocessing the collected data, extracting simple keywords from the collected data at a server side and deleting redundant information;
the network modeling module is used for reading the data of the data acquisition and cleaning module and calculating by using a relation prediction algorithm to obtain a similarity matrix between nodes;
the relation prediction module is used for sequencing the similarity matrix obtained by the network modeling module to the network and returning the results arranged in front to the client interface according to the requirements;
and the risk assessment module is used for fusing all the output information of the nodes and judging the risk degree of the company according to the node similarity result.
In the prediction method and the prediction system, the network modeling module and the relationship prediction module are updated by utilizing historical data, namely information of related companies in the published judgment document and cooperative relationship data among companies in the bidding field to which the companies belong, and the published judgment document is only used for verifying the prediction data obtained by the relationship prediction module.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.