CN110335165B

CN110335165B - Link prediction method and device

Info

Publication number: CN110335165B
Application number: CN201910576954.XA
Authority: CN
Inventors: 高俊杰
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd; Jingdong Technology Holding Co Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2021-03-30
Anticipated expiration: 2039-06-28
Also published as: CN110335165A

Abstract

The invention discloses a link prediction method and a link prediction device, and relates to the technical field of computers. One embodiment of the method comprises: generating a first training set according to existing relationship network data, wherein the first training set comprises class labels of all edges of an existing relationship network, the edges represent the relationship between two users in the relationship network, and the class labels of the edges indicate whether the relationship between the two users exists or not; performing parameter optimization on the selected model, and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; training a link prediction model using the second training set; and performing link prediction on the input relational network data by using the trained link prediction model. The method and the device can improve the accuracy of link prediction and improve the effect of relation network data mining.

Description

Link prediction method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a link prediction method and apparatus.

Background

Link Prediction (Link Prediction), which belongs to network data mining, is to mine or predict relationships such as social contact and financial transaction relationships that have not been expressed by a user (such relationships are also called as edges) by using social contact and financial transaction relationships that have been expressed, and the Link Prediction technology is widely applied to the fields of social friend recommendation, abnormal transaction monitoring and the like.

In the prior art, most of the edges are predicted based on the observed adjacency matrix, the edge of 0 in the adjacency matrix is considered to be absent, and the edge of 1 in the adjacency matrix is considered to be present. However, in a real network, all edges cannot be observed completely and correctly, for example, although not observed currently, a certain edge may actually exist, or, although the edge is observed to exist, the edge is false, and these conditions cause the determination accuracy to be greatly reduced.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the link prediction accuracy of the existing link prediction scheme is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a link prediction method and apparatus, which can improve accuracy of link prediction and improve the effect of relational network data mining.

To achieve the above object, according to an aspect of an embodiment of the present invention, a link prediction method is provided.

A link prediction method, comprising: generating a first training set according to existing relationship network data, wherein the first training set comprises class labels of all edges of an existing relationship network, the edges represent the relationship between two users in the relationship network, and the class labels of the edges indicate whether the relationship between the two users exists or not; performing parameter optimization on the selected model, and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; training a link prediction model using the second training set; and performing link prediction on the input relational network data by using the trained link prediction model.

Optionally, the step of performing parameter optimization on the selected model includes: initializing a plurality of sets of parameters of the selected model; under each group of parameters of the selected model, correcting the class labels of all edges in the first training set by using the selected model to obtain a third training set, performing first training on the link prediction model by using the third training set, performing link prediction on a pre-generated test set by using the link prediction model after the first training, and calculating the prediction accuracy of the link prediction model under the group of parameters according to a prediction result; establishing a functional relation between the selected model parameters and the prediction accuracy of the link prediction model according to the multiple groups of parameters and the prediction accuracy of the link prediction model under each group of parameters; and optimizing the selected model parameters in the functional relation by using a genetic algorithm to obtain the parameters of the selected model after the parameters are optimized.

Optionally, the selecting a model for generating a predicted value of each edge, and modifying the class label of each edge in the first training set includes: and replacing the class label of each edge in the first training set with the predicted value of each edge.

Optionally, the step of generating the first training set according to the existing relationship network data includes: calculating feature sets of all edges of the existing relationship network according to preset indexes and existing relationship network data; obtaining a data set according to the feature set of each edge and the category label of each edge; dividing the data set into two parts according to a preset proportion, and taking one part of the two parts as the first training set; the test set is generated by: taking the part of the data set except the first training set as a first test set; and sampling the first test set for multiple times, stopping sampling until the total number of the samples obtained by accumulative sampling is the same as the number of the samples in the first test set, and taking the set of all the samples obtained by sampling as the test set.

Optionally, the step of optimizing the selected model parameters in the functional relation by using a genetic algorithm to obtain the parameters of the selected model after the parameters are optimized includes: initializing a population, wherein each chromosome in the population is obtained by connecting the binary codes of a group of parameter values of the selected model end to end; calculating the prediction accuracy corresponding to each chromosome in the population for the initialized population, judging whether the highest prediction accuracy in the population meets the preset requirement, if so, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters, and using the decimal parameters as the parameters of the selected model after the parameters are optimized; if not, carrying out chromosome selection, crossing and mutation operations on the population to obtain a new population, and continuously repeating the process until the highest prediction accuracy corresponding to the chromosomes in the latest population obtained at a certain time reaches the preset requirement, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters which are used as the parameters of the selected model after the parameters are optimized; and calculating the prediction accuracy corresponding to the chromosome according to the functional relation after the prediction accuracy corresponding to the chromosome is the prediction accuracy of the link prediction model under the set of parameters after the binary codes forming the chromosome are reduced into the set of parameters of the selected model.

Optionally, the selected model is a logistic regression model or a feed forward neural network model.

According to another aspect of the embodiments of the present invention, a link prediction apparatus is provided.

A link prediction apparatus comprising: the first training set generating module is used for generating a first training set according to existing relationship network data, wherein the first training set comprises class labels of all edges of the existing relationship network, the edges represent the relationship between two users in the relationship network, and the class labels of the edges indicate whether the relationship between the two users exists or not; the second training set generation module is used for performing parameter optimization on the selected model and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; a link prediction model training module for training a link prediction model using the second training set; and the link prediction module is used for performing link prediction on the input relational network data by using the trained link prediction model.

Optionally, the second training set generating module includes a parameter optimization submodule configured to: initializing a plurality of sets of parameters of the selected model; under each group of parameters of the selected model, correcting the class labels of all edges in the first training set by using the selected model to obtain a third training set, performing first training on the link prediction model by using the third training set, performing link prediction on a pre-generated test set by using the link prediction model after the first training, and calculating the prediction accuracy of the link prediction model under the group of parameters according to a prediction result; establishing a functional relation between the selected model parameters and the prediction accuracy of the link prediction model according to the multiple groups of parameters and the prediction accuracy of the link prediction model under each group of parameters; and optimizing the selected model parameters in the functional relation by using a genetic algorithm to obtain the parameters of the selected model after the parameters are optimized.

Optionally, the first training set generating module is further configured to: calculating feature sets of all edges of the existing relationship network according to preset indexes and existing relationship network data; obtaining a data set according to the feature set of each edge and the category label of each edge; dividing the data set into two parts according to a preset proportion, and taking one part of the two parts as the first training set; the device also comprises a test set generation module for generating the test set by the following modes: taking the part of the data set except the first training set as a first test set; and sampling the first test set for multiple times, stopping sampling until the total number of the samples obtained by accumulative sampling is the same as the number of the samples in the first test set, and taking the set of all the samples obtained by sampling as the test set.

Optionally, the parameter optimization submodule includes a parameter optimization execution unit, configured to: initializing a population, wherein each chromosome in the population is obtained by connecting the binary codes of a group of parameter values of the selected model end to end; calculating the prediction accuracy corresponding to each chromosome in the population for the initialized population, judging whether the highest prediction accuracy in the population meets the preset requirement, if so, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters, and using the decimal parameters as the parameters of the selected model after the parameters are optimized; if not, carrying out chromosome selection, crossing and mutation operations on the population to obtain a new population, and continuously repeating the process until the highest prediction accuracy corresponding to the chromosomes in the latest population obtained at a certain time reaches the preset requirement, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters which are used as the parameters of the selected model after the parameters are optimized; and calculating the prediction accuracy corresponding to the chromosome according to the functional relation after the prediction accuracy corresponding to the chromosome is the prediction accuracy of the link prediction model under the set of parameters after the binary codes forming the chromosome are reduced into the set of parameters of the selected model.

According to yet another aspect of an embodiment of the present invention, an electronic device is provided.

An electronic device, comprising: one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the link prediction method provided by the present invention.

According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the link prediction method provided by the invention.

One embodiment of the above invention has the following advantages or benefits: generating a first training set according to the existing relationship network data, wherein the first training set comprises class labels of all edges of the existing relationship network; performing parameter optimization on the selected model, and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; training the link prediction model using a second training set; and performing link prediction on the input relational network data by using the trained link prediction model. The accuracy of link prediction can be improved, and the effect of relation network data mining can be improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of the main steps of a link prediction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a link prediction flow according to an embodiment of the present invention;

fig. 3 is a schematic diagram of main blocks of a link prediction apparatus according to an embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

Link prediction refers to mining or predicting the relationship, such as social contact and financial transaction relationship, that a user does not express yet, by using the already expressed social contact and financial transaction relationship, and the relationship is also called as edge connection. If the user A and the user B are in friend relationship and the user B and the user C are in friend relationship in social relationship mining, the purpose of the relationship mining is to research whether the user A and the user C have the opportunity to become friend relationship.

Fig. 1 is a schematic diagram of the main steps of a link prediction method according to an embodiment of the present invention.

As shown in fig. 1, the link prediction method according to the embodiment of the present invention mainly includes the following steps S101 to S104.

Step S101: a first training set is generated from existing relational network data.

The existing relational network data can be a network data graph or an adjacency matrix, a network graph of the relationship between users is obtained, the network graph is composed of nodes (nodes) and edges (edges), wherein the nodes represent the users, the edges represent the relationship between two users in the relational network, if the relationship between the two users is observed, the corresponding edges exist and are marked as 1, if the edges are not observed, the relationship does not exist between the two users and is marked as 0, and therefore the relationship between the users is converted into the network data graph or the adjacency matrix.

In the adjacency matrix, for example, the following form is used (for example, the number of users is 3, 1 indicates that an edge exists, and 0 indicates that an edge does not exist):

0 1 0

1 0 1

0 1 0

and calculating feature sets of all edges (including observed edges and unobserved edges) of the existing relational network according to preset indexes according to the existing relational network data. Calculating the above indexes of each edge of the existing relational network, organizing into a data format suitable for link prediction model training, wherein the data is specifically composed of rows and columns, each row represents one edge, for example, if 3 users exist in the network, C is generated correspondingly₃ ²And each column of the edges represents a feature corresponding to the edge, each feature is an index value calculated above, and the set of the index values of each edge forms the feature set of the edge, so that the feature set of each edge of the existing relationship network is obtained.

Each edge has a respective category label, the category label of the edge indicates whether a relationship between two users exists, a category label of 1 indicates that a relationship exists between two users, and a category label of 0 indicates that a relationship does not exist between two users.

And obtaining a data set according to the feature set of each edge and the class label of each edge of the existing relation network, wherein the last column of the data set is the class label of each edge, and the previous columns are feature sets of each edge.

The data set is divided into two parts according to a preset proportion, wherein one part is used as a first training set, and the other part is used as a first testing set. The first training set includes feature sets and category labels of edges of the existing relationship network, each row represents an edge, each column (except the last column) corresponding to each edge is a feature of the edge, and the last column is a category label of the edge.

The preset indexes can adopt indexes such as Common Neighbors (CN), Average Commute Time (ACT), random walk indexes (RWR) with restart, preference connection similarity (PA), local path indexes (LP), Katz indexes and the like. The number of neighbors of a node in the relational network refers to the number of other nodes connected with the node. Two nodes are more prone to edge if they have more common neighbors, e.g., a has an edge with B, B has an edge with C, and the common neighbor of a and C is B. Defining the average first arrival time m (x, y) as a random walk particle slave node v_xTo node v_yAverage number of steps required, node v_xAnd v_yM (x, y) + m (y, x). The local path metric (LP) is based on the common neighbor metric and takes into account the contribution of the third-order neighbors (i.e., the paths connecting two nodes via three edges). The Katz index considers all possible paths connecting two nodes. Preference connection similarity (PA) means that in a relational network, the probability of an upcoming new edge connecting to node x is proportional to the degree k (x) of node x, and thus the probability of a new edge connecting to nodes x and y is proportional to the product of two node degrees, i.e., the number of neighbors of the node. The random walk index (RWR) with restart assumes that a random walk particle returns to an initial position with a certain probability at each step, and thus can obtain the possibility that a particle that starts to walk from a certain node reaches each node at t steps.

Step S102: and performing parameter optimization on the selected model, and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set.

The model is selected to generate a predictor for each edge in the first training set, i.e., one predictor (0 or 1) is generated for each row feature in the first training set and no training is performed.

The selected model may specifically be a logistic regression model or a feedforward neural network model.

And after the parameter of the selected model is optimized, substituting the optimized selected model parameters into the selected model, predicting each edge in the first training set again by using the selected model after the parameter optimization to generate a predicted value of each edge, and replacing the category label of each edge in the first training set by using the predicted value to obtain a training set, namely a second training set.

The step of performing parameter optimization on the selected model may specifically include: initializing a plurality of groups of parameters of the selected model; under each group of parameters of the selected model, modifying the class labels of all edges in the first training set by using the selected model to obtain a third training set, performing first training on the link prediction model by using the third training set, performing link prediction on a pre-generated test set by using the link prediction model after the first training, and calculating the prediction accuracy of the link prediction model under the group of parameters according to the prediction result; establishing a functional relation between the selected model parameters and the prediction accuracy of the link prediction model according to the multiple groups of parameters and the prediction accuracy of the link prediction model under each group of parameters; and optimizing the selected model parameters in the functional relation by using a genetic algorithm to obtain the parameters of the selected model after parameter optimization.

Modifying the class label of each edge in the first training set, specifically comprising: and replacing the class label of each edge in the first training set with the predicted value of each edge. Given the parameters of the selected model, the selected model may generate new class labels for the edges in the first training set and may compose a new training set (i.e., a third training set).

The link prediction model may adopt a machine learning model, such as a support vector machine model, a GBDT (gradient lifting tree), a random forest, and the like, and may also adopt a deep learning model.

When a pre-generated test set is subjected to link prediction by using a first trained link prediction model, the test set is generated by sampling the first test set, specifically, the first test set is subjected to multiple times of sampling until the total number of samples obtained by accumulative sampling is the same as the number of samples in the first test set, and the sampling is stopped, and a set of all samples obtained by sampling is used as the test set.

And calculating the prediction accuracy of the link prediction model under the group of parameters according to the prediction result, specifically, comparing the prediction result with the class label of the test set, and obtaining the prediction accuracy of the link prediction model under the group of parameters according to the ratio of the number consistent with the comparison to the total number of the prediction results.

As can be seen from the above, a set of parameter values f (x) for a given selected model₁,x₂...)，(x₁,x₂...) for a selected model parameter value, f represents the selected model, the prediction accuracy (denoted as y) for that parameter value is obtained, and the functional relationship between the selected model parameter and the prediction accuracy of the link prediction model is obtained from the sets of parameter values and the corresponding prediction accuracies: y is g (f (x)₁,x₂...)) where g is an accuracy function consisting of two parts, a selected model and a measurement of prediction accuracy, which is not differentiable and therefore cannot be used to obtain parameter values using conventional optimization methods, the embodiment packages this step as a function with the input values of the selected model parameters and the output values of the prediction accuracy.

The step of optimizing the selected model parameters in the functional relationship using a genetic algorithm may comprise: initializing a population, wherein each chromosome in the population is obtained by connecting the binary codes of a group of parameter values of a selected model end to end; calculating the prediction accuracy corresponding to each chromosome in the population for the initialized population, judging whether the highest prediction accuracy in the population meets the preset requirement, if so, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters, and using the decimal parameters as the parameters of the selected model after parameter optimization; if not, carrying out chromosome selection, crossing and mutation operations on the population to obtain a new population, and continuously repeating the process until the highest prediction accuracy corresponding to the chromosomes in the latest population obtained at a certain time reaches a preset requirement, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters serving as the parameters of the selected model after parameter optimization; and the prediction accuracy corresponding to the chromosome is the prediction accuracy of the link prediction model under the set of parameters after the binary codes forming the chromosome are reduced into the set of parameters of the selected model, and the prediction accuracy corresponding to the chromosome is calculated according to the functional relation.

Before optimizing the selected model parameters in the functional relation by using a genetic algorithm, pre-selecting precision, interval and coding mode, specifically, the precision of the selected model parameters can be 3 bits after the decimal point, the parameter interval can be [ -1,1], which means that all parameters are to be in the interval, the coding mode selects binary coding, specifically, interval division is adopted, the limited interval is divided into small segments, if [0,1] is divided into 10 segments by using 0.1 as the precision, the 10 segments can be represented by 2^4>10, namely 4-bit binary number. For example 0.637197 may be encoded as 1000101110110101000111.

The number of chromosomes in the initialized population can be set to 50, namely, 50 groups of selected model parameter values are initialized randomly. Each set of parameter values is binary coded and concatenated end to end. Taking the example of a random initialization value of a set of parameters being [0.637197,0.637197], the binary is first encoded as [1000101110110101000111,1000101110110101000111], and then the binary is concatenated end-to-end as:

10001011101101010001111000101110110101000111，

and the value after the ligation is called a chromosome and each binary code before the ligation, 1000101110110101000111, is called a gene. According to the method, 50 groups of parameters are respectively subjected to binary code ending connection of parameter values of each group, 50 connected values are generated, and a set of the 50 groups of values is called an initialized population.

Calculating the prediction accuracy corresponding to each chromosome in the population, specifically, after each chromosome is split into gene segments (i.e., genes), performing binary to decimal reduction on the gene segments to obtain decimal parameter values, and substituting the parameter values into a functional relation between the selected model parameters and the prediction accuracy of the link prediction model to obtain the prediction accuracy corresponding to the parameter values, wherein 50 sets of parameter values correspond to 50 chromosomes, and 50 chromosomes can correspondingly obtain 50 prediction accuracies.

And when the chromosome selection is carried out on the population, a certain number of chromosomes are selected according to the calculated prediction accuracy corresponding to each chromosome in the population. If there are two chromosomes with higher prediction accuracy, for example, and the corresponding prediction accuracy is [0.6,0.3], then the probability that the first chromosome is selected is 0.6/(0.3+0.6) ═ 0.66, and the probability that the second chromosome is selected is 0.3/(0.3+0.6) ═ 0.34, then sampling can be performed according to the selection probabilities, and the chromosomes with higher probabilities are selected, for example, 25 chromosomes can be selected from 50 chromosomes.

Performing chromosome crossing operation on chromosomes selected by chromosome selection, specifically, selecting two chromosomes for the selected chromosomes each time to cross to obtain two new chromosomes until the number of the new chromosomes obtained by crossing is the same as the number of original chromosomes in the population, for example, obtaining 50 new chromosomes after 25 times of crossing operation on the selected 25 chromosomes.

The operation of crossing chromosomes is described by taking the following two chromosomes as an example, before crossing:

0000 1100 1101

0011 0101 0101

after crossing:

0000 0101 1101

0011 1100 0101

wherein, the black part (i.e. the middle 4 coded by chromosome 12) is a cross part, and after crossing the '1100' and the '0101', two new chromosomes are obtained.

Carrying out mutation operation on the new chromosomes obtained after the crossover operation, specifically, selecting any m (the value of m is manually set in advance) small nodes for each chromosome of the new chromosomes obtained after the crossover operation to carry out mutation operation, wherein the mutation operation is to change the original node '1' into '0' and change the original node '0' into '1'. For example, before mutation:

0001 0110 0101

after mutation:

0001 0111 0101

after the mutation is performed on the blackened node (i.e., the 8 th node), the original node "0" is changed to "1".

The method comprises the steps of obtaining a new population after chromosomes in the current population are selected, crossed and mutated, recalculating the prediction accuracy corresponding to each chromosome for the new population, judging whether the highest prediction accuracy in the new population meets a preset requirement, if the highest prediction accuracy in the new population does not meet the preset requirement, taking the new population as the current population, continuing to perform chromosome selection, crossing and mutation, obtaining the new population, repeating the steps until the highest prediction accuracy of the chromosomes in the latest population obtained at a certain time meets the preset requirement, stopping circulation, outputting parameters of a selected model after parameter optimization, and obtaining the parameters of the selected model after parameter optimization by reducing binary codes of the chromosomes corresponding to the highest prediction accuracy to decimal numbers.

Wherein, the above-mentioned highest prediction accuracy can be regarded as meeting the preset requirement when the following conditions are met: the highest prediction accuracy of the current population is still the highest in the optimization process of n times later (the value of n is set according to needs); or, if the prediction accuracy corresponding to a chromosome in the current population reaches a preset accuracy value.

Step S103: the link prediction model is trained using a second training set.

The second training set comprises feature sets of edges of the existing relationship network and the modified class labels. And taking the feature set of each edge of the existing relational network as the input of the link prediction model, taking the modified class label as a training target, and training the link prediction model to obtain the trained link prediction model.

Step S104: and performing link prediction on the input relational network data by using the trained link prediction model.

According to the input relational network data, the feature set of each edge (including observed edges and unobserved edges) of the input relational network can be calculated according to a preset index. And inputting the feature set of each edge of the input relation network into the trained link prediction model to determine whether each edge (the relation between users) of the relation network exists or not.

In the embodiment of the invention, in the link prediction of the relational network, if a large number of edges and virtual false edges which cannot be observed but actually exist, the link prediction effect is obviously improved compared with the prior art.

Fig. 2 is a schematic diagram of a link prediction flow according to an embodiment of the present invention.

In this embodiment, the selected model is a logistic regression model, and the selected model is a support vector machine model.

In many application scenarios of link prediction, there are a large number of relationships between nodes that cannot be observed or relationships between nodes that are spurious. For example, in the financial relationship between users, there is a user A, B, C in which the transfer relationship between users A, B can be found through the user's transfer data, but B, C are through cash transactions and the relationship cannot be observed. The prior art recognizes that since no observation was made between B, C, no financial relationship existed between B, C, resulting in the effect of the model on the mining relationships being affected. The present embodiment utilizes a logistic regression model to adjust such a relationship of errors, and performs modeling after adjustment, thereby improving the link prediction effect of the final support vector machine model.

As shown in fig. 2, the link prediction process of the embodiment of the present invention includes steps S201 to S208 as follows.

Step S201: and acquiring a network graph of the relationship between the users, and converting the relationship between the users into an adjacency matrix.

Step S202: and calculating indexes such as a Common Neighbor (CN), an Average Commuting Time (ACT), a restarting random walk index (RWR), a preference connection similarity (PA), a local path index (LP), a Katz index and the like of each edge of the existing relation network according to the adjacency matrix so as to obtain a feature set of each edge.

Step S203: a data set is generated in a data format suitable for link model training.

The dataset includes a feature set for each edge of the existing relationship network and a category label for each edge.

Step S204: initializing a logistic regression model and initializing a support vector machine model, and dividing a data set into a first training set and a first testing set to establish a function form with input as logistic regression parameters and output as prediction accuracy of the support vector machine model.

Specifically, multiple sets of parameters (namely multiple sets of logistic regression parameters) of a logistic regression model are initialized, under each set of parameters, the logistic regression model is used for correcting class labels of all edges in a first training set to obtain a third training set, the third training set is used for performing first training on the initialized support vector machine model, the support vector machine model after the first training is used for performing link prediction on a pre-generated test set, and the prediction accuracy of the support vector machine model under the set of parameters is calculated according to the prediction result; and establishing a functional relation between the logistic regression parameters and the prediction accuracy of the support vector machine model according to the multiple groups of parameters and the prediction accuracy of the support vector machine model under each group of parameters, wherein the input of the functional relation is the logistic regression parameters, and the output of the functional relation is the prediction accuracy of the support vector machine model. The test set is generated according to the first test set, and the specific generation method is described above and is not described here again.

Step S205: the logistic regression parameters in the functional form are optimized using genetic algorithms.

Initializing a population, wherein each chromosome in the population is obtained by connecting the binary codes of a group of parameter values of a logistic regression model end to end; calculating the prediction accuracy corresponding to each chromosome in the population for the initialized population, judging whether the highest prediction accuracy in the population meets the preset requirement, if so, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters, and using the decimal parameters as logistic regression parameters after parameter optimization; if not, carrying out chromosome selection, crossing and mutation operations on the population to obtain a new population, and continuously repeating the process until the highest prediction accuracy corresponding to the chromosomes in the latest population obtained at a certain time reaches a preset requirement, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters serving as logistic regression parameters after parameter optimization; the prediction accuracy corresponding to the chromosome is the prediction accuracy of the support vector machine model under a set of parameters after the binary codes forming the chromosome are reduced into the set of parameters of the logistic regression model, and the prediction accuracy corresponding to the chromosome is calculated according to the functional relation.

Step S206: and under the optimized logistic regression parameters, utilizing the logistic regression model to predict each edge in the first training set again to generate a predicted value of each edge, and using the predicted value to replace the class label of each edge in the first training set to obtain a second training set.

Step S207: the support vector machine model is trained using a second training set.

Step S208: and inputting the feature set of each edge of a relation network into the trained support vector machine model to determine whether each edge of the relation network exists.

The feature set of each edge of the relational network is obtained by calculating the input relational network data according to a preset index.

In the embodiment, the prediction accuracy of the support vector machine is designed into the function module, the function module can be optimized through a genetic algorithm, the optimized function module is used for improving the accuracy of link prediction, and the defect that the effect of relation network data mining is poor when a large number of edges and virtual false edges which cannot be observed but exist really exist in the network is overcome.

Fig. 3 is a schematic diagram of main blocks of a link prediction apparatus according to an embodiment of the present invention.

As shown in fig. 3, the link prediction apparatus 300 mainly includes: a first training set generation module 301, a second training set generation module 302, a link prediction model training module 303, and a link prediction module 304.

A first training set generating module 301, configured to generate a first training set according to existing relationship network data, where the first training set includes category labels of edges of an existing relationship network, where an edge represents a relationship between two users in the relationship network, and a category label of an edge indicates whether a relationship between two users exists.

The first training set generating module may specifically be configured to: calculating feature sets of all edges of the existing relationship network according to preset indexes according to the existing relationship network data; obtaining a data set according to the feature set of each edge and the category label of each edge; the data set is divided into two parts according to a preset proportion, and one part of the two parts is used as a first training set.

The link prediction apparatus 300 may further include a test set generation module for generating a test set by: taking the part of the data set except the first training set as a first test set; and sampling the first test set for multiple times, stopping sampling until the total number of the samples obtained by accumulative sampling is the same as the number of the samples in the first test set, and taking the set of all the samples obtained by sampling as the test set.

And a second training set generating module 302, configured to perform parameter optimization on the selected model, and modify the class label of each edge in the first training set by using the selected model after parameter optimization to obtain a second training set.

And selecting a model for generating a predicted value of each edge. The model can be a logistic regression model or a feedforward neural network model.

Modifying the class label of each edge in the first training set, comprising: and replacing the class label of each edge in the first training set with the predicted value of each edge.

The second training set generation module 302 may include a parameter optimization submodule for: initializing a plurality of groups of parameters of the selected model; under each group of parameters of the selected model, modifying the class labels of all edges in the first training set by using the selected model to obtain a third training set, performing first training on the link prediction model by using the third training set, performing link prediction on a pre-generated test set by using the link prediction model after the first training, and calculating the prediction accuracy of the link prediction model under the group of parameters according to the prediction result; establishing a functional relation between the selected model parameters and the prediction accuracy of the link prediction model according to the multiple groups of parameters and the prediction accuracy of the link prediction model under each group of parameters; and optimizing the selected model parameters in the functional relation by using a genetic algorithm to obtain the parameters of the selected model after parameter optimization.

The parameter optimization submodule may include a parameter optimization execution unit to: initializing a population, wherein each chromosome in the population is obtained by connecting the binary codes of a group of parameter values of a selected model end to end; calculating the prediction accuracy corresponding to each chromosome in the population for the initialized population, judging whether the highest prediction accuracy in the population meets the preset requirement, if so, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters, and using the decimal parameters as the parameters of the selected model after parameter optimization; if not, carrying out chromosome selection, crossing and mutation operations on the population to obtain a new population, and continuously repeating the process until the highest prediction accuracy corresponding to the chromosomes in the latest population obtained at a certain time reaches a preset requirement, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters serving as the parameters of the selected model after parameter optimization; and the prediction accuracy corresponding to the chromosome is the prediction accuracy of the link prediction model under the set of parameters after the binary codes forming the chromosome are reduced into the set of parameters of the selected model, and the prediction accuracy corresponding to the chromosome is calculated according to the functional relation.

A link prediction model training module 303, configured to train a link prediction model using the second training set;

and the link prediction module 304 is configured to perform link prediction on the input relational network data by using the trained link prediction model.

In addition, the detailed implementation of the link prediction device in the embodiment of the present invention has been described in detail in the above link prediction method, and therefore, the repeated content will not be described again.

Fig. 4 shows an exemplary system architecture 400 to which the link prediction method or the link prediction apparatus of the embodiments of the present invention may be applied.

As shown in fig. 4, the system architecture 400 may include

terminal devices

401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the

terminal devices

401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The

terminal devices

401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the link prediction method provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the link prediction apparatus is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a terminal device or server of an embodiment of the present application is shown. The terminal device or the server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a first training set generation module, a second training set generation module, a link prediction model training module, and a link prediction module. Where the names of these modules do not in some cases constitute a limitation of the module itself, for example, the first training set generating module may also be described as a "module for generating a first training set from existing relational network data".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: generating a first training set according to existing relationship network data, wherein the first training set comprises class labels of all edges of an existing relationship network, the edges represent the relationship between two users in the relationship network, and the class labels of the edges indicate whether the relationship between the two users exists or not; performing parameter optimization on the selected model, and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; training a link prediction model using the second training set; and performing link prediction on the input relational network data by using the trained link prediction model.

According to the technical scheme of the embodiment of the invention, a first training set is generated according to the existing relationship network data, wherein the first training set comprises class labels of all edges of the existing relationship network; performing parameter optimization on the selected model, and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; training the link prediction model using a second training set; and performing link prediction on the input relational network data by using the trained link prediction model. The accuracy of link prediction can be improved, and the effect of relation network data mining can be improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of link prediction, comprising:

generating a first training set according to existing relationship network data, wherein the first training set comprises class labels of all edges of an existing relationship network, the edges represent the relationship between two users in the relationship network, and the class labels of the edges indicate whether the relationship between the two users exists or not;

performing parameter optimization on the selected model, and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; a step of performing parameter optimization on the selected model, comprising: initializing a plurality of sets of parameters of the selected model; under each group of parameters of the selected model, correcting the class labels of all edges in the first training set by using the selected model to obtain a third training set, performing first training on a link prediction model by using the third training set, performing link prediction on a pre-generated test set by using the link prediction model after the first training, and calculating the prediction accuracy of the link prediction model under the group of parameters according to a prediction result; establishing a functional relation between the selected model parameters and the prediction accuracy of the link prediction model according to the multiple groups of parameters and the prediction accuracy of the link prediction model under each group of parameters; optimizing the selected model parameters in the functional relation by using a genetic algorithm to obtain the parameters of the selected model after the parameters are optimized;

training a link prediction model using the second training set;

and performing link prediction on the input relational network data by using the trained link prediction model.

2. The method of claim 1, wherein the selected model is used to generate a predicted value for the edges,

modifying the class label of each edge in the first training set, including: and replacing the class label of each edge in the first training set with the predicted value of each edge.

3. The method of claim 1, wherein the step of generating the first training set from existing relationship network data comprises:

calculating feature sets of all edges of the existing relationship network according to preset indexes and existing relationship network data;

obtaining a data set according to the feature set of each edge and the category label of each edge;

dividing the data set into two parts according to a preset proportion, and taking one part of the two parts as the first training set;

the test set is generated by:

taking the part of the data set except the first training set as a first test set;

and sampling the first test set for multiple times, stopping sampling until the total number of the samples obtained by accumulative sampling is the same as the number of the samples in the first test set, and taking the set of all the samples obtained by sampling as the test set.

4. The method of claim 1, wherein the step of using a genetic algorithm to optimize the selected model parameters in the functional relationship to obtain the parameters of the selected model after the parameter optimization comprises:

initializing a population, wherein each chromosome in the population is obtained by connecting the binary codes of a group of parameter values of the selected model end to end;

calculating the prediction accuracy corresponding to each chromosome in the population for the initialized population, judging whether the highest prediction accuracy in the population meets the preset requirement, if so, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters, and using the decimal parameters as the parameters of the selected model after the parameters are optimized; if not, carrying out chromosome selection, crossing and mutation operations on the population to obtain a new population, and continuously repeating the process until the highest prediction accuracy corresponding to the chromosomes in the latest population obtained at a certain time reaches the preset requirement, reducing the binary codes of the chromosomes corresponding to the highest prediction accuracy into a group of decimal parameters which are used as the parameters of the selected model after the parameters are optimized; wherein the content of the first and second substances,

and the prediction accuracy corresponding to the chromosome is the prediction accuracy of the link prediction model under the set of parameters after the binary codes forming the chromosome are reduced into the set of parameters of the selected model, and the prediction accuracy corresponding to the chromosome is calculated according to the functional relation.

5. The method of claim 1, wherein the selected model is a logistic regression model or a feed forward neural network model.

6. A link prediction apparatus, comprising:

the first training set generating module is used for generating a first training set according to existing relationship network data, wherein the first training set comprises class labels of all edges of the existing relationship network, the edges represent the relationship between two users in the relationship network, and the class labels of the edges indicate whether the relationship between the two users exists or not;

the second training set generation module is used for performing parameter optimization on the selected model and correcting the class labels of all edges in the first training set by using the selected model after parameter optimization to obtain a second training set; wherein the second training set generation module comprises a parameter optimization submodule configured to: initializing a plurality of sets of parameters of the selected model; under each group of parameters of the selected model, correcting the class labels of all edges in the first training set by using the selected model to obtain a third training set, performing first training on a link prediction model by using the third training set, performing link prediction on a pre-generated test set by using the link prediction model after the first training, and calculating the prediction accuracy of the link prediction model under the group of parameters according to a prediction result; establishing a functional relation between the selected model parameters and the prediction accuracy of the link prediction model according to the multiple groups of parameters and the prediction accuracy of the link prediction model under each group of parameters; optimizing the selected model parameters in the functional relation by using a genetic algorithm to obtain the parameters of the selected model after the parameters are optimized; a link prediction model training module for training a link prediction model using the second training set;

and the link prediction module is used for performing link prediction on the input relational network data by using the trained link prediction model.

7. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-5.

8. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.