CN110717043A

CN110717043A - Academic team construction method based on network representation learning training

Info

Publication number: CN110717043A
Application number: CN201910930765.8A
Authority: CN
Inventors: 李微; 陈瑞
Original assignee: Three Helix Big Data Technology (kunshan) Co Ltd
Current assignee: Three Helix Big Data Technology (kunshan) Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-01-21

Abstract

The invention discloses an academic team construction method based on network representation learning training, which comprises the following steps: reading students and scientific research data in the database; step two: training by using an author theme model to obtain author theme probability distribution; step three: constructing an initial academic network; step four: training based on a network representation learning method to obtain a learner vector; step five: clustering the learner vectors based on a machine learning clustering method; step six: clusters meeting the preset threshold are output as an academic team. The method has the advantages that the team building efficiency is high, the similarity of the theme of the built team is high, and communities with different granularity can be divided by changing the number of clusters according to needs.

Description

Academic team construction method based on network representation learning training

[ technical field ] A method for producing a semiconductor device

The invention belongs to the technical field of social network analysis, and particularly relates to an academic team construction method based on network representation learning training.

[ background of the invention ]

With the development of scientific research, the wide cooperation among scientific research students forms a complex academic network, the scale of academic teams is enlarged, the relations among team members are complex, the constitution conditions of the academic teams are deeply known and mined, the method is beneficial to helping enterprises to quickly know the group information of college students in the obstetrical and research cooperation, and can also help scientific research management departments to identify scientific talents and scientific research teams, and the development of disciplines is promoted.

The division tasks of academic teams can be completed by utilizing a community discovery technology, most of the existing methods are based on network topological structure information and mainly comprise methods based on clustering, methods based on modularization, spectral clustering, random block division models and the like; patent application No. 201810851399.2 in the prior art also discloses a team construction method based on an academic network, which can divide communities, but in the academic network, students corresponding to nodes contain a large amount of text information, such as research directions and paper data of the students, and the division method based on a network topology structure ignores the text information, so that topic cohesion of the communities of the students is difficult to ensure, the existing community discovery method cannot control the divided community scale, and the division method based on modularity optimization also easily generates communities which are not effectively divided and have very large scales.

Therefore, there is a need to provide a new academic team construction method based on network representation learning training to solve the above technical problems.

[ summary of the invention ]

The invention mainly aims to provide an academic team construction method based on network representation learning training, which has high team construction efficiency and high constructed team theme similarity and can divide communities with different granularities by changing the number of clusters according to requirements.

The invention realizes the purpose through the following technical scheme: an academic team construction method based on network representation learning training comprises the following steps,

the method comprises the following steps: reading students and scientific research data in the database;

step two: training by using an author theme model to obtain author theme probability distribution;

step three: constructing an initial academic network;

step four: training based on a network representation learning method to obtain a learner vector;

step five: clustering the learner vectors based on a machine learning clustering method;

step six: clusters meeting the preset threshold are output as an academic team.

Compared with the prior art, the academic team construction method based on the network representation learning training has the beneficial effects that: in the community discovery process, not only is the physical topological structure information of an academic network considered, but also author theme probability distribution is obtained through author theme model training, and text data contained in a learner is blended into the author theme probability distribution, so that an academic team with higher theme cohesion is obtained; in addition, when the learner vectors are clustered based on a machine learning clustering method, the clustering number can be changed, so that the number scale of academic teams can be flexibly controlled.

[ description of the drawings ]

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of an algorithmic process according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an AT model according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an author topic and topic probability distribution generated by an AT model according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an author topic probability distribution in an embodiment of the present invention;

fig. 6 is a schematic diagram of an academic network according to an embodiment of the present invention.

[ detailed description ] embodiments

Example (b):

referring to fig. 1, the present embodiment is a method for constructing an academic team based on network representation learning training, which includes the following steps:

step two: training by using an author topic model (author-topic, hereinafter referred to as AT model) to obtain the topic probability distribution of the learner;

step three: constructing an initial academic network;

step six: clusters meeting the preset threshold are output as an academic team.

Specifically, the specific method of the above steps is as follows.

The method comprises the following steps: and reading the students and scientific research data in the database.

Reading relevant data from a scholars database, wherein the relevant data comprises:

student information including ID, name, school, college;

paper data, including ID, title, author, abstract, issuing authority;

project data including ID, title, participant;

patent data includes ID, title, inventor, and application organization.

Wherein the content of the first and second substances,

the student information corresponds to nodes in an academic network;

authors of the paper, participants of the project, inventors of the patent are used to extract collaboration data, i.e. edges in the academic network;

the abstract of the paper is used for AT model training to represent the research subject of a scholars, and text information can be integrated in the process of training the scholars to vectors.

The ID in the related data is a primary key of the database and functions as a unique identifier of the data item.

The learner database may be an existing database such as a chinese knowledge network, a wan database, a national intellectual property office, a wipu network, a Baidu academic, or a system preset database.

In particular, the method comprises the following steps of,

1) examples of learner information queries are as follows:

74347, ' name ' Du XX ', ' school ' Chinese agriculture university ', ' ins ': institute of information and Electrical engineering ' }

2) For each scholarly to query his scientific research data, including thesis data, project data, and patent data, examples are as follows:

3) and saving the scientific research data as a document for later operation. Specifically, the document may be stored as txt, for example, a current directory is [ 74347.txt ], which is a paper data representing a scholars with ID 74347, and is referred to as "document" hereinafter, and the patent and project data are similar to the above steps.

Examples of the effects of the respective data:

1) the student information corresponds to nodes in an academic network;

(2) authors of the paper, participants of the project, inventors of the patent are used to extract the collaborative data, i.e. edges in the academic network.

(3) The abstract of the paper is used for AT model training to characterize the research topic of a scholarly and is represented by probability distribution.

{ '74347-Du XX' [ (5,0.8992357556293058), (11,0.099451539260517502)]}

Wherein the content of the first and second substances,

'74347' is the scholar ID;

'Du XX' is name;

(5,0.8992357556293058), indicating that the probability distribution on topic 5 is 0.8992;

(11,0.099451539260517502) indicates that the probability distribution on the 11 th topic is 0.0995.

There is no (or minimal) probability distribution on other topics, neglecting that the sum of the probability distributions of the learner on all topics is 1. This probability distribution can be used to calculate topic similarity between scholars later.

Step two: the topic probability distribution of the scholars is obtained by using the author topic model training.

Calculating the Topic probability distribution of the scholars by using an AT Model (Author-Topic Model) and the abstract of the thesis in all the thesis data in the step one, wherein the structure schematic diagram of the probability Model of the AT Model is shown in FIG. 3, x represents the Author, z represents the Topic, theta represents the Topic probability distribution of the Author generated by the Dirichlet distribution alpha, z represents the Topic, phi represents the word probability distribution of the Topic generated by the Dirichlet distribution beta, A represents the total number of authors, T represents the total number of topics, w is_dIs a set of words, a_dIs a collection of authors. The solving method of the AT model is the prior art in the field, and therefore, the embodiment is not described in detail, and the AT model function is directly called during programming calculation, and the AT model operation procedure is as follows:

model＝gensim.models.atmodel.AuthorTopicModel(corpus, num_topics＝theme_num,author2doc＝author2doc,id2word＝dictionary)

the existence form of the paper data in the calculation process is as follows:

C＝{(w₁,a₁),(w₂,a₃),……,(w_M,a_M)}，

wherein M is the total number of documents.

Before the AT model is used for calculation, the paper data document saved in the first step needs to be processed, and the processing comprises the following steps:

1) creating author to document mapping tables, e.g.

{ 'Zhangtrio': [1], 'Lisiu': [2, 3, 4], 'Wangwu': [5], };

2) establishing a word to ID mapping table, e.g.

{0: 'computer', 1: 'data mining';.. };

3) the document bag-of-words model is transcoded in the form of [ [ (0,1),. ], (6,1) ], [ 9, 2 ] ].

The training process is schematically illustrated in fig. 4-5.

Step three: an initial academic network is constructed.

Constructing data of points and edges required by an academic network, wherein the process is as follows:

1) establishing nodes in the academic network, wherein the data of each node comprises the student ID, the name, the school, the college and the author theme probability obtained in the step two;

for example:

2) establishing edges in an academic network: extracting cooperation data according to authors of the thesis, authors of the project and authors of the patent to obtain edges in the academic network;

for example:

3) and constructing an initial academic network according to the data of the points and the edges. The model is schematically shown in fig. 6, and is essentially an undirected weighted graph G ═ V, E, W, where V denotes a node set, i.e., all trainee node sets (refer to node data above), E denotes an edge set, i.e., all trainee relationship sets (refer to source, target in edge above), and W denotes a weight set of edges, i.e., weight in edge.

Step four: and training to obtain a learner vector based on a network representation learning method.

By improving the calculation process of the transition probability in the node2vec random walk process, the text information of the node can be integrated into the feature sequence extraction process to obtain an improved algorithm node2vec, and the specific implementation process of the algorithm is described below.

(1) And D, calculating the topic similarity between the nodes in the academic network obtained in the step three.

Calculating the topic similarity between the node i and the node j by using the cosine similarity is as follows,

in graph G ═ (V, E, W), P_i＝(p_i1，p_i2，......，p_iT) Is a topic probability distribution, P, of node i_j＝(p_j1，p_j2，......，p_jT) Is the topic probability distribution of node j.

(2) And generating a neighbor sequence of the nodes by adopting a random walk mode of topic similarity optimization.

Firstly, a current node i is given, the number of the next node is j, random walk with fixed length L is simulated, and then a node v_i-1To the next node v_iThe random walk transition probability of (2) is:

wherein, pi_ijIs the transition probability from node i to node j, Z is the sum of the transition probabilities of all nodes, π_ijThe calculation formula of (a) is as follows:

π_ij＝α_pq(i，j)·w_ij·sim(P_i，P_j)，

where p and q are parameters for setting two random walks, w_ijIs the weight of the edge formed by node i and node j, α_pqThe expression of (i, j) is as follows:

wherein d is_ijRepresenting the shortest path between node i and node j.

Thus finishingAs a result of the calculation of the probability of random walk transitions, the set of probability values is passed to an Alias Method (sampler) for selection of neighboring nodes, the probability of each node selected being equal to P (c)_i＝x|c_i-1V). And (4) setting the wandering length L to carry out wandering to obtain a plurality of wandering paths, wherein the paths are random wandering sequences.

(3) And training the random walk sequences by using a random gradient descent method to finally obtain a learner vector. The random gradient descent method is a conventional method in the art, and the present embodiment will not be described in detail.

The algorithm pseudo-code is shown below.

Step five: and clustering the learner vectors based on a machine learning clustering method.

(1) And calculating the node centrality.

The centrality measurement of the node is calculated by adopting a PageRank algorithm, and the larger the PageRank value is, the higher the centrality of the node is:

wherein β is a jump parameter, which is usually 0.8 or 0.9, e is a unit vector of n-dimension, and n is the number of all nodes. Beta MP represents the jump to the next node with probability beta in the random walk process, and (1-beta) e/n represents the random jump with probability (1-beta). The experimental result proves that the PageRank values of all the nodes are converged to be stable finally through continuous iteration. The formula of the centrality calculation is also prior art, and the present embodiment will not be described in detail.

(2) And calculating the node dispersion.

The discrete measurement of the nodes adopts a minimum distance delta_i(i＝1，2，…，n) measurement, namely calculating the distance between the node and all nodes with higher centrality, and taking the minimum value as delta_iMeasuring the discreteness of the nodes:

and if the node centrality is the same, sequencing according to the node ID. In addition, since the node having the highest centrality is inevitably the cluster center, the minimum distance is defined as max (δ)_i)。

(3) The node F-static index CV (i) is calculated.

And taking the first K nodes with the maximum front CV values as clustering centers.

Note that both the center degree and the dispersion degree are calculated in (V, E, W) of the graph G.

(4) And (6) clustering.

For each node, calculating the distance d from the node to each cluster center_icThen, the node is assigned to d_icAnd updating the mean value of the clustering centers by the smallest clustering centers, and repeating the operation until the mean value of the clustering centers is stable. Finally, there will be some nodes around each cluster center, usually called "clusters", which are the academic team we want finally.

The algorithm pseudo code is as follows.

Step six: clusters meeting the preset threshold are output as an academic team.

The scale of the academic team is generally more than 3 people, namely, the number threshold value is set to be 3, and the cluster with the node number gauge model being more than or equal to the threshold value 3 in the step five is output as the academic team.

Thus, the construction of an academic team is completed.

For example:

TABLE 1 results of division of the Dongda college of computers

In order to verify the advantages of the technical scheme in the community topic similarity construction of the academic team, several existing methods are provided below as experimental designs for result analysis

By adding topic similarity and optimizing the process of selecting clustering centers, the optimization algorithms node2vec and K-means are obtained, the community discovery algorithm based on network representation learning provided herein is denoted as NK, and the original algorithm is denoted as NK. In this embodiment, NK and differences of the conventional community discovery algorithms LPA, CNM, and Louvain in the three aspects of algorithm community quality, community topic, and community distribution are used.

Data preparation

The academic network data sets disclosed so far are CA-HepPh and DBLP data sets and the like, but since they do not contain paper text data and cannot be used, experiments are conducted on the self-built academic network data sets herein. In order to obtain a relatively comprehensive experiment result, the experiment adopts four data sets in different ranges, including a college data set, a school data set, a subject data set and a national data set, and specific information is shown in table 2. These quantities are derived from real society, belong to real network datasets, and the community partitioning results are unknown.

TABLE 2 academic network data set

Evaluation method

The community theme is mainly used for measuring the cohesion of the research directions of scholars in the community, the data set does not mark the research directions of the scholars, so that the F value cannot be calculated, and the manual judgment mode is interfered by personal cognitive deviation, so that the cosine similarity formula is used for measuring the theme similarity between the scholars, and the community theme similarity can be obtained by comprehensively and averagely weighting the similarity between all the scholars.

Table 3 is the community topic similarity index for each type of algorithm. In college and subject networks, the community topic similarity is high, and in school and national networks, the community topic similarity is low, mainly because the community topic similarity is eliminated in the network, the research fields among the groups of students in the network are closer. In each network, it is obvious that NK has better community topic similarity than other methods.

TABLE 3 Community topic similarity contrast

What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. An academic team construction method based on network representation learning training is characterized in that: which comprises the following steps of,

step three: constructing an initial academic network;

step six: clusters meeting the preset threshold are output as an academic team.

2. The academic team construction method based on network representation learning training of claim 1, wherein: the first step comprises reading relevant data from a scholars database, wherein the relevant data comprises:

student information including ID, name, school, college;

paper data, including ID, title, author, abstract, issuing authority;

project data including ID, title, participant;

patent data, including ID, title, inventor, application organization;

the ID is a primary key in the database.

3. The academic team construction method based on network representation learning training of claim 2, wherein: the second step comprises the following steps

2-1) reading a summary document in the paper data;

2-2) establishing an author and document mapping table;

2-3) carrying out coding conversion on the document bag-of-words model;

2-4) training an author theme model to obtain an author theme probability distribution t.

4. The academic team construction method based on network representation learning training of claim 3, wherein: the third step comprises the following steps:

3-1) establishing nodes V in the academic network, wherein the data of each node comprises the student ID, the name, the school, the college and the author theme probability obtained in the step two;

3-2) establishing an edge E in the academic network: extracting cooperation data according to authors of the thesis, authors of the project and authors of the patent to obtain edges in the academic network;

3-3) constructing an initial academic network G ═ V, E, W according to the data of the points and the edges, wherein W represents the weight of the edges, and the weight is the sum of the cooperation times of the papers, the items and the patents of the students.

5. The academic team construction method based on network representation learning training of claim 4, wherein: the fourth step comprises the following steps:

4-1) calculating the topic similarity among the nodes in the academic network obtained in the step three;

4-2) generating neighbor sequences of the nodes by adopting a random walk mode of topic similarity optimization to obtain a plurality of random walk sequences;

4-3) training the random walk sequence by using a random gradient descent method to obtain a learner vector.

6. The academic team construction method based on network representation learning training of claim 5, wherein: topic similarity sim (P) between node i and node j_i，P_j) The calculation formula of (a) is as follows:

wherein the content of the first and second substances,

P_i＝(p_i1，p_i2，......，p_iT) Is the subject probability distribution for node i,

P_j(p_j1，p_j2，......，p_jT) Is the topic probability distribution of node j.

7. The academic team construction method based on network representation learning training of claim 5, wherein: the obtaining of the random walk sequence in the step 4-2 comprises:

4-2-1) compute node v_i-1To the next node v_iRandom walk transition probability P (v)_i＝j|v_i-1I), a set of random walk transfer probability values, P (v) is obtained_i＝j|v_i-1I) the calculation formula is as follows:

π_ij＝α_pq(i，j)·w_ij·sim(P_i，P_j)，

wherein d is_ijRepresenting the shortest path between node i and node j;

4-2-2) selecting a neighbor node in the sampler by using the group of random walk transfer probability values, and walking by setting the walking length L to obtain a plurality of walking paths as a random walking sequence.

8. The academic team construction method based on network representation learning training of claim 5, wherein: the fifth step comprises the following steps:

5-1) calculating node centrality

Wherein, beta is a jump parameter, e is a unit vector of n dimension, n is the number of all nodes, beta MP represents jump to the next node with probability beta in the random walk process, and (1-beta) e/n represents one random jump with probability (1-beta);

5-2) calculating the node dispersion: the discrete measurement of the nodes adopts a minimum distance delta_i(i-1, 2, …, n) and taking the minimum value as the weightIs delta_iMeasuring the discreteness of the nodes:

wherein, delta_iIs the minimum value in the shortest path set between any two nodes, i ═ 1, 2, …, n;

5-3) calculating the node F-static statistic index CV (i):

taking the first K nodes with the maximum front CV values as clustering centers;

(4) clustering: calculating the distance d from the node to each cluster center_icThen, the node is assigned to d_icUpdating the mean value of the clustering centers by the smallest clustering centers, and repeating the operation until the mean value of the clustering centers is stable; finally, around each cluster center, there are clustered nodes, called "clusters", which are the final output academic teams.