CN110737837A

CN110737837A - Scientific research collaborator recommendation method based on multi-dimensional features under research gate platform

Info

Publication number: CN110737837A
Application number: CN201910981032.7A
Authority: CN
Inventors: 张鹏程; 邵孟巧; 金惠颖; 张雅玲; 于佳男
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-01-31
Anticipated expiration: 2039-10-16
Also published as: CN110737837B

Abstract

The invention discloses a scientific research partner recommending method based on multidimensional characteristics under research gate platforms, which measures the association relationship between scientific researchers and other scientific researchers from three dimensions of text similarity of published papers sent by the scientific researchers under the research gate platforms, social association between the scientific researchers, influence of the scientific researchers and the like, constructs a scientific research partner recommending model by using a linear combination method to carry out Top-N recommendation, and provides personalized scientific research partner recommending service for the scientific researchers.

Description

Scientific research collaborator recommendation method based on multi-dimensional features under research gate platform

Technical Field

The invention relates to a scientific research collaborator recommendation method based on multi-dimensional features under research gate platforms, and belongs to the technical field of software engineering recommendation systems, data mining and text mining.

Background

However, it is difficult tasks to find useful conversations among scientific researchers with the same or similar research interests, which take a lot of time in scientific research of the scientific workers, so that main problems in achieving scientific collaboration are to identify scientific collaborators with similar research interests.

Research gate, , the most popular research social networking platform at present, stands in 2008 and aims to promote scientific cooperation worldwide, effectively utilizes research gate as a catalyst for the contact between scientists, can greatly promote the communication and progress of research and research, and therefore, if the research social networking platform can help the researchers to find other researchers with the same or similar research interests, it is meaningful things.

Disclosure of Invention

The method measures the association relation between the scientific research workers and other scientific research workers from three dimensions of text similarity of published papers of the scientific research workers, social association degree between the scientific research workers and influence of the scientific research workers, and constructs a recommendation model of the scientific research collaborators by using a linear combination method, thereby providing personalized recommendation service for the scientific research collaborators for the scientific research workers.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

scientific research collaborator recommendation method based on multi-dimensional features under research gate platform, comprising the following steps:

(1) collecting published papers, social associations and self influence related data of scientific researchers under a research gate platform and carrying out preprocessing operation;

(2) calculating the text similarity of papers among scientific researchers by using a Doc2Vec text depth representation model;

(3) establishing a relationship matrix among the scientific researchers according to the existing social network, marking the concerned information among the scientific researchers by elements in the relationship matrix, and calculating the social association degree among the scientific researchers according to the proportion of common friends among the scientific researchers based on the relationship matrix;

(4) adding relevant data of the influence of the scientific researchers to average and calculate the influence of the scientific researchers;

(5) combining three dimensional characteristics of text similarity, social association degree and self influence of the paper to construct a linear combination recommendation model;

(6) and recommending by using a recommendation model, calculating a comprehensive similarity score of the candidate recommended scientific research collaborators and the given scientific research workers, ranking, and generating a recommended list of the scientific research collaborators for the given scientific research workers.

, the step (1) includes:

(11) collecting relevant data of published papers of scientific research workers under a research gate platform, wherein the relevant data comprises a paper title and a summary field;

(12) collecting relevant data of social associations among research workers under a research gate platform, wherein the relevant data comprises fields of concern and concern;

(13) collecting relevant data of self influence of a scientific research worker under a research gate platform, wherein the relevant data comprises interest values, reference numbers, recommendation numbers, reading amount and thesis number fields;

(14) collecting related data of self description information of scientific research workers under a research gate platform, wherein the related data comprises professional skills and subject fields;

(15) and cleaning, denoising, de-weighting and standardizing the collected data.

, the step (2) includes:

(21) reading an English corpus consisting of paper titles and abstracts, preprocessing data, converting case and case, checking spelling errors and the like, and regarding punctuation marks as invalid words;

(22) considering the paper title and abstract as paragraphs, each paragraphs and each words in the paragraphs are represented in vector form using the distributed storage model (PV-DM) in the sentence vector Doc2Vec model;

(23) after a vector space is generated, cosine similarity is utilized to calculate a cosine value between two segment vector included angles so as to represent the similarity of papers between scientific researchers, and a calculation formula of the cosine similarity is as follows:

a, B is a vector representation of two paragraphs, the cosine value is close to 1, the included angle approaches 0, indicating that the two vectors are more similar; the cosine value is close to 0 and the angle approaches 90 degrees, indicating that the two vectors are more dissimilar.

, the step (3) includes:

(31) according to the existing social network, a relation matrix a between n researchers is established, wherein a is n × n matrixes, and if a researcher u pays attention toA, a of researchers_uv1, otherwise a_uv＝0；

(32) Calculating the social association degree between the scientific researchers according to the proportion of the common friends between the scientific researchers, wherein the higher the proportion of the common friends is, the more similar the common friends is proved, and the calculation formula is as follows:

wherein out (u) is the set of researchers u pointing to other friends in the social network association graph, out (v) is the set of researchers v pointing to other friends in the social network association graph, out (u) ∩ out (v) represents the intersection of the two sets of researchers, | out (u) and | out (v) respectively represent the number of friends that the researchers are interested in sets out (u) and out (v).

And , adding the interest value, the reference number, the recommendation number, the reading amount and the paper number of the scientific research worker to average in the step (4) to obtain the influence of the scientific research worker.

, the step (5) includes:

(51) the text similarity, the social association degree and the influence are subjected to grouping respectively, and the grouping process is that the ratio of each feature value to the maximum value of the feature is subjected to grouping , so that the features are operated under the same order of magnitude, and the weight parameters are convenient to adjust;

(52) and (3) carrying out linear combination on the three dimensions subjected to the treatment of , and calculating the similarity among researchers, wherein the specific formula is as follows:

score(u₁,u₂)＝α*nor(paper(u₁,u₂))+β*nor(social(u₁,u₂))+γ*nor(popular(u₂))

the formula integrates the factors of three dimensions, wherein paper (u)₁,u₂) Shows two researchers u₁And u₂Similarity of text between papers, social (u)₁,u₂) Shows two researchers u₁And u₂BetweenSocial relevance of (1), popular (u)₂) Presentation scientist u₂α, β and gamma are weight parameters;

(53) by manually setting α, β and gamma, and then adjusting α, β and gamma based on the similarity of the professional skills and the subject fields of the researchers, the model is optimized, and candidate researchers u can be obtained finally₂With a given researcher u₁The score of similarity between them.

The invention has the beneficial effects that the scientific research collaborator personalized recommendation method based on multi-dimensional features under a research gate platform measures the association relationship between the scientific research workers and other scientific research workers from three dimensions, such as text similarity of published papers of the scientific research workers, social association degree between the scientific research workers, influence of the scientific research workers and the like, and a scientific research collaborator recommendation model is constructed by utilizing a linear combination method, so that personalized scientific research collaborator recommendation service is provided for the scientific research workers, and the problem of difficulty in finding the scientific research collaborators is solved.

Drawings

FIG. 1 is a schematic diagram of a scientific research partner personalized recommendation method based on multi-dimensional features under a research Gate platform according to the present invention;

fig. 2 is a diagram of a Doc2vec network architecture.

Detailed Description

The scientific research collaborator personalized recommendation method based on multi-dimensional features mainly comprises the following six steps of collecting relevant data of scientific researchers under the research gate platform, social association and self influence of the scientific researchers, carrying out preprocessing operation, using a Doc2Vec text depth representation model to calculate text similarity between the scientific researchers, building a relation matrix between the scientific researchers according to the existing social network, marking the concerned information among the elements in the relation matrix, calculating social association between the scientific researchers according to the common proportion between the scientific researchers, using the relation matrix to calculate the social association degree between the scientific researchers, using a combination of the social association degree of the elements in the relation matrix and the social interaction degree of the elements to calculate average influence relevant data of the scientific research workers, and carrying out combined recommendation on the social interaction degree of the elements and the social interaction degree of the scientific research workers, wherein the three steps are that the scientific research collaborator personalized recommendation method based on the multi-dimensional features, and the scientific research collaborator personalized recommendation method is carried out by combining the three recommended research author recommendation models, and the three recommended scientific research collaborative recommendation methods are carried out by using a given dimensional feature combination test recommendation model.

The specific implementation process of each step is described in detail as follows:

and step , collecting published papers, social associations and self influence related data of scientific research workers under the research gate platform, and performing preprocessing operation.

The method specifically comprises the following steps:

step 11, collecting relevant data of published papers sent by scientific research workers under a research gate platform, wherein the relevant data comprises a paper Title and Abstract fields which are expressed by Title and Abstract;

step 12, collecting relevant data of social association among scientific research workers under the research gate platform, wherein the relevant data comprises fields of concern and are expressed by Following and Followers;

step 13, collecting relevant data of self influence of scientific research workers under a research gate platform, wherein the relevant data comprises interest values, reference numbers, recommended numbers, reading amounts and thesis number fields which are respectively expressed by interest value, CiteCount, RecomCount, ReadCount and ItemCount;

step 14, collecting relevant data of self description information of scientific research workers under the research gate platform, wherein the relevant data comprises professional skill and subject fields which are expressed by Skills and Topics, and the fields can be used for adjusting and optimizing parameters of a linear combination recommendation model;

and 15, cleaning, denoising, removing the weight and standardizing the collected data.

And step two, calculating the text similarity of the papers among the scientific researchers by using a Doc2Vec text depth representation model.

The step mainly utilizes the similarity of papers among scientific researchers as the similarity among the scientific researchers, and shows that the more similar the papers are, the more similar the research interests of the scientific researchers are.

The method specifically comprises the following steps:

step 21, reading an English corpus consisting of paper titles and abstracts, preprocessing data, converting case and case, checking spelling errors and other operations, and regarding punctuation marks as invalid words;

step 22, regarding the paper titles and abstracts as paragraphs, representing each paragraphs and each word in the paragraphs in a vector form by using a distributed storage model (PV-DM) in a sentence vector Doc2Vec model;

the procedure is explained in conjunction with fig. 2, and here, a PV-DM model improved by using hierarchical software max is used to construct three-layer neural networks, i.e., an input layer, a projection layer (hidden layer) and an output layer, assuming that there are samples (context (w), where context (w) is composed of c words before and after a core word w as an input sample train _ X and the core word w as an output value train _ Y.

layer input layer, the input of the layer is randomly initialized K-dimensional segment vector V (context (para)) and 2c word vectors V (context (w) _1), V (context (w) _2), … … and V (context (w) _2c) in the segment context (w), the vector lengths are the same;

k takes 300, and the larger K, the more highly dimensional space paragraphs and words are mapped, and the more expressive the expression.

2c, selection: the length of the fixed context in the paragraph is generated by a sliding window method, and the larger the 2c is, the stronger the expression capability is, and the convergence speed is also reduced. A paragraph vector is shared in this context and can be considered as the subject of the paragraph.

Second layer projection layer: the layer accumulates 2c +1 vectors of the input layer and then calculates the average value to obtain a middle vector X _ w (K dimension) which is used as the input of the output layer hierarchical Softmax;

the third layer is an output layer which is Huffman trees, wherein leaf nodes are words in a corresponding vocabulary table, non-leaf nodes (coloring nodes) are equivalent to parameters W' from a hidden layer to the output layer in the original DNN model, the weight of the node is represented by P _ i, the weight is vectors, and a root node is the output X _ W of the projection layer.

The PV-DM predicts the words within this window given the paragraph and context vector. The specific training process is as follows:

1. initializing vectors with K dimensions for each paragraph and each word in the paragraph, and constructing a Huffman tree according to the word frequency;

2. training paragraphs in sequence, taking paragraphs as an example, inputting 2c word vectors in a context window of a segment vector and a central word W into a model, and accumulating and averaging by a projection layer to obtain a K-dimensional intermediate vector X _ w.X _ W which is input of a high iterative Soft max output layer and is a root node of a Huffman tree, wherein X _ W reaches a certain leaf node (namely a predicted current word W) along a certain path in the Huffman tree;

3. since W is known, the correct path from the root node to the leaf nodes can be determined based on the Huffman encoding of W, and the predictions that should be made on all classifiers (non-leaf nodes) on the path are also determined. For example, if the code of W is "01101", starting from the root node of the huffman tree, we want the probability that the intermediate vector is connected with the root node and divided into 0 through the iterative software ftmax calculation to be close to 1, the probability of inputting 1 at the second layer to be close to 1, and so on until the leaf node is reached;

4. proceeding until in 3, multiplying the probabilities obtained by calculation on paths to obtain the probability P of W in the current network, wherein the residual is (1-P), and then adjusting the parameters of non-leaf nodes in the path (the gradient is obtained by back propagation) by adopting a random gradient descent method to make the actual path close to the correct path, after n times of iterative convergence, obtaining the vector representation of each paragraph and each word in the paragraph, and after obtaining the vector representation of the paragraph, calculating the similarity between the papers by calculating cosine similarity.

The weights of non-leaf nodes in the Huffman tree are updated by using back propagation and random gradient descent every training, namely the weight parameters from a hidden layer to an output layer in the DNN model, and the segment vectors and the word vectors are also continuously updated. Through continuous training, the PV-DM model parameters obtained by the method are more and more accurate, and the segment vectors and the word vectors are more and more accurately expressed.

PV-DM, each training, slidingly intercepts a small number of words in a paragraph , and segment vectors are shared among several trainings with paragraphs, so there are several trainings with paragraphs, each training input contains segment vectors.

The PV-DM model improved by using the hierarchical Softmax has the advantages that the model adopts a Huffman tree, and words with larger weight (namely words with larger frequency) can obtain shorter codes at leaves with smaller depths. Such that more frequent words are discovered at a lesser cost.

Given as w₁,w₂,w₃,…w_TThe objective function is to maximize the average log-likelihood probability, as follows:

wherein p is the predicted central word w_tProbability of success, T being the length of the training word sequence, c being the size of the background window, i.e. the core word w_tIn connection with which context c words are present.

Step 23, after a vector space is generated, cosine similarity is utilized to calculate a cosine value between two segment vector included angles to represent the similarity of papers among researchers, and a calculation formula of the cosine similarity is as follows:

And step three, establishing a relation matrix among the scientific researchers according to the existing social network, marking the concerned information among the scientific researchers by the elements in the relation matrix, and calculating the social association degree among the scientific researchers according to the proportion of the common friends among the scientific researchers based on the relation matrix.

This step is primarily through the existing social network between the researchers, and then recommending new research collaborators to the researchers based on this social network, believing that two users with higher rates of common friends are more similar.

The method specifically comprises the following steps:

step 31, establishing a relation matrix A among n scientific research workers according to the existing social network, wherein A is n multiplied by n matrixes, and a is the relation matrix A if the scientific research workers u pay attention to the scientific research workers v_uv1, otherwise a_uv＝0；

For example: the friend list of the scientific researchers u in the matrix A is: u. of_a＝(a_u1,a_u2,a_u3,…,a_un)

Step 32, calculating the social association degree between the scientific researchers according to the proportion of the common friends between the scientific researchers, wherein the higher the proportion of the common friends is, the more similar the common friends is proved, and the calculation formula is as follows;

And step four, adding relevant data of the influence of the scientific research workers to calculate the influence of the scientific research workers by averaging.

generally consider it more popular to recommend more influential research associates to a given researcher.

The method specifically comprises the following steps: and adding the interest value, the reference number, the recommended number, the reading amount and the paper number of the scientific researchers to average to obtain the influence of the scientific researchers.

The specific calculation formula is calculated as follows:

wherein: interestvalue is an interest value, CiteCount is a reference number, RecomCount is a recommended number, ReadCount is a reading amount, and ItemCount is a theoretical number.

And step five, combining the three dimensional characteristics of the text similarity, the social association degree and the self influence of the thesis to construct a linear combination recommendation model.

The method specifically comprises the following steps:

step 51, in order to control the influence of the three characteristics, namely text similarity, social association degree and influence on a final result, the three characteristics can be subjected to grouping respectively, and the grouping process is that the ratio of each characteristic value to the maximum value of the characteristic is subjected to grouping so that the characteristics are operated under the same order of magnitude and parameters are convenient to adjust;

step 52, carrying out linear combination on the three dimensions subjected to the treatment of grouping , providing calculation methods for calculating the similarity between scientific researchers, wherein the final score can be expressed as the following formula:

score(u₁,u₂)＝a*nor(_pa_per(u₁，u₂))+β*nor(social(u₁，u₂))+γ*nor(popular(u₂))

the formula integrates the factors of three dimensions, wherein paper (u)₁,u₂) Shows two researchers u₁And u₂Similarity of text between papers, social (u)₁,u₂) Shows two researchers u₁And u₂Social relationship between, popup (u)₂) Presentation scientist u₂α, β and gamma are weight parameters;

step 53, α, β and gamma are manually set, α, β and gamma are adjusted through result feedback, so that the model is optimized, and candidate scientific researchers u can be obtained finally₂With a given researcher u₁The score of similarity between them.

Specifically, the method for adjusting the parameters according to the result feedback is that the similarity between two scientific researchers is calculated by using two fields of professional Skills Skills and theme Topics of relevant data of description information of the scientific researchers in a test data set, and then the similarity between the two scientific researchers calculated by a linear combination recommendation model is closer to the real similarity by adjusting parameters α, β and gamma, namely the similarity calculated by the two fields of the professional Skills Skills and the theme Topics.

And sixthly, carrying out recommendation test by using the recommendation model, calculating the similarity comprehensive score of the candidate recommended scientific research collaborators and the given scientific research workers, ranking, generating a recommended list of the scientific research collaborators for the given scientific research workers, and returning the ranked Top-N scientific research collaborators to the given scientific research workers.

Claims

1, scientific research collaborator recommendation method based on multi-dimensional features under research gate platform is characterized by comprising the following steps:

2. The scientific research collaborator recommendation method based on multi-dimensional features under the research gate platform as claimed in claim 1, wherein the step (1) comprises:

3. The scientific research collaborator recommendation method based on multi-dimensional features under the research gate platform as claimed in claim 1, wherein the step (2) comprises:

4. The scientific research collaborator recommendation method based on multi-dimensional features under the research gate platform as claimed in claim 1, wherein the step (3) comprises:

(31) according to the existing social network, a relation matrix a between n researchers is established, a being a n × n matrix, where a is if a researcher u pays attention to a researcher v_uv1, otherwise a_uv＝0；

5. The scientific research collaborator recommendation method based on multi-dimensional features under the research gate platform as claimed in claim 1, wherein in step (4), the interest value, the number of citations, the recommendation number, the reading amount and the paper number of the researchers are added and averaged to obtain the influence of the researchers themselves.

6. The scientific research collaborator recommendation method based on multi-dimensional features under the research gate platform as claimed in claim 1, wherein the step (5) comprises:

score(u₁，u₂)＝α*nor(paper(u₁，u₂))+β*nor(social(u₁，u₂))+γ*nor(popular(u₂))

the formula integrates the factors of three dimensions, wherein paper (u)₁，u₂) Shows two researchers u₁And u₂Similarity of text between papers, social (u)₁，u₂) Shows two researchers u₁And u₂Social relationship between, popup (u)₂) Presentation scientist u₂α, β,Gamma is a weight parameter;