CN109558494A

CN109558494A - A kind of scholar's name disambiguation method based on heterogeneous network insertion

Info

Publication number: CN109558494A
Application number: CN201811267181.9A
Authority: CN
Inventors: 杜; 杜一; 乔子越; 周园春
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2019-04-02

Abstract

The invention discloses a kind of scholar's name disambiguation methods based on heterogeneous network insertion, it the steps include: 1) to set multiple authors for needing to disambiguate, collect it is all the relevant paper of the authors that disambiguate is needed to setting, then generate paper relationship heterogeneous network using the semantic information of the author of collected paper and paper；2) according to paper relationship heterogeneous network, by generating the path comprising paper nodes neighbors node text information based on first path random walk strategy, and these paths are saved as into training corpus；3) training corpus is trained using Skip-gram model, generates the corresponding paper characterization vector of each paper；4) for the author that a setting needs to disambiguate in step 1), the corresponding paper characterization vector of paper of the author is obtained from obtained paper characterization vector；5) the paper characterization vector that step 4) obtains is clustered, obtains several clusters, realizes the disambiguation to author's name.

Description

A kind of scholar's name disambiguation method based on heterogeneous network insertion

Technical field

The present invention relates to big data, knowledge mapping, entity is disambiguated, and heterogeneous network embedded technology field is specifically a kind of non- The method based on the characterization vector study of first path random walk heterogeneous network node of supervision carries out the technology of scholar's name disambiguation.

Background technique

In building scientific and technical literature knowledge base, it is frequently encountered the problem of author's name disambiguates.Such as in knowledge base magnanimity Document in, have a large amount of author of the same name and exist, this can reduce name retrieval, character relation excavates, the association of personage's similitude Accuracy.Such as when retrieving author's name, it may appear that paper written by all authors of the same name is asked to solve this Topic, the paper that these are retrieved by the method for generalling use cluster is allocated to different author's entities, and clusters and can be used Co-author's relationship of paper, the journal title that paper is published, the information such as Article Titles similitude are led in this way as the feature of paper Crossing can be next to marking off by different authors of the same name to dividing for paper.Problem is that how to make good use of these above-mentioned opinions The characteristic information of text.

Researcher there are many disambiguates the solution of oneself that proposes of problem for this name, most common Thinking is exactly the characteristic information according to paper, constructs it to each paper and characterizes vector, by the distribution characteristics of vector, is come Paper is distinguished；Further, can by construct paper network, by network structural information by the feature of paper to Amount projects in the stronger latent space of characterization ability, so that the higher paper of similitude is in sky in new paper vector space Between on distribution be more nearly, while similarity is high or incoherent paper between be distributed it is farther.

Summary of the invention

According to the shortcoming of authors' name disambiguation method in existing scientific and technical literature knowledge base, the present invention provides one kind to be based on The author's name of internet startup disk study on heterogeneous paper network through first path random walk names entity disambiguation method, should Method using paper author, distribution the text informations such as periodical and title, keyword, abstract, by building heterogeneous network come The structural relation between paper is established, learns to obtain the characterization vector of paper by the internet startup disk to heterogeneous network, and according to These vectors achieve the purpose that paper cluster to disambiguate academic author's name.

The present invention specifically includes the following steps:

Step 1: collecting all papers relevant with the author that disambiguates of needs in paper library, by the author of these papers, The semantic information (including the information such as title, keyword and abstract) of the journal title issued, paper constructs paper relationship heterogeneous network Network.

Step 2: the paper relationship heterogeneous network of the generation according to step 1, by based on first path random walk plan The path comprising paper nodes neighbors node text information is slightly generated, and these paths are saved as into skip-gram in next step The training corpus of model.

Step 3: the corpus of the path composition of random walk is generated according to the step 2, is based on Skip-gram model Learn paper and characterizes vector.

Step 4: the author's title for needing to disambiguate in step 1 one collects the corresponding characterization vector of its paper, gives Agglomerate number K clusters above-mentioned characterization vector using Agglomerative Hierarchical Clustering method.Different clusters after cluster, represent it is of the same name but Paper set written by different authors, to realize the disambiguation to author's name.

Relative to previous correlation technique, scholar's name disambiguation method of the invention based on heterogeneous network insertion is excellent Point and contribution essentially consist in:

1. a method based on heterogeneous network representative learning is proposed, by constructing paper relationship heterogeneous network, based on member Path random walk strategy generates the path comprising paper nodes neighbors node text information, and formed according to these yuan of path Training corpus is the vector that each paper node efficiently learns the latent space to a low-dimensional using Skip-gram model It indicates, so that more, the same periodicals of common author, the distribution more adjunction of the higher paper of title similitude spatially Closely, at the same be unsatisfactory for these conditions paper distribution it is farther.

2. by the heterogeneous relation network of building paper, random walk and skip-gram model based on first path are protected The semantic information (title, abstract, keyword of paper etc.) for having stayed the structural information of paper network and the attribute of paper, compared to Previous algorithm increases the similitude used between the text informations such as Article Titles, abstract, keyword, improves paper table Levy the representativeness of vector.

3. by benchmark dataset test show this method while keeping higher arithmetic speed, relatively 20% to 40% is improved in the effect of most clustering.

Detailed description of the invention

Fig. 1 is the method for the present invention flow diagram；

Fig. 2 is heterogeneous network schematic diagram；

Fig. 3 is first path schematic diagram；

Fig. 4 is coordinates measurement schematic diagram.

Specific embodiment

Below in conjunction with attached drawing and embodiment the present invention is further elaborated explanation.

The present invention takes a kind of non-supervisory side based on the characterization vector study of first path random walk heterogeneous network node Method carries out the disambiguation of scholar's name.In following embodiment, chooses name and disambiguate paper benchmark database as paper library, and combine attached Figure, it is further elucidated above to the present invention.

Step 1: collecting all papers relevant with the author that disambiguates of needs in paper library, by the author of these papers, The journal title issued, the information architectures paper relationship heterogeneous network such as title, keyword and abstract.

Using each paper as the node in heterogeneous network, if having common author between them, just at them Between construct an entitled CoAuthor of relationship side, while this edge have common author's number attribute, if there is 1 altogether Same author, the attribute of this relationship are just 1, if there is 2 common authors, then attribute is just 2, and so on.

If two papers come from the same periodical, it is entitled that a relationship is just established between the two papers The side of CoVenue, since often a paper can only belong to a periodical, so the attribute value of this relationship is all 1.

If in the title of two papers, keyword and abstract, having the word of identical appearance, and this word is not off word, , here also there is the attribute of number on the side for so just constructing a CoWord between them accordingly, if there is a co-occurrence word, So attribute value is 1, if there are two co-occurrence word, attribute 2, and so on.

A kind of opinion for having node type, three kinds of relationship types and two of them relationship to have attribute is thus constructed Literary heterogeneous network.The schematic diagram of network is as shown in Figure 2.

In this step, the relationship of building removes CoAuthor (common author), CoWord (same to keyword), CoVenue (altogether Same periodical/meeting) outside, it can also be constructed according to other achievement informations, such as the adduction relationship between paper, common author state Family carries out identical descriptor after subject classification etc. for full text, i.e., several relations and corresponding attribute of a relation are arranged first； If constructing a line between the corresponding node of two papers there are the relationship of a certain setting between two papers, and according to pass The title on the side is arranged in system, and the attribute value on the side is arranged according to the attribute of a relation of the relationship.

According to the paper heterogeneous network that step 1 generates, a node is arbitrarily selected in the paper heterogeneous network, with the node For start node, random walk is carried out by path of side.

Include during providing the random walk under the guidance of first path, in first path the side of multiple and different relationship names simultaneously The appearance sequence on these sides is set, such as according to member path as p1-CoAuthor-p2-CoWord-p3-CoVunue-p4 Sequence carry out random walk (i.e. in random walk at random refer to when going to some relationship, randomly select with currently Node passes through the node that the relationship is connected), each time in walk process, pass through one according to the type on side as defined in current first path Kind rule is randomly selected, randomly selects node that one is connected by the type side with present node as next node, i.e., A paper node is randomly choosed first as starting path point, then randomly selects rule selection and the node side by above-mentioned Type be CoAuthor node as next path point, then by it is above-mentioned randomly select rule selection with the path node The node that the type on side is CoWord randomly selects rule selection and path section as next path point, finally by above-mentioned The node that the type on point side is CoVunue thus constitutes the migration sequence in a first path as next path point.Again A new first path is generated according to above-mentioned steps using the last one node in above-mentioned first path as start node, by n times In this way after iteration, change generate a long path, wherein each path node storage be paper mark id.Then iteration The such long path of M generation selects the node in network as the starting section in long path in order every time when raw growing path Point, and by each long path by row storage, each path node id is separated with separator (such as space or tab), is generated Training corpus.

First path schematic diagram is as shown in Figure 3.

Meanwhile in the random walk process under the guidance of first path, migration to some current node is simultaneously advised towards first path During certain fixed class side random walk, the attribute information of the relationship can take into account, this attribute is equivalent to the weight on side, power Value is bigger, illustrates that the relationship of two nodes is closer, therefore the attribute value on this side is bigger, then node is jumped along this edge Probability it is bigger, for example, in Fig. 2, if p1 is present node, the relationship of next-hop is CoAuthor, then having the pass with p1 Two nodes of system are p4 and p2 respectively, according to the attribute value of relationship between them, then the probability from p1 migration to p4 is 1/ 3, the probability of migration to p2 is 2/3.

In some cases, has something to do is missing from for some papers, such as institute in the title of some paper Some words do not appear in the title of an any other paper, then this relationship of CoWord is the absence of for it , when this happens, more flexible strategy is just used, i.e., according to the next relationship for currently lacking relationship in first path Migration carries out migration with regard to then according to its CoVunue relationship for that above said paper.

The schematic diagram for generating path is as shown in Figure 4.

Such migration strategy is not fixed simultaneously, can be designed new by redesigning to first path Migration strategy, such as in the heterogeneous network of the above-mentioned type, first path is designed as p1-CoAuthor-p2-CoVunue- P3-CoWord-p4, design in this way can be generated new random walk path, then form new corpus.

Such heterogeneous network designs and there are many multiplicity, such as when the information in paper library includes reference information When, a kind of side of new type can be constructed in above-mentioned heterogeneous network, thus constructing one has a kind of node class The random walk path corpus of the network can be generated by designing new first path in type, the heterogeneous network of four kinds of relationship types Library.With should the paper in paper library lack a certain characteristic information when, relationship can be constructed without using this feature.

Step 3: the corpus of the path composition of random walk is generated according to the step 2, is based on Skip-gram model Realize the study of paper vector.

The corpus that the path composition of random walk is generated according to the step 2, is instructed using skip-gram model Practice, the word2vec method in the library gensim or Google open source based on C language word2vec work in specifically used python Tool.

Skip-gram model method sees the id of node as word, regards the node catenation sequence in path as word Contextual information ultimately generates the corresponding vector of each node i d by training, correspondingly, thus generating the characterization of paper Vector.

Step 4: the author's title for needing to disambiguate for one collects its all paper in existing database by step Rapid one, two, the three characterization vectors learnt give agglomerate number K, using Agglomerative Hierarchical Clustering method, to above-mentioned characterization vector into Row cluster.Different agglomerates after cluster, represent paper set written by different authors, disappear to realize to author's name Discrimination.

Experiment uses paper (Jie Tang, A.C.M.Fong, Bo Wang, and Jing Zhang.A Unified Probabilistic Framework for Name Disambiguation in Digital Library.IEEE Transaction on Knowledge and Data Engineering,Volume 24,Issue 6,2012,Pages 975-987. and Xuezhi Wang, Jie Tang, Hong Cheng, and Philip S.Yu.ADANA:Active Name Disambiguation.In Proceedings of 2011IEEE International Conference on Data Mining.pp.794-803. the paper data set in) has 100 authors' names for needing to disambiguate in the data, amounts to 7447 Paper, paper name and author information be it is complete, 4% paper lacks journal title.

All papers in data set are constructed into a heterogeneous network together first, then using the above method to every Paper learns an insertion vector, and then for each author for needing to disambiguate, the corresponding paper of the author is placed on one Cluster is played, and assumes known class number K.

It is clustered using the method or K-Means clustering method of HAC (Agglomerative Hierarchical Clustering).Cluster result is used Pairwise Precise, Pairwise Recall, Pairwise F1 evaluation index assessed, and average.It can also Not preassign agglomerate number K, in cluster, the clustering algorithms such as such as DBSCAN are used.

Baseline method used at present has LINE, DNGR, metapath2vec.Three of the above method is all that network is embedding Enter method, by constructing paper network, according to corresponding internet startup disk method, the characterization vector of paper is arrived in study.Wherein LINE With title similarity refers in building paper homogenous network, if the title between paper have it is certain similar Property, then increase the weight on corresponding paper connection side, and finally carry out internet startup disk study using the method for LINE.Following table is not With the disambiguation effect of method.

Method	Prec	Rec	F1
				our approach	79.68	80.14	78.85
LINE	61.22	49.96	53.02
				LINE with title similarity	79.29	58.69	64.98
metapath2vec	64.44	67.75	64.40
				DNGR	44.62	70.21	51.65

It can be seen that method of the invention is substantially better than other methods.Due to the study for having used heterogeneous network to be embedded in Method, remains the relation information of paper itself as far as possible, so that the characterization ability that the paper vector learnt has is stronger, Therefore the effect of disambiguation is improved.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should subject to the claims.

Claims

1. a kind of scholar's name disambiguation method based on heterogeneous network insertion, the steps include:

1) set it is multiple need the authors that disambiguate, collect the relevant paper of all authors disambiguated with setting needs, then utilize The author of collected paper and semantic information generate paper relationship heterogeneous network；

It 2) include paper nodes neighbors by being generated based on first path random walk strategy according to the paper relationship heterogeneous network The path of node text information, and these paths are saved as to the training corpus of skip-gram model；

3) training corpus is trained using Skip-gram model, generate the corresponding paper of each paper characterize to Amount；

4) for the author that a setting needs to disambiguate in step 1), the author is obtained from the paper characterization vector that step 3) obtains The corresponding paper of paper characterize vector；

5) the paper characterization vector that step 4) obtains is clustered, obtains several clusters；Using different clusters as of the same name with the author But it is not paper set written by same people, realizes the disambiguation to author's name.

2. the method as described in claim 1, which is characterized in that the method for generating the paper relationship heterogeneous network are as follows: will be every Several relations and corresponding attribute of a relation is arranged as the node in heterogeneous network in one paper；If between two papers There are the relationships of a certain setting, then a line is constructed between the corresponding node of two papers, and the name on the side is arranged according to relationship Claim, and the attribute value on the side is set according to the attribute of a relation of the relationship.

3. method according to claim 2, which is characterized in that the relationship includes but is not limited to following one or more of passes System: containing common author, comprising same keyword, belong to that common periodical or meeting, there are adduction relationships, common author country.

4. method according to claim 1 or 2, which is characterized in that the method for generating the training corpus are as follows: in the paper A node is arbitrarily selected in relationship heterogeneous network, using the node as start node, carries out migration under the guidance of first path, it is long to generate one Path；Change start node continues to generate the long path, and by each long path by row storage, each path node id is used Separator separates, and generates training corpus.

5. method as claimed in claim 4, which is characterized in that carry out the method for migration under the guidance of first path are as follows:

51) selection that path top is carried out according to the sequence of side appearance as defined in first path, if present node has to next node Multiple qualified sides then choose a qualified side and determine the connected next node of present node；The member In path including multiple and different relationship names while and appearance sequence while these are set；

52) after the n times for repeating step 51) setting, a long path is obtained.

6. method as claimed in claim 5, which is characterized in that in the step 51), choose one according to the weight on side and meet Determine the connected next node of present node in the side of condition；Wherein, the probability that the bigger side of weight is selected is also bigger.

7. the method as described in claim 1, which is characterized in that the skip-gram model is by the path node in path Id sees word as, and the node catenation sequence in path is regarded as to the contextual information of word, ultimately generates each section by training The corresponding vector of point id, i.e. node i d correspond to the paper characterization vector of paper.

8. method as claimed in claim 1 or 7, which is characterized in that the skip-gram model is the library gensim in python Word2vec method or Google open source based on C language word2vec tool.

9. the method as described in claim 1, which is characterized in that utilize author, journal title and the semantic information of collected paper Generate paper relationship heterogeneous network；Institute's semantic information includes but is not limited to following one or more of information: author, title, pass Keyword and summary info.

10. the method as described in claim 1, which is characterized in that given agglomerate number K, using Agglomerative Hierarchical Clustering method, to step The rapid paper characterization vector 4) obtained is clustered.