CN109753662B - Duplicate name writer identification method based on hierarchical network - Google Patents

Duplicate name writer identification method based on hierarchical network Download PDF

Info

Publication number
CN109753662B
CN109753662B CN201910030797.2A CN201910030797A CN109753662B CN 109753662 B CN109753662 B CN 109753662B CN 201910030797 A CN201910030797 A CN 201910030797A CN 109753662 B CN109753662 B CN 109753662B
Authority
CN
China
Prior art keywords
nodes
author
link edge
hierarchical
duplicate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910030797.2A
Other languages
Chinese (zh)
Other versions
CN109753662A (en
Inventor
高建良
蒋志怡
杜宏亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201910030797.2A priority Critical patent/CN109753662B/en
Publication of CN109753662A publication Critical patent/CN109753662A/en
Application granted granted Critical
Publication of CN109753662B publication Critical patent/CN109753662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a duplicate name author identification method based on a hierarchical network, which comprises the following steps: step 1, for a given literature data set, dividing subsets according to publication time; building an author collaboration network G for subsets i Generating a hierarchical network G; connecting link edges for the rename author nodes in the G; if the nodes at the two ends of the link edge are determined to be the same person, combining the nodes at the two ends of the link edge; step 2, calculating the similarity scores of duplicate author nodes at two ends of the link edge, and assigning the similarity scores as corresponding link edge weights; and 3, finding out the link edge with the maximum weight, judging whether the weight is greater than a set threshold, if so, combining the nodes at two ends of the link edge, updating the weight of the link edge according to the method in the step 2, and then iteratively executing the step 3 until the weight of the link edge with the maximum weight is less than or equal to the set threshold, outputting an identification result at the moment, wherein the authors corresponding to the combined nodes are the same person. The invention considers the characteristics of document publication time, the existence of duplicate names of collaborators and the like, and efficiently and accurately solves the problem of duplicate name authors in document data sets.

Description

Duplicate author identification method based on hierarchical network
Technical Field
The invention belongs to the field of hierarchical network construction and renaming identification algorithms, and particularly relates to a renaming author identification method based on a hierarchical network.
Background
In a document dataset, retrieved with the author name as a search criteria, a list of all documents for that name is typically returned. The duplicate author problem refers to a problem that multiple authors in a document data set have the same name but cannot determine whether the authors are the same person. The duplicate author problem can lead to low accuracy in document data retrieval.
In the processes of establishing, reviewing and managing scientific research projects, searching review experts, searching student information in a certain field by researchers, editing periodicals, searching experts for reviewing documents, transacting periodicals and planning questions, searching subjects speaking students by academic conference organizers and the like, the effective famous author identification algorithm can bring convenience in the aspects of scientific research management, academic research, scientific research evaluation and the like. In addition, the identification result of the duplicate authors can bring additional value, such as establishing a citation network, a cooperation network and author business card files, discovering research directions and field changes of scientific researchers, and discovering the transition trend of working units of the scientific researchers.
The problem of duplicate authors solved by traditional manual processing methods is inevitable, and the name division and document classification of authors in a growing large amount of scientific and technical literature data cannot be dealt with. With the explosive growth of the number and quality of international papers published in China, the international influence of Chinese researchers is increasing. Meanwhile, in the data set of English scientific and technical literature, the problem of the famous names of Chinese researchers is more and more serious. The reasons are mainly that the abbreviated name formats are different in the process of converting the Chinese name into the English name and the phenomenon that the Chinese author name is different in the same pronunciation and character exists.
Scientific researchers at home and abroad propose a duplicate-name author identification algorithm from different angles such as social networks, machine learning, probability models and the like. In existing approaches, the algorithmic goal is typically to partition a list of documents returned by retrieving a certain author name for a document data set. The GHOST method is used for constructing a partner relation network for the rename authors, the rename authors are represented by different nodes, the partners with the same names are represented by the same node, then effective paths among nodes of the rename authors are found to represent similar scores among the nodes of the rename authors, and finally a clustering method is used for realizing division of the rename authors. Attempts to solve the problem of duplicate authors using graph data mining methods have been ongoing in recent years, but are currently staying at the level of using graph connectivity and graph fusion methods. In the prior art methods, the publication time feature was ignored. In the existing method for solving the problem of duplicate-name authors by constructing a network, the phenomenon that the duplicate-name problem exists in the collaborators of the duplicate-name authors is ignored. In addition, scientific publishers have also proposed solutions to the problem of duplicate authors, requiring each contribution researcher to register an ID and require accurate labeling of the documents preceding it. Many scientific researchers do not bring trouble because of the problem of renaming at present, and the registration enthusiasm is not high.
Disclosure of Invention
The invention aims to provide a duplicate name writer identification method based on a hierarchical network aiming at the defects of the prior art, which divides document data according to document publication time and constructs the hierarchical network, considers the duplicate name problem of a partner of the duplicate name writer, and iteratively calculates the similarity score of writer nodes in the hierarchical network, thereby solving the duplicate name writer problem existing in a document database and accurately and efficiently identifying whether the duplicate name writer in a given document data set is the same person.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a duplicate name author identification method based on a hierarchical network is characterized by comprising the following steps:
step 1, constructing a hierarchical network G: first, for a given document dataset P, P is divided by publication time of each document in the document dataset P, such that the published documents form a subset per time period, resulting in P = { P = { P = } 1 ,P 2 ,P 3 ,…,P n }; then, for each P i Constructing a corresponding author collaboration network G i Generating a hierarchical network G = { G = 1 ,G 2 ,G 3 ,…,G n N, a node in the author cooperation network represents a document author, an edge in the author cooperation network represents the relationship of the collaborators among the authors, and the node attribute comprises the information of the affiliated hierarchy and the document information; then, link edges are connected between the duplicate name author nodes in the G, namely the nodes with the same author name; finally, traversing each link edge, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, combining the nodes at the two ends of the link edge, wherein the attributes of the combined nodes comprise all hierarchical information before combination;
step 2, calculating a similarity score Simscore of the renamed nodes in the hierarchical network: traversing each link edge in the hierarchical network G, calculating the similarity score Simscore of the duplicate author nodes at the two ends of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity score Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise the similarity score of the affiliated organization, the similarity score of the document text, the similarity score of the affiliated hierarchical information, the similarity score of the cooperative relationship and the like;
step 3, judging the duplicate authors: for each link edge weight in the hierarchical network G, finding out the link edge with the maximum weight (the maximum weight corresponding to the link edge is max) and judging whether the link edge weight corresponding to the link edge is larger than a set threshold, if so, combining the nodes at two ends of the link edge with the maximum weight, wherein the combined node attribute comprises all hierarchical data before combination, updating the link edge weights of the combined node and the adjacent nodes thereof according to the method in the step 2, and then iterating the step 3 until the link edge weight with the maximum weight is smaller than or equal to the set threshold, outputting a duplicate author identification result, wherein in the output identification result, authors corresponding to the nodes combined in the hierarchical network are the same person.
Preferably, in the step 1, for each P i Constructing a corresponding author collaboration network G i For P in the process of i For each document P in (1), the node is labeled with author name and subscript i, e.g., author name A with a renaming problem in document data set P, using { A } 1 ,A 2 ,…,A m Denotes author A in different documents, m denotes the number of occurrences of author name A in document data set P, G i Is composed of multiple complete graphs, the number of the complete graphs is equal to P i Number of documents in (1).
In a preferable mode, in step 1, the accurate information includes mailbox information and/or document list information of a homepage of an author.
As a preferred mode, in the step 2, for the rename author name r, the rename author nodes at two ends of the link edge are set as
Figure BDA0001944160000000041
And
Figure BDA0001944160000000042
the similar scores of the mechanisms are
Figure BDA0001944160000000043
The document text similarity score is
Figure BDA0001944160000000044
The similarity score of the hierarchical information is
Figure BDA0001944160000000045
The partnership similarity score is
Figure BDA0001944160000000046
Wherein:
similarity score of affiliated organization
Figure BDA0001944160000000047
Similar score to literature text
Figure BDA0001944160000000048
Respectively representing nodes using word-frequency vectors
Figure BDA0001944160000000049
And node
Figure BDA00019441600000000410
The word frequency vector is calculated according to the affiliated organization and the document text
Figure BDA00019441600000000411
And
Figure BDA00019441600000000412
cosine similarity of representation
Figure BDA00019441600000000413
Calculating word frequency vectors
Figure BDA00019441600000000414
And
Figure BDA00019441600000000415
cosine similarity of representation
Figure BDA00019441600000000416
The calculation formula is as follows:
Figure BDA0001944160000000051
Figure BDA0001944160000000052
the similarity score of the hierarchical information is
Figure BDA0001944160000000053
Value and node of
Figure BDA0001944160000000054
And node
Figure BDA0001944160000000055
The information of the belonging layer is related,
Figure BDA0001944160000000056
the other information similarity scores are score weighted. In the process of merging nodes, the hierarchical information to which the nodes belong may include two or more hierarchical information. In the calculation of
Figure BDA0001944160000000057
And with
Figure BDA0001944160000000058
Similarity score of belonging hierarchical information
Figure BDA0001944160000000059
Then, select the node
Figure BDA00019441600000000510
And node
Figure BDA00019441600000000511
Is found out according to the minimum difference value in the hierarchical information
Figure BDA00019441600000000512
When node
Figure BDA00019441600000000513
And a node
Figure BDA00019441600000000514
When information belonging to the same layer exists in the hierarchical information,
Figure BDA00019441600000000515
maximum; the larger the value of the assigned hierarchical difference is,
Figure BDA00019441600000000516
the smaller;
partnership similarity scores
Figure BDA00019441600000000517
In (1),
Figure BDA00019441600000000518
representing nodes
Figure BDA00019441600000000519
And node
Figure BDA00019441600000000520
The number of the neighbor nodes is the same; the more the same number of neighboring nodes,
Figure BDA00019441600000000521
the larger the value;
link edge two-end duplicate name author node in hierarchical network G
Figure BDA00019441600000000522
And
Figure BDA00019441600000000523
similar score of (2)
Figure BDA00019441600000000524
Figure BDA00019441600000000525
Wherein, ω is m Indicates the similarity score S of the setting m Coefficient of (1), S m Is S aff Or S txt
Compared with the prior art, the invention has the beneficial effects that: constructing a hierarchical network and calculating similar scores for the first time, performing score weighting on a plurality of similar scores of a rename author by utilizing hierarchical information, and calculating similar scores of a cooperative relationship only under the condition that partner judgment of rename author nodes is completed; the method has the advantages that the characteristics of document publication time, duplicate names of collaborators and the like which are ignored by the existing methods are added into the process of calculating the similarity scores, a hierarchical network is constructed according to the publication time, the similarity scores of the duplicate names of the authors are calculated more accurately, the problem of the duplicate names of the authors in the document data set is solved efficiently and accurately, and the method is beneficial to improving the accuracy and recall rate of document retrieval.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a diagram of the hierarchical network constructed in step 1 of the method of the present invention.
FIG. 3 is a flow chart of the method of the present invention for calculating the similarity score of rename author nodes at both ends of a link edge in step 2.
FIG. 4 is a flowchart of updating link edge weights after merging nodes in step 3 of the method of the present invention.
Fig. 5 is a hierarchical network diagram after the nodes are merged for the first time in step 3.
Fig. 6 is a hierarchical network diagram after the nodes are merged for the second time in step 3.
Fig. 7 is a hierarchical network diagram after the nodes are merged for the third time in step 3.
Fig. 8 is a hierarchical network diagram after the fourth node merging in step 3.
Detailed Description
The duplicate name author identification method based on the hierarchical network according to the present invention is further described in detail with reference to the flowchart and the implementation example.
The embodiment shows the process of identifying the names "liu quiet" and "liu fang" of the renamed authors by using the hierarchical network-based renamed author identification method of the present invention for part of the literature data shown in table 1, and details the embodiments of the present invention. Firstly, constructing a hierarchical network by using a document data set, and connecting link edges among duplicate name author nodes in the hierarchical network; then traversing each link edge, and calculating the similarity score of duplicate author nodes at two ends of the link edge as the weight of the link edge; and finally, finding out the link edge with the maximum weight in the hierarchical network, if the weight of the link is greater than a threshold value, judging the link edge to be the same person, merging the nodes, updating the weight of the link edge between the adjacent nodes of the merged nodes, iteratively finding out the node with the maximum weight of the link edge in the hierarchical network, judging until the weight of the link edge with the maximum weight is less than the set threshold value, finishing the process of judging the renamed author, and outputting a final renamed author identification result, namely the merged node corresponds to the same person.
Table 1 exemplary literature data
Figure BDA0001944160000000071
As shown in fig. 1, the present invention comprises the steps of:
step 1, constructing a hierarchical network G: first, a document data set shown in table 1 is input. And dividing the literature data set according to publication time to construct a hierarchical author cooperation network. For a given document data set P, P is divided by publication time of each document in the document data set P such that the published documents form a subset per time period, resulting in P = { P = { P } 1 ,P 2 ,P 3 ,…,P n }. In the embodiment, a document data set is divided according to a document publication time { {2003,2004}, {2008,2009}, {2013,2014} } to obtain document data subsets.
Then, for each P i Building a corresponding author collaboration network G i Generating a hierarchical network G = { G = 1 ,G 2 ,G 3 ,…,G n Where i =1,2, \ 8230;, n, authorsThe nodes in the cooperation network represent document authors, edges in the author cooperation network represent partner relationships among authors, and the node attributes include information of the affiliated hierarchies and document information. For P i Of the document p, an author collaboration network is constructed, and nodes are labeled in the form of author names plus subscripts i. In this embodiment, the author name "Liu Jing" is "A i "Format shows, and the comparison table of other author names and tags is shown in Table 2. Generation of G = { G 1 ,G 2 ,G 3 ,…,G n }。G i Composed of multiple complete graphs, the number of the complete graphs is equal to P i Number of documents in (1).
Then, traversing the duplicate name author nodes, and connecting link edges between the duplicate name author nodes in G, namely the nodes with the same author name; and finally, in G, traversing each link edge, merging nodes with determined information, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, merging the nodes at two ends of the link edge, wherein the attribute of the merged node comprises all hierarchical information before merging. The accurate information includes mailbox information or document list information of an author's personal homepage, etc. In this embodiment, the personal homepage information is inquired about, D 2 And D 3 The authors of the representations are the same person, thus merging node D 2 And D 3 . The node attribute after merging includes all hierarchical data before merging. The constructed hierarchical network is shown in fig. 2.
TABLE 2 node labels
Name of the author Node label
Liu Jing A 1 ,A 2 ,A 3 ,A 4 ,A 5 ,A 6 ,A 7
Zhong Weicai B 1 ,B 2
Liu Fang C 1 ,C 2 ,C 3
Jiao Licheng D 1 ,D 2 ,D 3
Lu Hanqing E 1
Li Zhu Shu F 1
Hu Kang G 1
Wang Jingrun H 1
Zhai Suodi I 1
Chen Xiaohong J 1 ,J 2
Yin Ling K 1
Step 2, calculating a similarity score Simscore of the renamed nodes in the hierarchical network: traversing each link edge in the hierarchical network G, calculating a similarity score Simscore between duplicate author nodes represented by two end points of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity distribution Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise affiliated organization similarity scores, literature text similarity scores, affiliated hierarchical information similarity scores, cooperation relation similarity scores and the like.
For the duplicate author name r, duplicate author nodes at two ends of link edge are set as
Figure BDA0001944160000000091
And
Figure BDA0001944160000000092
the similar scores of the mechanisms are
Figure BDA0001944160000000093
The document text similarity score is
Figure BDA0001944160000000094
The similarity score of the hierarchical information is
Figure BDA0001944160000000095
The partnership similarity score is
Figure BDA0001944160000000096
Wherein:
similarity score of affiliated organization
Figure BDA0001944160000000097
Similar score to literature text
Figure BDA0001944160000000098
Respectively representing nodes using word-frequency vectors
Figure BDA0001944160000000099
And node
Figure BDA00019441600000000910
The word frequency vector is calculated according to the affiliated organization and the literature text
Figure BDA0001944160000000101
And
Figure BDA0001944160000000102
cosine similarity of representation
Figure BDA0001944160000000103
Calculating word frequency vectors
Figure BDA0001944160000000104
And
Figure BDA0001944160000000105
cosine similarity of representation
Figure BDA0001944160000000106
The calculation formula is as follows:
Figure BDA0001944160000000107
Figure BDA0001944160000000108
the similarity score of the hierarchical information is
Figure BDA0001944160000000109
Value and node of
Figure BDA00019441600000001010
And a node
Figure BDA00019441600000001011
The information of the belonging layer is related,
Figure BDA00019441600000001012
score weighting is performed on the other information similarity scores. MergingIn the node process, the hierarchical information to which the node belongs may include two or more hierarchical information. In the calculation of
Figure BDA00019441600000001013
And with
Figure BDA00019441600000001014
Similarity score of belonging hierarchical information
Figure BDA00019441600000001015
When selecting a node
Figure BDA00019441600000001016
And node
Figure BDA00019441600000001017
Is found out according to the minimum difference value in the hierarchical information
Figure BDA00019441600000001018
When node
Figure BDA00019441600000001019
And node
Figure BDA00019441600000001020
When information belonging to the same layer exists in the hierarchical information,
Figure BDA00019441600000001021
maximum; the larger the value of the underlying hierarchical difference is,
Figure BDA00019441600000001022
the smaller;
partnership similarity score
Figure BDA00019441600000001023
In the step (1), the first step,
Figure BDA00019441600000001024
representing nodes
Figure BDA00019441600000001025
And node
Figure BDA00019441600000001026
The number of the neighbor nodes is the same; the more the same number of neighboring nodes,
Figure BDA00019441600000001027
the larger the value;
link edge two-end duplicate name author node in hierarchical network G
Figure BDA00019441600000001028
And
Figure BDA00019441600000001029
similar score of (2)
Figure BDA00019441600000001030
Figure BDA00019441600000001031
Wherein, ω is m Representing a similarity score S set on demand m Coefficient of (1), S m Is S aff Or S txt
In this embodiment, the values of the similarity scores of the hierarchical information are as follows: h =0, S h =1.0; h =1, S h =0.9; h =2, S h =0.8. The value of the similar score of the cooperative relationship is related to the number of the same neighbor nodes, and the value is as follows: co _ adj =1, S num (co _ adj) =0.1; co _ adj =2, S num (co _ adj) =0.2; co _ adj =3, S num (co _ adj) =0.3. Converting the organ character strings and the document texts of the rename authors into vector representation by adopting word frequency-inverse text frequency (TF-IDF), and solving the vector cosine similarity score to obtain the final S aff And S txt . After the similarity scores of the respective parts are found, the similarity scores are counted as shown in FIG. 3The similarity score Simscore of the nodes at both ends of the link is calculated as shown in table 3.
TABLE 3 Link weights generated by first calculation
Simscore A 1 A 2 A 3 A 4 A 5 A 6 A 7 C 1 C 2 C 3
A 1 0 0 0
A 2 0.56 0 0 0
A 3 0.00 0.00 0 0 0
A 4 0.11 0.21 0.00 0 0 0
A 5 0.00 0.00 0.00 0.00 0 0 0
A 6 0.00 0.00 0.00 0.00 0.00 0 0 0
A 7 0.00 0.00 0.00 0.00 0.00 0.53 0 0 0
C 1 0 0 0 0 0 0 0
C 2 0 0 0 0 0 0 0 0.53
C 3 0 0 0 0 0 0 0 0.00 0.00
Step 3, judging the duplicate authors:
(I) For each link edge weight in the hierarchical network G, the link edge with the largest weight in the hierarchical network G is found, and its value is max (link). And judging the magnitude relation between the value and the set threshold value. If the corresponding link edge weight value is larger than the set threshold value, executing the operation in the step (II); otherwise, outputting the duplicate name author identification result, namely that the merged nodes correspond to the same person. In this embodiment, the threshold is set to be 0.20, and in the hierarchical network G, as shown in table 3, the link edge with the largest weight is<A 1 ,A 2 >The weight is 0.56. Comparing the relationship between the weight and the set threshold value: 0.56>0.20。
(II) As shown in FIG. 4, the nodes at both ends of link having the largest weight and larger than the threshold are merged. The node attributes after merging include all hierarchical data before merging. And update the link weights of their neighbor nodes. After the nodes are merged, the similar scores of the cooperation relationship in the similar scores of the link edges between the adjacent nodes
Figure BDA0001944160000000121
An update is required to update the simcore value. After a node is merged, the node attribute of the node changes. The affiliated hierarchical information may change, in other similar scores, after the affiliated mechanism and document text information are combined, the affiliated mechanism and document text are represented again by using a vector representation method, and the similar score between the node and the rename node changes. And recalculating the similarity scores of the merged nodes and the rename nodes, and updating the link edge weight values connected with the nodes. In this embodiment<A 1 ,A 2 >The nodes are merged into A 1,2 Calculating A 1,2 Similarity scores with the rename nodes. Link edges existing for neighbor nodes after combination<C 1 ,C 2 >And updating the link edge weight value. The similarity score Simscore after updating is shown in table 4. After updating, the hierarchical network G is shown in fig. 5.
TABLE 4 Link weights after second calculation of merged nodes
Simscore A 1,2 A 3 A 4 A 5 A 6 A 7 C 1 C 2 C 3
A 1,2 0 0 0
A 3 0.00 0 0 0
A 4 0.22 0.00 0 0 0
A 5 0.00 0.00 0.00 0 0 0
A 6 0.00 0.00 0.00 0.00 0 0 0
A 7 0.00 0.00 0.00 0.00 0.53 0 0 0
C 1 0 0 0 0 0 0
C 2 0 0 0 0 0 0 0.63
C 3 0 0 0 0 0 0 0.00 0.00
Find out in Table 4 that the link edge with the largest current weight is<C 1 ,C 2 >Judging the magnitude relation between the weight and a set threshold (0.20): 0.63>0.20. Then the nodes C at both ends of the link are merged 1 And C 2 Is C 1,2 And updating the link edge weights of the merged node and the renamed node, and updating the link values in the neighbor nodes of the merged node. In this embodiment, only C needs to be updated 1,2 And C 3 Link edge weight of. The similarity score Simscore after updating is shown in table 5. After updating, the hierarchical network G is shown in fig. 6.
TABLE 5 Link weights after the third calculation of the merged nodes
Simscore A 1,2 A 3 A 4 A 5 A 6 A 7 C 1,2 C 3
A 1,2 0 0
A 3 0.00 0 0
A 4 0.22 0.00 0 0
A 5 0.00 0.00 0.00 0 0
A 6 0.00 0.00 0.00 0.00 0 0
A 7 0.00 0.00 0.00 0.00 0.53 0 0
C 1,2 0 0 0 0 0 0
C 3 0 0 0 0 0 0 0.00
Find out in Table 5 that the link edge with the largest current weight is<A 6 ,A 7 >Determine its rightMagnitude relation of value to set threshold (0.20): 0.53>0.20. Then merge the node A at both ends of the link 6 And A 7 Is A 6,7 And updating the link edge weights of the merged node and the rename node, and updating the link values in the neighbor nodes of the merged node. In this embodiment, the similarity score Simscore after the link is updated is shown in table 6. After updating, the hierarchical network G is shown in fig. 7.
TABLE 6 fourth calculation of link weights after node merging
Simscore A 1,2 A 3 A 4 A 5 A 6,7 C 1,2 C 3
A 1,2 0 0
A 3 0.00 0 0
A 4 0.22 0.00 0 0
A 5 0.00 0.00 0.00 0 0
A 6,7 0.00 0.00 0.00 0.00 0 0
C 1,2 0 0 0 0 0
C 3 0 0 0 0 0 0.00
Find out that the link edge with the largest weight in Table 6 is<A 1,2 ,A 4 >Judging the magnitude relation between the weight and a set threshold (0.20): 0.22>0.20. Then merge the nodes A at both ends of the link 1,2 And A 4 Is A 1,2,4 And updating the link edge weights of the merged node and the renamed node, and updating the link values in the neighbor nodes of the merged node. This embodimentIn the example, the similarity score Simscore after the update is shown in table 7. After the update, the hierarchical network G is shown in fig. 8.
TABLE 7 Link weights after fifth calculation of merged nodes
Simscore A 1,2,4 A 3 A 5 A 6,7 C 1,2 C 3
A 1,2,4 0 0
A 3 0.00 0 0
A 5 0.00 0.00 0 0
A 6,7 0.00 0.00 0.00 0 0
C 1,2 0 0 0 0
C 3 0 0 0 0 0.00
And finding out the link edge with the maximum current weight in the table 7, wherein the weight is 0 and is smaller than a set threshold, and outputting a final identification result, namely that the merged nodes correspond to the same person. As shown in fig. 8, the merged nodes are determined to be the same person.
The final recognition result is: 'Liu Jing': { { A 1 ,A 2 ,A 4 },{A 6 ,A 7 },{A 3 },{A 5 }, liu Fang: { { C 1 ,C 2 },{C 3 }}. The accuracy of the identification result is 100%.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (4)

1. A duplicate name author identification method based on a hierarchical network is characterized by comprising the following steps:
step 1, firstly, for a given literature data set P, dividing P according to publication time of each literature in the literature data set P, so that the published literature forms a subset in each time period, and obtaining P = { P = 1 ,P 2 ,P 3 ,…,P n }; then, for each P i Building a corresponding author collaboration network G i Generating a hierarchical network G = { G = 1 ,G 2 ,G 3 ,…,G n N, the nodes in the author collaboration network represent document authors, the edges in the author collaboration network represent collaborator relationships between authors, and the node attributes areThe information and the literature information of the belonged hierarchy are contained; then, connecting link edges for the duplicate author nodes in the G; finally, traversing each link edge, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, combining the nodes at the two ends of the link edge, wherein the attributes of the combined nodes comprise all hierarchical information before combination;
step 2, traversing each link edge in the hierarchical network G, calculating a similarity score Simscore of the duplicate author nodes at two ends of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity score Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise the similarity score of the affiliated organization, the similarity score of the document text, the similarity score of the affiliated hierarchical information and the similarity score of the cooperative relationship;
and 3, finding out the link edge with the maximum weight for each link edge weight in the hierarchical network G, judging whether the corresponding link edge weight is greater than a set threshold, if so, merging the nodes at two ends of the link edge with the maximum weight, wherein the attribute of the merged node comprises all hierarchical data before merging, updating the link edge weights of the merged node and the neighbor nodes thereof according to the method in the step 2, and then, iteratively executing the step 3 until the link edge weight with the maximum weight is less than or equal to the set threshold, outputting the duplicate author identification result, wherein the authors corresponding to the nodes merged in the hierarchical network are the same person in the output identification result.
2. The hierarchical network-based duplicate name author identification method as claimed in claim 1, wherein in step 1, for each P, there is a duplicate name author identification method i Building a corresponding author collaboration network G i For P in the process of i For each document p in (1), the node is labeled in the form of author name plus subscript, G i Composed of multiple complete graphs, the number of the complete graphs is equal to P i Number of documents in (1).
3. The method for identifying a duplicate author based on a hierarchical network as set forth in claim 1, wherein in the step 1, the accurate information includes mailbox information or document list information of an author's personal homepage.
4. The method as claimed in claim 1, wherein in the step 2, for the rename author name r, the rename author nodes at two ends of the link edge are set as
Figure FDA0001944159990000021
And
Figure FDA0001944159990000022
the similar scores of the mechanisms are
Figure FDA0001944159990000023
The document text similarity score is
Figure FDA0001944159990000024
The similarity score of the hierarchical information is
Figure FDA0001944159990000025
The affinity score of the cooperative relationship is
Figure FDA0001944159990000026
Wherein:
similarity score of affiliated organization
Figure FDA0001944159990000027
Similar score to literature text
Figure FDA0001944159990000028
Respectively representing nodes using word-frequency vectors
Figure FDA0001944159990000029
And node
Figure FDA00019441599900000210
OfBelonging to organizations and literature texts, and calculating word frequency vectors
Figure FDA00019441599900000211
And
Figure FDA00019441599900000212
cosine similarity of representation
Figure FDA00019441599900000213
Calculating word frequency vectors
Figure FDA00019441599900000214
And
Figure FDA00019441599900000215
cosine similarity of representation
Figure FDA00019441599900000216
The calculation formula is as follows:
Figure FDA00019441599900000217
Figure FDA0001944159990000031
in the calculation of
Figure FDA0001944159990000032
And with
Figure FDA0001944159990000033
Similarity score of belonging hierarchical information
Figure FDA0001944159990000034
Then, select the node
Figure FDA0001944159990000035
And a node
Figure FDA0001944159990000036
Finding the value of the smallest difference in the associated hierarchical information
Figure FDA0001944159990000037
Partnership similarity score
Figure FDA0001944159990000038
In (1),
Figure FDA0001944159990000039
representing nodes
Figure FDA00019441599900000310
And node
Figure FDA00019441599900000311
The number of the neighbor nodes is the same; the more the same number of neighboring nodes,
Figure FDA00019441599900000312
the larger the value;
link edge two-end duplicate name author node in hierarchical network G
Figure FDA00019441599900000313
And with
Figure FDA00019441599900000314
Similar score of (2)
Figure FDA00019441599900000315
Figure FDA00019441599900000316
Wherein, ω is m Indicating a set similarity score S m Coefficient of (a), S m Is S aff Or S txt
CN201910030797.2A 2019-01-14 2019-01-14 Duplicate name writer identification method based on hierarchical network Active CN109753662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910030797.2A CN109753662B (en) 2019-01-14 2019-01-14 Duplicate name writer identification method based on hierarchical network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910030797.2A CN109753662B (en) 2019-01-14 2019-01-14 Duplicate name writer identification method based on hierarchical network

Publications (2)

Publication Number Publication Date
CN109753662A CN109753662A (en) 2019-05-14
CN109753662B true CN109753662B (en) 2023-01-06

Family

ID=66404608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910030797.2A Active CN109753662B (en) 2019-01-14 2019-01-14 Duplicate name writer identification method based on hierarchical network

Country Status (1)

Country Link
CN (1) CN109753662B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889467A (en) * 2019-12-20 2020-03-17 中国建设银行股份有限公司 Company name matching method and device, terminal equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653590A (en) * 2015-12-21 2016-06-08 青岛智能产业技术研究院 Name duplication disambiguation method of Chinese literature authors
CN106294677A (en) * 2016-08-04 2017-01-04 浙江大学 A kind of towards the name disambiguation method of China author in english literature
JP2018088093A (en) * 2016-11-28 2018-06-07 日本電信電話株式会社 Device, method, and program of object candidate region extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653590A (en) * 2015-12-21 2016-06-08 青岛智能产业技术研究院 Name duplication disambiguation method of Chinese literature authors
CN106294677A (en) * 2016-08-04 2017-01-04 浙江大学 A kind of towards the name disambiguation method of China author in english literature
JP2018088093A (en) * 2016-11-28 2018-06-07 日本電信電話株式会社 Device, method, and program of object candidate region extraction

Also Published As

Publication number Publication date
CN109753662A (en) 2019-05-14

Similar Documents

Publication Publication Date Title
CN111191466B (en) Homonymous author disambiguation method based on network characterization and semantic characterization
CN110688474B (en) Embedded representation obtaining and citation recommending method based on deep learning and link prediction
US10133807B2 (en) Author disambiguation and publication assignment
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
CN102508859A (en) Advertisement classification method and device based on webpage characteristic
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN107194672B (en) Review distribution method integrating academic expertise and social network
CN110264372B (en) Topic community discovery method based on node representation
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN113158041B (en) Article recommendation method based on multi-attribute features
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN109753662B (en) Duplicate name writer identification method based on hierarchical network
CN116244497A (en) Cross-domain paper recommendation method based on heterogeneous data embedding
CN112836008B (en) Index establishing method based on decentralized storage data
CN114896514B (en) Web API label recommendation method based on graph neural network
Hu et al. Cnn-iets: A cnn-based probabilistic approach for information extraction by text segmentation
CN114385927B (en) Scientific research partner recommendation method based on multi-similarity fusion
JP2009176072A (en) System, method and program for extracting element group
CN115730599A (en) Chinese patent key information identification method based on structBERT, computer equipment, storage medium and program product
CN110727833B (en) Multi-view learning-based graph data retrieval result optimization method
CN112925839A (en) Incremental data set-oriented knowledge discovery method and discovery device
CN111984776B (en) Mechanism name standardization method based on word vector model
Yin et al. Heterogeneous information network model for equipment-standard system
CN113641783B (en) Content block retrieval method, device, equipment and medium based on key sentences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant