CN109753662B

CN109753662B - Duplicate name writer identification method based on hierarchical network

Info

Publication number: CN109753662B
Application number: CN201910030797.2A
Authority: CN
Inventors: 高建良; 蒋志怡; 杜宏亮
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-01-14
Filing date: 2019-01-14
Publication date: 2023-01-06
Anticipated expiration: 2039-01-14
Also published as: CN109753662A

Abstract

The invention discloses a duplicate name author identification method based on a hierarchical network, which comprises the following steps: step 1, for a given literature data set, dividing subsets according to publication time; building an author collaboration network G for subsets _i Generating a hierarchical network G; connecting link edges for the rename author nodes in the G; if the nodes at the two ends of the link edge are determined to be the same person, combining the nodes at the two ends of the link edge; step 2, calculating the similarity scores of duplicate author nodes at two ends of the link edge, and assigning the similarity scores as corresponding link edge weights; and 3, finding out the link edge with the maximum weight, judging whether the weight is greater than a set threshold, if so, combining the nodes at two ends of the link edge, updating the weight of the link edge according to the method in the step 2, and then iteratively executing the step 3 until the weight of the link edge with the maximum weight is less than or equal to the set threshold, outputting an identification result at the moment, wherein the authors corresponding to the combined nodes are the same person. The invention considers the characteristics of document publication time, the existence of duplicate names of collaborators and the like, and efficiently and accurately solves the problem of duplicate name authors in document data sets.

Description

Duplicate author identification method based on hierarchical network

Technical Field

The invention belongs to the field of hierarchical network construction and renaming identification algorithms, and particularly relates to a renaming author identification method based on a hierarchical network.

Background

In a document dataset, retrieved with the author name as a search criteria, a list of all documents for that name is typically returned. The duplicate author problem refers to a problem that multiple authors in a document data set have the same name but cannot determine whether the authors are the same person. The duplicate author problem can lead to low accuracy in document data retrieval.

In the processes of establishing, reviewing and managing scientific research projects, searching review experts, searching student information in a certain field by researchers, editing periodicals, searching experts for reviewing documents, transacting periodicals and planning questions, searching subjects speaking students by academic conference organizers and the like, the effective famous author identification algorithm can bring convenience in the aspects of scientific research management, academic research, scientific research evaluation and the like. In addition, the identification result of the duplicate authors can bring additional value, such as establishing a citation network, a cooperation network and author business card files, discovering research directions and field changes of scientific researchers, and discovering the transition trend of working units of the scientific researchers.

The problem of duplicate authors solved by traditional manual processing methods is inevitable, and the name division and document classification of authors in a growing large amount of scientific and technical literature data cannot be dealt with. With the explosive growth of the number and quality of international papers published in China, the international influence of Chinese researchers is increasing. Meanwhile, in the data set of English scientific and technical literature, the problem of the famous names of Chinese researchers is more and more serious. The reasons are mainly that the abbreviated name formats are different in the process of converting the Chinese name into the English name and the phenomenon that the Chinese author name is different in the same pronunciation and character exists.

Scientific researchers at home and abroad propose a duplicate-name author identification algorithm from different angles such as social networks, machine learning, probability models and the like. In existing approaches, the algorithmic goal is typically to partition a list of documents returned by retrieving a certain author name for a document data set. The GHOST method is used for constructing a partner relation network for the rename authors, the rename authors are represented by different nodes, the partners with the same names are represented by the same node, then effective paths among nodes of the rename authors are found to represent similar scores among the nodes of the rename authors, and finally a clustering method is used for realizing division of the rename authors. Attempts to solve the problem of duplicate authors using graph data mining methods have been ongoing in recent years, but are currently staying at the level of using graph connectivity and graph fusion methods. In the prior art methods, the publication time feature was ignored. In the existing method for solving the problem of duplicate-name authors by constructing a network, the phenomenon that the duplicate-name problem exists in the collaborators of the duplicate-name authors is ignored. In addition, scientific publishers have also proposed solutions to the problem of duplicate authors, requiring each contribution researcher to register an ID and require accurate labeling of the documents preceding it. Many scientific researchers do not bring trouble because of the problem of renaming at present, and the registration enthusiasm is not high.

Disclosure of Invention

The invention aims to provide a duplicate name writer identification method based on a hierarchical network aiming at the defects of the prior art, which divides document data according to document publication time and constructs the hierarchical network, considers the duplicate name problem of a partner of the duplicate name writer, and iteratively calculates the similarity score of writer nodes in the hierarchical network, thereby solving the duplicate name writer problem existing in a document database and accurately and efficiently identifying whether the duplicate name writer in a given document data set is the same person.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a duplicate name author identification method based on a hierarchical network is characterized by comprising the following steps:

step 1, constructing a hierarchical network G: first, for a given document dataset P, P is divided by publication time of each document in the document dataset P, such that the published documents form a subset per time period, resulting in P = { P = { P = } ₁ ,P ₂ ,P ₃ ,…,P _n }; then, for each P _i Constructing a corresponding author collaboration network G _i Generating a hierarchical network G = { G = ₁ ,G ₂ ,G ₃ ,…,G _n N, a node in the author cooperation network represents a document author, an edge in the author cooperation network represents the relationship of the collaborators among the authors, and the node attribute comprises the information of the affiliated hierarchy and the document information; then, link edges are connected between the duplicate name author nodes in the G, namely the nodes with the same author name; finally, traversing each link edge, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, combining the nodes at the two ends of the link edge, wherein the attributes of the combined nodes comprise all hierarchical information before combination;

step 2, calculating a similarity score Simscore of the renamed nodes in the hierarchical network: traversing each link edge in the hierarchical network G, calculating the similarity score Simscore of the duplicate author nodes at the two ends of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity score Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise the similarity score of the affiliated organization, the similarity score of the document text, the similarity score of the affiliated hierarchical information, the similarity score of the cooperative relationship and the like;

step 3, judging the duplicate authors: for each link edge weight in the hierarchical network G, finding out the link edge with the maximum weight (the maximum weight corresponding to the link edge is max) and judging whether the link edge weight corresponding to the link edge is larger than a set threshold, if so, combining the nodes at two ends of the link edge with the maximum weight, wherein the combined node attribute comprises all hierarchical data before combination, updating the link edge weights of the combined node and the adjacent nodes thereof according to the method in the step 2, and then iterating the step 3 until the link edge weight with the maximum weight is smaller than or equal to the set threshold, outputting a duplicate author identification result, wherein in the output identification result, authors corresponding to the nodes combined in the hierarchical network are the same person.

Preferably, in the step 1, for each P _i Constructing a corresponding author collaboration network G _i For P in the process of _i For each document P in (1), the node is labeled with author name and subscript i, e.g., author name A with a renaming problem in document data set P, using { A } ₁ ,A ₂ ,…,A _m Denotes author A in different documents, m denotes the number of occurrences of author name A in document data set P, G _i Is composed of multiple complete graphs, the number of the complete graphs is equal to P _i Number of documents in (1).

In a preferable mode, in step 1, the accurate information includes mailbox information and/or document list information of a homepage of an author.

As a preferred mode, in the step 2, for the rename author name r, the rename author nodes at two ends of the link edge are set as

And

the similar scores of the mechanisms are

The document text similarity score is

The similarity score of the hierarchical information is

The partnership similarity score is

Wherein:

similarity score of affiliated organization

Similar score to literature text

Respectively representing nodes using word-frequency vectors

And node

The word frequency vector is calculated according to the affiliated organization and the document text

And

cosine similarity of representation

Calculating word frequency vectors

And

cosine similarity of representation

The calculation formula is as follows:

the similarity score of the hierarchical information is

Value and node of

And node

The information of the belonging layer is related,

the other information similarity scores are score weighted. In the process of merging nodes, the hierarchical information to which the nodes belong may include two or more hierarchical information. In the calculation of

And with

Similarity score of belonging hierarchical information

Then, select the node

And node

Is found out according to the minimum difference value in the hierarchical information

When node

And a node

When information belonging to the same layer exists in the hierarchical information,

maximum; the larger the value of the assigned hierarchical difference is,

the smaller;

partnership similarity scores

In (1),

representing nodes

And node

The number of the neighbor nodes is the same; the more the same number of neighboring nodes,

the larger the value;

link edge two-end duplicate name author node in hierarchical network G

And

similar score of (2)

Wherein, ω is _m Indicates the similarity score S of the setting _m Coefficient of (1), S _m Is S _aff Or S _txt 。

Compared with the prior art, the invention has the beneficial effects that: constructing a hierarchical network and calculating similar scores for the first time, performing score weighting on a plurality of similar scores of a rename author by utilizing hierarchical information, and calculating similar scores of a cooperative relationship only under the condition that partner judgment of rename author nodes is completed; the method has the advantages that the characteristics of document publication time, duplicate names of collaborators and the like which are ignored by the existing methods are added into the process of calculating the similarity scores, a hierarchical network is constructed according to the publication time, the similarity scores of the duplicate names of the authors are calculated more accurately, the problem of the duplicate names of the authors in the document data set is solved efficiently and accurately, and the method is beneficial to improving the accuracy and recall rate of document retrieval.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a diagram of the hierarchical network constructed in step 1 of the method of the present invention.

FIG. 3 is a flow chart of the method of the present invention for calculating the similarity score of rename author nodes at both ends of a link edge in step 2.

FIG. 4 is a flowchart of updating link edge weights after merging nodes in step 3 of the method of the present invention.

Fig. 5 is a hierarchical network diagram after the nodes are merged for the first time in step 3.

Fig. 6 is a hierarchical network diagram after the nodes are merged for the second time in step 3.

Fig. 7 is a hierarchical network diagram after the nodes are merged for the third time in step 3.

Fig. 8 is a hierarchical network diagram after the fourth node merging in step 3.

Detailed Description

The duplicate name author identification method based on the hierarchical network according to the present invention is further described in detail with reference to the flowchart and the implementation example.

The embodiment shows the process of identifying the names "liu quiet" and "liu fang" of the renamed authors by using the hierarchical network-based renamed author identification method of the present invention for part of the literature data shown in table 1, and details the embodiments of the present invention. Firstly, constructing a hierarchical network by using a document data set, and connecting link edges among duplicate name author nodes in the hierarchical network; then traversing each link edge, and calculating the similarity score of duplicate author nodes at two ends of the link edge as the weight of the link edge; and finally, finding out the link edge with the maximum weight in the hierarchical network, if the weight of the link is greater than a threshold value, judging the link edge to be the same person, merging the nodes, updating the weight of the link edge between the adjacent nodes of the merged nodes, iteratively finding out the node with the maximum weight of the link edge in the hierarchical network, judging until the weight of the link edge with the maximum weight is less than the set threshold value, finishing the process of judging the renamed author, and outputting a final renamed author identification result, namely the merged node corresponds to the same person.

Table 1 exemplary literature data

As shown in fig. 1, the present invention comprises the steps of:

step 1, constructing a hierarchical network G: first, a document data set shown in table 1 is input. And dividing the literature data set according to publication time to construct a hierarchical author cooperation network. For a given document data set P, P is divided by publication time of each document in the document data set P such that the published documents form a subset per time period, resulting in P = { P = { P } ₁ ,P ₂ ,P ₃ ,…,P _n }. In the embodiment, a document data set is divided according to a document publication time { {2003,2004}, {2008,2009}, {2013,2014} } to obtain document data subsets.

Then, for each P _i Building a corresponding author collaboration network G _i Generating a hierarchical network G = { G = ₁ ,G ₂ ,G ₃ ,…,G _n Where i =1,2, \ 8230;, n, authorsThe nodes in the cooperation network represent document authors, edges in the author cooperation network represent partner relationships among authors, and the node attributes include information of the affiliated hierarchies and document information. For P _i Of the document p, an author collaboration network is constructed, and nodes are labeled in the form of author names plus subscripts i. In this embodiment, the author name "Liu Jing" is "A _i "Format shows, and the comparison table of other author names and tags is shown in Table 2. Generation of G = { G ₁ ,G ₂ ,G ₃ ,…,G _n }。G _i Composed of multiple complete graphs, the number of the complete graphs is equal to P _i Number of documents in (1).

Then, traversing the duplicate name author nodes, and connecting link edges between the duplicate name author nodes in G, namely the nodes with the same author name; and finally, in G, traversing each link edge, merging nodes with determined information, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, merging the nodes at two ends of the link edge, wherein the attribute of the merged node comprises all hierarchical information before merging. The accurate information includes mailbox information or document list information of an author's personal homepage, etc. In this embodiment, the personal homepage information is inquired about, D ₂ And D ₃ The authors of the representations are the same person, thus merging node D ₂ And D ₃ . The node attribute after merging includes all hierarchical data before merging. The constructed hierarchical network is shown in fig. 2.

TABLE 2 node labels

Name of the author	Node label
		Liu Jing	A ₁ ,A ₂ ,A ₃ ,A ₄ ,A ₅ ,A ₆ ,A ₇
Zhong Weicai	B ₁ ,B ₂
		Liu Fang	C ₁ ,C ₂ ,C ₃
Jiao Licheng	D ₁ ,D ₂ ,D ₃
		Lu Hanqing	E ₁
Li Zhu Shu	F ₁
		Hu Kang	G ₁
Wang Jingrun	H ₁
		Zhai Suodi	I ₁
Chen Xiaohong	J ₁ ,J ₂
		Yin Ling	K ₁

Step 2, calculating a similarity score Simscore of the renamed nodes in the hierarchical network: traversing each link edge in the hierarchical network G, calculating a similarity score Simscore between duplicate author nodes represented by two end points of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity distribution Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise affiliated organization similarity scores, literature text similarity scores, affiliated hierarchical information similarity scores, cooperation relation similarity scores and the like.

For the duplicate author name r, duplicate author nodes at two ends of link edge are set as

And

the similar scores of the mechanisms are

The document text similarity score is

The similarity score of the hierarchical information is

The partnership similarity score is

Wherein:

similarity score of affiliated organization

Similar score to literature text

Respectively representing nodes using word-frequency vectors

And node

The word frequency vector is calculated according to the affiliated organization and the literature text

And

cosine similarity of representation

Calculating word frequency vectors

And

cosine similarity of representation

The calculation formula is as follows:

the similarity score of the hierarchical information is

Value and node of

And a node

The information of the belonging layer is related,

score weighting is performed on the other information similarity scores. MergingIn the node process, the hierarchical information to which the node belongs may include two or more hierarchical information. In the calculation of

And with

Similarity score of belonging hierarchical information

When selecting a node

And node

When node

And node

maximum; the larger the value of the underlying hierarchical difference is,

the smaller;

partnership similarity score

In the step (1), the first step,

representing nodes

And node

the larger the value;

link edge two-end duplicate name author node in hierarchical network G

And

similar score of (2)

Wherein, ω is _m Representing a similarity score S set on demand _m Coefficient of (1), S _m Is S _aff Or S _txt 。

In this embodiment, the values of the similarity scores of the hierarchical information are as follows: h =0, S _h =1.0; h =1, S _h =0.9; h =2, S _h =0.8. The value of the similar score of the cooperative relationship is related to the number of the same neighbor nodes, and the value is as follows: co _ adj =1, S _num (co _ adj) =0.1; co _ adj =2, S _num (co _ adj) =0.2; co _ adj =3, S _num (co _ adj) =0.3. Converting the organ character strings and the document texts of the rename authors into vector representation by adopting word frequency-inverse text frequency (TF-IDF), and solving the vector cosine similarity score to obtain the final S _aff And S _txt . After the similarity scores of the respective parts are found, the similarity scores are counted as shown in FIG. 3The similarity score Simscore of the nodes at both ends of the link is calculated as shown in table 3.

TABLE 3 Link weights generated by first calculation

Simscore	A ₁	A ₂	A ₃	A ₄	A ₅	A ₆	A ₇	C ₁	C ₂	C ₃
											A ₁								0	0	0
A ₂	0.56							0	0	0
											A ₃	0.00	0.00						0	0	0
A ₄	0.11	0.21	0.00					0	0	0
											A ₅	0.00	0.00	0.00	0.00				0	0	0
A ₆	0.00	0.00	0.00	0.00	0.00			0	0	0
											A ₇	0.00	0.00	0.00	0.00	0.00	0.53		0	0	0
C ₁	0	0	0	0	0	0	0
											C ₂	0	0	0	0	0	0	0	0.53
C ₃	0	0	0	0	0	0	0	0.00	0.00

Step 3, judging the duplicate authors:

(I) For each link edge weight in the hierarchical network G, the link edge with the largest weight in the hierarchical network G is found, and its value is max (link). And judging the magnitude relation between the value and the set threshold value. If the corresponding link edge weight value is larger than the set threshold value, executing the operation in the step (II); otherwise, outputting the duplicate name author identification result, namely that the merged nodes correspond to the same person. In this embodiment, the threshold is set to be 0.20, and in the hierarchical network G, as shown in table 3, the link edge with the largest weight is<A ₁ ,A ₂ >The weight is 0.56. Comparing the relationship between the weight and the set threshold value: 0.56>0.20。

(II) As shown in FIG. 4, the nodes at both ends of link having the largest weight and larger than the threshold are merged. The node attributes after merging include all hierarchical data before merging. And update the link weights of their neighbor nodes. After the nodes are merged, the similar scores of the cooperation relationship in the similar scores of the link edges between the adjacent nodes

An update is required to update the simcore value. After a node is merged, the node attribute of the node changes. The affiliated hierarchical information may change, in other similar scores, after the affiliated mechanism and document text information are combined, the affiliated mechanism and document text are represented again by using a vector representation method, and the similar score between the node and the rename node changes. And recalculating the similarity scores of the merged nodes and the rename nodes, and updating the link edge weight values connected with the nodes. In this embodiment<A ₁ ,A ₂ >The nodes are merged into A _1,2 Calculating A _1,2 Similarity scores with the rename nodes. Link edges existing for neighbor nodes after combination<C ₁ ,C ₂ >And updating the link edge weight value. The similarity score Simscore after updating is shown in table 4. After updating, the hierarchical network G is shown in fig. 5.

TABLE 4 Link weights after second calculation of merged nodes

Simscore	A _1,2	A ₃	A ₄	A ₅	A ₆	A ₇	C ₁	C ₂	C ₃
										A _1,2							0	0	0
A ₃	0.00						0	0	0
										A ₄	0.22	0.00					0	0	0
A ₅	0.00	0.00	0.00				0	0	0
										A ₆	0.00	0.00	0.00	0.00			0	0	0
A ₇	0.00	0.00	0.00	0.00	0.53		0	0	0
										C ₁	0	0	0	0	0	0
C ₂	0	0	0	0	0	0	0.63
										C ₃	0	0	0	0	0	0	0.00	0.00

Find out in Table 4 that the link edge with the largest current weight is<C ₁ ,C ₂ >Judging the magnitude relation between the weight and a set threshold (0.20): 0.63>0.20. Then the nodes C at both ends of the link are merged ₁ And C ₂ Is C _1,2 And updating the link edge weights of the merged node and the renamed node, and updating the link values in the neighbor nodes of the merged node. In this embodiment, only C needs to be updated _1,2 And C ₃ Link edge weight of. The similarity score Simscore after updating is shown in table 5. After updating, the hierarchical network G is shown in fig. 6.

TABLE 5 Link weights after the third calculation of the merged nodes

Simscore	A _1,2	A ₃	A ₄	A ₅	A ₆	A ₇	C _1,2	C ₃
									A _1,2							0	0
A ₃	0.00						0	0
									A ₄	0.22	0.00					0	0
A ₅	0.00	0.00	0.00				0	0
									A ₆	0.00	0.00	0.00	0.00			0	0
A ₇	0.00	0.00	0.00	0.00	0.53		0	0
									C _1,2	0	0	0	0	0	0
C ₃	0	0	0	0	0	0	0.00

Find out in Table 5 that the link edge with the largest current weight is<A ₆ ,A ₇ >Determine its rightMagnitude relation of value to set threshold (0.20): 0.53>0.20. Then merge the node A at both ends of the link ₆ And A ₇ Is A _6,7 And updating the link edge weights of the merged node and the rename node, and updating the link values in the neighbor nodes of the merged node. In this embodiment, the similarity score Simscore after the link is updated is shown in table 6. After updating, the hierarchical network G is shown in fig. 7.

TABLE 6 fourth calculation of link weights after node merging

Simscore	A _1,2	A ₃	A ₄	A ₅	A _6,7	C _1,2	C ₃
								A _1,2						0	0
A ₃	0.00					0	0
								A ₄	0.22	0.00				0	0
A ₅	0.00	0.00	0.00			0	0
								A _6,7	0.00	0.00	0.00	0.00		0	0
C _1,2	0	0	0	0	0
								C ₃	0	0	0	0	0	0.00

Find out that the link edge with the largest weight in Table 6 is<A _1,2 ,A ₄ >Judging the magnitude relation between the weight and a set threshold (0.20): 0.22>0.20. Then merge the nodes A at both ends of the link _1,2 And A ₄ Is A _1,2,4 And updating the link edge weights of the merged node and the renamed node, and updating the link values in the neighbor nodes of the merged node. This embodimentIn the example, the similarity score Simscore after the update is shown in table 7. After the update, the hierarchical network G is shown in fig. 8.

TABLE 7 Link weights after fifth calculation of merged nodes

Simscore	A _1,2,4	A ₃	A ₅	A _6,7	C _1,2	C ₃
							A _1,2,4					0	0
A ₃	0.00				0	0
							A ₅	0.00	0.00			0	0
A _6,7	0.00	0.00	0.00		0	0
							C _1,2	0	0	0	0
C ₃	0	0	0	0	0.00

And finding out the link edge with the maximum current weight in the table 7, wherein the weight is 0 and is smaller than a set threshold, and outputting a final identification result, namely that the merged nodes correspond to the same person. As shown in fig. 8, the merged nodes are determined to be the same person.

The final recognition result is: 'Liu Jing': { { A ₁ ,A ₂ ,A ₄ },{A ₆ ,A ₇ },{A ₃ },{A ₅ }, liu Fang: { { C ₁ ,C ₂ },{C ₃ }}. The accuracy of the identification result is 100%.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A duplicate name author identification method based on a hierarchical network is characterized by comprising the following steps:

step 1, firstly, for a given literature data set P, dividing P according to publication time of each literature in the literature data set P, so that the published literature forms a subset in each time period, and obtaining P = { P = ₁ ,P ₂ ,P ₃ ,…,P _n }; then, for each P _i Building a corresponding author collaboration network G _i Generating a hierarchical network G = { G = ₁ ,G ₂ ,G ₃ ,…,G _n N, the nodes in the author collaboration network represent document authors, the edges in the author collaboration network represent collaborator relationships between authors, and the node attributes areThe information and the literature information of the belonged hierarchy are contained; then, connecting link edges for the duplicate author nodes in the G; finally, traversing each link edge, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, combining the nodes at the two ends of the link edge, wherein the attributes of the combined nodes comprise all hierarchical information before combination;

step 2, traversing each link edge in the hierarchical network G, calculating a similarity score Simscore of the duplicate author nodes at two ends of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity score Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise the similarity score of the affiliated organization, the similarity score of the document text, the similarity score of the affiliated hierarchical information and the similarity score of the cooperative relationship;

and 3, finding out the link edge with the maximum weight for each link edge weight in the hierarchical network G, judging whether the corresponding link edge weight is greater than a set threshold, if so, merging the nodes at two ends of the link edge with the maximum weight, wherein the attribute of the merged node comprises all hierarchical data before merging, updating the link edge weights of the merged node and the neighbor nodes thereof according to the method in the step 2, and then, iteratively executing the step 3 until the link edge weight with the maximum weight is less than or equal to the set threshold, outputting the duplicate author identification result, wherein the authors corresponding to the nodes merged in the hierarchical network are the same person in the output identification result.

2. The hierarchical network-based duplicate name author identification method as claimed in claim 1, wherein in step 1, for each P, there is a duplicate name author identification method _i Building a corresponding author collaboration network G _i For P in the process of _i For each document p in (1), the node is labeled in the form of author name plus subscript, G _i Composed of multiple complete graphs, the number of the complete graphs is equal to P _i Number of documents in (1).

3. The method for identifying a duplicate author based on a hierarchical network as set forth in claim 1, wherein in the step 1, the accurate information includes mailbox information or document list information of an author's personal homepage.

4. The method as claimed in claim 1, wherein in the step 2, for the rename author name r, the rename author nodes at two ends of the link edge are set as