CN109753662B - Duplicate name writer identification method based on hierarchical network - Google Patents
Duplicate name writer identification method based on hierarchical network Download PDFInfo
- Publication number
- CN109753662B CN109753662B CN201910030797.2A CN201910030797A CN109753662B CN 109753662 B CN109753662 B CN 109753662B CN 201910030797 A CN201910030797 A CN 201910030797A CN 109753662 B CN109753662 B CN 109753662B
- Authority
- CN
- China
- Prior art keywords
- nodes
- author
- link edge
- hierarchical
- duplicate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a duplicate name author identification method based on a hierarchical network, which comprises the following steps: step 1, for a given literature data set, dividing subsets according to publication time; building an author collaboration network G for subsets i Generating a hierarchical network G; connecting link edges for the rename author nodes in the G; if the nodes at the two ends of the link edge are determined to be the same person, combining the nodes at the two ends of the link edge; step 2, calculating the similarity scores of duplicate author nodes at two ends of the link edge, and assigning the similarity scores as corresponding link edge weights; and 3, finding out the link edge with the maximum weight, judging whether the weight is greater than a set threshold, if so, combining the nodes at two ends of the link edge, updating the weight of the link edge according to the method in the step 2, and then iteratively executing the step 3 until the weight of the link edge with the maximum weight is less than or equal to the set threshold, outputting an identification result at the moment, wherein the authors corresponding to the combined nodes are the same person. The invention considers the characteristics of document publication time, the existence of duplicate names of collaborators and the like, and efficiently and accurately solves the problem of duplicate name authors in document data sets.
Description
Technical Field
The invention belongs to the field of hierarchical network construction and renaming identification algorithms, and particularly relates to a renaming author identification method based on a hierarchical network.
Background
In a document dataset, retrieved with the author name as a search criteria, a list of all documents for that name is typically returned. The duplicate author problem refers to a problem that multiple authors in a document data set have the same name but cannot determine whether the authors are the same person. The duplicate author problem can lead to low accuracy in document data retrieval.
In the processes of establishing, reviewing and managing scientific research projects, searching review experts, searching student information in a certain field by researchers, editing periodicals, searching experts for reviewing documents, transacting periodicals and planning questions, searching subjects speaking students by academic conference organizers and the like, the effective famous author identification algorithm can bring convenience in the aspects of scientific research management, academic research, scientific research evaluation and the like. In addition, the identification result of the duplicate authors can bring additional value, such as establishing a citation network, a cooperation network and author business card files, discovering research directions and field changes of scientific researchers, and discovering the transition trend of working units of the scientific researchers.
The problem of duplicate authors solved by traditional manual processing methods is inevitable, and the name division and document classification of authors in a growing large amount of scientific and technical literature data cannot be dealt with. With the explosive growth of the number and quality of international papers published in China, the international influence of Chinese researchers is increasing. Meanwhile, in the data set of English scientific and technical literature, the problem of the famous names of Chinese researchers is more and more serious. The reasons are mainly that the abbreviated name formats are different in the process of converting the Chinese name into the English name and the phenomenon that the Chinese author name is different in the same pronunciation and character exists.
Scientific researchers at home and abroad propose a duplicate-name author identification algorithm from different angles such as social networks, machine learning, probability models and the like. In existing approaches, the algorithmic goal is typically to partition a list of documents returned by retrieving a certain author name for a document data set. The GHOST method is used for constructing a partner relation network for the rename authors, the rename authors are represented by different nodes, the partners with the same names are represented by the same node, then effective paths among nodes of the rename authors are found to represent similar scores among the nodes of the rename authors, and finally a clustering method is used for realizing division of the rename authors. Attempts to solve the problem of duplicate authors using graph data mining methods have been ongoing in recent years, but are currently staying at the level of using graph connectivity and graph fusion methods. In the prior art methods, the publication time feature was ignored. In the existing method for solving the problem of duplicate-name authors by constructing a network, the phenomenon that the duplicate-name problem exists in the collaborators of the duplicate-name authors is ignored. In addition, scientific publishers have also proposed solutions to the problem of duplicate authors, requiring each contribution researcher to register an ID and require accurate labeling of the documents preceding it. Many scientific researchers do not bring trouble because of the problem of renaming at present, and the registration enthusiasm is not high.
Disclosure of Invention
The invention aims to provide a duplicate name writer identification method based on a hierarchical network aiming at the defects of the prior art, which divides document data according to document publication time and constructs the hierarchical network, considers the duplicate name problem of a partner of the duplicate name writer, and iteratively calculates the similarity score of writer nodes in the hierarchical network, thereby solving the duplicate name writer problem existing in a document database and accurately and efficiently identifying whether the duplicate name writer in a given document data set is the same person.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a duplicate name author identification method based on a hierarchical network is characterized by comprising the following steps:
step 2, calculating a similarity score Simscore of the renamed nodes in the hierarchical network: traversing each link edge in the hierarchical network G, calculating the similarity score Simscore of the duplicate author nodes at the two ends of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity score Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise the similarity score of the affiliated organization, the similarity score of the document text, the similarity score of the affiliated hierarchical information, the similarity score of the cooperative relationship and the like;
step 3, judging the duplicate authors: for each link edge weight in the hierarchical network G, finding out the link edge with the maximum weight (the maximum weight corresponding to the link edge is max) and judging whether the link edge weight corresponding to the link edge is larger than a set threshold, if so, combining the nodes at two ends of the link edge with the maximum weight, wherein the combined node attribute comprises all hierarchical data before combination, updating the link edge weights of the combined node and the adjacent nodes thereof according to the method in the step 2, and then iterating the step 3 until the link edge weight with the maximum weight is smaller than or equal to the set threshold, outputting a duplicate author identification result, wherein in the output identification result, authors corresponding to the nodes combined in the hierarchical network are the same person.
Preferably, in the step 1, for each P i Constructing a corresponding author collaboration network G i For P in the process of i For each document P in (1), the node is labeled with author name and subscript i, e.g., author name A with a renaming problem in document data set P, using { A } 1 ,A 2 ,…,A m Denotes author A in different documents, m denotes the number of occurrences of author name A in document data set P, G i Is composed of multiple complete graphs, the number of the complete graphs is equal to P i Number of documents in (1).
In a preferable mode, in step 1, the accurate information includes mailbox information and/or document list information of a homepage of an author.
As a preferred mode, in the step 2, for the rename author name r, the rename author nodes at two ends of the link edge are set asAndthe similar scores of the mechanisms areThe document text similarity score isThe similarity score of the hierarchical information isThe partnership similarity score isWherein:
similarity score of affiliated organizationSimilar score to literature textRespectively representing nodes using word-frequency vectorsAnd nodeThe word frequency vector is calculated according to the affiliated organization and the document textAndcosine similarity of representationCalculating word frequency vectorsAndcosine similarity of representationThe calculation formula is as follows:
the similarity score of the hierarchical information isValue and node ofAnd nodeThe information of the belonging layer is related,the other information similarity scores are score weighted. In the process of merging nodes, the hierarchical information to which the nodes belong may include two or more hierarchical information. In the calculation ofAnd withSimilarity score of belonging hierarchical informationThen, select the nodeAnd nodeIs found out according to the minimum difference value in the hierarchical informationWhen nodeAnd a nodeWhen information belonging to the same layer exists in the hierarchical information,maximum; the larger the value of the assigned hierarchical difference is,the smaller;
partnership similarity scoresIn (1),representing nodesAnd nodeThe number of the neighbor nodes is the same; the more the same number of neighboring nodes,the larger the value;
Wherein, ω is m Indicates the similarity score S of the setting m Coefficient of (1), S m Is S aff Or S txt 。
Compared with the prior art, the invention has the beneficial effects that: constructing a hierarchical network and calculating similar scores for the first time, performing score weighting on a plurality of similar scores of a rename author by utilizing hierarchical information, and calculating similar scores of a cooperative relationship only under the condition that partner judgment of rename author nodes is completed; the method has the advantages that the characteristics of document publication time, duplicate names of collaborators and the like which are ignored by the existing methods are added into the process of calculating the similarity scores, a hierarchical network is constructed according to the publication time, the similarity scores of the duplicate names of the authors are calculated more accurately, the problem of the duplicate names of the authors in the document data set is solved efficiently and accurately, and the method is beneficial to improving the accuracy and recall rate of document retrieval.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a diagram of the hierarchical network constructed in step 1 of the method of the present invention.
FIG. 3 is a flow chart of the method of the present invention for calculating the similarity score of rename author nodes at both ends of a link edge in step 2.
FIG. 4 is a flowchart of updating link edge weights after merging nodes in step 3 of the method of the present invention.
Fig. 5 is a hierarchical network diagram after the nodes are merged for the first time in step 3.
Fig. 6 is a hierarchical network diagram after the nodes are merged for the second time in step 3.
Fig. 7 is a hierarchical network diagram after the nodes are merged for the third time in step 3.
Fig. 8 is a hierarchical network diagram after the fourth node merging in step 3.
Detailed Description
The duplicate name author identification method based on the hierarchical network according to the present invention is further described in detail with reference to the flowchart and the implementation example.
The embodiment shows the process of identifying the names "liu quiet" and "liu fang" of the renamed authors by using the hierarchical network-based renamed author identification method of the present invention for part of the literature data shown in table 1, and details the embodiments of the present invention. Firstly, constructing a hierarchical network by using a document data set, and connecting link edges among duplicate name author nodes in the hierarchical network; then traversing each link edge, and calculating the similarity score of duplicate author nodes at two ends of the link edge as the weight of the link edge; and finally, finding out the link edge with the maximum weight in the hierarchical network, if the weight of the link is greater than a threshold value, judging the link edge to be the same person, merging the nodes, updating the weight of the link edge between the adjacent nodes of the merged nodes, iteratively finding out the node with the maximum weight of the link edge in the hierarchical network, judging until the weight of the link edge with the maximum weight is less than the set threshold value, finishing the process of judging the renamed author, and outputting a final renamed author identification result, namely the merged node corresponds to the same person.
Table 1 exemplary literature data
As shown in fig. 1, the present invention comprises the steps of:
Then, for each P i Building a corresponding author collaboration network G i Generating a hierarchical network G = { G = 1 ,G 2 ,G 3 ,…,G n Where i =1,2, \ 8230;, n, authorsThe nodes in the cooperation network represent document authors, edges in the author cooperation network represent partner relationships among authors, and the node attributes include information of the affiliated hierarchies and document information. For P i Of the document p, an author collaboration network is constructed, and nodes are labeled in the form of author names plus subscripts i. In this embodiment, the author name "Liu Jing" is "A i "Format shows, and the comparison table of other author names and tags is shown in Table 2. Generation of G = { G 1 ,G 2 ,G 3 ,…,G n }。G i Composed of multiple complete graphs, the number of the complete graphs is equal to P i Number of documents in (1).
Then, traversing the duplicate name author nodes, and connecting link edges between the duplicate name author nodes in G, namely the nodes with the same author name; and finally, in G, traversing each link edge, merging nodes with determined information, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, merging the nodes at two ends of the link edge, wherein the attribute of the merged node comprises all hierarchical information before merging. The accurate information includes mailbox information or document list information of an author's personal homepage, etc. In this embodiment, the personal homepage information is inquired about, D 2 And D 3 The authors of the representations are the same person, thus merging node D 2 And D 3 . The node attribute after merging includes all hierarchical data before merging. The constructed hierarchical network is shown in fig. 2.
TABLE 2 node labels
Name of the author | Node label |
Liu Jing | A 1 ,A 2 ,A 3 ,A 4 ,A 5 ,A 6 ,A 7 |
Zhong Weicai | B 1 ,B 2 |
Liu Fang | C 1 ,C 2 ,C 3 |
Jiao Licheng | D 1 ,D 2 ,D 3 |
Lu Hanqing | E 1 |
Li Zhu Shu | F 1 |
Hu Kang | G 1 |
Wang Jingrun | H 1 |
Zhai Suodi | I 1 |
Chen Xiaohong | J 1 ,J 2 |
Yin Ling | K 1 |
Step 2, calculating a similarity score Simscore of the renamed nodes in the hierarchical network: traversing each link edge in the hierarchical network G, calculating a similarity score Simscore between duplicate author nodes represented by two end points of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity distribution Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise affiliated organization similarity scores, literature text similarity scores, affiliated hierarchical information similarity scores, cooperation relation similarity scores and the like.
For the duplicate author name r, duplicate author nodes at two ends of link edge are set asAndthe similar scores of the mechanisms areThe document text similarity score isThe similarity score of the hierarchical information isThe partnership similarity score isWherein:
similarity score of affiliated organizationSimilar score to literature textRespectively representing nodes using word-frequency vectorsAnd nodeThe word frequency vector is calculated according to the affiliated organization and the literature textAndcosine similarity of representationCalculating word frequency vectorsAndcosine similarity of representationThe calculation formula is as follows:
the similarity score of the hierarchical information isValue and node ofAnd a nodeThe information of the belonging layer is related,score weighting is performed on the other information similarity scores. MergingIn the node process, the hierarchical information to which the node belongs may include two or more hierarchical information. In the calculation ofAnd withSimilarity score of belonging hierarchical informationWhen selecting a nodeAnd nodeIs found out according to the minimum difference value in the hierarchical informationWhen nodeAnd nodeWhen information belonging to the same layer exists in the hierarchical information,maximum; the larger the value of the underlying hierarchical difference is,the smaller;
partnership similarity scoreIn the step (1), the first step,representing nodesAnd nodeThe number of the neighbor nodes is the same; the more the same number of neighboring nodes,the larger the value;
Wherein, ω is m Representing a similarity score S set on demand m Coefficient of (1), S m Is S aff Or S txt 。
In this embodiment, the values of the similarity scores of the hierarchical information are as follows: h =0, S h =1.0; h =1, S h =0.9; h =2, S h =0.8. The value of the similar score of the cooperative relationship is related to the number of the same neighbor nodes, and the value is as follows: co _ adj =1, S num (co _ adj) =0.1; co _ adj =2, S num (co _ adj) =0.2; co _ adj =3, S num (co _ adj) =0.3. Converting the organ character strings and the document texts of the rename authors into vector representation by adopting word frequency-inverse text frequency (TF-IDF), and solving the vector cosine similarity score to obtain the final S aff And S txt . After the similarity scores of the respective parts are found, the similarity scores are counted as shown in FIG. 3The similarity score Simscore of the nodes at both ends of the link is calculated as shown in table 3.
TABLE 3 Link weights generated by first calculation
Simscore | A 1 | A 2 | A 3 | A 4 | A 5 | A 6 | A 7 | C 1 | C 2 | C 3 |
A 1 | 0 | 0 | 0 | |||||||
A 2 | 0.56 | 0 | 0 | 0 | ||||||
A 3 | 0.00 | 0.00 | 0 | 0 | 0 | |||||
A 4 | 0.11 | 0.21 | 0.00 | 0 | 0 | 0 | ||||
A 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | |||
A 6 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | ||
A 7 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.53 | 0 | 0 | 0 | |
C 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||
C 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.53 | ||
C 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 |
Step 3, judging the duplicate authors:
(I) For each link edge weight in the hierarchical network G, the link edge with the largest weight in the hierarchical network G is found, and its value is max (link). And judging the magnitude relation between the value and the set threshold value. If the corresponding link edge weight value is larger than the set threshold value, executing the operation in the step (II); otherwise, outputting the duplicate name author identification result, namely that the merged nodes correspond to the same person. In this embodiment, the threshold is set to be 0.20, and in the hierarchical network G, as shown in table 3, the link edge with the largest weight is<A 1 ,A 2 >The weight is 0.56. Comparing the relationship between the weight and the set threshold value: 0.56>0.20。
(II) As shown in FIG. 4, the nodes at both ends of link having the largest weight and larger than the threshold are merged. The node attributes after merging include all hierarchical data before merging. And update the link weights of their neighbor nodes. After the nodes are merged, the similar scores of the cooperation relationship in the similar scores of the link edges between the adjacent nodesAn update is required to update the simcore value. After a node is merged, the node attribute of the node changes. The affiliated hierarchical information may change, in other similar scores, after the affiliated mechanism and document text information are combined, the affiliated mechanism and document text are represented again by using a vector representation method, and the similar score between the node and the rename node changes. And recalculating the similarity scores of the merged nodes and the rename nodes, and updating the link edge weight values connected with the nodes. In this embodiment<A 1 ,A 2 >The nodes are merged into A 1,2 Calculating A 1,2 Similarity scores with the rename nodes. Link edges existing for neighbor nodes after combination<C 1 ,C 2 >And updating the link edge weight value. The similarity score Simscore after updating is shown in table 4. After updating, the hierarchical network G is shown in fig. 5.
TABLE 4 Link weights after second calculation of merged nodes
Simscore | A 1,2 | A 3 | A 4 | A 5 | A 6 | A 7 | C 1 | C 2 | C 3 |
A 1,2 | 0 | 0 | 0 | ||||||
A 3 | 0.00 | 0 | 0 | 0 | |||||
A 4 | 0.22 | 0.00 | 0 | 0 | 0 | ||||
A 5 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | |||
A 6 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | ||
A 7 | 0.00 | 0.00 | 0.00 | 0.00 | 0.53 | 0 | 0 | 0 | |
C 1 | 0 | 0 | 0 | 0 | 0 | 0 | |||
C 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0.63 | ||
C 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 |
Find out in Table 4 that the link edge with the largest current weight is<C 1 ,C 2 >Judging the magnitude relation between the weight and a set threshold (0.20): 0.63>0.20. Then the nodes C at both ends of the link are merged 1 And C 2 Is C 1,2 And updating the link edge weights of the merged node and the renamed node, and updating the link values in the neighbor nodes of the merged node. In this embodiment, only C needs to be updated 1,2 And C 3 Link edge weight of. The similarity score Simscore after updating is shown in table 5. After updating, the hierarchical network G is shown in fig. 6.
TABLE 5 Link weights after the third calculation of the merged nodes
Simscore | A 1,2 | A 3 | A 4 | A 5 | A 6 | A 7 | C 1,2 | C 3 |
A 1,2 | 0 | 0 | ||||||
A 3 | 0.00 | 0 | 0 | |||||
A 4 | 0.22 | 0.00 | 0 | 0 | ||||
A 5 | 0.00 | 0.00 | 0.00 | 0 | 0 | |||
A 6 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | ||
A 7 | 0.00 | 0.00 | 0.00 | 0.00 | 0.53 | 0 | 0 | |
C 1,2 | 0 | 0 | 0 | 0 | 0 | 0 | ||
C 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |
Find out in Table 5 that the link edge with the largest current weight is<A 6 ,A 7 >Determine its rightMagnitude relation of value to set threshold (0.20): 0.53>0.20. Then merge the node A at both ends of the link 6 And A 7 Is A 6,7 And updating the link edge weights of the merged node and the rename node, and updating the link values in the neighbor nodes of the merged node. In this embodiment, the similarity score Simscore after the link is updated is shown in table 6. After updating, the hierarchical network G is shown in fig. 7.
TABLE 6 fourth calculation of link weights after node merging
Simscore | A 1,2 | A 3 | A 4 | A 5 | A 6,7 | C 1,2 | C 3 |
A 1,2 | 0 | 0 | |||||
A 3 | 0.00 | 0 | 0 | ||||
A 4 | 0.22 | 0.00 | 0 | 0 | |||
A 5 | 0.00 | 0.00 | 0.00 | 0 | 0 | ||
A 6,7 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | |
C 1,2 | 0 | 0 | 0 | 0 | 0 | ||
C 3 | 0 | 0 | 0 | 0 | 0 | 0.00 |
Find out that the link edge with the largest weight in Table 6 is<A 1,2 ,A 4 >Judging the magnitude relation between the weight and a set threshold (0.20): 0.22>0.20. Then merge the nodes A at both ends of the link 1,2 And A 4 Is A 1,2,4 And updating the link edge weights of the merged node and the renamed node, and updating the link values in the neighbor nodes of the merged node. This embodimentIn the example, the similarity score Simscore after the update is shown in table 7. After the update, the hierarchical network G is shown in fig. 8.
TABLE 7 Link weights after fifth calculation of merged nodes
Simscore | A 1,2,4 | A 3 | A 5 | A 6,7 | C 1,2 | C 3 |
A 1,2,4 | 0 | 0 | ||||
A 3 | 0.00 | 0 | 0 | |||
A 5 | 0.00 | 0.00 | 0 | 0 | ||
A 6,7 | 0.00 | 0.00 | 0.00 | 0 | 0 | |
C 1,2 | 0 | 0 | 0 | 0 | ||
C 3 | 0 | 0 | 0 | 0 | 0.00 |
And finding out the link edge with the maximum current weight in the table 7, wherein the weight is 0 and is smaller than a set threshold, and outputting a final identification result, namely that the merged nodes correspond to the same person. As shown in fig. 8, the merged nodes are determined to be the same person.
The final recognition result is: 'Liu Jing': { { A 1 ,A 2 ,A 4 },{A 6 ,A 7 },{A 3 },{A 5 }, liu Fang: { { C 1 ,C 2 },{C 3 }}. The accuracy of the identification result is 100%.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (4)
1. A duplicate name author identification method based on a hierarchical network is characterized by comprising the following steps:
step 1, firstly, for a given literature data set P, dividing P according to publication time of each literature in the literature data set P, so that the published literature forms a subset in each time period, and obtaining P = { P = 1 ,P 2 ,P 3 ,…,P n }; then, for each P i Building a corresponding author collaboration network G i Generating a hierarchical network G = { G = 1 ,G 2 ,G 3 ,…,G n N, the nodes in the author collaboration network represent document authors, the edges in the author collaboration network represent collaborator relationships between authors, and the node attributes areThe information and the literature information of the belonged hierarchy are contained; then, connecting link edges for the duplicate author nodes in the G; finally, traversing each link edge, if authors corresponding to two end nodes of the link edge have accurate information and are determined to be the same person, combining the nodes at the two ends of the link edge, wherein the attributes of the combined nodes comprise all hierarchical information before combination;
step 2, traversing each link edge in the hierarchical network G, calculating a similarity score Simscore of the duplicate author nodes at two ends of the link edge, and assigning the calculated similarity score Simscore as a corresponding link edge weight; the similarity score Simscore is obtained by weighting a plurality of sub-similarity scores by hierarchical information, wherein the sub-similarity scores comprise the similarity score of the affiliated organization, the similarity score of the document text, the similarity score of the affiliated hierarchical information and the similarity score of the cooperative relationship;
and 3, finding out the link edge with the maximum weight for each link edge weight in the hierarchical network G, judging whether the corresponding link edge weight is greater than a set threshold, if so, merging the nodes at two ends of the link edge with the maximum weight, wherein the attribute of the merged node comprises all hierarchical data before merging, updating the link edge weights of the merged node and the neighbor nodes thereof according to the method in the step 2, and then, iteratively executing the step 3 until the link edge weight with the maximum weight is less than or equal to the set threshold, outputting the duplicate author identification result, wherein the authors corresponding to the nodes merged in the hierarchical network are the same person in the output identification result.
2. The hierarchical network-based duplicate name author identification method as claimed in claim 1, wherein in step 1, for each P, there is a duplicate name author identification method i Building a corresponding author collaboration network G i For P in the process of i For each document p in (1), the node is labeled in the form of author name plus subscript, G i Composed of multiple complete graphs, the number of the complete graphs is equal to P i Number of documents in (1).
3. The method for identifying a duplicate author based on a hierarchical network as set forth in claim 1, wherein in the step 1, the accurate information includes mailbox information or document list information of an author's personal homepage.
4. The method as claimed in claim 1, wherein in the step 2, for the rename author name r, the rename author nodes at two ends of the link edge are set asAndthe similar scores of the mechanisms areThe document text similarity score isThe similarity score of the hierarchical information isThe affinity score of the cooperative relationship isWherein:
similarity score of affiliated organizationSimilar score to literature textRespectively representing nodes using word-frequency vectorsAnd nodeOfBelonging to organizations and literature texts, and calculating word frequency vectorsAndcosine similarity of representationCalculating word frequency vectorsAndcosine similarity of representationThe calculation formula is as follows:
in the calculation ofAnd withSimilarity score of belonging hierarchical informationThen, select the nodeAnd a nodeFinding the value of the smallest difference in the associated hierarchical information
Partnership similarity scoreIn (1),representing nodesAnd nodeThe number of the neighbor nodes is the same; the more the same number of neighboring nodes,the larger the value;
Wherein, ω is m Indicating a set similarity score S m Coefficient of (a), S m Is S aff Or S txt 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910030797.2A CN109753662B (en) | 2019-01-14 | 2019-01-14 | Duplicate name writer identification method based on hierarchical network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910030797.2A CN109753662B (en) | 2019-01-14 | 2019-01-14 | Duplicate name writer identification method based on hierarchical network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109753662A CN109753662A (en) | 2019-05-14 |
CN109753662B true CN109753662B (en) | 2023-01-06 |
Family
ID=66404608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910030797.2A Active CN109753662B (en) | 2019-01-14 | 2019-01-14 | Duplicate name writer identification method based on hierarchical network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109753662B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110889467A (en) * | 2019-12-20 | 2020-03-17 | 中国建设银行股份有限公司 | Company name matching method and device, terminal equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105653590A (en) * | 2015-12-21 | 2016-06-08 | 青岛智能产业技术研究院 | Name duplication disambiguation method of Chinese literature authors |
CN106294677A (en) * | 2016-08-04 | 2017-01-04 | 浙江大学 | A kind of towards the name disambiguation method of China author in english literature |
JP2018088093A (en) * | 2016-11-28 | 2018-06-07 | 日本電信電話株式会社 | Device, method, and program of object candidate region extraction |
-
2019
- 2019-01-14 CN CN201910030797.2A patent/CN109753662B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105653590A (en) * | 2015-12-21 | 2016-06-08 | 青岛智能产业技术研究院 | Name duplication disambiguation method of Chinese literature authors |
CN106294677A (en) * | 2016-08-04 | 2017-01-04 | 浙江大学 | A kind of towards the name disambiguation method of China author in english literature |
JP2018088093A (en) * | 2016-11-28 | 2018-06-07 | 日本電信電話株式会社 | Device, method, and program of object candidate region extraction |
Also Published As
Publication number | Publication date |
---|---|
CN109753662A (en) | 2019-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111191466B (en) | Homonymous author disambiguation method based on network characterization and semantic characterization | |
CN110688474B (en) | Embedded representation obtaining and citation recommending method based on deep learning and link prediction | |
US10133807B2 (en) | Author disambiguation and publication assignment | |
Xie et al. | Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb | |
CN102508859A (en) | Advertisement classification method and device based on webpage characteristic | |
CN107590128B (en) | Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN107194672B (en) | Review distribution method integrating academic expertise and social network | |
CN110264372B (en) | Topic community discovery method based on node representation | |
CN111222318A (en) | Trigger word recognition method based on two-channel bidirectional LSTM-CRF network | |
CN113158041B (en) | Article recommendation method based on multi-attribute features | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN109753662B (en) | Duplicate name writer identification method based on hierarchical network | |
CN116244497A (en) | Cross-domain paper recommendation method based on heterogeneous data embedding | |
CN112836008B (en) | Index establishing method based on decentralized storage data | |
CN114896514B (en) | Web API label recommendation method based on graph neural network | |
Hu et al. | Cnn-iets: A cnn-based probabilistic approach for information extraction by text segmentation | |
CN114385927B (en) | Scientific research partner recommendation method based on multi-similarity fusion | |
JP2009176072A (en) | System, method and program for extracting element group | |
CN115730599A (en) | Chinese patent key information identification method based on structBERT, computer equipment, storage medium and program product | |
CN110727833B (en) | Multi-view learning-based graph data retrieval result optimization method | |
CN112925839A (en) | Incremental data set-oriented knowledge discovery method and discovery device | |
CN111984776B (en) | Mechanism name standardization method based on word vector model | |
Yin et al. | Heterogeneous information network model for equipment-standard system | |
CN113641783B (en) | Content block retrieval method, device, equipment and medium based on key sentences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |