CN115688737A - Paper cold start disambiguation method based on feature extraction and fusion - Google Patents

Paper cold start disambiguation method based on feature extraction and fusion Download PDF

Info

Publication number
CN115688737A
CN115688737A CN202211382121.8A CN202211382121A CN115688737A CN 115688737 A CN115688737 A CN 115688737A CN 202211382121 A CN202211382121 A CN 202211382121A CN 115688737 A CN115688737 A CN 115688737A
Authority
CN
China
Prior art keywords
model
features
paper
training
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211382121.8A
Other languages
Chinese (zh)
Inventor
张日崇
刘德志
赵智安
周安吉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202211382121.8A priority Critical patent/CN115688737A/en
Publication of CN115688737A publication Critical patent/CN115688737A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention realizes a thesis cold start disambiguation method based on feature extraction and fusion by a method in the field of network security. The method is divided into three parts: firstly, constructing a deep fine-grained thesis semantic feature extraction model: the method comprises the following steps of (1) providing sufficient information support for the whole semantic representation framework by introducing BERT and extracting semantic features through BERT variation and a distillation model; then, constructing a structural feature extraction model capable of modeling the discrimination force difference and the relationship between the features; and finally, constructing a feature fusion model which can fully utilize a plurality of structural features and avoid the over-smooth representation of the network nodes. The method provided by the invention designs a set of paper author disambiguation algorithm framework with high accuracy, high universality and expandability in large-scale data sets and practical engineering application.

Description

Paper cold start disambiguation method based on feature extraction and fusion
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a thesis cold start disambiguation method based on feature extraction and fusion.
Background
As a carrier for assisting researchers and institutions to share, propagate, organize, and utilize various literature resources and talent resources, on-line academic achievement management and retrieval systems such as Google Scholar, dblp, AMiner and the like have increased the problem of author name ambiguity in data with the continuous expansion of digital resources and the continuous expansion of talent teams. Statistically, 151,671 different common surnames and 5,163 different common first names exist in the U.S., and the 300 most common name combinations cover about 78.74% of the population in the U.S. Whereas, according to the birthday paradox, for common name combinations, the probability of having the same-name researcher at the same university is almost inevitable. The situation is very common for domestic researchers, the situation is not effectively relieved because of the large number of Chinese characters and the rich name combinations, in order to ensure the wider spread and sharing of documents, a large number of documents are published in an English form, the names of Chinese researchers are identified in a Pinyin and Pinyin abbreviation form, and the probability of homonymy is greatly increased. Meanwhile, the larger liquidity of the academic talents and the lack and inconsistency of the structured information of scientific literature add great challenges to the disambiguation task of the paper authors. The incompleteness of data fields of personal homepage of researchers, the heterogeneity of webpage structures and the lack of time for updating also greatly reduce the availability of exogenous data.
The accurate and complete literature result set of researchers is the premise of realizing quantitative evaluation, accurate retrieval and effective management. Although the industry and academia have had much research effort on the task of homonymous disambiguation, the problem has not been fully solved for the ever-growing scale of academia platform data. Due to the limitation of the existing thesis allocation algorithm, a large number of allocation errors exist in the existing large-scale academic management system. According to official statistics, AMIner has about 130,000,000 author archives and over 200,000,000 papers, which makes the same name situation of authors very complicated. A large number of traditional structural features represented by the characteristics of 'number of co-authors' are invalid in different degrees on a large-scale literature data set, and deep features such as a thesis research sub-field, a sentence expression mode and the like cannot be modeled by a shallow coarse-grained semantic representation model, so that the accuracy of the existing thesis representation and author disambiguation method on the large-scale real data set is low. In order to ensure the reliability of the algorithm in practical engineering application, the scientific literature is automatically clustered and distributed in a manner of manually making high-quality features and complex rules, so that the labor cost and the later maintenance cost of experts are seriously increased, and the universality of the algorithm is difficult to guarantee.
The current research on the paper author disambiguation task is mainly divided into two sub-categories, cold start disambiguation and incremental disambiguation. The cold start disambiguation task is to cluster all papers belonging to the same author into a plurality of collections written by different author entities for the condition that a specific name has a large number of unallocated papers in the early stage of building an academic management system. The incremental disambiguation task is to distribute newly added literature data of an academic system or a talent system to an existing paper author entity on the basis of a cold-start disambiguation task result. I.e., cold start disambiguation is a clustering problem and incremental disambiguation is a classification problem. At present, research on a paper author disambiguation task mainly focuses on a cold start disambiguation task, on one hand, the cold start disambiguation is the basis of incremental disambiguation, and on the other hand, research emphasis of the two tasks is how to obtain better paper representation by using existing data features, so that high-accuracy clustering or classification is realized in a feature space, large overlap is generated on research contents, and the clustering task can more fully evaluate the paper characterization effect.
In recent years, with the great increase of the number of scientific documents and the gradual advance of the digital construction in the scientific research field, in the applications of scientific document retrieval systems, academic talent management systems, academic network mining and the like, the difficulty in solving the disambiguation problem of the paper authors is increased more and the urgency of solution is highlighted more and more. After researching large-scale data sets such as AMIner-18 and the like, the invention finds that the existing author disambiguation method mainly has the following three problems:
(1) Deep semantic feature information such as the research sub-field and sentence pattern expression features cannot be well extracted. In the existing paper author disambiguation framework, a word2vec or doc2vec plus manual definition rule mode is generally adopted for extracting semantic feature information, and the unsupervised information in a verification set and a test set can be well utilized, so that the method is simple and effective on a small-scale data set. However, the extracted semantic information is only simple, superficial and coarse semantic information, and the influence of the context and word order in sentences on the content is ignored, so that only a general research field of the thesis can be modeled, the characteristics of small research direction and the syntactic sentence pattern of the thesis are difficult to extract, and the optimization space of the model is small. On a large-scale data set, the similar situation of the research field can occur to the same author, and the information of the organization to which the author belongs can change frequently due to the fluidity of talents, and the characteristics of the fine-grained research direction, writing and expression characteristics and the like of the content of the thesis can be the important basis for judging whether the author is the same author or not.
(2) The discriminative power differences and inter-feature relationships of the structured features are not modeled. In the method framework based on graph structure feature fusion, semantic representation vectors are used as nodes, and structural features between papers are used as edges. However, in order to avoid the problem of excessive smoothing in the graph convolution or random walk process as much as possible, the weight of the edge is usually thinned to 0 or 1, indicating whether the edge exists. This ignores the discriminative power differences of different structured features, basically assuming that the discriminative power of all structured features is equal, but this is clearly not in line with the fact. And the logical relationship between the structured features is not represented on the adjacency matrix, the weights of the edges of the papers with the relationships of two or more structured features are simply added or covered. The lack of the above two kinds of information will affect the accuracy of the subsequent feature fusion and clustering process, and such methods rely heavily on manual construction of high-precision heterogeneous maps and adjacency matrices, thus it is difficult to guarantee recall rate, and they do not have universality and mobility oriented to different data sets.
(3) Training set class information is not used in feature fusion, the accuracy of manually constructing a network structure is excessively relied on, and the number of structural features that can be introduced to avoid excessive smoothing is limited. In the existing method framework, the feature fusion is mostly realized by extracting network representation, and the semantic features and the structural features are fused by taking the semantic features as nodes, taking the structural features as edges and training graph representation models. However, in this process, both the random walk mode based on the meta-path and the training construction method of the prior graph convolution belong to an unsupervised mode, and the prior knowledge in the training set is not well utilized. Experiments show that the accuracy requirement of the algorithm on a network structure is very high, and the quality of the obtained feature representation and the final clustering effect can be reduced by introducing low-accuracy but high-recall structural features, so that the algorithm can only construct 1-2 high-accuracy structural features for use in practical application, and is difficult to comprehensively use various thesis feature information under the condition that a large number of traditional features of a large-scale data set are invalid to different degrees, and has great limitation.
Disclosure of Invention
Therefore, the invention firstly provides a thesis cold start disambiguation method based on feature extraction and fusion, which is divided into three parts:
firstly, constructing a deep fine-grained thesis semantic feature extraction model: inputting a synonym author argument set to be disambiguated, providing sufficient information support for the whole semantic representation framework by introducing BERT and extracting semantic features through a BERT variant and a distillation model, respectively trying to construct a downstream learning task in a binary group and triple group mode for training, then learning to semantic features which are related to the author disambiguation task and have strong discrimination performance through the designed downstream task, avoiding overfitting phenomenon by combining with countertraining, and carrying out targeted optimization design on a network structure, a training process and an optimization target by facing the characteristics of the paper author disambiguation task;
then, a structural feature extraction model capable of modeling the discriminative power difference and the inter-feature relation of the features is constructed, a clustering task is converted into a two-classification task to construct a training sample, an integrated learning method based on a decision tree and other machine learning methods are tried to perform structural representation training, and the contribution difference of the model to the thesis structural features and the logic relation between the features are learned and modeled by utilizing category information under the name of each author to be disambiguated in a training set;
and finally, constructing a feature fusion model which can fully utilize multiple structured features and avoid the excessive smoothness of network node representation, introducing the prior knowledge of a training set by utilizing a downstream task construction mode of a binary group and a triple group to train a graph convolution neural network based on the existing unsupervised graph convolution method, thereby obtaining an available model, finally dividing the model into paper clusters of the same author, wherein papers in each paper cluster belong to the same author.
The concrete implementation mode of the BERT variant and the distillation model is as follows: the MiniLM model with 6 layers is selected, and the purposes are realized through three items of design: the method is different from a one-to-one corresponding distillation mode among layers, and the Student model distills attention distribution of a last layer of Transformer in complete self-attention distribution of a Teacher model; secondly, adding scaling dot product operation among the queues, keys and values; and thirdly, introducing an assistant teaching mechanism to carry out transition on distillation from a large-scale pre-training Teacher model to a very small Student model.
The downstream learning task is specifically designed to: based on the clustering information in the training set, a training sample is constructed by using the following algorithm: based on each paper in the name of the author to be disambiguated in the training process
Figure BDA0003928770780000041
A training sample triplet i is constructed which is,
Figure BDA0003928770780000042
as an anchor sample, first randomly select one and
Figure BDA0003928770780000043
the paper written by the same author is taken as a positive sample
Figure BDA0003928770780000044
Then randomly selecting one and
Figure BDA0003928770780000045
papers written by different authors as negative examples
Figure BDA0003928770780000046
Thereby forming the ith triple training sample
Figure BDA0003928770780000047
Constructing a corresponding model architecture by the downstream task of the triple, constructing the model structure of the sub-neural network into three completely consistent sub-neural networks by using a structure similar to the twin network, and sharing a model parameter W in the training process;
defining a neural network formed by BERT and Pooling layers in the design of the Loss function as f, sharing weight W by three sub-networks in a network model, measuring semantic representation vectors obtained by the two papers by adopting cosine distance, and calculating an optimization target Loss under a triple downstream task construction mode based on the semantic representation vectors tri
Figure BDA0003928770780000048
The method for the confrontation training comprises the following steps: from the perspective of optimization theory, a confrontation training mode is designed based on gradients, an original sample input model is subjected to forward propagation to obtain a calculation result of a loss function, then the gradients are transmitted back to the original sample input through chain-type principle reverse propagation, disturbance to the positive gradient of a sample input vector is generated to a certain degree based on the loss function, the disturbance is added to the original sample to obtain a confrontation sample, then the confrontation sample vector is used as the model input to be subjected to forward propagation and is added into an optimization target, and the optimization target finally completes training and updating of model parameters through a gradient descent method.
The structural feature extraction model firstly selects structural features, and divides general fields which can be used for a paper author disambiguation task into two main types: the method comprises the following steps that unstructured semantic features and structured relation features are used, the unstructured semantic features refer to text features with strong semantic information and comprise three types of thesis titles, thesis abstracts and domain keywords, the text features are expressed into semantic feature vectors through semantic representation models respectively or after being spliced, the structured relation features refer to the fact that the text information of the features does not have too much value, and the values can be expressed only when corresponding fields of two thesis are compared, and the structured relation features comprise four types of author names, author institutions, conference names and publication years;
before the extraction of the structural features, the IDF of each field value is calculated, namely the frequency of the field value appearing in the corresponding field of all documents is used for indicating the strength of the discrimination of the field value, then the extraction of the structural features is carried out, the weights of the relationship are quantitatively described by using continuous numerical values based on the structural relationship features, and the problem of serious over-smoothing which can appear in the traditional method is solved in the feature fusion method which is designed subsequently,
for a feature fusion algorithm based on a graph structure designed in the following invention, semantic representation vectors of papers are used as nodes, structural relationship features are used as edges, a calculation mode of the association relationship features between certain papers is defined firstly, when the calculation result of the relationship features between two papers is not 0, an edge of the relationship type is constructed between the corresponding nodes of the two papers, the weight of the edge is the calculation result of the relationship features, when the constructed association relationship features are not of one type, different types of edges exist between the nodes in a graph network, namely, the paper network constructed through the structural feature extraction algorithm is a heterogeneous graph structure.
The structural characterization training method comprises the following steps: an LGBM model is introduced to model the difference contribution and the relationship between the characteristics of the structured structural characteristics, a series of gradient lifting trees are constructed by defining downstream tasks of two classifications, whether two papers are written by the same author is predicted according to subgraph structures corresponding to different relationship characteristics of a result obtained after the paper relationship heterogeneous graph network is split according to the edge type, the contribution difference and the relationship between different structural characteristics are modeled in the process, and the output of the model is the strength of the relationship characteristics of the two papers.
The construction process of the feature fusion model is that based on a thesis relationship network which is obtained by semantic feature extraction and relationship feature extraction and takes thesis semantic representation vectors as nodes and relationship weights among the thesis as edges, a new node representation vector which can keep an original graph structure in a feature space is trained, thesis unstructured semantic features and thesis structured relationship features are fused, final thesis clustering is carried out in the obtained node representation space, and the distance between the thesis nodes in the space is used as a clustering basis of a clustering algorithm;
the specific method comprises the following steps: assume that the set of all authors to be disambiguated is
Figure BDA0003928770780000051
Given the name a of the author to be disambiguated, N papers are shared to form a paper relation network
Figure BDA0003928770780000052
Wherein
Figure BDA0003928770780000053
A paper node set in the graph, epsilon is a set of edges representing the relationship between papers in the graph, and a paper node semantic representation matrix is obtained according to a semantic feature extraction model
Figure BDA0003928770780000054
Wherein
Figure BDA0003928770780000055
D is the dimension of the semantic feature space, and the relationship matrix between the papers is obtained according to the structural feature extraction model
Figure BDA0003928770780000061
First passes a threshold value epsilon s Thinning the adjacent matrix to obtain a thinned adjacent matrix
Figure BDA0003928770780000062
I.e. for any two papers
Figure BDA0003928770780000063
And thesis
Figure BDA0003928770780000064
Comprises the following steps:
Figure BDA0003928770780000065
then Y is reacted with
Figure BDA0003928770780000066
A GCN map convolution model g with two layers being accessed, the first convolution layer parameter being
Figure BDA0003928770780000067
Figure BDA0003928770780000068
The second convolution layer parameter is
Figure BDA0003928770780000069
ReLU is used as an activation function in the middle, and a dropout layer is added after convolution of each layer of graph in training to slow down the occurrence of the overfitting phenomenon, and finally the characteristic-fused thesis nodes are obtained;
in specific implementation, a two-tuple downstream task construction mode is adopted to select a paper pair on a training set
Figure BDA00039287707800000610
And
Figure BDA00039287707800000611
construction training sample
Figure BDA00039287707800000612
Wherein t is i,j Whether two papers are labels written by the same author entity, if the authors are the same t i,j =1, author is different then t i,j =0, the loss function of a single sample is the cross entropy loss under two classes:
Figure BDA00039287707800000613
the technical effects to be realized by the invention are as follows:
the invention researches and realizes a set of paper author disambiguation algorithm framework with high accuracy, high universality and expandability in large-scale data sets and practical engineering application mainly by constructing a deep fine-grained paper semantic feature extraction model, a structural feature extraction model capable of modeling discrimination difference and inter-feature relation of features and a feature fusion model capable of fully utilizing multiple structural features and avoiding excessive smoothness of network node representation.
Drawings
FIG. 1 illustrates a model architecture under a binary downstream task and a ternary downstream task
FIG. 2 gradient-based confrontation training method
FIG. 3 thesis structured feature extraction and relational network construction
FIG. 4 aggregation of multiple structured features based on ensemble learning
FIG. 5 GCN-based supervised feature fusion training procedure
Detailed Description
The following is a preferred embodiment of the present invention and is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.
The invention provides a paper cold start disambiguation method based on feature extraction and fusion. The method mainly comprises three parts:
(1) And constructing a deep fine-grained thesis semantic feature extraction model. By introducing a BERT and a deep neural network model of a variant thereof, downstream tasks are constructed in a binary group and triple group mode respectively and are trained, and the aspects of a model network structure, a training process, an optimization target and the like are subjected to targeted optimization design facing to the disambiguation task characteristic of a thesis author, so that deep semantic characteristics of a thesis research sub-field, a sentence expression characteristic and the like are extracted and expressed, the discriminability of the semantic characteristics under large-scale data sets and complex homonymy conditions is improved, and a clear decision boundary is expected to be obtained in a characteristic space, so that a heterogeneous graph structure has high-quality node expression, and the accuracy of subsequent feature fusion is guaranteed.
(2) And constructing a structural feature extraction model capable of modeling the discrimination difference of the features and the relationship among the features. The clustering task is converted into the binary task to construct a training sample, and an integrated learning method and other machine learning methods based on a decision tree are tried to be used, and the category information under the name of each author to be disambiguated in a training set is utilized to enable the model to learn and model the contribution difference of the paper structural features and the logic relationship among the features, so that a dense heterogeneous graph network structure which can flexibly face different data sets and fuse multiple structural features is constructed and used for fusing the semantic features and the structural features of subsequent papers.
(3) And constructing a feature fusion model which can fully utilize multiple structural features and avoid the excessive smoothness of the network node characterization. An attempt is made to introduce a priori knowledge of a training set for training a graph convolution neural network by using a downstream task construction mode of a binary group and a triple group based on the existing unsupervised graph convolution method.
Deep fine-grained thesis semantic feature extraction model
There are often deep semantic features in papers to provide information support for author disambiguation, which may include: study sub-domain information, sentence pattern characteristics, expression logic, etc. However, the current mainstream framework is deficient in the recognition capability of the research sub-field of the thesis.
The invention provides a model based on a pre-training model and combined with a downstream task to realize extraction of unstructured semantic features: semantic features are extracted through BERT variants and a distillation model to provide sufficient information support for the whole semantic representation framework, then the semantic features which are related to an author disambiguation task and have strong discrimination performance are learned through a designed downstream task, and the overfitting phenomenon is avoided by combining confrontation training.
BERT variants and distillation models
The invention hopes to use the BERT variant and distillation model to extract deep fine-grained semantic features of the paper content, and the deep and effective network structure and pre-training method realized in the NLP field can help to better capture the deep semantic features in the paper, so that sufficient information support is provided for the whole semantic representation framework. The MiniLM model with 6 layers is selected after experimental comparison
MiniLM is a compression method for a pre-training model based on a Transformer, which is provided for solving the problems of low reasoning speed, large memory occupation and easy over-fitting on a small-scale data set and a data set with large characteristic span of pre-training models such as basic BERT. It mainly achieves the above-mentioned purpose through three designs: firstly, a distillation mode different from one-to-one correspondence between layers is adopted, the Student model does not learn the complete self-attention distribution of the Teacher model, but only distills the attention distribution of the last layer of the Transformer, and the one-to-one mapping constraint between the Teacher model and the Student model is eliminated, so that the structure and the number of layers of the Student model can be more flexible; secondly, in order to enable the Student model to more deeply simulate the self-attention behavior of the Teacher, scaling dot product operations are added among the queries, keys and values, hidden layer dimensions of different sizes can be converted into a relation matrix of the same size under the condition that other parameters are not introduced, and the flexibility of Student hidden layer dimension selection is further improved; an assistant teaching mechanism is introduced to transition the distillation from a large-scale pre-training Teacher model to a minimal Student model, and the performance of the model is further improved under the condition that the number of layers of a Transformer is reduced by more than half and the distillation effect of hidden layer dimensions is reduced by more than half.
Downstream learning task design
After the model is extracted with more general semantic features and linguistic features through pre-training and distillation, in order to enable the model to learn the semantic features which are related to an author disambiguation task and have stronger distinguishing performance, a supervised downstream task needs to be defined on the basis of a training set clustering task to train the model.
The invention adopts a twin network mode in model training, the model structure is shown in the left side of figure 1, the model structure is a coupling architecture established based on a pair of neural networks, and the network is composed of two neural networks with the same structure and parameters.
The construction method of the downstream learning task at the right side in fig. 1 refers to the processing mode of the face recognition task in the image field. In the invention, on the task of disambiguating the semantic features of the papers, a certain author may have the research directions of a plurality of sub-fields, and different authors may also have the research directions of the same sub-field, so that the semantic similarity between papers compiled by the same author is possibly smaller than that between papers compiled by different authors. This is very similar to the problem in face recognition tasks, i.e. the similarity of the front and side captured images of the same person may be less than the similarity between the front and front captured images of two different persons.
Therefore, when constructing a downstream learning task, an attempt is made to construct a training sample based on the clustering information in the training set by using the following algorithm: based on each paper under the name of the author to be disambiguated in the training process
Figure BDA0003928770780000091
A training sample triplet i is constructed which is,
Figure BDA0003928770780000092
as an anchor sample, first randomly selecting one and
Figure BDA0003928770780000093
the paper written by the same author is taken as a positive sample
Figure BDA0003928770780000094
Then randomly selecting one and
Figure BDA0003928770780000095
the paper written by different authors is used as a negative example
Figure BDA0003928770780000096
Thereby forming the ith triple training sample
Figure BDA0003928770780000097
The model architecture corresponding to the task building downstream of the triples is shown on the right side of fig. 1. Similar to the structure of the twin network, the model structures of the three sub-neural networks in the figure are also completely consistent, and the model parameters W are shared in the training process.
The design of the Loss function refers to triple Loss used when the problem is solved in the image field. The neural network formed by the BERT and Pooling layers is defined as f, and the three sub-networks in the network model share the weight W. Measuring semantic representation vectors obtained by the two papers by adopting cosine distance, and calculating an optimization target Loss under a triple downstream task construction mode based on the semantic representation vectors tri
Figure BDA0003928770780000098
Confrontational training method
Because the number of research sub-fields of scientific literature is very large, the characteristic span difference of different Batch training samples before and after the training process is large, and the training process is unstable. This is also exacerbated by the presence of different research sub-fields by the same author, and the presence of the same research sub-field by different authors, and causes earlier overfitting phenomena. Therefore, in order to make the training effect more stable on the verification set and the test set and minimize the occurrence of the overfitting phenomenon, the present invention attempts to introduce a countertraining method to solve the above two problems.
According to the discussion in the existing research and the empirical experiments in the related work, it is proved that although the purpose of using the anti-training method in the image processing field is to avoid the gradient-based neural network attack, the generalization performance of the trained model to the non-antagonistic sample is reduced, but the generalization performance and the model robustness of the model can be improved in the natural language processing field, and the anti-training method has the effect similar to regularization and application value.
The confrontation training method introduced by the invention is a confrontation training mode based on gradient from the perspective of optimization theory.
The challenge sample construction and training method is shown in fig. 2. Firstly, an original sample input model is subjected to forward propagation, a calculation result of a loss function is obtained, then the gradient is transmitted back to the original sample input through chain type principle reverse propagation, a certain degree of disturbance is generated on the positive gradient of a sample input vector based on the loss function, the disturbance is added to the original sample to obtain a countermeasure sample, then the countermeasure sample vector is used as the model input to be subjected to forward propagation and added into an optimization target, and the optimization target finally completes the training and updating of model parameters through a gradient descent method.
Thesis structure feature extraction model for feature relationship modeling
Structured feature selection
Although scientific literature published in various periodicals, meetings, and academic journals differs in the types of fields available, the formats of the fields, the value ranges of the fields, and the like, the intersection can be used to classify the fields that can be used for the disambiguation task of the author of the paper into two broad categories: unstructured semantic features and structured relational features, as shown in the following table.
Figure BDA0003928770780000101
TABLE 1 scientific literature general field information and its classification
The above table lists all the literature features that are relatively common on each scientific literature system data set, i.e., most scientific literature system data sets have several common fields except for the absence of individual data field values. From the fairness of the process comparisons, the research in the disambiguation field of the authors of the prior papers was also developed based on the literature features shown in the above table. The system comprises a paper subject, a paper abstract and a conference name field, wherein the paper subject, the paper abstract and the conference name field are all of character string types, and each document only corresponds to one character string; the domain keywords, the author names and the author mechanisms are of list types with elements of character strings, the quantity of the domain keywords is not fixed and is about five generally, the size range of the author name list is wide, the author names and the author mechanism list are in one-to-one correspondence according to the element sequence, namely, each author has one author mechanism; the year of publication is an integer type, typically a four-bit integer.
The unstructured semantic features refer to text features with strong semantic information, three types of paper titles, paper abstracts and domain keywords are listed in the table, the features can be represented as semantic feature vectors through a semantic representation model respectively or after being spliced, the features are basic features for judging whether the paper is written by the same author, and as the research fields of the same author have certain correlation under normal conditions, extremely large semantic feature spans rarely occur. The structural relationship features mean that the text information of the features has no great value, and the values can be shown only when corresponding fields of two papers are compared, and the four types of author names, author institutions, conference names and publication years are listed in the table. Taking the name of an author as an example, valuable semantic information cannot be extracted from the character string of the name alone, but when a name appears in the author lists of two papers, it means that there is a high possibility that there is an additional co-author for the two papers, increasing the probability that the two papers are written by the same author to be disambiguated. Therefore, the document features are classified as structural relationship features to establish relationships between the papers, the unstructured semantic features and the structural relationship features in the above table are only possible classification modes, some fields can belong to two types at the same time, for example, fields such as meeting names, author institutions and the like may contain certain semantic information which can indicate the fields to which the papers belong, and the fields can also be used as semantic features.
Structured feature extraction
Based on the four structural relationship features selected in the above, the incidence relationship among various papers can be constructed, and in order to ensure the accuracy of modeling the strength of the relationship among the papers as much as possible, the invention uses continuous numerical quantification to describe the weight of the relationship, and solves the problem of serious over-smoothing in the traditional method in the feature fusion method of the subsequent design.
For the feature fusion algorithm based on the graph structure designed by the invention, the extraction process of the structural relationship features is the construction process of the graph structure, as shown in fig. 3. The semantic representation vectors of the paper are used as nodes, and the structural relationship features are used as edges. Firstly, defining a calculation mode of the association relation characteristic between certain papers, when the calculation result of the association relation characteristic between two papers is not 0, constructing an edge of the relation type between corresponding nodes of the two papers, wherein the weight of the edge is the calculation result of the association relation characteristic. When the constructed incidence relation features are not of one type, different types of edges exist between nodes in the graph network, namely, the thesis network constructed by the structural feature extraction algorithm is a heterogeneous graph structure.
Before the extraction of the structural features, the invention firstly calculates the IDF of each field value, namely the frequency of the field value appearing in all corresponding fields of documents, and is used for indicating the strength of the discrimination of the field by the value. The starting point of such design is that when the same word appears in different fields, the semantic meaning and the discrimination power of whether two papers are written by the same author are different, for example, the word "conference" is usually meaningless for judging the field of the papers when the word appears in the field "VENUE", but the research field which probably represents the papers when the word appears in the field "tlle" is related to the business of conference flow formulation, online conference software development and the like, and is valuable for constructing the relationship between the papers.
Before the extraction of the structural features, the invention firstly calculates the IDF of each field value, namely the frequency of the field value appearing in all corresponding fields of documents, and is used for indicating the strength of the discrimination of the field by the value.
LGBM-based structured representation training
Aiming at the problem that when the feature fusion based on the graph structure is carried out by the existing method, all the used adjacency matrixes are manually constructed, namely the selection and extraction results of the structural features are directly utilized for combination and simple superposition. However, it is not easy to find out through data analysis that the discrimination strength of different structural features for whether two papers are written by the same author is very different, and the discrimination process needs to examine the relationship among multiple structural features at the same time. In the existing method, all structural features are regarded as being equally important for judgment and have no logical relation, so that the error rate is extremely high during feature fusion and subsequent clustering if a graph network is constructed by using all the structural features.
In order to solve the two problems and get rid of the serious dependence on the artificial characteristic engineering, so that the algorithm framework has stronger universality and accuracy on different data sets, the LGBM model is introduced to model the difference contribution and the characteristic relation of the constructed structural characteristics.
Fig. 4 shows the LGBM-based structured characterization training and application process. By defining the downstream tasks of the second classification, the model constructs a series of gradient lifting trees to predict whether two papers are written by the same author according to subgraph structures corresponding to different relation characteristics obtained by the result of splitting the paper relation heterogeneous graph network according to the type of the edge, and the contribution difference and the relation between different structural characteristics are modeled in the process, wherein the output of the model is the strength of the relation characteristics of the two papers.
1.1 5.3 graph-structure based fusion of paper features
The part is to study a graph embedding method based on the graph network structure, and fuse the unstructured semantic features of the papers and the structured relational features of the papers by training new node representation vectors capable of reserving the original graph structure in a feature space. And finally, clustering the thesis in the obtained node representation space, wherein the distance between the thesis nodes in the space is used as a clustering basis of a clustering algorithm.
When the graph convolution neural network (GCN) is embedded into a graph, the training of model parameters in the neural network needs to be guided by relying on a proper downstream task and an optimization target. In the prior art, when feature fusion is carried out based on GCN, a manually constructed sparse adjacent matrix A is used * The adjustment of the GCN model parameters is directly implemented on the test set in a manner similar to an auto-encoder as a training target by itself. To add more training set information to it, let the model learn a priori knowledge on more training sets, one intuitive idea is to optimize the objective A * Replacing the adjacent matrix A between the papers constructed according to the real clustering result in the training set truth And transferring the parameters obtained by the model on the training set to a test set for use.
Assume that the set of all authors to be disambiguated is
Figure BDA0003928770780000121
Given the name a of the author to be disambiguated, N papers are shared to form a paper relation network
Figure BDA0003928770780000122
Wherein
Figure BDA0003928770780000123
And epsilon is a set of edges representing relationships between papers in the graph. Obtaining a paper node semantic representation matrix according to a semantic feature extraction model
Figure BDA0003928770780000124
Wherein
Figure BDA0003928770780000125
D is the dimension of the semantic feature space. Obtaining a relation matrix between papers according to a structural feature extraction model
Figure BDA0003928770780000126
First passes a threshold value epsilon s Thinning the adjacent matrix to obtain a thinned adjacent matrix
Figure BDA0003928770780000127
I.e. for any two papers
Figure BDA0003928770780000128
And a paper
Figure BDA0003928770780000129
Comprises the following steps:
Figure BDA0003928770780000131
then Y is reacted with
Figure BDA0003928770780000132
A GCN graph convolution model g with two layers switched in, the first convolution layer parameter being
Figure BDA0003928770780000133
Figure BDA0003928770780000134
The second convolution layer parameter is
Figure BDA0003928770780000135
ReLU is used as an activation function in the middle, and a dropout layer is added after convolution of each layer of graph in training to slow down the occurrence of an overfitting phenomenon, and finally a feature-fused thesis node is obtained.
In specific implementation, a two-tuple downstream task construction mode can be adopted to select a paper pair on a training set
Figure BDA0003928770780000136
And
Figure BDA0003928770780000137
construction training sample
Figure BDA0003928770780000138
Wherein t is i,j Whether two papers are labels written by the same author entity, if the authors are the same t i,j =1, author is different then t i,j And =0. The loss function of a single sample is the cross entropy loss under the two classifications:
Figure BDA0003928770780000139
the model structure and training process are shown in fig. 5.

Claims (7)

1. A thesis cold start disambiguation method based on feature extraction and fusion is characterized in that: the method is divided into three parts:
firstly, constructing a deep fine-grained thesis semantic feature extraction model: inputting a synonym author argument set to be disambiguated, providing sufficient information support for the whole semantic representation framework by introducing BERT and extracting semantic features through a BERT variant and a distillation model, respectively trying to construct a downstream learning task in a binary group and triple group mode for training, then learning to semantic features which are related to the author disambiguation task and have strong discrimination performance through the designed downstream task, avoiding overfitting phenomenon by combining with countertraining, and carrying out targeted optimization design on a network structure, a training process and an optimization target by facing the characteristics of the paper author disambiguation task;
then, a structural feature extraction model capable of modeling the discriminative power difference and the inter-feature relation of the features is constructed, a clustering task is converted into a two-classification task to construct a training sample, an integrated learning method based on a decision tree and other machine learning methods are tried to perform structural representation training, and the contribution difference of the model to the thesis structural features and the logic relation between the features are learned and modeled by utilizing category information under the name of each author to be disambiguated in a training set;
and finally, constructing a feature fusion model which can fully utilize multiple structured features and avoid the excessive smoothness of network node representation, introducing the prior knowledge of a training set by utilizing a downstream task construction mode of a binary group and a triple group to train a graph convolution neural network based on the existing unsupervised graph convolution method, thereby obtaining an available model, finally dividing the model into paper clusters of the same author, wherein papers in each paper cluster belong to the same author.
2. A paper cold-start disambiguation method based on feature extraction and fusion as recited in claim 1, wherein: the concrete implementation modes of the BERT variant and the distillation model are as follows: the MiniLM model with 6 layers is selected, and the purposes are realized through three items of design: the method is different from a one-to-one corresponding distillation mode among layers, and the Student model distills attention distribution of a last layer of Transformer in complete self-attention distribution of a Teacher model; secondly, adding scaling dot product operation among the queues, keys and values; and thirdly, introducing a teaching assistant mechanism to carry out transition from large-scale pre-training of the Teacher model to distillation of the minimal Student model.
3. A paper cold-start disambiguation method based on feature extraction and fusion as recited in claim 2, characterized in that: the downstream learning task is specifically designed to: based on the clustering information in the training set, a training sample is constructed by using the following algorithm: based on each paper under the name of the author to be disambiguated in the training process
Figure FDA0003928770770000011
A training sample triplet i is constructed which is,
Figure FDA0003928770770000012
as an anchor sample, first randomly selecting one and
Figure FDA0003928770770000013
the paper written by the same author is taken as a positive sample
Figure FDA0003928770770000014
Then randomly selecting one and
Figure FDA0003928770770000015
the paper written by different authors is used as a negative example
Figure FDA0003928770770000021
Thereby forming the ith triple training sample
Figure FDA0003928770770000022
Constructing a corresponding model architecture by the downstream task of the triple, constructing the model structure of the sub-neural network into three completely consistent sub-neural networks by utilizing a structure similar to the twin network, and sharing a model parameter W in the training process;
defining a neural network formed by BERT and Pooling layers in the design of the Loss function as f, sharing weight W by three sub-networks in a network model, measuring semantic representation vectors obtained by the two papers by adopting cosine distance, and calculating an optimization target Loss under a triple downstream task construction mode based on the semantic representation vectors tri
Figure FDA0003928770770000023
4. A paper cold-start disambiguation method based on feature extraction and fusion as recited in claim 3, characterized in that: the method for the confrontation training comprises the following steps: from the perspective of optimization theory, a countermeasure training mode is designed based on gradient, an original sample input model is subjected to forward propagation to obtain a calculation result of a loss function, then the gradient is transmitted back to the original sample input through chain type principle reverse propagation, disturbance to a certain degree is generated on the positive gradient of a sample input vector based on the loss function, the disturbance is added to the original sample to obtain a countermeasure sample, then the countermeasure sample vector is used as the model input to be subjected to forward propagation and added into an optimization target, and the optimization target finally completes training and updating of model parameters through a gradient descent method.
5. The paper cold-start disambiguation method on the basis of feature extraction and fusion as recited in claim 4, wherein: the structural feature extraction model firstly selects structural features, and divides general fields which can be used for a paper author disambiguation task into two main types: the method comprises the following steps that unstructured semantic features and structured relation features are used, the unstructured semantic features refer to text features with strong semantic information and comprise three types of thesis titles, thesis abstracts and domain keywords, the text features are expressed into semantic feature vectors through semantic representation models respectively or after being spliced, the structured relation features refer to the fact that the text information of the features does not have too much value, and the values can be expressed only when corresponding fields of two thesis are compared, and the structured relation features comprise four types of author names, author institutions, conference names and publication years;
before the extraction of the structural features, the IDF of each field value is calculated, namely the frequency of the field value appearing in the corresponding field of all documents is used for indicating the strength of the discrimination of the field value, then the extraction of the structural features is carried out, the weights of the relationship are quantitatively described by using continuous numerical values based on the structural relationship features, and the problem of serious over-smoothing which can appear in the traditional method is solved in the feature fusion method which is designed subsequently,
for a feature fusion algorithm based on a graph structure designed in the following invention, semantic representation vectors of papers are used as nodes, structural relationship features are used as edges, a calculation mode of the association relationship features between certain papers is defined firstly, when the calculation result of the relationship features between two papers is not 0, an edge of the relationship type is constructed between the corresponding nodes of the two papers, the weight of the edge is the calculation result of the relationship features, when the constructed association relationship features are not of one type, different types of edges exist between the nodes in a graph network, namely, the paper network constructed through the structural feature extraction algorithm is a heterogeneous graph structure.
6. The paper cold-start disambiguation method on the basis of feature extraction and fusion as recited in claim 5, wherein: the structural characterization training method comprises the following steps: and (2) introducing an LGBM model to model the difference contribution and the relationship between the characteristics of the structured structural characteristics, constructing a series of gradient lifting trees by defining downstream tasks of two classifications, predicting whether two papers are written by the same author or not according to sub-graph structures corresponding to different relationship characteristics of a result obtained after the paper relationship heterogeneous graph network is split according to the edge type, modeling the difference of the contributions and the relationship between the different structural characteristics in the process, and outputting the model, namely the strength of the two papers with the relationship characteristics of the two papers.
7. The paper cold-start disambiguation method on the basis of feature extraction and fusion as recited in claim 6, wherein: the construction process of the feature fusion model comprises the steps of training a new node representation vector capable of reserving an original graph structure in a feature space based on a paper relationship network which is obtained by semantic feature extraction and relationship feature extraction and takes paper semantic representation vectors as nodes and relationship weights among papers as edges, fusing paper unstructured semantic features and paper structured relationship features, and finally clustering papers according to the obtained node representation space, wherein the distance between paper nodes in the space is used as a clustering basis of a clustering algorithm;
the specific method comprises the following steps: assume that the set of all authors to be disambiguated is
Figure FDA0003928770770000031
Given the name a of the author to be disambiguated, N papers are shared to form a paper relation network
Figure FDA0003928770770000032
Wherein
Figure FDA0003928770770000033
The method is characterized in that the method is a paper node set in a graph, epsilon is a set of edges representing relations between papers in the graph, and a paper node semantic representation matrix is obtained according to a semantic feature extraction model
Figure FDA0003928770770000034
Wherein
Figure FDA0003928770770000035
D is the dimension of the semantic feature space, and the relationship matrix between the papers is obtained according to the structural feature extraction model
Figure FDA0003928770770000036
First passes a threshold value epsilon s Thinning the adjacent matrix to obtain a thinned adjacent matrix
Figure FDA0003928770770000037
I.e. for any two papers
Figure FDA0003928770770000038
And thesis
Figure FDA0003928770770000039
Comprises the following steps:
Figure FDA00039287707700000310
then Y is reacted with
Figure FDA00039287707700000311
A GCN map convolution model g with two layers being accessed, the first convolution layer parameter being
Figure FDA00039287707700000312
Figure FDA00039287707700000313
The second convolution layer parameter is
Figure FDA00039287707700000314
ReLU is used as an activation function in the middle, and a dropout layer is added after convolution of each layer of graph in training to slow down the occurrence of an overfitting phenomenon, and finally a characteristic-fused thesis node is obtained;
in specific implementation, a two-tuple downstream task construction mode is adopted to select a paper pair on a training set
Figure FDA00039287707700000315
And
Figure FDA00039287707700000316
structural training sample
Figure FDA00039287707700000317
Wherein t is i,j Whether two papers are tags written by the same author entity, if the authors are the same, t i,j =1, author different then t i,j =0, the loss function of a single sample is the cross entropy loss under two classes:
Figure FDA00039287707700000318
CN202211382121.8A 2022-11-07 2022-11-07 Paper cold start disambiguation method based on feature extraction and fusion Pending CN115688737A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211382121.8A CN115688737A (en) 2022-11-07 2022-11-07 Paper cold start disambiguation method based on feature extraction and fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211382121.8A CN115688737A (en) 2022-11-07 2022-11-07 Paper cold start disambiguation method based on feature extraction and fusion

Publications (1)

Publication Number Publication Date
CN115688737A true CN115688737A (en) 2023-02-03

Family

ID=85049300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211382121.8A Pending CN115688737A (en) 2022-11-07 2022-11-07 Paper cold start disambiguation method based on feature extraction and fusion

Country Status (1)

Country Link
CN (1) CN115688737A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556050A (en) * 2024-01-12 2024-02-13 长春吉大正元信息技术股份有限公司 Data classification and classification method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556050A (en) * 2024-01-12 2024-02-13 长春吉大正元信息技术股份有限公司 Data classification and classification method and device, electronic equipment and storage medium
CN117556050B (en) * 2024-01-12 2024-04-12 长春吉大正元信息技术股份有限公司 Data classification and classification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Pham et al. Learning multiple layers of knowledge representation for aspect based sentiment analysis
Priyadarshini et al. A novel LSTM–CNN–grid search-based deep neural network for sentiment analysis
Hidayat et al. Sentiment analysis of twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier
Zhang et al. Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering
Li et al. Discriminative deep random walk for network classification
Liu et al. Research of fast SOM clustering for text information
Dos Santos et al. Multilabel classification on heterogeneous graphs with gaussian embeddings
CN113515632B (en) Text classification method based on graph path knowledge extraction
CN113962293A (en) LightGBM classification and representation learning-based name disambiguation method and system
Jiang et al. Candidate region aware nested named entity recognition
Chen et al. Text similarity semantic calculation based on deep reinforcement learning
CN115688737A (en) Paper cold start disambiguation method based on feature extraction and fusion
CN110569355A (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
Thijs et al. The contribution of the lexical component in hybrid clustering, the case of four decades of “Scientometrics”
Ding et al. The research of text mining based on self-organizing maps
Tahrat et al. Text2geo: from textual data to geospatial information
Ramaprabha et al. Survey on sentence similarity evaluation using deep learning
Li et al. Inferring user profiles in online social networks based on convolutional neural network
Wang et al. Extracting discriminative keyphrases with learned semantic hierarchies
Selvi et al. Topic categorization of Tamil news articles
Sivalingam et al. CRF-MEM: Conditional Random Field Model Based Modified Expectation Maximization Algorithm for Sarcasm Detection in Social Media
Yun et al. Combining vector space features and convolution neural network for text sentiment analysis
Liu et al. Identifying scholarly communities from unstructured texts
Wu et al. Facet annotation by extending CNN with a matching strategy
Li et al. An entity linking model based on candidate features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination