CN112215007B

CN112215007B - Organization named entity normalization method and system based on LEAM model

Info

Publication number: CN112215007B
Application number: CN202011141040.XA
Authority: CN
Inventors: 亓杰星; 彭金波; 傅洛伊; 王新兵; 陈贵海
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2022-09-23
Anticipated expiration: 2040-10-22
Also published as: CN112215007A

Abstract

The invention provides a mechanism named entity normalization method and system based on an LEAM model, which comprises the following steps: step S1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet preset conditions; step S2: removing noise in the screened data according to the regular expression; step S3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion; step S4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization; step S5: and inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment. The invention can be used for counting the paper publication number of each academic institution, thereby being more scientific and more intuitive to judge the academic ability of a certain academic institution.

Description

Organization named entity normalization method and system based on LEAM model

Technical Field

The invention relates to the technical field of mechanism named entity normalization, in particular to a mechanism named entity normalization method and system based on an LEAM model.

Background

Organization named entity normalization the main purpose in academic big data is to identify and map various organization aliases to organization entities belonging to reality. The organization named entity normalization is important for academic organization capability evaluation, organization cooperation network, student name disambiguation, student trajectory tracking, talent mobility, academic thesis management, academic ranking and the like. In the increasing number of academic papers, organization named entity normalization is also an essential step for building an academic network knowledge graph.

With the progress of modern science and technology, the number of scientific research papers is increased dramatically. In recent years, the average growth rate of the number of papers and patents has been kept around 15%. Meanwhile, the institutional thesis statistics are very complicated due to problems of translation methods, spelling errors, system changes, writing styles and the like. Therefore, it is very critical to provide a simple and effective large-scale academic institution named entity normalization system.

Methods for solving the problem of mechanism normalization can be generally divided into three categories: rule-based methods, knowledge-based methods, and hybrid methods that combine the two methods. The rule-based method utilizes some naming rules of the organization naming entity, utilizes the regular expression to match the organization alias, and extracts information which can be utilized in the organization alias and is used for identifying the organization. The system mainly includes an NEMO system proposed by De Bru and Moed, which extracts information in an organization named entity, such as a geographical location, a website, a mailbox, an organization name, and the like, by using a layer-by-layer rule, and performs mapping through existing local information. The knowledge-based method utilizes pre-prepared labeled data to learn features therein through a machine learning algorithm to realize a classification or clustering model.

Although the method of normalization using rules is superior in some examples, it has certain requirements on author naming specifications, so that it cannot be widely applied and has low accuracy, so most normalization algorithms use knowledge-based methods. The invention provides a method based on deep learning, which is different from the traditional machine learning-based methods such as naive Bayes, SVM and the like.

Patent document CN111783465A (application No.: CN202010630635.5) discloses a named entity normalization method, comprising: acquiring a question of a user; performing word segmentation and named entity identification on the user question to obtain a universal named entity set; generating a syntax tree set for the universal named entity set by using a CYK algorithm; traversing the syntax tree set to obtain a maximum tree combination; traversing the maximum tree combination, and converting into a fixed expression according to a preset grammar sequence. The method and the device can effectively improve the analysis matching capability of the complex question sentence, and further improve the man-machine interaction capability of the intelligent device.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a mechanism named entity normalization method and system based on a LEAM model.

The mechanism named entity normalization method based on the LEAM model provided by the invention comprises the following steps:

step S1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet a preset condition;

step S2: removing noise in the screened data according to the regular expression;

step S3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion;

step S4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization;

step S5: and inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment.

Preferably, the step S1 includes:

step S101: deriving the named entities of all academic institutions from the database;

step S102: counting the names and the frequencies of all normalization mechanisms corresponding to all the same named entities;

step S103: and reserving the normalization mechanism corresponding to the named entity according to the maximum frequency, and deleting other data.

Preferably, the step S2 includes:

step S201: converting Latin text appearing in the organization name into English letters;

step S202: and removing stop words and punctuation marks in the converted organization names according to the regular expression, and expanding and writing the abbreviations in the organization names.

Preferably, the step S3 includes:

step S301: counting the number of mechanism entity aliases corresponding to all the normalized mechanism categories;

step S302: according to the following steps of 6: 2: a ratio of 2 randomly divides the entity aliases for each class into a training set, a validation set, and a test set.

Preferably, the step S4 includes:

step S401: loading the trained 100-dimensional character vector as the initialization of the character during training;

step S402: sequencing the original names of the mechanisms in the training set according to the length of the character strings, taking training data with the size of the batch according to the length each time to form a training batch, and performing equal operation on the batches in the verification set;

step S403: representing each original name in each batch into a matrix of L rows and 100 columns, wherein L is the length of the longest character string in the batch, and each row of the matrix is represented by a 100-dimensional vector of each character;

step S404: training a LEAM-based text classification model by a matrix representation of training data;

step S405: and after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set.

Preferably, let-based text classification model is considered as f ₀ 、f ₁ And f ₂ Cascade of three systems, wherein f ₀ The representation represents the character as a vector, f ₁ Is to operate these vectors to obtain the representation of the text, f ₂ Is to use the text representation for classification;

at f ₀ Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f ₁ The method comprises the steps of utilizing the similarity of characters of a normalization mechanism and original mechanisms to carry out aggregation of text representations;

by c _i Vector representation of the ith normalization mechanism, C is a matrix representation of all normalization mechanisms, assuming a total of K classes, using cosine faciesSimilarity calculation the similarity between each category and character:

wherein,

the product of the elements is represented by,

is a normalized matrix of K x 100 and,

each of the elements of (1) is

V represents a text representation matrix; v. of ₁ Column l representing V;

by G _lr The correlation of the center at length l of 2r +1 is measured, and the similarity vector is expressed as:

u _l ＝RELU(G _lr W ₁ +b ₁ )

wherein u is _l ∈R ^K ，

W ₁ Representing linear parameters in the classification model; b ₁ Representing constant parameters in the classification model;

the maximum pool is used to obtain the coefficient of maximum correlation, from which the coefficient of attention mechanism is obtained, named β, and the final text is expressed as:

β ₁ the l-th element representing β;

and converting the model into an optimization problem through a corresponding loss entropy function, and training corresponding parameters by using corresponding training set data.

Preferably, the step S5 includes:

step S501: testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model;

step S502: analyzing the data which are not classified according to the preset rule, removing the data with wrong labels, and manually marking the data which are difficult to classify;

step S503: and modifying the size of the batch and the hyper-parameters of the learning rate, and repeatedly executing the steps S1-S5 to obtain the optimal hyper-parameter design and the corresponding model.

The mechanism named entity normalization system based on the LEAM model provided by the invention comprises:

module M1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet preset conditions;

module M2: removing noise in the screened data according to the regular expression;

module M3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion;

module M4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization;

module M5: and inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment.

Preferably, the module M1 includes:

a module M101: deriving the named entities of all academic institutions from the database;

the module M102: counting the names and the frequencies of all normalization mechanisms corresponding to all the same named entities;

the module M103: reserving a normalization mechanism corresponding to the named entity according to the maximum frequency, and deleting other data;

the module M2 includes:

a module M201: converting Latin text appearing in the organization name into English letters;

the module M202: removing stop words and punctuation marks in the converted mechanism name according to the regular expression, and performing expansion writing on the abbreviation in the mechanism name;

the module M3 includes:

a module M301: counting the number of mechanism entity aliases corresponding to all the normalized mechanism categories;

a module M302: according to the proportion of 6: 2: a scale of 2 randomly divides the entity aliases for each class into a training set, a validation set, and a test set.

Preferably, the module M4 includes:

the module M401: loading the trained 100-dimensional character vector as the initialization of the character during training;

the module M402: sequencing original names of mechanisms in a training set according to the length of a character string, taking training data with the size of a batch according to the length each time to form a trained batch, and performing equal operation on the batches in a verification set;

module M403: representing each original name in each batch into a matrix of L rows and 100 columns, wherein L is the length of the longest character string in the batch, and each row of the matrix is represented by a 100-dimensional vector of each character;

the module M404: training a LEAM-based text classification model by a matrix representation of training data;

the module M405: after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set;

the module M5 includes:

module M501: testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model;

the module M502: analyzing the data which are not classified according to the preset rule, removing the data with wrong labels, and manually marking the data which are difficult to classify;

a module M503: and modifying the size of the batch and the hyper-parameters of the learning rate, and repeatedly executing the modules M1-S5 to obtain the optimal hyper-parameter design and the corresponding model.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention adopts the latest deep learning method to construct a simple and effective mechanism normalization system, and solves the problems of low accuracy and high operation complexity of the old method;

2. the method is used for providing reliable academic institution information for vast researchers, the information can be used for academic ability evaluation or academic data knowledge graph construction, and the significance is great;

3. the method is used for distinguishing the academic ability of the academic institution and is better used for constructing other academic systems, such as an academic knowledge map and the like.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the algorithm of the present invention;

FIG. 2 is a basic framework diagram of a LEAM-based classification model;

fig. 3 is a schematic diagram of the LEAM model.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example (b):

the invention relates to a system for mechanism named entity normalization by designing and realizing, which relates to the collection and arrangement of mechanism named entity data, the screening of mechanism named entity data, the denoising, the construction based on an LEAM classification model, the training and adjustment of the model by using data and the like; specifically, as shown in fig. 1, the method of the present invention comprises the following steps:

step S1: and screening all academic institution information data by using a statistical rule to remove obviously wrong data.

Step S2: in the screened data, the noise existing in the data is removed by using the regular rule or some other rule.

Step S3: and dividing the processed data into a training set, a verification set and a test set according to the categories and corresponding proportions.

Step S4: inputting the training set and the verification set into a LEAM model, and training out a model for mechanism named entity normalization, wherein a schematic diagram of the LEAM model is shown in figure 3.

The step S1 includes: the organization named entities and the corresponding normalized academic institutions are obtained from the Acemap database, and the organization named entities and the corresponding normalized academic institutions comprise 1.53 hundred million organization named entities, about 3179 ten thousand organization named entities after deduplication, and 2.5 ten thousand normalized organizations. All processed data are stored in a csv format, so that the data are convenient to use later.

Step S101: the named entities of all academic institutions are taken from the database.

Step S102: the organization named entity obtained from the database has a part of data mapped as an error normalized organization name. In the invention, a method of counting frequency is adopted to remove error data, the specific process is to traverse the whole data set, count each appearing entity, count the normalization mechanisms corresponding to the same named entity, store the result in a dictionary format, take the keywords as the mechanism named entity, and take the key values as all normalization mechanism ID lists (including repetition) corresponding to the named entity.

Step S103: counting the key value list obtained in the step S102, regarding the normalized mechanism ID of the mechanism named entity which appears the most times as the normalized mechanism ID of the mechanism named entity, regarding other mechanism IDs which do not appear the most times as error data, removing the error data from the data set, and performing deduplication on the remaining data and then retaining the remaining data as a CSV file.

The step S2 includes: the Latin is converted into English related rules, Latin which cannot be identified by the method is converted, and the text is further denoised by the related rules such as regular rules and the like. The method comprises the following specific steps:

step S201: latin text appearing in the institution entity is converted into English letters by using the dictionary.

Step S202: and then according to the regular expression, removing stop words and punctuation marks in the converted organization name, and expanding and writing the abbreviation in the organization name. Specific examples are shown in table one.

Table one: processing stop words, punctuation, acronyms

The step S3 includes: traversing data to obtain the number of mechanism entity names owned by all normalization mechanisms, and dividing three different data sets according to a certain proportion to store as csv files for model training. The method comprises the following specific steps:

step S301: and traversing the csv files stored after the error data are removed, and counting the number of mechanism entity aliases corresponding to the categories of all the normalized mechanisms, wherein the number is shown in a table II.

Table two: normalized organization alias statistics

Normalized organization name	Number of aliases
		Centre national de la recherche scientifique	295420
Chinese Academy of Sciences	173719
		Harvard University	157932
National Institutes of Health	120212
		University of Tokyo	111260
Stanford University	90774
		……	……

Step S302: according to the following steps of 6: 2: 2 randomly dividing the entity aliases of each category into a training set, a verification set and a test set, and storing the training sets, the verification set and the test set in a csv format, as shown in table three.

Table three: data set partitioning storage format

Training set	train.csv
		Verification set	dev.csv
Test set	test.csv

The step S4 includes: converting the mechanism named entity in the character string form into a representation matrix through embedding, sending the representation matrix into a constructed classification system according to a corresponding training rule, and training a reliable mechanism normalized classification model, as shown in fig. 2, the concrete steps are as follows:

step S401: the 100-dimensional character vector trained by Stanford university is loaded as the initialization of the character at training. The character vector is trained by the fundamental principle of glove. The realization comprises three steps: first, a co-occurrence matrix is constructed according to a corpus, and each element in the matrix represents the co-occurrence times of words and context words in a context window with a specific size. Meanwhile, the weight of the times can also change along with the distance between two words; then, establishing an approximate relation between the word vector and the co-occurrence matrix; and finally, constructing a loss function of the training word vector according to the approximate relation, and then obtaining the final word vector representation through training.

Step S402: and sequencing the original names of the mechanisms in the training set according to the length of the character string, and taking the training data with the size of the batch according to the length every time to form a trained batch. The same is true for the validation set's batch. The size of the batch is the size of the data volume required when the parameters are updated every time, the training speed is faster when the batch is larger, but more training rounds may be required to obtain the optimal solution, and each training occupies a large amount of training resources. The size of the batch is typically set to a power of 2.

Step S403: each original name in each batch is represented as a matrix of L rows and 100 columns, where L is the length of the longest string in that batch. Each row of the matrix is a 100-dimensional vector representation of each character.

Step S404: a LEAM-based text classification model is trained by a matrix representation of training data, and the principles of this classification model will be explained in detail below.

Consider the text classification model as f ₀ ，f ₁ And f ₂ Cascade of three systems, wherein f ₀ The representation represents the character as a vector, f ₁ Is to operate these vectors to obtain the representation of the text, f ₂ The text representation is used for classification. At f ₀ Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f ₁ Wherein the aggregation of the text representations is performed using the similarity of the characters of the normalization mechanism and the original mechanism. In particular, with c _i Represents the vector representation of the ith normalization mechanism, and C is the matrix representation of all normalization mechanisms. Assuming that there are K categories in total, the similarity between each category and the character can be calculated by using cosine similarity:

wherein,

the product of the elements is represented by,

is a K x 100 normalized matrix of the matrix,

each of the elements of (1) is

In order to better acquire the space information between characters, nonlinearity is introduced in the similarity, and the nonlinearity can be obtained by convolution and an activation function.

By G _lr With the measurement centre at length l of 2r +1Correlation, then the similarity vector is expressed as:

u _l ＝RELU(G _lr W ₁ +b ₁ )

wherein u is _l ∈R ^K Then, by using the maximum pool, the maximum correlation coefficient can be obtained. The attention mechanism coefficient, named beta, can then be derived from this coefficient. The final text representation is:

so that the model can be transformed into an optimization problem by the corresponding loss entropy function. Corresponding parameters can be trained using the corresponding training set data.

Step S405: and after each training round is finished, verifying the model by using a verification set, judging whether the model plays a role or not, and adjusting the hyper-parameters of the model according to the result of the verification set. In the training process, attention should be paid to the variation trend of the loss of the training set and the verification set function at any time, and the retained final model is the parameter with the highest accuracy of the verification set. Meanwhile, if the accuracy of the training set is increased and the accuracy of the verification set is reduced, whether the model is over-fitted or not should be considered, and if the over-fitting is confirmed, the training should be stopped and the model with the highest accuracy of the verification set is taken as the final retained model. The training results are shown in table four:

table four: training results

Data set	Rate of accuracy
		Training set	98.2％
Verification set	97.1％

The step S5 includes: the method comprises the following steps of testing a model by using a test set, adjusting hyper-parameters, analyzing error data, further cleaning or manually adjusting the error data to obtain a model with the highest accuracy as much as possible, and storing the model as a final model, wherein the method comprises the following specific steps:

s501: and testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model. The results are shown in Table five.

Table five: test results

Data set	Rate of accuracy
		Test set 1	97.4％
Test set 2	97.7％
		Test set 3	97.5％

S502: and analyzing the classified error data, removing data with obvious label errors, and manually marking the data which is difficult to classify. An example of the data with the classification error is shown in table six.

Table six: classification of error data examples

S503: and modifying the size of the batch, learning rate and other hyper-parameters, and repeating all the steps to obtain the optimal hyper-parameter design and the corresponding model.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the apparatus, and the modules thereof provided by the present invention may be considered as a hardware component, and the modules included in the system, the apparatus, and the modules for implementing various programs may also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A mechanism named entity normalization method based on an LEAM model is characterized by comprising the following steps:

step S1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet preset conditions;

step S5: inputting the test set into a trained model, testing the effect of the model and carrying out fine adjustment;

the step S4 includes:

step S402: sequencing original names of mechanisms in a training set according to the length of a character string, taking training data with the size of a batch according to the length each time to form a trained batch, and performing equal operation on the batches in a verification set;

step S405: after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set;

consider the LEAM-based text classification model as f ₀ 、f ₁ And f ₂ Cascade of three systems, wherein f ₀ The representation represents the character as a vector, f ₁ Is to operate these vectors to obtain the representation of the text, f ₂ Is to use the text representation for classification;

at f ₀ Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f ₁ Using the phases of the characters of the normalization mechanism and the original mechanismSimilarity to aggregate textual representations;

by c _i And C is a matrix representation of all the normalization mechanisms, assuming that K categories exist, and calculating the similarity between each category and the character by using cosine similarity:

wherein,

the product of the elements is represented by,

is a normalized matrix of K x 100 and,

each of the elements of (1) is

V represents a text representation matrix; v. of ₁ Column l representing V;

u _l ＝RELU(G _lr W ₁ +b ₁ )

wherein u is _l ∈R ^K ，

W ₁ Representing linear parameters in the classification model; b is a mixture of ₁ Representing constant parameters in the classification model;

β ₁ the l-th element representing β;

2. The method for normalization of organization named entities according to claim 1, wherein said step S1 comprises:

3. The method for normalization of organization named entities according to claim 1, wherein said step S2 comprises:

step S202: according to the regular expression, stop words and punctuation marks in the converted mechanism name are removed, and abbreviation words in the mechanism name are expanded and written.

4. The method for normalization of organization named entities according to claim 1, wherein said step S3 comprises:

5. The method for normalization of organization named entities according to claim 1, wherein said step S5 comprises:

6. An organization named entity normalization system based on an LEAM model, which is characterized by comprising:

module M5: inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment;

the module M4 includes:

a module M402: sequencing original names of mechanisms in a training set according to the length of a character string, taking training data with the size of a batch according to the length each time to form a trained batch, and performing equal operation on the batches in a verification set;

module M405: after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set;

the module M5 includes:

a module M501: testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model;

the module M502: analyzing the data which are not classified according to the preset rule, removing the data with wrong labels, and manually marking the data which are difficult to be classified;

a module M503: modifying the size of the batch and the hyper-parameters of the learning rate, and repeatedly executing the modules M1-S5 to obtain the optimal hyper-parameter design and the corresponding model;

at f ₀ Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f ₁ The similarity of characters of a normalization mechanism and the original mechanism is utilized to carry out the aggregation of text representation;

by c _i And C is a matrix representation of all normalization mechanisms, assuming that K categories exist, and calculating the similarity between each category and the character by using the cosine similarity:

wherein,

the product of the elements is expressed by,

is a normalized matrix of K x 100 and,

each of the elements of (1) is

V represents a text representation matrix; v. of ₁ Column l representing V;

u _l ＝RELU(G _lr W ₁ +b ₁ )

wherein u is _l ∈R ^K ，

the maximum pool is used to get the coefficient of maximum correlation, from which the coefficient of attention mechanism is derived, named β, and the final text is expressed as:

β ₁ the l-th element representing β;

7. The LEAM model-based organization named entity normalization system of claim 6, wherein said module M1 comprises:

the module M2 includes:

module M201: converting Latin text appearing in the organization name into English letters;

the module M3 includes:

module M301: counting the number of mechanism entity aliases corresponding to all the normalized mechanism categories;

the module M302: according to the proportion of 6: 2: a ratio of 2 randomly divides the entity aliases for each class into a training set, a validation set, and a test set.