CN112215007B - Organization named entity normalization method and system based on LEAM model - Google Patents

Organization named entity normalization method and system based on LEAM model Download PDF

Info

Publication number
CN112215007B
CN112215007B CN202011141040.XA CN202011141040A CN112215007B CN 112215007 B CN112215007 B CN 112215007B CN 202011141040 A CN202011141040 A CN 202011141040A CN 112215007 B CN112215007 B CN 112215007B
Authority
CN
China
Prior art keywords
model
data
module
training
normalization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011141040.XA
Other languages
Chinese (zh)
Other versions
CN112215007A (en
Inventor
亓杰星
彭金波
傅洛伊
王新兵
陈贵海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202011141040.XA priority Critical patent/CN112215007B/en
Publication of CN112215007A publication Critical patent/CN112215007A/en
Application granted granted Critical
Publication of CN112215007B publication Critical patent/CN112215007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a mechanism named entity normalization method and system based on an LEAM model, which comprises the following steps: step S1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet preset conditions; step S2: removing noise in the screened data according to the regular expression; step S3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion; step S4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization; step S5: and inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment. The invention can be used for counting the paper publication number of each academic institution, thereby being more scientific and more intuitive to judge the academic ability of a certain academic institution.

Description

Organization named entity normalization method and system based on LEAM model
Technical Field
The invention relates to the technical field of mechanism named entity normalization, in particular to a mechanism named entity normalization method and system based on an LEAM model.
Background
Organization named entity normalization the main purpose in academic big data is to identify and map various organization aliases to organization entities belonging to reality. The organization named entity normalization is important for academic organization capability evaluation, organization cooperation network, student name disambiguation, student trajectory tracking, talent mobility, academic thesis management, academic ranking and the like. In the increasing number of academic papers, organization named entity normalization is also an essential step for building an academic network knowledge graph.
With the progress of modern science and technology, the number of scientific research papers is increased dramatically. In recent years, the average growth rate of the number of papers and patents has been kept around 15%. Meanwhile, the institutional thesis statistics are very complicated due to problems of translation methods, spelling errors, system changes, writing styles and the like. Therefore, it is very critical to provide a simple and effective large-scale academic institution named entity normalization system.
Methods for solving the problem of mechanism normalization can be generally divided into three categories: rule-based methods, knowledge-based methods, and hybrid methods that combine the two methods. The rule-based method utilizes some naming rules of the organization naming entity, utilizes the regular expression to match the organization alias, and extracts information which can be utilized in the organization alias and is used for identifying the organization. The system mainly includes an NEMO system proposed by De Bru and Moed, which extracts information in an organization named entity, such as a geographical location, a website, a mailbox, an organization name, and the like, by using a layer-by-layer rule, and performs mapping through existing local information. The knowledge-based method utilizes pre-prepared labeled data to learn features therein through a machine learning algorithm to realize a classification or clustering model.
Although the method of normalization using rules is superior in some examples, it has certain requirements on author naming specifications, so that it cannot be widely applied and has low accuracy, so most normalization algorithms use knowledge-based methods. The invention provides a method based on deep learning, which is different from the traditional machine learning-based methods such as naive Bayes, SVM and the like.
Patent document CN111783465A (application No.: CN202010630635.5) discloses a named entity normalization method, comprising: acquiring a question of a user; performing word segmentation and named entity identification on the user question to obtain a universal named entity set; generating a syntax tree set for the universal named entity set by using a CYK algorithm; traversing the syntax tree set to obtain a maximum tree combination; traversing the maximum tree combination, and converting into a fixed expression according to a preset grammar sequence. The method and the device can effectively improve the analysis matching capability of the complex question sentence, and further improve the man-machine interaction capability of the intelligent device.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a mechanism named entity normalization method and system based on a LEAM model.
The mechanism named entity normalization method based on the LEAM model provided by the invention comprises the following steps:
step S1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet a preset condition;
step S2: removing noise in the screened data according to the regular expression;
step S3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion;
step S4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization;
step S5: and inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment.
Preferably, the step S1 includes:
step S101: deriving the named entities of all academic institutions from the database;
step S102: counting the names and the frequencies of all normalization mechanisms corresponding to all the same named entities;
step S103: and reserving the normalization mechanism corresponding to the named entity according to the maximum frequency, and deleting other data.
Preferably, the step S2 includes:
step S201: converting Latin text appearing in the organization name into English letters;
step S202: and removing stop words and punctuation marks in the converted organization names according to the regular expression, and expanding and writing the abbreviations in the organization names.
Preferably, the step S3 includes:
step S301: counting the number of mechanism entity aliases corresponding to all the normalized mechanism categories;
step S302: according to the following steps of 6: 2: a ratio of 2 randomly divides the entity aliases for each class into a training set, a validation set, and a test set.
Preferably, the step S4 includes:
step S401: loading the trained 100-dimensional character vector as the initialization of the character during training;
step S402: sequencing the original names of the mechanisms in the training set according to the length of the character strings, taking training data with the size of the batch according to the length each time to form a training batch, and performing equal operation on the batches in the verification set;
step S403: representing each original name in each batch into a matrix of L rows and 100 columns, wherein L is the length of the longest character string in the batch, and each row of the matrix is represented by a 100-dimensional vector of each character;
step S404: training a LEAM-based text classification model by a matrix representation of training data;
step S405: and after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set.
Preferably, let-based text classification model is considered as f 0 、f 1 And f 2 Cascade of three systems, wherein f 0 The representation represents the character as a vector, f 1 Is to operate these vectors to obtain the representation of the text, f 2 Is to use the text representation for classification;
at f 0 Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f 1 The method comprises the steps of utilizing the similarity of characters of a normalization mechanism and original mechanisms to carry out aggregation of text representations;
by c i Vector representation of the ith normalization mechanism, C is a matrix representation of all normalization mechanisms, assuming a total of K classes, using cosine faciesSimilarity calculation the similarity between each category and character:
Figure BDA0002738276900000031
wherein,
Figure BDA0002738276900000032
the product of the elements is represented by,
Figure BDA0002738276900000033
is a normalized matrix of K x 100 and,
Figure BDA0002738276900000034
each of the elements of (1) is
Figure BDA0002738276900000035
Figure BDA0002738276900000036
V represents a text representation matrix; v. of 1 Column l representing V;
by G lr The correlation of the center at length l of 2r +1 is measured, and the similarity vector is expressed as:
u l =RELU(G lr W 1 +b 1 )
wherein u is l ∈R K
W 1 Representing linear parameters in the classification model; b 1 Representing constant parameters in the classification model;
the maximum pool is used to obtain the coefficient of maximum correlation, from which the coefficient of attention mechanism is obtained, named β, and the final text is expressed as:
Figure BDA0002738276900000037
β 1 the l-th element representing β;
and converting the model into an optimization problem through a corresponding loss entropy function, and training corresponding parameters by using corresponding training set data.
Preferably, the step S5 includes:
step S501: testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model;
step S502: analyzing the data which are not classified according to the preset rule, removing the data with wrong labels, and manually marking the data which are difficult to classify;
step S503: and modifying the size of the batch and the hyper-parameters of the learning rate, and repeatedly executing the steps S1-S5 to obtain the optimal hyper-parameter design and the corresponding model.
The mechanism named entity normalization system based on the LEAM model provided by the invention comprises:
module M1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet preset conditions;
module M2: removing noise in the screened data according to the regular expression;
module M3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion;
module M4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization;
module M5: and inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment.
Preferably, the module M1 includes:
a module M101: deriving the named entities of all academic institutions from the database;
the module M102: counting the names and the frequencies of all normalization mechanisms corresponding to all the same named entities;
the module M103: reserving a normalization mechanism corresponding to the named entity according to the maximum frequency, and deleting other data;
the module M2 includes:
a module M201: converting Latin text appearing in the organization name into English letters;
the module M202: removing stop words and punctuation marks in the converted mechanism name according to the regular expression, and performing expansion writing on the abbreviation in the mechanism name;
the module M3 includes:
a module M301: counting the number of mechanism entity aliases corresponding to all the normalized mechanism categories;
a module M302: according to the proportion of 6: 2: a scale of 2 randomly divides the entity aliases for each class into a training set, a validation set, and a test set.
Preferably, the module M4 includes:
the module M401: loading the trained 100-dimensional character vector as the initialization of the character during training;
the module M402: sequencing original names of mechanisms in a training set according to the length of a character string, taking training data with the size of a batch according to the length each time to form a trained batch, and performing equal operation on the batches in a verification set;
module M403: representing each original name in each batch into a matrix of L rows and 100 columns, wherein L is the length of the longest character string in the batch, and each row of the matrix is represented by a 100-dimensional vector of each character;
the module M404: training a LEAM-based text classification model by a matrix representation of training data;
the module M405: after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set;
the module M5 includes:
module M501: testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model;
the module M502: analyzing the data which are not classified according to the preset rule, removing the data with wrong labels, and manually marking the data which are difficult to classify;
a module M503: and modifying the size of the batch and the hyper-parameters of the learning rate, and repeatedly executing the modules M1-S5 to obtain the optimal hyper-parameter design and the corresponding model.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention adopts the latest deep learning method to construct a simple and effective mechanism normalization system, and solves the problems of low accuracy and high operation complexity of the old method;
2. the method is used for providing reliable academic institution information for vast researchers, the information can be used for academic ability evaluation or academic data knowledge graph construction, and the significance is great;
3. the method is used for distinguishing the academic ability of the academic institution and is better used for constructing other academic systems, such as an academic knowledge map and the like.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the algorithm of the present invention;
FIG. 2 is a basic framework diagram of a LEAM-based classification model;
fig. 3 is a schematic diagram of the LEAM model.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example (b):
the invention relates to a system for mechanism named entity normalization by designing and realizing, which relates to the collection and arrangement of mechanism named entity data, the screening of mechanism named entity data, the denoising, the construction based on an LEAM classification model, the training and adjustment of the model by using data and the like; specifically, as shown in fig. 1, the method of the present invention comprises the following steps:
step S1: and screening all academic institution information data by using a statistical rule to remove obviously wrong data.
Step S2: in the screened data, the noise existing in the data is removed by using the regular rule or some other rule.
Step S3: and dividing the processed data into a training set, a verification set and a test set according to the categories and corresponding proportions.
Step S4: inputting the training set and the verification set into a LEAM model, and training out a model for mechanism named entity normalization, wherein a schematic diagram of the LEAM model is shown in figure 3.
Step S5: and inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment.
The step S1 includes: the organization named entities and the corresponding normalized academic institutions are obtained from the Acemap database, and the organization named entities and the corresponding normalized academic institutions comprise 1.53 hundred million organization named entities, about 3179 ten thousand organization named entities after deduplication, and 2.5 ten thousand normalized organizations. All processed data are stored in a csv format, so that the data are convenient to use later.
Step S101: the named entities of all academic institutions are taken from the database.
Step S102: the organization named entity obtained from the database has a part of data mapped as an error normalized organization name. In the invention, a method of counting frequency is adopted to remove error data, the specific process is to traverse the whole data set, count each appearing entity, count the normalization mechanisms corresponding to the same named entity, store the result in a dictionary format, take the keywords as the mechanism named entity, and take the key values as all normalization mechanism ID lists (including repetition) corresponding to the named entity.
Step S103: counting the key value list obtained in the step S102, regarding the normalized mechanism ID of the mechanism named entity which appears the most times as the normalized mechanism ID of the mechanism named entity, regarding other mechanism IDs which do not appear the most times as error data, removing the error data from the data set, and performing deduplication on the remaining data and then retaining the remaining data as a CSV file.
The step S2 includes: the Latin is converted into English related rules, Latin which cannot be identified by the method is converted, and the text is further denoised by the related rules such as regular rules and the like. The method comprises the following specific steps:
step S201: latin text appearing in the institution entity is converted into English letters by using the dictionary.
Step S202: and then according to the regular expression, removing stop words and punctuation marks in the converted organization name, and expanding and writing the abbreviation in the organization name. Specific examples are shown in table one.
Table one: processing stop words, punctuation, acronyms
Figure BDA0002738276900000071
The step S3 includes: traversing data to obtain the number of mechanism entity names owned by all normalization mechanisms, and dividing three different data sets according to a certain proportion to store as csv files for model training. The method comprises the following specific steps:
step S301: and traversing the csv files stored after the error data are removed, and counting the number of mechanism entity aliases corresponding to the categories of all the normalized mechanisms, wherein the number is shown in a table II.
Table two: normalized organization alias statistics
Normalized organization name Number of aliases
Centre national de la recherche scientifique 295420
Chinese Academy of Sciences 173719
Harvard University 157932
National Institutes of Health 120212
University of Tokyo 111260
Stanford University 90774
…… ……
Step S302: according to the following steps of 6: 2: 2 randomly dividing the entity aliases of each category into a training set, a verification set and a test set, and storing the training sets, the verification set and the test set in a csv format, as shown in table three.
Table three: data set partitioning storage format
Training set train.csv
Verification set dev.csv
Test set test.csv
The step S4 includes: converting the mechanism named entity in the character string form into a representation matrix through embedding, sending the representation matrix into a constructed classification system according to a corresponding training rule, and training a reliable mechanism normalized classification model, as shown in fig. 2, the concrete steps are as follows:
step S401: the 100-dimensional character vector trained by Stanford university is loaded as the initialization of the character at training. The character vector is trained by the fundamental principle of glove. The realization comprises three steps: first, a co-occurrence matrix is constructed according to a corpus, and each element in the matrix represents the co-occurrence times of words and context words in a context window with a specific size. Meanwhile, the weight of the times can also change along with the distance between two words; then, establishing an approximate relation between the word vector and the co-occurrence matrix; and finally, constructing a loss function of the training word vector according to the approximate relation, and then obtaining the final word vector representation through training.
Step S402: and sequencing the original names of the mechanisms in the training set according to the length of the character string, and taking the training data with the size of the batch according to the length every time to form a trained batch. The same is true for the validation set's batch. The size of the batch is the size of the data volume required when the parameters are updated every time, the training speed is faster when the batch is larger, but more training rounds may be required to obtain the optimal solution, and each training occupies a large amount of training resources. The size of the batch is typically set to a power of 2.
Step S403: each original name in each batch is represented as a matrix of L rows and 100 columns, where L is the length of the longest string in that batch. Each row of the matrix is a 100-dimensional vector representation of each character.
Step S404: a LEAM-based text classification model is trained by a matrix representation of training data, and the principles of this classification model will be explained in detail below.
Consider the text classification model as f 0 ,f 1 And f 2 Cascade of three systems, wherein f 0 The representation represents the character as a vector, f 1 Is to operate these vectors to obtain the representation of the text, f 2 The text representation is used for classification. At f 0 Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f 1 Wherein the aggregation of the text representations is performed using the similarity of the characters of the normalization mechanism and the original mechanism. In particular, with c i Represents the vector representation of the ith normalization mechanism, and C is the matrix representation of all normalization mechanisms. Assuming that there are K categories in total, the similarity between each category and the character can be calculated by using cosine similarity:
Figure BDA0002738276900000081
wherein,
Figure BDA0002738276900000082
the product of the elements is represented by,
Figure BDA0002738276900000083
is a K x 100 normalized matrix of the matrix,
Figure BDA0002738276900000084
each of the elements of (1) is
Figure BDA0002738276900000085
Figure BDA0002738276900000086
In order to better acquire the space information between characters, nonlinearity is introduced in the similarity, and the nonlinearity can be obtained by convolution and an activation function.
By G lr With the measurement centre at length l of 2r +1Correlation, then the similarity vector is expressed as:
u l =RELU(G lr W 1 +b 1 )
wherein u is l ∈R K Then, by using the maximum pool, the maximum correlation coefficient can be obtained. The attention mechanism coefficient, named beta, can then be derived from this coefficient. The final text representation is:
Figure BDA0002738276900000091
so that the model can be transformed into an optimization problem by the corresponding loss entropy function. Corresponding parameters can be trained using the corresponding training set data.
Step S405: and after each training round is finished, verifying the model by using a verification set, judging whether the model plays a role or not, and adjusting the hyper-parameters of the model according to the result of the verification set. In the training process, attention should be paid to the variation trend of the loss of the training set and the verification set function at any time, and the retained final model is the parameter with the highest accuracy of the verification set. Meanwhile, if the accuracy of the training set is increased and the accuracy of the verification set is reduced, whether the model is over-fitted or not should be considered, and if the over-fitting is confirmed, the training should be stopped and the model with the highest accuracy of the verification set is taken as the final retained model. The training results are shown in table four:
table four: training results
Data set Rate of accuracy
Training set 98.2%
Verification set 97.1%
The step S5 includes: the method comprises the following steps of testing a model by using a test set, adjusting hyper-parameters, analyzing error data, further cleaning or manually adjusting the error data to obtain a model with the highest accuracy as much as possible, and storing the model as a final model, wherein the method comprises the following specific steps:
s501: and testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model. The results are shown in Table five.
Table five: test results
Data set Rate of accuracy
Test set 1 97.4%
Test set 2 97.7%
Test set 3 97.5%
S502: and analyzing the classified error data, removing data with obvious label errors, and manually marking the data which is difficult to classify. An example of the data with the classification error is shown in table six.
Table six: classification of error data examples
Figure BDA0002738276900000092
Figure BDA0002738276900000101
S503: and modifying the size of the batch, learning rate and other hyper-parameters, and repeating all the steps to obtain the optimal hyper-parameter design and the corresponding model.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the apparatus, and the modules thereof provided by the present invention may be considered as a hardware component, and the modules included in the system, the apparatus, and the modules for implementing various programs may also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (7)

1. A mechanism named entity normalization method based on an LEAM model is characterized by comprising the following steps:
step S1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet preset conditions;
step S2: removing noise in the screened data according to the regular expression;
step S3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion;
step S4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization;
step S5: inputting the test set into a trained model, testing the effect of the model and carrying out fine adjustment;
the step S4 includes:
step S401: loading the trained 100-dimensional character vector as the initialization of the character during training;
step S402: sequencing original names of mechanisms in a training set according to the length of a character string, taking training data with the size of a batch according to the length each time to form a trained batch, and performing equal operation on the batches in a verification set;
step S403: representing each original name in each batch into a matrix of L rows and 100 columns, wherein L is the length of the longest character string in the batch, and each row of the matrix is represented by a 100-dimensional vector of each character;
step S404: training a LEAM-based text classification model by a matrix representation of training data;
step S405: after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set;
consider the LEAM-based text classification model as f 0 、f 1 And f 2 Cascade of three systems, wherein f 0 The representation represents the character as a vector, f 1 Is to operate these vectors to obtain the representation of the text, f 2 Is to use the text representation for classification;
at f 0 Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f 1 Using the phases of the characters of the normalization mechanism and the original mechanismSimilarity to aggregate textual representations;
by c i And C is a matrix representation of all the normalization mechanisms, assuming that K categories exist, and calculating the similarity between each category and the character by using cosine similarity:
Figure FDA0003788910800000011
wherein,
Figure FDA0003788910800000012
the product of the elements is represented by,
Figure FDA0003788910800000013
is a normalized matrix of K x 100 and,
Figure FDA0003788910800000014
each of the elements of (1) is
Figure FDA0003788910800000015
V represents a text representation matrix; v. of 1 Column l representing V;
by G lr The correlation of the center at length l of 2r +1 is measured, and the similarity vector is expressed as:
u l =RELU(G lr W 1 +b 1 )
wherein u is l ∈R K
W 1 Representing linear parameters in the classification model; b is a mixture of 1 Representing constant parameters in the classification model;
the maximum pool is used to obtain the coefficient of maximum correlation, from which the coefficient of attention mechanism is obtained, named β, and the final text is expressed as:
Figure FDA0003788910800000021
β 1 the l-th element representing β;
and converting the model into an optimization problem through a corresponding loss entropy function, and training corresponding parameters by using corresponding training set data.
2. The method for normalization of organization named entities according to claim 1, wherein said step S1 comprises:
step S101: deriving the named entities of all academic institutions from the database;
step S102: counting the names and the frequencies of all normalization mechanisms corresponding to all the same named entities;
step S103: and reserving the normalization mechanism corresponding to the named entity according to the maximum frequency, and deleting other data.
3. The method for normalization of organization named entities according to claim 1, wherein said step S2 comprises:
step S201: converting Latin text appearing in the organization name into English letters;
step S202: according to the regular expression, stop words and punctuation marks in the converted mechanism name are removed, and abbreviation words in the mechanism name are expanded and written.
4. The method for normalization of organization named entities according to claim 1, wherein said step S3 comprises:
step S301: counting the number of mechanism entity aliases corresponding to all the normalized mechanism categories;
step S302: according to the following steps of 6: 2: a ratio of 2 randomly divides the entity aliases for each class into a training set, a validation set, and a test set.
5. The method for normalization of organization named entities according to claim 1, wherein said step S5 comprises:
step S501: testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model;
step S502: analyzing the data which are not classified according to the preset rule, removing the data with wrong labels, and manually marking the data which are difficult to classify;
step S503: and modifying the size of the batch and the hyper-parameters of the learning rate, and repeatedly executing the steps S1-S5 to obtain the optimal hyper-parameter design and the corresponding model.
6. An organization named entity normalization system based on an LEAM model, which is characterized by comprising:
module M1: screening all academic institution information data through a preset statistical rule, and removing data which do not meet preset conditions;
module M2: removing noise in the screened data according to the regular expression;
module M3: dividing the de-noised data into a training set, a verification set and a test set according to categories and a preset proportion;
module M4: inputting the training set and the verification set into an LEAM model, and training out a model for mechanism named entity normalization;
module M5: inputting the test set into the trained model, testing the effect of the model and carrying out fine adjustment;
the module M4 includes:
the module M401: loading the trained 100-dimensional character vector as the initialization of the character during training;
a module M402: sequencing original names of mechanisms in a training set according to the length of a character string, taking training data with the size of a batch according to the length each time to form a trained batch, and performing equal operation on the batches in a verification set;
module M403: representing each original name in each batch into a matrix of L rows and 100 columns, wherein L is the length of the longest character string in the batch, and each row of the matrix is represented by a 100-dimensional vector of each character;
the module M404: training a LEAM-based text classification model by a matrix representation of training data;
module M405: after each training round is finished, verifying the text classification model by using a verification set, judging whether the text classification model achieves a preset effect or not, and adjusting the hyper-parameters of the text classification model according to the result of the verification set;
the module M5 includes:
a module M501: testing the test set data by using the trained model, counting the accuracy of the result, and storing the accuracy and the corresponding model;
the module M502: analyzing the data which are not classified according to the preset rule, removing the data with wrong labels, and manually marking the data which are difficult to be classified;
a module M503: modifying the size of the batch and the hyper-parameters of the learning rate, and repeatedly executing the modules M1-S5 to obtain the optimal hyper-parameter design and the corresponding model;
consider the LEAM-based text classification model as f 0 、f 1 And f 2 Cascade of three systems, wherein f 0 The representation represents the character as a vector, f 1 Is to operate these vectors to obtain the representation of the text, f 2 Is to use the text representation for classification;
at f 0 Learning a matrix representation of the normalization mechanism to influence the representation of the character vector while at f 1 The similarity of characters of a normalization mechanism and the original mechanism is utilized to carry out the aggregation of text representation;
by c i And C is a matrix representation of all normalization mechanisms, assuming that K categories exist, and calculating the similarity between each category and the character by using the cosine similarity:
Figure FDA0003788910800000041
wherein,
Figure FDA0003788910800000042
the product of the elements is expressed by,
Figure FDA0003788910800000043
is a normalized matrix of K x 100 and,
Figure FDA0003788910800000044
each of the elements of (1) is
Figure FDA0003788910800000045
V represents a text representation matrix; v. of 1 Column l representing V;
by G lr The correlation of the center at length l of 2r +1 is measured, and the similarity vector is expressed as:
u l =RELU(G lr W 1 +b 1 )
wherein u is l ∈R K
W 1 Representing linear parameters in the classification model; b 1 Representing constant parameters in the classification model;
the maximum pool is used to get the coefficient of maximum correlation, from which the coefficient of attention mechanism is derived, named β, and the final text is expressed as:
Figure FDA0003788910800000046
β 1 the l-th element representing β;
and converting the model into an optimization problem through a corresponding loss entropy function, and training corresponding parameters by using corresponding training set data.
7. The LEAM model-based organization named entity normalization system of claim 6, wherein said module M1 comprises:
a module M101: deriving the named entities of all academic institutions from the database;
the module M102: counting the names and the frequencies of all normalization mechanisms corresponding to all the same named entities;
the module M103: reserving a normalization mechanism corresponding to the named entity according to the maximum frequency, and deleting other data;
the module M2 includes:
module M201: converting Latin text appearing in the organization name into English letters;
the module M202: removing stop words and punctuation marks in the converted mechanism name according to the regular expression, and performing expansion writing on the abbreviation in the mechanism name;
the module M3 includes:
module M301: counting the number of mechanism entity aliases corresponding to all the normalized mechanism categories;
the module M302: according to the proportion of 6: 2: a ratio of 2 randomly divides the entity aliases for each class into a training set, a validation set, and a test set.
CN202011141040.XA 2020-10-22 2020-10-22 Organization named entity normalization method and system based on LEAM model Active CN112215007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011141040.XA CN112215007B (en) 2020-10-22 2020-10-22 Organization named entity normalization method and system based on LEAM model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011141040.XA CN112215007B (en) 2020-10-22 2020-10-22 Organization named entity normalization method and system based on LEAM model

Publications (2)

Publication Number Publication Date
CN112215007A CN112215007A (en) 2021-01-12
CN112215007B true CN112215007B (en) 2022-09-23

Family

ID=74054853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011141040.XA Active CN112215007B (en) 2020-10-22 2020-10-22 Organization named entity normalization method and system based on LEAM model

Country Status (1)

Country Link
CN (1) CN112215007B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733543A (en) * 2021-01-26 2021-04-30 上海交通大学 Organization named entity normalization method and system based on text editing generation model
CN116910275B (en) * 2023-09-12 2023-12-15 无锡容智技术有限公司 Form generation method and system based on large language model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090985A (en) * 2019-11-28 2020-05-01 华中师范大学 Chinese text difficulty assessment method based on siamese network and multi-core LEAM framework
CN111274405A (en) * 2020-02-26 2020-06-12 北京工业大学 Text classification method based on GCN
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field
CN111738003A (en) * 2020-06-15 2020-10-02 中国科学院计算技术研究所 Named entity recognition model training method, named entity recognition method, and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444721B (en) * 2020-05-27 2022-09-23 南京大学 Chinese text key information extraction method based on pre-training language model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090985A (en) * 2019-11-28 2020-05-01 华中师范大学 Chinese text difficulty assessment method based on siamese network and multi-core LEAM framework
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal
CN111274405A (en) * 2020-02-26 2020-06-12 北京工业大学 Text classification method based on GCN
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field
CN111738003A (en) * 2020-06-15 2020-10-02 中国科学院计算技术研究所 Named entity recognition model training method, named entity recognition method, and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于BERT-BLSTM-CRF的政务领域命名实体识别方法;杨春明等;《西南科技大学学报》;20200930(第03期);全文 *
面向智能客服系统的情感分析技术;宋双永等;《中文信息学报》;20200215(第02期);正文第1章-第8章 *

Also Published As

Publication number Publication date
CN112215007A (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
CN111274806B (en) Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
Campos et al. Biomedical named entity recognition: a survey of machine-learning tools
CN103154936B (en) For the method and system of robotization text correction
Wood et al. The sequence memoizer
CN109902159A (en) A kind of intelligent O&M statement similarity matching process based on natural language processing
Djumalieva et al. An open and data-driven taxonomy of skills extracted from online job adverts
CN112541056B (en) Medical term standardization method, device, electronic equipment and storage medium
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
Noaman et al. Naive Bayes classifier based Arabic document categorization
US20120197888A1 (en) Method and apparatus for selecting clusterings to classify a predetermined data set
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN109036577A (en) Diabetic complication analysis method and device
CN112560478A (en) Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation
CN112215007B (en) Organization named entity normalization method and system based on LEAM model
CN110222192A (en) Corpus method for building up and device
US20230325424A1 (en) Systems and methods for generating codes and code books based using cosine proximity
CN114528919A (en) Natural language processing method and device and computer equipment
CN111785387A (en) Method and system for disease standardized mapping classification by using Bert
CN111177375A (en) Electronic document classification method and device
CN115309910B (en) Language-text element and element relation joint extraction method and knowledge graph construction method
CN117291192B (en) Government affair text semantic understanding analysis method and system
CN117422074A (en) Method, device, equipment and medium for standardizing clinical information text
CN112215006B (en) Organization named entity normalization method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant