CN111984776B

CN111984776B - Mechanism name standardization method based on word vector model

Info

Publication number: CN111984776B
Application number: CN202010844347.XA
Authority: CN
Inventors: 侯颖; 崔运鹏; 李欢; 王婷; 马浩
Original assignee: Agricultural Information Institute of CAAS
Current assignee: Agricultural Information Institute of CAAS
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2023-08-11
Anticipated expiration: 2040-08-20
Also published as: CN111984776A

Abstract

The invention discloses a mechanism name standardization method based on a word vector model, which comprises the following steps: analyzing name field characteristics of a scientific and technological literature data mechanism, and selecting related fields of the mechanism; extracting information text of relevant fields of the literature, and cleaning and transforming the relevant fields; constructing a word vector model for the extracted text information by using a word2vec word vector method, and clustering the mechanism names; searching words with high similarity by combining the word vector model and the clustering file, and identifying and extracting the mechanism name from the words; the Jaro similarity method is adopted to calculate the names of the matching similar institutions by setting a threshold value. The invention can effectively improve the data reliability problem in the scientific and technological evaluation based on mass data, and standardizes the storage and management of the organization names in the scientific and technological literature database, thereby improving the standardization of the construction of the scientific and technological literature database.

Description

Mechanism name standardization method based on word vector model

Technical Field

The invention relates to the field of book information and information extraction, in particular to a mechanism name standardization method based on a word vector model.

Background

In a scientific literature database, literature information resources with multiple sources are contained, information extraction is carried out from massive resources, and an entity is needed to be used for searching, wherein the entity mainly comprises the contents of a person name, an organization, a geographic name, a date, a title, a keyword and the like, and the person name and the organization name are the most important two types.

Along with history transition and system reform, organization names, particularly group names, are often changed due to the change of basic functions and organization structures, and many organization names have the problem of non-uniform and non-standard expression due to the reasons of Chinese and English full names, short names, abbreviations and the like. In the aspect of scientific literature, the phenomenon of irregular and non-uniform name expression of institutions easily causes retrieval and statistics errors aiming at academic achievements of the institutions, and is also unfavorable for statistics and mining analysis of related data.

At present, the name specification of scientific and technical literature is still in a research stage, and manual labeling, correction and other methods are mainly adopted, and the method has high requirement on labor cost, so that the existing name and organization association is difficult. There is therefore a need for a method of institutional name specification based on a word vector model to solve the above problems.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a mechanism name standardization method based on a word vector model. The aim of the invention is achieved by the following technical scheme:

the invention comprises the following steps:

s10, acquiring the name and field characteristics of an organization of scientific and technical literature data, and selecting related fields of the organization;

s20, extracting a document related field information text through the selection mechanism related field;

s30, constructing word2vec word vector models from a plurality of information texts, and clustering mechanism names;

s40, searching and extracting a set of similar mechanism names based on the word vector model and the clustering result;

and S50, calculating the similarity between the institutions by adopting a Jaro similarity method to obtain a final institution specification result.

Further, the organization field in the metadata of the analysis science and technology literature comprises information such as organization short, organization full name, enhanced organization name provided by web office, secondary organization name, and organization internal Id organization address. The invention only standardizes the primary mechanism name and does not relate to the secondary mechanism name, so that the finally selected related fields of the mechanism comprise short mechanism name, full mechanism name, a system to which the mechanism belongs, an internal Id of the mechanism and an address field of the mechanism.

Further, the document related information text extraction work comprises extraction of document related fields and cleaning and transformation of the extracted field information to obtain a word vector training text set text _wv Mechanism name collection list after cleaning and conversion _inst Mechanism name and internal Id correspondence list after cleaning transformation _inst-id 。

Further, a word vector training text set text is obtained _wv The method of (1) comprises:

1) For each document R in the scientific and literature data set R _i Extracting a document title according to the tag information;

2) For each contributor C in each document contribution set C _j Judging whether the type is an author, if so, extracting the name of the author and the corresponding mechanism number of the author according to the label information;

3) Extracting an author corresponding mechanism (including mechanism abbreviations, mechanism holonomics and systems to which the mechanism belongs) and a mechanism corresponding address number according to the mechanism number;

4) Extracting address information corresponding to the mechanism according to the address number;

5) Extracting literature topics according to the label information;

6) And cleaning, converting and storing the extracted literature information.

Further, the data cleaning work comprises deleting special characters and punctuation, and uniformly converting the special characters and the punctuation into lowercase characters; the data transformation is directed to the author name and the organization name, which are mostly composed of a plurality of words, and are calculated by the model in units of words in word vector training, so that the step specially processes the author and the author organization field, namely, the author name composed of a plurality of words is connected by a symbol '-' and the organization name composed of a plurality of words is connected by a symbol '-' so that the author and the author organization name are integrated, and the transformed organization name collection list is cleaned _inst Only the name of the mechanism after cleaning and conversion is contained.

Further, cleaning the transformed mechanism name and internal Id correspondence list _inst-id Containing post-conversion organization namesName, original organization name, and organization internal Id. Obtaining list _inst-id The method of (1) comprises:

1) For each document R in the scientific and literature data set R _i Extracting relevant information of a mechanism;

2) For each mechanism I in the mechanism set I of each document _j Extracting the name of the organization and the Id of the organization;

3) Cleaning and converting the name of the extraction mechanism;

4) The name of the mechanism after cleaning and conversion, the name of the original mechanism and the Id in the mechanism form a triplet;

6) Adding all triples to the aggregate list _inst-id And stored.

Further, the word2vec word vector model is trained by adopting a skip-gram model. The skip-gram model contains an input layer, a hidden layer, and an output layer. The model predicts its context using the center word with a prediction probability of p (w _i |w _t ) Where t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c is the context window constant, i.e. the input is a word vector for a particular word, and the output is a context word vector for the particular word. By training text set text _wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set.

Further, the mechanism name clustering method utilizes a K-means clustering method to obtain vectors of all clustering centers, word vectors output by a word vector model are respectively calculated to be similar to all the clustering centers to obtain clustering categories corresponding to words, and finally, the sorted words and the clustering files class of the corresponding categories are obtained _wv 。

Further, the searching and extracting similar organization names specifically includes:

1) Will aggregate list _inst-id The mechanism name inst after cleaning conversion _clean And clustering file class _wv Taking intersection to obtain clustered all mechanism name set class _inst ；

2) For organization name collection list _inst Each mechanism i of (2) _i In the word ofSearching a similar word set W in the vector model, and if each word W in the set W _j Mechanism i _i Similarity between Si _ij Greater than the set threshold T _wv And word w _j Comprises at inst _clean In (C), word w _j Joining set s _inst 。

3) For set s _inst Each of the similar mechanisms s _i In the collection class _inst Find the names of the mechanisms which are similar to each other, add the aggregate inst _i ；

4) Each mechanism s _i And corresponding inst _i Adding set inst _jaro 。

Further, the method for calculating the names of the similar institutions by using the Jaro similarity method comprises the following steps:

1) For aggregate inst _jaro Each s of (3) _i Corresponding inst _i Adopting Jaro similarity to calculate the similarity between mechanisms;

2) For aggregate inst _i Each mechanism i of (2) _k If s _i And i _k Similarity Si _ik Greater than the set threshold T _jaro Word i is then processed _k Adding aggregate sinst _i 。

3) Will aggregate s _inst Sum aggregate sinst _i Removing the reforming reaction to obtain the final similar mechanism final _i 。

4) For final _i Each mechanism f of (3) _i Corresponding relation list between organization name and internal Id _inst-id Find the original organization name and add set O _i 。

5) Set O derived from each institution _i And adding the set O to obtain a final organization standardization result.

One or more embodiments of the present invention may have the following advantages over the prior art:

according to the mechanism name specification method based on the word vector model, a word2vec word vector calculation method is adopted, text calculation is associated with a context, and word vectors have semantic features; the mapping of multiple organization names to one organization entity is achieved in combination with the Jaro similarity. The invention can effectively improve the data reliability problem in the scientific and technological evaluation based on mass data, and standardizes the storage and management of the organization names in the scientific and technological literature database, thereby improving the standardization of the construction of the scientific and technological literature database.

Drawings

FIG. 1 is a flow chart of a method for specifying a facility name based on a word vector model;

FIG. 2 is a flow chart of a facility name specification method based on a word vector model for obtaining word vector training text;

FIG. 3 is a flow chart of a method for specifying a facility name based on a word vector model to obtain correspondence between facility names and internal Ids;

FIG. 4 is a diagram of a facility name specification method skip-gram model based on a word vector model;

FIG. 5 is a flow chart of a method for extracting similar organization names based on a word vector model organization name specification;

FIG. 6 is a flowchart of a facility name specification method based on a word vector model for calculating similar facility names using Jaro similarity.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples and the accompanying drawings.

As shown in FIG. 1, a method for specifying a organization name based on a word vector model is provided. In the present embodiment, basic parameter settings are as follows: context window constant c=5 in skip-gram model, similarity threshold T in word vector model _wv =0.8, jaro similarity threshold T _jaro =0.95. The specific implementation method can be divided into the following steps:

s10, analyzing relevant field characteristics of a scientific and technological literature data organization, and selecting relevant fields of the organization;

the related fields of the mechanism in the metadata of the scientific literature are information such as mechanism short (such as Shanxi Univ), mechanism full name (such as Shanxi University), enhanced organization name provided by web office, secondary organization name, id mechanism address in the mechanism and the like. The invention only standardizes the primary mechanism name and does not relate to the secondary mechanism name, so that the finally selected related fields of the mechanism comprise short mechanism name, full mechanism name, a system to which the mechanism belongs, an internal Id of the mechanism and an address field of the mechanism.

Step S20, extracting a document related field information text;

the work of the step mainly completes the extraction of the relevant fields of the literature, and the cleaning and the transformation of the information of the extracted fields to obtain a word vector training text set text _wv Mechanism name collection list after cleaning and conversion _inst Mechanism name and internal Id correspondence list after cleaning transformation _inst-id 。

Acquiring word vector training text set text _wv The method of (2) comprises, as shown in FIG. 2:

5) Extracting literature topics according to the label information;

6) And cleaning, converting and storing the extracted literature information.

The data cleaning work comprises deleting special characters and punctuations, and uniformly converting the special characters and the punctuations into lowercase characters; the data transformation is directed to the author name and the organization name, which are mostly composed of a plurality of words, and are calculated by words in word vector training, so this step performs special treatment on the author and the author organization field, namely, the author name composed of a plurality of words is connected by a symbol '-' (for example, zhang San is treated as Zhang-San), and the organization name composed of a plurality of words is connected by a symbol '_' for the purpose of making the author and the author organization name become a whole.

Cleaning and transforming mechanism name collection list _inst Only the name of the mechanism after cleaning and conversion is contained.

Cleaning and transforming mechanism name and internal Id corresponding relation set list _inst-id Including the transformed organization name, original organization name, and organization internal Id. Obtaining list _inst-id The method of (1) comprises, as shown in FIG. 3:

3) Cleaning and converting the name of the extraction mechanism;

6) Adding all triples to the aggregate list _inst-id And stored.

Step S30, constructing a word2vec word vector model and clustering mechanism names;

this step completes word2vec word vector model training and clustering. Word2vec Word vector training uses skip-gram models. As shown in fig. 4, the model contains an input layer, a hidden layer, and an output layer. The model predicts its context using the center word with a prediction probability of p (w _i |w _t ) Where t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c is the context window constant, set to 5, i.e. the input is the word vector for a particular word, and the output is the context word vector for the particular word. By training text set text _wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set. The word vector length of each word is 200 dimensions, each of which is a real number.

The organization name clustering is completed to cluster the output vector of each word in the word vector model, and the category can be set by the user autonomously, here set to 100. Finally obtaining the cluster file class of the ordered words and the corresponding classes thereof _wv 。

Step S40, searching and extracting similar mechanism names based on the word vector model and the clustering result;

the step is to complete the extraction of the names of the similar institutions based on the word vector model and the clustering result. The method for extracting the names of the similar institutions comprises the following steps of:

2) For organization name collection list _inst Each mechanism i of (2) _i Searching a similar word set W in the word vector model, and if each word W in the set W _j Mechanism i _i Similarity between Si _ij Greater than the set threshold T _wv And word w _j Comprises at inst _clean In (C), word w _j Joining set s _inst . For example, organization aarsus_uniqueness, find the first 8 similar words and similarities in the word vector model, and the results are shown in table 1.

TABLE 1 set of similar words for word vector model lookup

Words and phrases	Similarity degree
		aarhus	0.9022127389907837
aarhus_univ	0.889610767364502
		univ_aarhus	0.8739291429519653
aarhus_univ_hosp	0.8125547766685486
		pedersen-steen-b	0.7520588636398315
birkedal-henrik	0.7493382692337036
		ostergaard-jane-nautrup	0.745323657989502
kronvang-brian	0.7398045063018799

Adding words with similarity greater than a set threshold value of 0.8 to the set s for the mechanism _inst ＝{'aarhus_univ','univ_aarhus','aarhus_univ_hosp'}；

3) For set s _inst Each of the similar mechanisms s _i In the collection class _inst Find the names of the mechanisms which are similar to each other, add the aggregate inst _i . For example, mechanism aarshus_univ, in classification _inst Find the front and back 10 similar organization names to add the aggregate inst _i 。

4) Each mechanism s _i And corresponding inst _i Adding set inst _jaro 。

Step S50, calculating the similarity between the mechanisms by adopting a Jaro similarity method;

this step completes the set inst _jaro Each s of (3) _i Corresponding inst _i And (5) Jaro similarity calculation of the included mechanism. The method for calculating the names of similar institutions includes, as shown in fig. 6:

1) For aggregate inst _jaro Each of (3)s _i Corresponding inst _i Adopting Jaro similarity to calculate the similarity between mechanisms;

2) For aggregate inst _i Each mechanism i of (2) _k If s _i And i _k Similarity Si _ik Greater than the set threshold T _jaro Word i is then processed _k Adding aggregate sinst _i . For example, the aggregate sinst obtained by calculating the organization aarshus_univ _i Is { 'aalborg_universality', 'aarhus_universt'.

3) Will aggregate s _inst Sum aggregate sinst _i Removing the reforming reaction to obtain the final similar mechanism final _i . For example, organization aarshus_univ, syndicated collection s _inst And aggregate to obtain final _i sinst _i ＝{'aarhus_university','aarhus_univ','univ_aarhus','aarhus_univ_hosp','aalborg_university'}。

4) For final _i Each mechanism f of (3) _i Corresponding relation list between organization name and internal Id _inst-id Find the original organization name and add set O _i . For example, organization aarshus_univ has undergone this step of calculation O _i ＝{'Aarhus University','Aarhus Univ','Aarhus Univ Hosp','Aalborg University'}。

The accuracy was measured to be 91% by testing the method described in the summary of the invention on a scientific literature containing 370 institutions.

The invention provides a Word vector-based mechanism name mapping method, which adopts a strategy of combining Word2vec Word vectors and Jaro similarity to realize mapping from a plurality of mechanism names to one mechanism entity, and a new Word vector calculation method based on Word2vec changes the defect that the context is not considered in the prior text calculation, and associates the text calculation with the influence of the corpus context, so that the Word vectors have semantic characteristics, and the establishment of a canonical library judges the similarity and the identity of the mechanism names to a certain extent through context combination, thereby achieving the research purpose of mechanism name standardization.

Although the embodiments of the present invention are described above, the embodiments are only used for facilitating understanding of the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the appended claims.

Claims

1. A mechanism name standardization method based on a word vector model is characterized in that:

the method comprises the following steps:

2. The method of claim 1, wherein the organization related fields include organization abbreviation, organization full scale, enhanced organization name provided by web of science, secondary organization name, organization internal Id and organization address.

3. The method according to claim 1, wherein the extraction of the document related information text includes extracting document related fields and cleaning and transforming the extracted field information to obtain a word vector training text set text _wv Cleaning transformationPost-agency name collection list _inst Mechanism name and internal Id correspondence list after cleaning transformation _inst-id 。

4. The method of claim 3, wherein the obtaining word vector training text set text _wv The method of (1) comprises:

3) According to the mechanism number, extracting the mechanism abbreviation, the mechanism full name, the system of the mechanism and the address number of the mechanism corresponding to the author;

5) Extracting literature topics according to the label information;

6) And cleaning, converting and storing the extracted literature information.

5. The method of claim 3, wherein the data cleansing operation includes deleting special characters and punctuation, converting uniformly into lowercase characters, cleansing the transformed mechanism name set list _inst Only the name of the mechanism after cleaning and conversion is contained.

6. The method of claim 3, wherein the cleaning transformed mechanism name and internal Id correspondence list _inst-id The name of the mechanism, the original name of the mechanism and the Id inside the mechanism after conversion are contained; obtaining list _inst-id The method of (1) comprises:

3) Cleaning and converting the name of the extraction mechanism;

6) Adding all triples to the aggregate list _inst-id And stored.

7. The method for specifying a name of a mechanism based on a word vector model according to claim 3, wherein the word2vec word vector model is trained by using a skip-gram model; the skip-gram model comprises an input layer, a hidden layer and an output layer; the model predicts its context using the center word with a prediction probability of p (w _i |w _t ) Wherein t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c being a context window constant, i.e. the input is a word vector for a particular word, and the output is a context word vector for the particular word; by training text set text _wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set.

8. The method for standardizing the names of institutions based on the word vector model according to claim 3, wherein the method for clustering the names of institutions utilizes a K-means clustering method to obtain the vectors of all the clustering centers, calculates the similarity between the word vectors output by the word vector model and all the clustering centers to obtain the clustering categories corresponding to the words, and finally obtains the clustering files class of the ordered words and the corresponding categories thereof _wv 。

9. The method for specifying a facility name based on a word vector model according to claim 8, wherein the searching and extracting the similar facility names specifically comprises:

1) Will aggregate list _inst-id The mechanism name inst after cleaning conversion _clean And clustering file class _wv The intersection set is taken out of the two,obtaining clustered all mechanism name set classes _inst ；

2) For organization name collection list _inst Each mechanism i of (2) _i Searching a similar word set W in the word vector model, and if each word W in the set W _j Mechanism i _i Similarity between Si _ij Greater than the set threshold T _wv And word w _j Comprises at inst _clean In (C), word w _j Joining set s _inst ；

4) Each mechanism s _i And corresponding inst _i Adding set inst _jaro 。

10. The method of claim 9, wherein the method for computing the similar organization names by the Jaro similarity method comprises:

2) For aggregate inst _i Each mechanism i of (2) _k If s _i And i _k Similarity Si _ik Greater than the set threshold T _jaro Word i is then processed _k Adding aggregate sinst _i ；

3) Will aggregate s _inst Sum aggregate sinst _i Removing the reforming reaction to obtain the final similar mechanism final _i ；

4) For final _i Each mechanism f of (3) _i Corresponding relation list between organization name and internal Id _inst-id Find the original organization name and add set O _i ；