CN111984776B - Mechanism name standardization method based on word vector model - Google Patents

Mechanism name standardization method based on word vector model Download PDF

Info

Publication number
CN111984776B
CN111984776B CN202010844347.XA CN202010844347A CN111984776B CN 111984776 B CN111984776 B CN 111984776B CN 202010844347 A CN202010844347 A CN 202010844347A CN 111984776 B CN111984776 B CN 111984776B
Authority
CN
China
Prior art keywords
name
inst
organization
word
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010844347.XA
Other languages
Chinese (zh)
Other versions
CN111984776A (en
Inventor
侯颖
崔运鹏
李欢
王婷
马浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Information Institute of CAAS
Original Assignee
Agricultural Information Institute of CAAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Information Institute of CAAS filed Critical Agricultural Information Institute of CAAS
Priority to CN202010844347.XA priority Critical patent/CN111984776B/en
Publication of CN111984776A publication Critical patent/CN111984776A/en
Application granted granted Critical
Publication of CN111984776B publication Critical patent/CN111984776B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a mechanism name standardization method based on a word vector model, which comprises the following steps: analyzing name field characteristics of a scientific and technological literature data mechanism, and selecting related fields of the mechanism; extracting information text of relevant fields of the literature, and cleaning and transforming the relevant fields; constructing a word vector model for the extracted text information by using a word2vec word vector method, and clustering the mechanism names; searching words with high similarity by combining the word vector model and the clustering file, and identifying and extracting the mechanism name from the words; the Jaro similarity method is adopted to calculate the names of the matching similar institutions by setting a threshold value. The invention can effectively improve the data reliability problem in the scientific and technological evaluation based on mass data, and standardizes the storage and management of the organization names in the scientific and technological literature database, thereby improving the standardization of the construction of the scientific and technological literature database.

Description

Mechanism name standardization method based on word vector model
Technical Field
The invention relates to the field of book information and information extraction, in particular to a mechanism name standardization method based on a word vector model.
Background
In a scientific literature database, literature information resources with multiple sources are contained, information extraction is carried out from massive resources, and an entity is needed to be used for searching, wherein the entity mainly comprises the contents of a person name, an organization, a geographic name, a date, a title, a keyword and the like, and the person name and the organization name are the most important two types.
Along with history transition and system reform, organization names, particularly group names, are often changed due to the change of basic functions and organization structures, and many organization names have the problem of non-uniform and non-standard expression due to the reasons of Chinese and English full names, short names, abbreviations and the like. In the aspect of scientific literature, the phenomenon of irregular and non-uniform name expression of institutions easily causes retrieval and statistics errors aiming at academic achievements of the institutions, and is also unfavorable for statistics and mining analysis of related data.
At present, the name specification of scientific and technical literature is still in a research stage, and manual labeling, correction and other methods are mainly adopted, and the method has high requirement on labor cost, so that the existing name and organization association is difficult. There is therefore a need for a method of institutional name specification based on a word vector model to solve the above problems.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a mechanism name standardization method based on a word vector model. The aim of the invention is achieved by the following technical scheme:
the invention comprises the following steps:
s10, acquiring the name and field characteristics of an organization of scientific and technical literature data, and selecting related fields of the organization;
s20, extracting a document related field information text through the selection mechanism related field;
s30, constructing word2vec word vector models from a plurality of information texts, and clustering mechanism names;
s40, searching and extracting a set of similar mechanism names based on the word vector model and the clustering result;
and S50, calculating the similarity between the institutions by adopting a Jaro similarity method to obtain a final institution specification result.
Further, the organization field in the metadata of the analysis science and technology literature comprises information such as organization short, organization full name, enhanced organization name provided by web office, secondary organization name, and organization internal Id organization address. The invention only standardizes the primary mechanism name and does not relate to the secondary mechanism name, so that the finally selected related fields of the mechanism comprise short mechanism name, full mechanism name, a system to which the mechanism belongs, an internal Id of the mechanism and an address field of the mechanism.
Further, the document related information text extraction work comprises extraction of document related fields and cleaning and transformation of the extracted field information to obtain a word vector training text set text wv Mechanism name collection list after cleaning and conversion inst Mechanism name and internal Id correspondence list after cleaning transformation inst-id
Further, a word vector training text set text is obtained wv The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting a document title according to the tag information;
2) For each contributor C in each document contribution set C j Judging whether the type is an author, if so, extracting the name of the author and the corresponding mechanism number of the author according to the label information;
3) Extracting an author corresponding mechanism (including mechanism abbreviations, mechanism holonomics and systems to which the mechanism belongs) and a mechanism corresponding address number according to the mechanism number;
4) Extracting address information corresponding to the mechanism according to the address number;
5) Extracting literature topics according to the label information;
6) And cleaning, converting and storing the extracted literature information.
Further, the data cleaning work comprises deleting special characters and punctuation, and uniformly converting the special characters and the punctuation into lowercase characters; the data transformation is directed to the author name and the organization name, which are mostly composed of a plurality of words, and are calculated by the model in units of words in word vector training, so that the step specially processes the author and the author organization field, namely, the author name composed of a plurality of words is connected by a symbol '-' and the organization name composed of a plurality of words is connected by a symbol '-' so that the author and the author organization name are integrated, and the transformed organization name collection list is cleaned inst Only the name of the mechanism after cleaning and conversion is contained.
Further, cleaning the transformed mechanism name and internal Id correspondence list inst-id Containing post-conversion organization namesName, original organization name, and organization internal Id. Obtaining list inst-id The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting relevant information of a mechanism;
2) For each mechanism I in the mechanism set I of each document j Extracting the name of the organization and the Id of the organization;
3) Cleaning and converting the name of the extraction mechanism;
4) The name of the mechanism after cleaning and conversion, the name of the original mechanism and the Id in the mechanism form a triplet;
6) Adding all triples to the aggregate list inst-id And stored.
Further, the word2vec word vector model is trained by adopting a skip-gram model. The skip-gram model contains an input layer, a hidden layer, and an output layer. The model predicts its context using the center word with a prediction probability of p (w i |w t ) Where t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c is the context window constant, i.e. the input is a word vector for a particular word, and the output is a context word vector for the particular word. By training text set text wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set.
Further, the mechanism name clustering method utilizes a K-means clustering method to obtain vectors of all clustering centers, word vectors output by a word vector model are respectively calculated to be similar to all the clustering centers to obtain clustering categories corresponding to words, and finally, the sorted words and the clustering files class of the corresponding categories are obtained wv
Further, the searching and extracting similar organization names specifically includes:
1) Will aggregate list inst-id The mechanism name inst after cleaning conversion clean And clustering file class wv Taking intersection to obtain clustered all mechanism name set class inst
2) For organization name collection list inst Each mechanism i of (2) i In the word ofSearching a similar word set W in the vector model, and if each word W in the set W j Mechanism i i Similarity between Si ij Greater than the set threshold T wv And word w j Comprises at inst clean In (C), word w j Joining set s inst
3) For set s inst Each of the similar mechanisms s i In the collection class inst Find the names of the mechanisms which are similar to each other, add the aggregate inst i
4) Each mechanism s i And corresponding inst i Adding set inst jaro
Further, the method for calculating the names of the similar institutions by using the Jaro similarity method comprises the following steps:
1) For aggregate inst jaro Each s of (3) i Corresponding inst i Adopting Jaro similarity to calculate the similarity between mechanisms;
2) For aggregate inst i Each mechanism i of (2) k If s i And i k Similarity Si ik Greater than the set threshold T jaro Word i is then processed k Adding aggregate sinst i
3) Will aggregate s inst Sum aggregate sinst i Removing the reforming reaction to obtain the final similar mechanism final i
4) For final i Each mechanism f of (3) i Corresponding relation list between organization name and internal Id inst-id Find the original organization name and add set O i
5) Set O derived from each institution i And adding the set O to obtain a final organization standardization result.
One or more embodiments of the present invention may have the following advantages over the prior art:
according to the mechanism name specification method based on the word vector model, a word2vec word vector calculation method is adopted, text calculation is associated with a context, and word vectors have semantic features; the mapping of multiple organization names to one organization entity is achieved in combination with the Jaro similarity. The invention can effectively improve the data reliability problem in the scientific and technological evaluation based on mass data, and standardizes the storage and management of the organization names in the scientific and technological literature database, thereby improving the standardization of the construction of the scientific and technological literature database.
Drawings
FIG. 1 is a flow chart of a method for specifying a facility name based on a word vector model;
FIG. 2 is a flow chart of a facility name specification method based on a word vector model for obtaining word vector training text;
FIG. 3 is a flow chart of a method for specifying a facility name based on a word vector model to obtain correspondence between facility names and internal Ids;
FIG. 4 is a diagram of a facility name specification method skip-gram model based on a word vector model;
FIG. 5 is a flow chart of a method for extracting similar organization names based on a word vector model organization name specification;
FIG. 6 is a flowchart of a facility name specification method based on a word vector model for calculating similar facility names using Jaro similarity.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples and the accompanying drawings.
As shown in FIG. 1, a method for specifying a organization name based on a word vector model is provided. In the present embodiment, basic parameter settings are as follows: context window constant c=5 in skip-gram model, similarity threshold T in word vector model wv =0.8, jaro similarity threshold T jaro =0.95. The specific implementation method can be divided into the following steps:
s10, analyzing relevant field characteristics of a scientific and technological literature data organization, and selecting relevant fields of the organization;
the related fields of the mechanism in the metadata of the scientific literature are information such as mechanism short (such as Shanxi Univ), mechanism full name (such as Shanxi University), enhanced organization name provided by web office, secondary organization name, id mechanism address in the mechanism and the like. The invention only standardizes the primary mechanism name and does not relate to the secondary mechanism name, so that the finally selected related fields of the mechanism comprise short mechanism name, full mechanism name, a system to which the mechanism belongs, an internal Id of the mechanism and an address field of the mechanism.
Step S20, extracting a document related field information text;
the work of the step mainly completes the extraction of the relevant fields of the literature, and the cleaning and the transformation of the information of the extracted fields to obtain a word vector training text set text wv Mechanism name collection list after cleaning and conversion inst Mechanism name and internal Id correspondence list after cleaning transformation inst-id
Acquiring word vector training text set text wv The method of (2) comprises, as shown in FIG. 2:
1) For each document R in the scientific and literature data set R i Extracting a document title according to the tag information;
2) For each contributor C in each document contribution set C j Judging whether the type is an author, if so, extracting the name of the author and the corresponding mechanism number of the author according to the label information;
3) Extracting an author corresponding mechanism (including mechanism abbreviations, mechanism holonomics and systems to which the mechanism belongs) and a mechanism corresponding address number according to the mechanism number;
4) Extracting address information corresponding to the mechanism according to the address number;
5) Extracting literature topics according to the label information;
6) And cleaning, converting and storing the extracted literature information.
The data cleaning work comprises deleting special characters and punctuations, and uniformly converting the special characters and the punctuations into lowercase characters; the data transformation is directed to the author name and the organization name, which are mostly composed of a plurality of words, and are calculated by words in word vector training, so this step performs special treatment on the author and the author organization field, namely, the author name composed of a plurality of words is connected by a symbol '-' (for example, zhang San is treated as Zhang-San), and the organization name composed of a plurality of words is connected by a symbol '_' for the purpose of making the author and the author organization name become a whole.
Cleaning and transforming mechanism name collection list inst Only the name of the mechanism after cleaning and conversion is contained.
Cleaning and transforming mechanism name and internal Id corresponding relation set list inst-id Including the transformed organization name, original organization name, and organization internal Id. Obtaining list inst-id The method of (1) comprises, as shown in FIG. 3:
1) For each document R in the scientific and literature data set R i Extracting relevant information of a mechanism;
2) For each mechanism I in the mechanism set I of each document j Extracting the name of the organization and the Id of the organization;
3) Cleaning and converting the name of the extraction mechanism;
4) The name of the mechanism after cleaning and conversion, the name of the original mechanism and the Id in the mechanism form a triplet;
6) Adding all triples to the aggregate list inst-id And stored.
Step S30, constructing a word2vec word vector model and clustering mechanism names;
this step completes word2vec word vector model training and clustering. Word2vec Word vector training uses skip-gram models. As shown in fig. 4, the model contains an input layer, a hidden layer, and an output layer. The model predicts its context using the center word with a prediction probability of p (w i |w t ) Where t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c is the context window constant, set to 5, i.e. the input is the word vector for a particular word, and the output is the context word vector for the particular word. By training text set text wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set. The word vector length of each word is 200 dimensions, each of which is a real number.
The organization name clustering is completed to cluster the output vector of each word in the word vector model, and the category can be set by the user autonomously, here set to 100. Finally obtaining the cluster file class of the ordered words and the corresponding classes thereof wv
Step S40, searching and extracting similar mechanism names based on the word vector model and the clustering result;
the step is to complete the extraction of the names of the similar institutions based on the word vector model and the clustering result. The method for extracting the names of the similar institutions comprises the following steps of:
1) Will aggregate list inst-id The mechanism name inst after cleaning conversion clean And clustering file class wv Taking intersection to obtain clustered all mechanism name set class inst
2) For organization name collection list inst Each mechanism i of (2) i Searching a similar word set W in the word vector model, and if each word W in the set W j Mechanism i i Similarity between Si ij Greater than the set threshold T wv And word w j Comprises at inst clean In (C), word w j Joining set s inst . For example, organization aarsus_uniqueness, find the first 8 similar words and similarities in the word vector model, and the results are shown in table 1.
TABLE 1 set of similar words for word vector model lookup
Words and phrases Similarity degree
aarhus 0.9022127389907837
aarhus_univ 0.889610767364502
univ_aarhus 0.8739291429519653
aarhus_univ_hosp 0.8125547766685486
pedersen-steen-b 0.7520588636398315
birkedal-henrik 0.7493382692337036
ostergaard-jane-nautrup 0.745323657989502
kronvang-brian 0.7398045063018799
Adding words with similarity greater than a set threshold value of 0.8 to the set s for the mechanism inst ={'aarhus_univ','univ_aarhus','aarhus_univ_hosp'};
3) For set s inst Each of the similar mechanisms s i In the collection class inst Find the names of the mechanisms which are similar to each other, add the aggregate inst i . For example, mechanism aarshus_univ, in classification inst Find the front and back 10 similar organization names to add the aggregate inst i
4) Each mechanism s i And corresponding inst i Adding set inst jaro
Step S50, calculating the similarity between the mechanisms by adopting a Jaro similarity method;
this step completes the set inst jaro Each s of (3) i Corresponding inst i And (5) Jaro similarity calculation of the included mechanism. The method for calculating the names of similar institutions includes, as shown in fig. 6:
1) For aggregate inst jaro Each of (3)s i Corresponding inst i Adopting Jaro similarity to calculate the similarity between mechanisms;
2) For aggregate inst i Each mechanism i of (2) k If s i And i k Similarity Si ik Greater than the set threshold T jaro Word i is then processed k Adding aggregate sinst i . For example, the aggregate sinst obtained by calculating the organization aarshus_univ i Is { 'aalborg_universality', 'aarhus_universt'.
3) Will aggregate s inst Sum aggregate sinst i Removing the reforming reaction to obtain the final similar mechanism final i . For example, organization aarshus_univ, syndicated collection s inst And aggregate to obtain final i sinst i ={'aarhus_university','aarhus_univ','univ_aarhus','aarhus_univ_hosp','aalborg_university'}。
4) For final i Each mechanism f of (3) i Corresponding relation list between organization name and internal Id inst-id Find the original organization name and add set O i . For example, organization aarshus_univ has undergone this step of calculation O i ={'Aarhus University','Aarhus Univ','Aarhus Univ Hosp','Aalborg University'}。
5) Set O derived from each institution i And adding the set O to obtain a final organization standardization result.
The accuracy was measured to be 91% by testing the method described in the summary of the invention on a scientific literature containing 370 institutions.
The invention provides a Word vector-based mechanism name mapping method, which adopts a strategy of combining Word2vec Word vectors and Jaro similarity to realize mapping from a plurality of mechanism names to one mechanism entity, and a new Word vector calculation method based on Word2vec changes the defect that the context is not considered in the prior text calculation, and associates the text calculation with the influence of the corpus context, so that the Word vectors have semantic characteristics, and the establishment of a canonical library judges the similarity and the identity of the mechanism names to a certain extent through context combination, thereby achieving the research purpose of mechanism name standardization.
Although the embodiments of the present invention are described above, the embodiments are only used for facilitating understanding of the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the appended claims.

Claims (10)

1. A mechanism name standardization method based on a word vector model is characterized in that:
the method comprises the following steps:
s10, acquiring the name and field characteristics of an organization of scientific and technical literature data, and selecting related fields of the organization;
s20, extracting a document related field information text through the selection mechanism related field;
s30, constructing word2vec word vector models from a plurality of information texts, and clustering mechanism names;
s40, searching and extracting a set of similar mechanism names based on the word vector model and the clustering result;
and S50, calculating the similarity between the institutions by adopting a Jaro similarity method to obtain a final institution specification result.
2. The method of claim 1, wherein the organization related fields include organization abbreviation, organization full scale, enhanced organization name provided by web of science, secondary organization name, organization internal Id and organization address.
3. The method according to claim 1, wherein the extraction of the document related information text includes extracting document related fields and cleaning and transforming the extracted field information to obtain a word vector training text set text wv Cleaning transformationPost-agency name collection list inst Mechanism name and internal Id correspondence list after cleaning transformation inst-id
4. The method of claim 3, wherein the obtaining word vector training text set text wv The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting a document title according to the tag information;
2) For each contributor C in each document contribution set C j Judging whether the type is an author, if so, extracting the name of the author and the corresponding mechanism number of the author according to the label information;
3) According to the mechanism number, extracting the mechanism abbreviation, the mechanism full name, the system of the mechanism and the address number of the mechanism corresponding to the author;
4) Extracting address information corresponding to the mechanism according to the address number;
5) Extracting literature topics according to the label information;
6) And cleaning, converting and storing the extracted literature information.
5. The method of claim 3, wherein the data cleansing operation includes deleting special characters and punctuation, converting uniformly into lowercase characters, cleansing the transformed mechanism name set list inst Only the name of the mechanism after cleaning and conversion is contained.
6. The method of claim 3, wherein the cleaning transformed mechanism name and internal Id correspondence list inst-id The name of the mechanism, the original name of the mechanism and the Id inside the mechanism after conversion are contained; obtaining list inst-id The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting relevant information of a mechanism;
2) For each mechanism I in the mechanism set I of each document j Extracting the name of the organization and the Id of the organization;
3) Cleaning and converting the name of the extraction mechanism;
4) The name of the mechanism after cleaning and conversion, the name of the original mechanism and the Id in the mechanism form a triplet;
6) Adding all triples to the aggregate list inst-id And stored.
7. The method for specifying a name of a mechanism based on a word vector model according to claim 3, wherein the word2vec word vector model is trained by using a skip-gram model; the skip-gram model comprises an input layer, a hidden layer and an output layer; the model predicts its context using the center word with a prediction probability of p (w i |w t ) Wherein t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c being a context window constant, i.e. the input is a word vector for a particular word, and the output is a context word vector for the particular word; by training text set text wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set.
8. The method for standardizing the names of institutions based on the word vector model according to claim 3, wherein the method for clustering the names of institutions utilizes a K-means clustering method to obtain the vectors of all the clustering centers, calculates the similarity between the word vectors output by the word vector model and all the clustering centers to obtain the clustering categories corresponding to the words, and finally obtains the clustering files class of the ordered words and the corresponding categories thereof wv
9. The method for specifying a facility name based on a word vector model according to claim 8, wherein the searching and extracting the similar facility names specifically comprises:
1) Will aggregate list inst-id The mechanism name inst after cleaning conversion clean And clustering file class wv The intersection set is taken out of the two,obtaining clustered all mechanism name set classes inst
2) For organization name collection list inst Each mechanism i of (2) i Searching a similar word set W in the word vector model, and if each word W in the set W j Mechanism i i Similarity between Si ij Greater than the set threshold T wv And word w j Comprises at inst clean In (C), word w j Joining set s inst
3) For set s inst Each of the similar mechanisms s i In the collection class inst Find the names of the mechanisms which are similar to each other, add the aggregate inst i
4) Each mechanism s i And corresponding inst i Adding set inst jaro
10. The method of claim 9, wherein the method for computing the similar organization names by the Jaro similarity method comprises:
1) For aggregate inst jaro Each s of (3) i Corresponding inst i Adopting Jaro similarity to calculate the similarity between mechanisms;
2) For aggregate inst i Each mechanism i of (2) k If s i And i k Similarity Si ik Greater than the set threshold T jaro Word i is then processed k Adding aggregate sinst i
3) Will aggregate s inst Sum aggregate sinst i Removing the reforming reaction to obtain the final similar mechanism final i
4) For final i Each mechanism f of (3) i Corresponding relation list between organization name and internal Id inst-id Find the original organization name and add set O i
5) Set O derived from each institution i And adding the set O to obtain a final organization standardization result.
CN202010844347.XA 2020-08-20 2020-08-20 Mechanism name standardization method based on word vector model Active CN111984776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010844347.XA CN111984776B (en) 2020-08-20 2020-08-20 Mechanism name standardization method based on word vector model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010844347.XA CN111984776B (en) 2020-08-20 2020-08-20 Mechanism name standardization method based on word vector model

Publications (2)

Publication Number Publication Date
CN111984776A CN111984776A (en) 2020-11-24
CN111984776B true CN111984776B (en) 2023-08-11

Family

ID=73444173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010844347.XA Active CN111984776B (en) 2020-08-20 2020-08-20 Mechanism name standardization method based on word vector model

Country Status (1)

Country Link
CN (1) CN111984776B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101576898A (en) * 2008-11-26 2009-11-11 北京中加国道科技有限公司 Metadata proposal suitable for permanently filing and using network academic resources
CN102197406A (en) * 2008-10-23 2011-09-21 起元技术有限责任公司 Fuzzy data operations
CN104281570A (en) * 2013-07-01 2015-01-14 富士通株式会社 Information processing method and device and method and device for standardizing organization names
CN104881398A (en) * 2014-08-29 2015-09-02 北京大学 Method for extracting author affiliation information of English literature published by Chinese authors
CN105653590A (en) * 2015-12-21 2016-06-08 青岛智能产业技术研究院 Name duplication disambiguation method of Chinese literature authors
CN106250524A (en) * 2016-08-04 2016-12-21 浪潮软件集团有限公司 Organization name extraction method and device based on semantic information
JP2019061550A (en) * 2017-09-27 2019-04-18 株式会社ミラボ Standard item name setting device, standard item name setting method, and standard item name setting program
CN110956043A (en) * 2019-12-17 2020-04-03 人和未来生物科技(长沙)有限公司 Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization
CN111160011A (en) * 2019-12-17 2020-05-15 浙江大华技术股份有限公司 Organization unit standardization method, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102197406A (en) * 2008-10-23 2011-09-21 起元技术有限责任公司 Fuzzy data operations
CN107273977A (en) * 2008-10-23 2017-10-20 起元技术有限责任公司 Method, system and machine readable hardware storage apparatus for identifying matching
CN101576898A (en) * 2008-11-26 2009-11-11 北京中加国道科技有限公司 Metadata proposal suitable for permanently filing and using network academic resources
CN104281570A (en) * 2013-07-01 2015-01-14 富士通株式会社 Information processing method and device and method and device for standardizing organization names
CN104881398A (en) * 2014-08-29 2015-09-02 北京大学 Method for extracting author affiliation information of English literature published by Chinese authors
CN105653590A (en) * 2015-12-21 2016-06-08 青岛智能产业技术研究院 Name duplication disambiguation method of Chinese literature authors
CN106250524A (en) * 2016-08-04 2016-12-21 浪潮软件集团有限公司 Organization name extraction method and device based on semantic information
JP2019061550A (en) * 2017-09-27 2019-04-18 株式会社ミラボ Standard item name setting device, standard item name setting method, and standard item name setting program
CN110956043A (en) * 2019-12-17 2020-04-03 人和未来生物科技(长沙)有限公司 Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization
CN111160011A (en) * 2019-12-17 2020-05-15 浙江大华技术股份有限公司 Organization unit standardization method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
科研机构名称归一化实现;贾君枝等;《图书情报工作》;第62卷(第13期);103-110 *

Also Published As

Publication number Publication date
CN111984776A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
Cohen et al. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
US10558754B2 (en) Method and system for automating training of named entity recognition in natural language processing
CN111832289B (en) Service discovery method based on clustering and Gaussian LDA
CN111104794A (en) Text similarity matching method based on subject words
Toda et al. A probabilistic approach for automatically filling form-based web interfaces
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN106708929B (en) Video program searching method and device
Fu et al. Automatic record linkage of individuals and households in historical census data
Balasubramanian et al. A multimodal approach for extracting content descriptive metadata from lecture videos
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN111460091A (en) Medical short text data negative sample sampling method and medical diagnosis standard term mapping model training method
CN115983233A (en) Electronic medical record duplication rate estimation method based on data stream matching
CN106570196B (en) Video program searching method and device
CN115422372A (en) Knowledge graph construction method and system based on software test
CN108345694B (en) Document retrieval method and system based on theme database
CN110019474B (en) Automatic synonymy data association method and device in heterogeneous database and electronic equipment
Quemy et al. ECHR-OD: On building an integrated open repository of legal documents for machine learning applications
CN103034657B (en) Documentation summary generates method and apparatus
CN111984776B (en) Mechanism name standardization method based on word vector model
CN112989011B (en) Data query method, data query device and electronic equipment
US20230186351A1 (en) Transformer Based Search Engine with Controlled Recall for Romanized Multilingual Corpus
CN112215006B (en) Organization named entity normalization method and system
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant