CN111984776B - Mechanism name standardization method based on word vector model - Google Patents
Mechanism name standardization method based on word vector model Download PDFInfo
- Publication number
- CN111984776B CN111984776B CN202010844347.XA CN202010844347A CN111984776B CN 111984776 B CN111984776 B CN 111984776B CN 202010844347 A CN202010844347 A CN 202010844347A CN 111984776 B CN111984776 B CN 111984776B
- Authority
- CN
- China
- Prior art keywords
- name
- inst
- organization
- word
- word vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a mechanism name standardization method based on a word vector model, which comprises the following steps: analyzing name field characteristics of a scientific and technological literature data mechanism, and selecting related fields of the mechanism; extracting information text of relevant fields of the literature, and cleaning and transforming the relevant fields; constructing a word vector model for the extracted text information by using a word2vec word vector method, and clustering the mechanism names; searching words with high similarity by combining the word vector model and the clustering file, and identifying and extracting the mechanism name from the words; the Jaro similarity method is adopted to calculate the names of the matching similar institutions by setting a threshold value. The invention can effectively improve the data reliability problem in the scientific and technological evaluation based on mass data, and standardizes the storage and management of the organization names in the scientific and technological literature database, thereby improving the standardization of the construction of the scientific and technological literature database.
Description
Technical Field
The invention relates to the field of book information and information extraction, in particular to a mechanism name standardization method based on a word vector model.
Background
In a scientific literature database, literature information resources with multiple sources are contained, information extraction is carried out from massive resources, and an entity is needed to be used for searching, wherein the entity mainly comprises the contents of a person name, an organization, a geographic name, a date, a title, a keyword and the like, and the person name and the organization name are the most important two types.
Along with history transition and system reform, organization names, particularly group names, are often changed due to the change of basic functions and organization structures, and many organization names have the problem of non-uniform and non-standard expression due to the reasons of Chinese and English full names, short names, abbreviations and the like. In the aspect of scientific literature, the phenomenon of irregular and non-uniform name expression of institutions easily causes retrieval and statistics errors aiming at academic achievements of the institutions, and is also unfavorable for statistics and mining analysis of related data.
At present, the name specification of scientific and technical literature is still in a research stage, and manual labeling, correction and other methods are mainly adopted, and the method has high requirement on labor cost, so that the existing name and organization association is difficult. There is therefore a need for a method of institutional name specification based on a word vector model to solve the above problems.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a mechanism name standardization method based on a word vector model. The aim of the invention is achieved by the following technical scheme:
the invention comprises the following steps:
s10, acquiring the name and field characteristics of an organization of scientific and technical literature data, and selecting related fields of the organization;
s20, extracting a document related field information text through the selection mechanism related field;
s30, constructing word2vec word vector models from a plurality of information texts, and clustering mechanism names;
s40, searching and extracting a set of similar mechanism names based on the word vector model and the clustering result;
and S50, calculating the similarity between the institutions by adopting a Jaro similarity method to obtain a final institution specification result.
Further, the organization field in the metadata of the analysis science and technology literature comprises information such as organization short, organization full name, enhanced organization name provided by web office, secondary organization name, and organization internal Id organization address. The invention only standardizes the primary mechanism name and does not relate to the secondary mechanism name, so that the finally selected related fields of the mechanism comprise short mechanism name, full mechanism name, a system to which the mechanism belongs, an internal Id of the mechanism and an address field of the mechanism.
Further, the document related information text extraction work comprises extraction of document related fields and cleaning and transformation of the extracted field information to obtain a word vector training text set text wv Mechanism name collection list after cleaning and conversion inst Mechanism name and internal Id correspondence list after cleaning transformation inst-id 。
Further, a word vector training text set text is obtained wv The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting a document title according to the tag information;
2) For each contributor C in each document contribution set C j Judging whether the type is an author, if so, extracting the name of the author and the corresponding mechanism number of the author according to the label information;
3) Extracting an author corresponding mechanism (including mechanism abbreviations, mechanism holonomics and systems to which the mechanism belongs) and a mechanism corresponding address number according to the mechanism number;
4) Extracting address information corresponding to the mechanism according to the address number;
5) Extracting literature topics according to the label information;
6) And cleaning, converting and storing the extracted literature information.
Further, the data cleaning work comprises deleting special characters and punctuation, and uniformly converting the special characters and the punctuation into lowercase characters; the data transformation is directed to the author name and the organization name, which are mostly composed of a plurality of words, and are calculated by the model in units of words in word vector training, so that the step specially processes the author and the author organization field, namely, the author name composed of a plurality of words is connected by a symbol '-' and the organization name composed of a plurality of words is connected by a symbol '-' so that the author and the author organization name are integrated, and the transformed organization name collection list is cleaned inst Only the name of the mechanism after cleaning and conversion is contained.
Further, cleaning the transformed mechanism name and internal Id correspondence list inst-id Containing post-conversion organization namesName, original organization name, and organization internal Id. Obtaining list inst-id The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting relevant information of a mechanism;
2) For each mechanism I in the mechanism set I of each document j Extracting the name of the organization and the Id of the organization;
3) Cleaning and converting the name of the extraction mechanism;
4) The name of the mechanism after cleaning and conversion, the name of the original mechanism and the Id in the mechanism form a triplet;
6) Adding all triples to the aggregate list inst-id And stored.
Further, the word2vec word vector model is trained by adopting a skip-gram model. The skip-gram model contains an input layer, a hidden layer, and an output layer. The model predicts its context using the center word with a prediction probability of p (w i |w t ) Where t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c is the context window constant, i.e. the input is a word vector for a particular word, and the output is a context word vector for the particular word. By training text set text wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set.
Further, the mechanism name clustering method utilizes a K-means clustering method to obtain vectors of all clustering centers, word vectors output by a word vector model are respectively calculated to be similar to all the clustering centers to obtain clustering categories corresponding to words, and finally, the sorted words and the clustering files class of the corresponding categories are obtained wv 。
Further, the searching and extracting similar organization names specifically includes:
1) Will aggregate list inst-id The mechanism name inst after cleaning conversion clean And clustering file class wv Taking intersection to obtain clustered all mechanism name set class inst ;
2) For organization name collection list inst Each mechanism i of (2) i In the word ofSearching a similar word set W in the vector model, and if each word W in the set W j Mechanism i i Similarity between Si ij Greater than the set threshold T wv And word w j Comprises at inst clean In (C), word w j Joining set s inst 。
3) For set s inst Each of the similar mechanisms s i In the collection class inst Find the names of the mechanisms which are similar to each other, add the aggregate inst i ;
4) Each mechanism s i And corresponding inst i Adding set inst jaro 。
Further, the method for calculating the names of the similar institutions by using the Jaro similarity method comprises the following steps:
1) For aggregate inst jaro Each s of (3) i Corresponding inst i Adopting Jaro similarity to calculate the similarity between mechanisms;
2) For aggregate inst i Each mechanism i of (2) k If s i And i k Similarity Si ik Greater than the set threshold T jaro Word i is then processed k Adding aggregate sinst i 。
3) Will aggregate s inst Sum aggregate sinst i Removing the reforming reaction to obtain the final similar mechanism final i 。
4) For final i Each mechanism f of (3) i Corresponding relation list between organization name and internal Id inst-id Find the original organization name and add set O i 。
5) Set O derived from each institution i And adding the set O to obtain a final organization standardization result.
One or more embodiments of the present invention may have the following advantages over the prior art:
according to the mechanism name specification method based on the word vector model, a word2vec word vector calculation method is adopted, text calculation is associated with a context, and word vectors have semantic features; the mapping of multiple organization names to one organization entity is achieved in combination with the Jaro similarity. The invention can effectively improve the data reliability problem in the scientific and technological evaluation based on mass data, and standardizes the storage and management of the organization names in the scientific and technological literature database, thereby improving the standardization of the construction of the scientific and technological literature database.
Drawings
FIG. 1 is a flow chart of a method for specifying a facility name based on a word vector model;
FIG. 2 is a flow chart of a facility name specification method based on a word vector model for obtaining word vector training text;
FIG. 3 is a flow chart of a method for specifying a facility name based on a word vector model to obtain correspondence between facility names and internal Ids;
FIG. 4 is a diagram of a facility name specification method skip-gram model based on a word vector model;
FIG. 5 is a flow chart of a method for extracting similar organization names based on a word vector model organization name specification;
FIG. 6 is a flowchart of a facility name specification method based on a word vector model for calculating similar facility names using Jaro similarity.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples and the accompanying drawings.
As shown in FIG. 1, a method for specifying a organization name based on a word vector model is provided. In the present embodiment, basic parameter settings are as follows: context window constant c=5 in skip-gram model, similarity threshold T in word vector model wv =0.8, jaro similarity threshold T jaro =0.95. The specific implementation method can be divided into the following steps:
s10, analyzing relevant field characteristics of a scientific and technological literature data organization, and selecting relevant fields of the organization;
the related fields of the mechanism in the metadata of the scientific literature are information such as mechanism short (such as Shanxi Univ), mechanism full name (such as Shanxi University), enhanced organization name provided by web office, secondary organization name, id mechanism address in the mechanism and the like. The invention only standardizes the primary mechanism name and does not relate to the secondary mechanism name, so that the finally selected related fields of the mechanism comprise short mechanism name, full mechanism name, a system to which the mechanism belongs, an internal Id of the mechanism and an address field of the mechanism.
Step S20, extracting a document related field information text;
the work of the step mainly completes the extraction of the relevant fields of the literature, and the cleaning and the transformation of the information of the extracted fields to obtain a word vector training text set text wv Mechanism name collection list after cleaning and conversion inst Mechanism name and internal Id correspondence list after cleaning transformation inst-id 。
Acquiring word vector training text set text wv The method of (2) comprises, as shown in FIG. 2:
1) For each document R in the scientific and literature data set R i Extracting a document title according to the tag information;
2) For each contributor C in each document contribution set C j Judging whether the type is an author, if so, extracting the name of the author and the corresponding mechanism number of the author according to the label information;
3) Extracting an author corresponding mechanism (including mechanism abbreviations, mechanism holonomics and systems to which the mechanism belongs) and a mechanism corresponding address number according to the mechanism number;
4) Extracting address information corresponding to the mechanism according to the address number;
5) Extracting literature topics according to the label information;
6) And cleaning, converting and storing the extracted literature information.
The data cleaning work comprises deleting special characters and punctuations, and uniformly converting the special characters and the punctuations into lowercase characters; the data transformation is directed to the author name and the organization name, which are mostly composed of a plurality of words, and are calculated by words in word vector training, so this step performs special treatment on the author and the author organization field, namely, the author name composed of a plurality of words is connected by a symbol '-' (for example, zhang San is treated as Zhang-San), and the organization name composed of a plurality of words is connected by a symbol '_' for the purpose of making the author and the author organization name become a whole.
Cleaning and transforming mechanism name collection list inst Only the name of the mechanism after cleaning and conversion is contained.
Cleaning and transforming mechanism name and internal Id corresponding relation set list inst-id Including the transformed organization name, original organization name, and organization internal Id. Obtaining list inst-id The method of (1) comprises, as shown in FIG. 3:
1) For each document R in the scientific and literature data set R i Extracting relevant information of a mechanism;
2) For each mechanism I in the mechanism set I of each document j Extracting the name of the organization and the Id of the organization;
3) Cleaning and converting the name of the extraction mechanism;
4) The name of the mechanism after cleaning and conversion, the name of the original mechanism and the Id in the mechanism form a triplet;
6) Adding all triples to the aggregate list inst-id And stored.
Step S30, constructing a word2vec word vector model and clustering mechanism names;
this step completes word2vec word vector model training and clustering. Word2vec Word vector training uses skip-gram models. As shown in fig. 4, the model contains an input layer, a hidden layer, and an output layer. The model predicts its context using the center word with a prediction probability of p (w i |w t ) Where t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c is the context window constant, set to 5, i.e. the input is the word vector for a particular word, and the output is the context word vector for the particular word. By training text set text wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set. The word vector length of each word is 200 dimensions, each of which is a real number.
The organization name clustering is completed to cluster the output vector of each word in the word vector model, and the category can be set by the user autonomously, here set to 100. Finally obtaining the cluster file class of the ordered words and the corresponding classes thereof wv 。
Step S40, searching and extracting similar mechanism names based on the word vector model and the clustering result;
the step is to complete the extraction of the names of the similar institutions based on the word vector model and the clustering result. The method for extracting the names of the similar institutions comprises the following steps of:
1) Will aggregate list inst-id The mechanism name inst after cleaning conversion clean And clustering file class wv Taking intersection to obtain clustered all mechanism name set class inst ;
2) For organization name collection list inst Each mechanism i of (2) i Searching a similar word set W in the word vector model, and if each word W in the set W j Mechanism i i Similarity between Si ij Greater than the set threshold T wv And word w j Comprises at inst clean In (C), word w j Joining set s inst . For example, organization aarsus_uniqueness, find the first 8 similar words and similarities in the word vector model, and the results are shown in table 1.
TABLE 1 set of similar words for word vector model lookup
Words and phrases | Similarity degree |
aarhus | 0.9022127389907837 |
aarhus_univ | 0.889610767364502 |
univ_aarhus | 0.8739291429519653 |
aarhus_univ_hosp | 0.8125547766685486 |
pedersen-steen-b | 0.7520588636398315 |
birkedal-henrik | 0.7493382692337036 |
ostergaard-jane-nautrup | 0.745323657989502 |
kronvang-brian | 0.7398045063018799 |
Adding words with similarity greater than a set threshold value of 0.8 to the set s for the mechanism inst ={'aarhus_univ','univ_aarhus','aarhus_univ_hosp'};
3) For set s inst Each of the similar mechanisms s i In the collection class inst Find the names of the mechanisms which are similar to each other, add the aggregate inst i . For example, mechanism aarshus_univ, in classification inst Find the front and back 10 similar organization names to add the aggregate inst i 。
4) Each mechanism s i And corresponding inst i Adding set inst jaro 。
Step S50, calculating the similarity between the mechanisms by adopting a Jaro similarity method;
this step completes the set inst jaro Each s of (3) i Corresponding inst i And (5) Jaro similarity calculation of the included mechanism. The method for calculating the names of similar institutions includes, as shown in fig. 6:
1) For aggregate inst jaro Each of (3)s i Corresponding inst i Adopting Jaro similarity to calculate the similarity between mechanisms;
2) For aggregate inst i Each mechanism i of (2) k If s i And i k Similarity Si ik Greater than the set threshold T jaro Word i is then processed k Adding aggregate sinst i . For example, the aggregate sinst obtained by calculating the organization aarshus_univ i Is { 'aalborg_universality', 'aarhus_universt'.
3) Will aggregate s inst Sum aggregate sinst i Removing the reforming reaction to obtain the final similar mechanism final i . For example, organization aarshus_univ, syndicated collection s inst And aggregate to obtain final i sinst i ={'aarhus_university','aarhus_univ','univ_aarhus','aarhus_univ_hosp','aalborg_university'}。
4) For final i Each mechanism f of (3) i Corresponding relation list between organization name and internal Id inst-id Find the original organization name and add set O i . For example, organization aarshus_univ has undergone this step of calculation O i ={'Aarhus University','Aarhus Univ','Aarhus Univ Hosp','Aalborg University'}。
5) Set O derived from each institution i And adding the set O to obtain a final organization standardization result.
The accuracy was measured to be 91% by testing the method described in the summary of the invention on a scientific literature containing 370 institutions.
The invention provides a Word vector-based mechanism name mapping method, which adopts a strategy of combining Word2vec Word vectors and Jaro similarity to realize mapping from a plurality of mechanism names to one mechanism entity, and a new Word vector calculation method based on Word2vec changes the defect that the context is not considered in the prior text calculation, and associates the text calculation with the influence of the corpus context, so that the Word vectors have semantic characteristics, and the establishment of a canonical library judges the similarity and the identity of the mechanism names to a certain extent through context combination, thereby achieving the research purpose of mechanism name standardization.
Although the embodiments of the present invention are described above, the embodiments are only used for facilitating understanding of the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the appended claims.
Claims (10)
1. A mechanism name standardization method based on a word vector model is characterized in that:
the method comprises the following steps:
s10, acquiring the name and field characteristics of an organization of scientific and technical literature data, and selecting related fields of the organization;
s20, extracting a document related field information text through the selection mechanism related field;
s30, constructing word2vec word vector models from a plurality of information texts, and clustering mechanism names;
s40, searching and extracting a set of similar mechanism names based on the word vector model and the clustering result;
and S50, calculating the similarity between the institutions by adopting a Jaro similarity method to obtain a final institution specification result.
2. The method of claim 1, wherein the organization related fields include organization abbreviation, organization full scale, enhanced organization name provided by web of science, secondary organization name, organization internal Id and organization address.
3. The method according to claim 1, wherein the extraction of the document related information text includes extracting document related fields and cleaning and transforming the extracted field information to obtain a word vector training text set text wv Cleaning transformationPost-agency name collection list inst Mechanism name and internal Id correspondence list after cleaning transformation inst-id 。
4. The method of claim 3, wherein the obtaining word vector training text set text wv The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting a document title according to the tag information;
2) For each contributor C in each document contribution set C j Judging whether the type is an author, if so, extracting the name of the author and the corresponding mechanism number of the author according to the label information;
3) According to the mechanism number, extracting the mechanism abbreviation, the mechanism full name, the system of the mechanism and the address number of the mechanism corresponding to the author;
4) Extracting address information corresponding to the mechanism according to the address number;
5) Extracting literature topics according to the label information;
6) And cleaning, converting and storing the extracted literature information.
5. The method of claim 3, wherein the data cleansing operation includes deleting special characters and punctuation, converting uniformly into lowercase characters, cleansing the transformed mechanism name set list inst Only the name of the mechanism after cleaning and conversion is contained.
6. The method of claim 3, wherein the cleaning transformed mechanism name and internal Id correspondence list inst-id The name of the mechanism, the original name of the mechanism and the Id inside the mechanism after conversion are contained; obtaining list inst-id The method of (1) comprises:
1) For each document R in the scientific and literature data set R i Extracting relevant information of a mechanism;
2) For each mechanism I in the mechanism set I of each document j Extracting the name of the organization and the Id of the organization;
3) Cleaning and converting the name of the extraction mechanism;
4) The name of the mechanism after cleaning and conversion, the name of the original mechanism and the Id in the mechanism form a triplet;
6) Adding all triples to the aggregate list inst-id And stored.
7. The method for specifying a name of a mechanism based on a word vector model according to claim 3, wherein the word2vec word vector model is trained by using a skip-gram model; the skip-gram model comprises an input layer, a hidden layer and an output layer; the model predicts its context using the center word with a prediction probability of p (w i |w t ) Wherein t-c.ltoreq.i.ltoreq.t+c, i.noteq.t, c being a context window constant, i.e. the input is a word vector for a particular word, and the output is a context word vector for the particular word; by training text set text wv And simplifying the text content into vector operation of a K-dimensional vector space, and converting the vector operation to obtain word vectors of each word in the text set.
8. The method for standardizing the names of institutions based on the word vector model according to claim 3, wherein the method for clustering the names of institutions utilizes a K-means clustering method to obtain the vectors of all the clustering centers, calculates the similarity between the word vectors output by the word vector model and all the clustering centers to obtain the clustering categories corresponding to the words, and finally obtains the clustering files class of the ordered words and the corresponding categories thereof wv 。
9. The method for specifying a facility name based on a word vector model according to claim 8, wherein the searching and extracting the similar facility names specifically comprises:
1) Will aggregate list inst-id The mechanism name inst after cleaning conversion clean And clustering file class wv The intersection set is taken out of the two,obtaining clustered all mechanism name set classes inst ;
2) For organization name collection list inst Each mechanism i of (2) i Searching a similar word set W in the word vector model, and if each word W in the set W j Mechanism i i Similarity between Si ij Greater than the set threshold T wv And word w j Comprises at inst clean In (C), word w j Joining set s inst ;
3) For set s inst Each of the similar mechanisms s i In the collection class inst Find the names of the mechanisms which are similar to each other, add the aggregate inst i ;
4) Each mechanism s i And corresponding inst i Adding set inst jaro 。
10. The method of claim 9, wherein the method for computing the similar organization names by the Jaro similarity method comprises:
1) For aggregate inst jaro Each s of (3) i Corresponding inst i Adopting Jaro similarity to calculate the similarity between mechanisms;
2) For aggregate inst i Each mechanism i of (2) k If s i And i k Similarity Si ik Greater than the set threshold T jaro Word i is then processed k Adding aggregate sinst i ;
3) Will aggregate s inst Sum aggregate sinst i Removing the reforming reaction to obtain the final similar mechanism final i ;
4) For final i Each mechanism f of (3) i Corresponding relation list between organization name and internal Id inst-id Find the original organization name and add set O i ;
5) Set O derived from each institution i And adding the set O to obtain a final organization standardization result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010844347.XA CN111984776B (en) | 2020-08-20 | 2020-08-20 | Mechanism name standardization method based on word vector model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010844347.XA CN111984776B (en) | 2020-08-20 | 2020-08-20 | Mechanism name standardization method based on word vector model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111984776A CN111984776A (en) | 2020-11-24 |
CN111984776B true CN111984776B (en) | 2023-08-11 |
Family
ID=73444173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010844347.XA Active CN111984776B (en) | 2020-08-20 | 2020-08-20 | Mechanism name standardization method based on word vector model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111984776B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101576898A (en) * | 2008-11-26 | 2009-11-11 | 北京中加国道科技有限公司 | Metadata proposal suitable for permanently filing and using network academic resources |
CN102197406A (en) * | 2008-10-23 | 2011-09-21 | 起元技术有限责任公司 | Fuzzy data operations |
CN104281570A (en) * | 2013-07-01 | 2015-01-14 | 富士通株式会社 | Information processing method and device and method and device for standardizing organization names |
CN104881398A (en) * | 2014-08-29 | 2015-09-02 | 北京大学 | Method for extracting author affiliation information of English literature published by Chinese authors |
CN105653590A (en) * | 2015-12-21 | 2016-06-08 | 青岛智能产业技术研究院 | Name duplication disambiguation method of Chinese literature authors |
CN106250524A (en) * | 2016-08-04 | 2016-12-21 | 浪潮软件集团有限公司 | Organization name extraction method and device based on semantic information |
JP2019061550A (en) * | 2017-09-27 | 2019-04-18 | 株式会社ミラボ | Standard item name setting device, standard item name setting method, and standard item name setting program |
CN110956043A (en) * | 2019-12-17 | 2020-04-03 | 人和未来生物科技(长沙)有限公司 | Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization |
CN111160011A (en) * | 2019-12-17 | 2020-05-15 | 浙江大华技术股份有限公司 | Organization unit standardization method, device, equipment and storage medium |
-
2020
- 2020-08-20 CN CN202010844347.XA patent/CN111984776B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102197406A (en) * | 2008-10-23 | 2011-09-21 | 起元技术有限责任公司 | Fuzzy data operations |
CN107273977A (en) * | 2008-10-23 | 2017-10-20 | 起元技术有限责任公司 | Method, system and machine readable hardware storage apparatus for identifying matching |
CN101576898A (en) * | 2008-11-26 | 2009-11-11 | 北京中加国道科技有限公司 | Metadata proposal suitable for permanently filing and using network academic resources |
CN104281570A (en) * | 2013-07-01 | 2015-01-14 | 富士通株式会社 | Information processing method and device and method and device for standardizing organization names |
CN104881398A (en) * | 2014-08-29 | 2015-09-02 | 北京大学 | Method for extracting author affiliation information of English literature published by Chinese authors |
CN105653590A (en) * | 2015-12-21 | 2016-06-08 | 青岛智能产业技术研究院 | Name duplication disambiguation method of Chinese literature authors |
CN106250524A (en) * | 2016-08-04 | 2016-12-21 | 浪潮软件集团有限公司 | Organization name extraction method and device based on semantic information |
JP2019061550A (en) * | 2017-09-27 | 2019-04-18 | 株式会社ミラボ | Standard item name setting device, standard item name setting method, and standard item name setting program |
CN110956043A (en) * | 2019-12-17 | 2020-04-03 | 人和未来生物科技(长沙)有限公司 | Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization |
CN111160011A (en) * | 2019-12-17 | 2020-05-15 | 浙江大华技术股份有限公司 | Organization unit standardization method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
科研机构名称归一化实现;贾君枝等;《图书情报工作》;第62卷(第13期);103-110 * |
Also Published As
Publication number | Publication date |
---|---|
CN111984776A (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321925B (en) | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints | |
Cohen et al. | Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods | |
WO2019091026A1 (en) | Knowledge base document rapid search method, application server, and computer readable storage medium | |
US10558754B2 (en) | Method and system for automating training of named entity recognition in natural language processing | |
CN111832289B (en) | Service discovery method based on clustering and Gaussian LDA | |
CN111104794A (en) | Text similarity matching method based on subject words | |
Toda et al. | A probabilistic approach for automatically filling form-based web interfaces | |
CN111797214A (en) | FAQ database-based problem screening method and device, computer equipment and medium | |
CN106708929B (en) | Video program searching method and device | |
Fu et al. | Automatic record linkage of individuals and households in historical census data | |
Balasubramanian et al. | A multimodal approach for extracting content descriptive metadata from lecture videos | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN111460091A (en) | Medical short text data negative sample sampling method and medical diagnosis standard term mapping model training method | |
CN115983233A (en) | Electronic medical record duplication rate estimation method based on data stream matching | |
CN106570196B (en) | Video program searching method and device | |
CN115422372A (en) | Knowledge graph construction method and system based on software test | |
CN108345694B (en) | Document retrieval method and system based on theme database | |
CN110019474B (en) | Automatic synonymy data association method and device in heterogeneous database and electronic equipment | |
Quemy et al. | ECHR-OD: On building an integrated open repository of legal documents for machine learning applications | |
CN103034657B (en) | Documentation summary generates method and apparatus | |
CN111984776B (en) | Mechanism name standardization method based on word vector model | |
CN112989011B (en) | Data query method, data query device and electronic equipment | |
US20230186351A1 (en) | Transformer Based Search Engine with Controlled Recall for Romanized Multilingual Corpus | |
CN112215006B (en) | Organization named entity normalization method and system | |
JP4567025B2 (en) | Text classification device, text classification method, text classification program, and recording medium recording the program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |