WO2021142968A1 - Multilingual-oriented semantic similarity calculation method for general place names, and application thereof - Google Patents
Multilingual-oriented semantic similarity calculation method for general place names, and application thereof Download PDFInfo
- Publication number
- WO2021142968A1 WO2021142968A1 PCT/CN2020/085814 CN2020085814W WO2021142968A1 WO 2021142968 A1 WO2021142968 A1 WO 2021142968A1 CN 2020085814 W CN2020085814 W CN 2020085814W WO 2021142968 A1 WO2021142968 A1 WO 2021142968A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- place
- names
- similarity
- name
- category
- Prior art date
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000001914 filtration Methods 0.000 claims description 7
- 230000008520 organization Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- OXXJZDJLYSMGIQ-ZRDIBKRKSA-N 8-[2-[(e)-3-hydroxypent-1-enyl]-5-oxocyclopent-3-en-1-yl]octanoic acid Chemical compound CCC(O)\C=C\C1C=CC(=O)C1CCCCCCCC(O)=O OXXJZDJLYSMGIQ-ZRDIBKRKSA-N 0.000 description 2
- 101100397117 Arabidopsis thaliana PPA3 gene Proteins 0.000 description 2
- 101001057699 Homo sapiens Inorganic pyrophosphatase Proteins 0.000 description 2
- 102100027050 Inorganic pyrophosphatase Human genes 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Remote Sensing (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (8)
- 一种面向多语种的通用地名语义相似度计算方法,其特征在于,包括如下步骤:A method for calculating the semantic similarity of general place names for multilingualism, which is characterized in that it comprises the following steps:根据语种编码区间确定地名语种,并根据文献信息将地名进行归一化为罗马化地名;Determine the language of place names according to the language coding interval, and normalize place names into romanized place names based on document information;从地名信息库中获取两个地名的类别属性信息,根据地名分类体系及地名类别相似度模型计算地名类别相似度;Obtain the category attribute information of two place names from the place-name information database, and calculate the similarity of place-name categories according to the place-name classification system and place-name category similarity model;根据地名字符串相似度模型计算罗马化后地名的字符串相似度;Calculate the string similarity of place names after romanization according to the place name string similarity model;从地名信息库中获取两个地名的经纬度,然后根据地名空间邻近度模型计算空间邻近度;Obtain the latitude and longitude of two place names from the place-name information database, and then calculate the spatial proximity according to the geographic-name spatial proximity model;根据地名的类别相似度、字符串相似度和空间邻近度确定地名语义相似度。The semantic similarity of geographical names is determined according to the category similarity, string similarity and spatial proximity of geographical names.
- 根据权利要求1所述的地名语义相似度计算方法,其特征在于,根据地名分类体系及地名类别相似度模型计算地名类别相似度包括:The method for calculating the semantic similarity of place names according to claim 1, wherein calculating the similarity of place names according to the place name classification system and the place name category similarity model comprises:如果两地名所属类别位于地名分类体系的同一子类下的类别,则计算共同父类到根节点的距离之和以及最近的共同父类地名类别到两地名类别的距离,然后利用同类别相似模型计算类别相似度;如果两地名所属类别位于不同子类下的类别,则计算两地名类别所在子类的相关度后利用非同类别相似度模型计算类别相似度。If the categories of two place names are in the category under the same subcategory of the place name classification system, calculate the sum of the distances from the common parent category to the root node and the distance from the nearest common parent category to the two place name categories, and then use the same category similarity model Calculate category similarity; if the categories of two place names are in categories under different subcategories, calculate the correlation between the subcategories of the two place name categories and then use the non-same category similarity model to calculate the category similarity.
- 根据权利要求2所述的地名语义相似度计算方法,其特征在于,同一子类下的类别相似度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 2, wherein the category similarity model under the same subcategory is expressed as:其中,S c(i,j)表示地名i和j的地名类别相似度,l表示地名i和j的类别的最近的共同父类到根节点的距离,d i表示地名i和j的类别的最近的共同父类到i的类别的距离,d j表示地名i和j的类别的最近的共同父类到j的类别的距离,α(i,j)表示最近的共同父类到i和j的类别的距离之和。 Among them, S c (i, j) represents the similarity of the place names of place names i and j, l represents the distance from the nearest common parent of the category of place names i and j to the root node, and d i represents the category of place names i and j. The distance from the nearest common parent to the category of i, d j represents the distance from the nearest common parent of the categories of place names i and j to the category of j, and α(i,j) represents the nearest common parent to i and j The sum of the distances of the categories.
- 根据权利要求2所述的地名语义相似度计算方法,其特征在于,不同子类下的类别相似度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 2, wherein the category similarity models under different subcategories are expressed as:其中,S c(i,j)表示地名i和j的地名类别相似度,β’表示i和j的类别所在子类的相关度,d’ i表示i和j的类别的最近的共同父类到i的类别的距离,d’ j表示i和j的类别的最近的共同父类到j的类别的距离;α’(i,j)表示最近的共同父类到i和j的类别的距离之和。 Among them, S c (i, j) represents the similarity of the place names of place names i and j, β'represents the correlation degree of the subcategories of the categories of i and j, and d'i represents the nearest common parent category of the categories of i and j. The distance to the category of i, d' j represents the distance from the nearest common parent of the categories of i and j to the category of j; α'(i,j) represents the distance from the nearest common parent to the category of i and j Sum.
- 根据权利要求1所述的地名语义相似度计算方法,其特征在于,地名字符串相似度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 1, wherein the string similarity model of geographical names is expressed as:其中,A(i,j)表示地名i和j的地名字符串相似度,d[i,j]代表地名i和j的编辑距离,ML代表地名i和j字符串长度的最大值,Len代表最小匹配长度,L(i)代表地名i字符串的长度,L(j)代表地名j字符串的长度,a和b表示权重。Among them, A(i,j) represents the similarity of the place name strings of place names i and j, d[i,j] represents the edit distance of place names i and j, ML represents the maximum string length of place names i and j, and Len represents The minimum matching length, L(i) represents the length of the character string of the place name i, L(j) represents the length of the character string of the place name j, and a and b indicate the weight.
- 根据权利要求1所述的地名语义相似度计算方法,其特征在于,地名空间邻近度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 1, wherein the spatial proximity model of geographical names is expressed as:其中,S E(i,j)代表地名i和j的地名空间邻近度,lon i,lon j,lat i和lat j分别为地名i和j的经纬度。 Wherein, S E (i, j) representative of the space names names i and j proximity, lon i, lon j, lat i and j LAT i and j are names of latitude and longitude.
- 根据权利要求1所述的地名语义相似度计算方法,其特征在于,地名语义相似度的计算模型为:The method for calculating the semantic similarity of geographical names according to claim 1, wherein the calculation model of the semantic similarity of geographical names is:F(i,j)=A(i,j)S E(i,j)S C(i,j) F(i,j)=A(i,j)S E (i,j)S C (i,j)其中,S c(i,j)表示地名i和j的地名类别相似度,A(i,j)表示地名i和j的地名字符串相似度,S E(i,j)表示地名i和j的地名空间邻近度,F(i,j)表示地名i和j的地名语义相似度。 Among them, S c (i,j) represents the similarity of place names i and j, A(i,j) represents the similarity of place names i and j, and S E (i,j) represents place names i and j The spatial proximity of place names, F(i,j) represents the semantic similarity of place names i and j.
- 地名语义相似度计算方法在多语种地名数据查询中的应用,其特征在于,包括如下步骤:The application of the method for calculating the semantic similarity of geographical names in multilingual geographical name data query is characterized in that it includes the following steps:通过地名信息库提取所有地名的字符串、类别和经纬度属性,根据语种编码区间确定地名语种和进行地名归一化处理,并依据地名语种的不同特征分为表音型和表意型索引方法,其中表音型文字以字母相似度为基准,结合字母总数、字母部首数、单词总数和单词首字母编码语言特征,基于多维特征统计向量的索引组织方式进行表音型地名索引构建;表意型文字以字符局部相似度为基准,结合地名的相同字符、字符数量、字符位置语言特征,基于单个字的地名索引组织方式进行表意型地名索引构建;Extract the character strings, categories, and latitude and longitude attributes of all place names through the place name information database, determine the place name languages and normalize place names according to the language coding interval, and divide them into phonetic and ideographic indexing methods according to the different characteristics of place names. Phonographic characters are based on letter similarity, combined with the total number of letters, the number of letter radicals, the total number of words and the coding language characteristics of the first letter of the word, and the construction of the phonographic place name index based on the index organization method of the multi-dimensional feature statistical vector; ideographic characters Taking the local similarity of characters as the benchmark, combining the same characters, number of characters, and language characteristics of character positions in place names, constructing an ideographic place-name index based on the organization of single-word place-name indexes;确定待查询的地名的字符串、类别和经纬度属性,并进行归一化处理;Determine the character string, category, and latitude and longitude attributes of the place name to be queried, and normalize it;根据待查询地名所确定的字符串、类别和经纬度属性依次对索引中的所有进行筛选,其中依据确定的地名字符串,使用地名字符串相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若字符串为空则直接符合筛选条件;依据确定的地名 类别,使用类别相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若类别为空则直接符合筛选条件;依据确定的地名经纬度,使用地名空间邻近度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若经纬度为空则直接符合筛选条件;According to the character string, category, and latitude and longitude attributes determined by the place name to be queried, all the items in the index are sequentially filtered. According to the determined place name string, the place name string similarity model is used for calculation, and the calculation result is higher than the set threshold. Filter conditions, otherwise filter the place name, if the string is empty, it will directly meet the filter conditions; according to the determined place name category, use the category similarity model to calculate, the calculation result is higher than the set threshold and meet the filter conditions, otherwise it will be filtered If the category is empty, the place name directly meets the filtering conditions; according to the determined place name latitude and longitude, the place name spatial proximity model is used for calculation. When the calculation result is higher than the set threshold, the filtering conditions are met. Otherwise, the place name is filtered. If it is empty, it will directly meet the filter conditions;依次将待查询地名与所有候选地名采用根据权利要求1-7任一项所述的面向多语种的通用地名语义相似度计算方法进行计算;In turn, the place name to be queried and all candidate place names are calculated by using the multilingual-oriented universal place name semantic similarity calculation method according to any one of claims 1-7;将计算结果进行倒序排列,排序越靠前的地名与待查询地名越相似。Sort the calculation results in reverse order, the higher the ranking, the more similar the place name to be queried.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010058317.6A CN111325235B (en) | 2020-01-19 | 2020-01-19 | Multilingual-oriented universal place name semantic similarity calculation method and application thereof |
CN202010058317.6 | 2020-01-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021142968A1 true WO2021142968A1 (en) | 2021-07-22 |
Family
ID=71170946
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/085814 WO2021142968A1 (en) | 2020-01-19 | 2020-04-21 | Multilingual-oriented semantic similarity calculation method for general place names, and application thereof |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN111325235B (en) |
AU (1) | AU2020101024A4 (en) |
WO (1) | WO2021142968A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076734B (en) * | 2021-04-15 | 2023-01-20 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140229501A1 (en) * | 2011-10-20 | 2014-08-14 | Deutsche Post Ag | Comparing positional information |
CN107239442A (en) * | 2017-05-09 | 2017-10-10 | 北京京东金融科技控股有限公司 | A kind of method and apparatus of calculating address similarity |
CN107861947A (en) * | 2017-11-07 | 2018-03-30 | 昆明理工大学 | A kind of method of the card language name Entity recognition based on across language resource |
CN108171529A (en) * | 2017-12-04 | 2018-06-15 | 昆明理工大学 | A kind of address similarity estimating method |
CN108572960A (en) * | 2017-03-08 | 2018-09-25 | 富士通株式会社 | Place name disappears qi method and place name disappears qi device |
CN108804398A (en) * | 2017-05-03 | 2018-11-13 | 阿里巴巴集团控股有限公司 | The similarity calculating method and device of address text |
CN110276021A (en) * | 2019-04-29 | 2019-09-24 | 小轮(上海)网络科技有限公司 | Place name matching process and device based on semantic similarity |
CN110598791A (en) * | 2019-09-12 | 2019-12-20 | 深圳前海微众银行股份有限公司 | Address similarity evaluation method, device, equipment and medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2158540A4 (en) * | 2007-06-18 | 2010-10-20 | Geographic Services Inc | Geographic feature name search system |
CN103605752A (en) * | 2013-11-21 | 2014-02-26 | 武大吉奥信息技术有限公司 | Address matching method based on semantic recognition |
-
2020
- 2020-01-19 CN CN202010058317.6A patent/CN111325235B/en active Active
- 2020-04-21 WO PCT/CN2020/085814 patent/WO2021142968A1/en active Application Filing
- 2020-04-21 AU AU2020101024A patent/AU2020101024A4/en not_active Ceased
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140229501A1 (en) * | 2011-10-20 | 2014-08-14 | Deutsche Post Ag | Comparing positional information |
CN108572960A (en) * | 2017-03-08 | 2018-09-25 | 富士通株式会社 | Place name disappears qi method and place name disappears qi device |
CN108804398A (en) * | 2017-05-03 | 2018-11-13 | 阿里巴巴集团控股有限公司 | The similarity calculating method and device of address text |
CN107239442A (en) * | 2017-05-09 | 2017-10-10 | 北京京东金融科技控股有限公司 | A kind of method and apparatus of calculating address similarity |
CN107861947A (en) * | 2017-11-07 | 2018-03-30 | 昆明理工大学 | A kind of method of the card language name Entity recognition based on across language resource |
CN108171529A (en) * | 2017-12-04 | 2018-06-15 | 昆明理工大学 | A kind of address similarity estimating method |
CN110276021A (en) * | 2019-04-29 | 2019-09-24 | 小轮(上海)网络科技有限公司 | Place name matching process and device based on semantic similarity |
CN110598791A (en) * | 2019-09-12 | 2019-12-20 | 深圳前海微众银行股份有限公司 | Address similarity evaluation method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
AU2020101024A4 (en) | 2020-07-23 |
CN111325235B (en) | 2023-04-25 |
CN111325235A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111104794B (en) | Text similarity matching method based on subject term | |
US10515090B2 (en) | Data extraction and transformation method and system | |
US7827125B1 (en) | Learning based on feedback for contextual personalized information retrieval | |
CN111291161A (en) | Legal case knowledge graph query method, device, equipment and storage medium | |
CN111680173A (en) | CMR model for uniformly retrieving cross-media information | |
CN111522910B (en) | Intelligent semantic retrieval method based on cultural relic knowledge graph | |
CN109190117A (en) | A kind of short text semantic similarity calculation method based on term vector | |
CN106844331A (en) | A kind of sentence similarity computational methods and system | |
CN110377747B (en) | Knowledge base fusion method for encyclopedic website | |
CN110147421B (en) | Target entity linking method, device, equipment and storage medium | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
JP5057474B2 (en) | Method and system for calculating competition index between objects | |
WO2021114825A1 (en) | Method and device for institution standardization, electronic device, and storage medium | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
Fu et al. | Automatic record linkage of individuals and households in historical census data | |
CN103646112A (en) | Dependency parsing field self-adaption method based on web search | |
CN106886565B (en) | Automatic polymerization method for foundation house type | |
Hossny et al. | Feature selection methods for event detection in Twitter: a text mining approach | |
CN116848490A (en) | Document analysis using model intersection | |
Sebti et al. | A new word sense similarity measure in WordNet | |
CN111553160A (en) | Method and system for obtaining answers to question sentences in legal field | |
CN114997288A (en) | Design resource association method | |
CN112486919A (en) | Document management method, system and storage medium | |
Pang et al. | A text similarity measurement based on semantic fingerprint of characteristic phrases | |
WO2021142968A1 (en) | Multilingual-oriented semantic similarity calculation method for general place names, and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20913178 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20913178 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20913178 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.01.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20913178 Country of ref document: EP Kind code of ref document: A1 |