WO2021142968A1 - Multilingual-oriented semantic similarity calculation method for general place names, and application thereof - Google Patents

Multilingual-oriented semantic similarity calculation method for general place names, and application thereof Download PDF

Info

Publication number
WO2021142968A1
WO2021142968A1 PCT/CN2020/085814 CN2020085814W WO2021142968A1 WO 2021142968 A1 WO2021142968 A1 WO 2021142968A1 CN 2020085814 W CN2020085814 W CN 2020085814W WO 2021142968 A1 WO2021142968 A1 WO 2021142968A1
Authority
WO
WIPO (PCT)
Prior art keywords
place
names
similarity
name
category
Prior art date
Application number
PCT/CN2020/085814
Other languages
French (fr)
Chinese (zh)
Inventor
张雪英
薛理
叶鹏
赵文强
吴恪涵
Original Assignee
南京师范大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京师范大学 filed Critical 南京师范大学
Publication of WO2021142968A1 publication Critical patent/WO2021142968A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Remote Sensing (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A multilingual-oriented semantic similarity calculation method for general place names, and the application thereof. The method comprises: determining the language of a place name according to language coding intervals, and normalizing the place name to a romanized place name according to document information; acquiring, from a place name information library, category attribute information of two place names, and calculating a place name category similarity according to a place name classification system and a place name category similarity model; calculating, according to a place name character string similarity model, a character string similarity of the romanized place name; acquiring, from the place name information library, the longitude and latitude for each of the two place names, and then calculating spatial proximity according to a place name spatial proximity model; and determining a place name semantic similarity according to the place name category similarity, the character string similarity, and the spatial proximity. Compared with place name similarity calculation methods which only take place name character strings or spatial geometric features into consideration, the method can significantly improve the accuracy of place name similarity calculation, and can better satisfy application requirements, such as querying, matching and sharing services for multilingual place names, in a big data environment.

Description

面向多语种的通用地名语义相似度计算方法及其应用Multilingual-oriented Semantic Similarity Calculation Method of General Geographical Names and Its Application 技术领域Technical field
本发明属于地理信息科学领域,涉及一种面向多语种的通用地名语义相似度计算方法及其在多语种数据库地名查询中的应用。The invention belongs to the field of geographic information science, and relates to a multilingual-oriented method for calculating the semantic similarity of general place names and its application in multilingual database place name queries.
背景技术Background technique
地名是人类对地理环境具有特定位置、范围及形态特征的地理对象和地理现象所共同约定的语言符号。语义是数据(符号)所代表的概念的含义,以及这些含义之间的关系。随着计算机技术的发展与移动互联网的普及,不同国家、机构或者企业已经建立了各种类型的地名信息库,且大部分地名信息库包含地名类别,经纬度等信息。然而,这些地名信息库在覆盖范围、数据形式、语种类型、数据内容等方面存在较大的差异。因此如何快速、准确的计算不同地名信息库中的地名相似度,已成为地名研究中的重要课题。Geographical names are linguistic signs agreed upon by humans for geographic objects and geographic phenomena that have specific locations, ranges, and morphological characteristics of the geographic environment. Semantics is the meaning of the concepts represented by data (symbols) and the relationship between these meanings. With the development of computer technology and the popularization of mobile Internet, different countries, institutions or enterprises have established various types of place-name information databases, and most of the place-name information databases contain information such as place-name categories, latitude and longitude. However, these geographical name information databases are quite different in terms of coverage, data format, language type, and data content. Therefore, how to quickly and accurately calculate the similarity of place names in different place-name information databases has become an important topic in place-name research.
目前地名相似度计算方法主要分为三类。①一类是基于地名字符串的,即通过比较地名的字符串来计算地名相似度,如Smart等将规则模型与隐马尔可夫模型相结合,可以有效解决地名拼写、格式、字符集等不一致问题;占斌斌等利用基于地名建立的通名词典和结构规则库判定地名类型,然后通过字符串相似度匹配得到最佳的地名数据匹配结果,并在德州市实验区得到了较好的验证结果;叶鹏等在顾及中文字符多级特征的基础上,基于中文地名词典构建了地名单字索引,利用字符过滤与相似度排序等机制实现中文地名的高效匹配。②第二类是基于地理要素的,即利用地名的空间位置、面积和形状等几何信息计算地名的相似度。如Egenhofer和Clementini提出了度量多重表达中空间几何数据结构不一致性及拓扑关系不一致性的标准,能够较为理想的判断空间几何数据的一致性;Van等利用K中心点聚类和朴素贝叶斯分类法能够对带有地理标签的照片进行地名一致性处理。③第三类是基于地名语义的相似度计算方法。如陈佳丽多重表达的空间数据在空间关系、语义和几何方面可能存在不一致性,因此必须对这些不一致性进行评价和修正,把本体引入地理信息建模中,结合语义一致性,以基于对象匹配的方法实现数据匹配。At present, the calculation methods of place name similarity are mainly divided into three categories. ① The first type is based on the place name string, that is, the place name similarity is calculated by comparing the string of place names. For example, Smart, etc. combine the rule model with the hidden Markov model, which can effectively solve the inconsistency of place name spelling, format, character set, etc. Question: Zhan Binbin and others used the generic name dictionary and structural rule database based on geographical names to determine the type of geographical names, and then obtained the best geographical name data matching results through string similarity matching, which was well verified in the experimental area of Dezhou City Result: Ye Peng et al. built an index of geographical names based on the Chinese geographical names dictionary, taking into account the multi-level features of Chinese characters, and used mechanisms such as character filtering and similarity sorting to achieve efficient matching of Chinese geographical names. ② The second category is based on geographic elements, that is, the similarity of place names is calculated using geometric information such as the spatial location, area, and shape of place names. For example, Egenhofer and Clementini proposed a standard to measure the inconsistency of spatial geometric data structure and the inconsistency of topological relations in multiple expressions, which can ideally judge the consistency of spatial geometric data; Van et al. used K center point clustering and naive Bayesian classification The method is able to process geographically-labeled photos for consistency in place names. ③The third category is the similarity calculation method based on the semantics of geographical names. For example, Chen Jiali’s multiple representations of spatial data may have inconsistencies in spatial relations, semantics, and geometry. Therefore, these inconsistencies must be evaluated and corrected, and the ontology must be introduced into geographic information modeling, combined with semantic consistency, and based on object matching. Method to achieve data matching.
上述学者在地名相似度计算方面,取得了不错的成果。但是依然存在某些问题:①编辑距离算法等算法通过分析地名单一特征计算地名相似度,如地名字符串或地名几何特征,并没有考虑地名的其它特征,导致在某些特殊情况下地名相似度的准确度并不理想,尤其是地名重名,地名空间位置接近等特殊情况。②部分算法是针对特定语言提出的算法,对于其它语言并不适用。因此,如何在地名数据来源广,数据结构复杂,语义差异大等的情况下,实 现地名相似度计算,是本领域技术人员需要研究和解决的难题。The above-mentioned scholars have achieved good results in calculating the similarity of place names. However, there are still some problems: ① Algorithms such as the edit distance algorithm calculate the similarity of place names by analyzing the characteristics of the list of places, such as place name strings or geometric characteristics of place names, and do not consider other characteristics of place names, resulting in similar place names in some special cases. The accuracy of the degree is not ideal, especially in special cases such as duplicate place names and close spatial locations of place names. ② Some algorithms are proposed for specific languages, and are not applicable to other languages. Therefore, how to calculate the similarity of geographical names under the circumstances of a wide range of geographical name data sources, complex data structures, and large semantic differences is a difficult problem for those skilled in the art to study and solve.
发明内容Summary of the invention
发明目的:有鉴于此,本发明提供了一种面向多语种的通用地名语义相似度计算方法,目的在于解决现有地名相似度计算方法准确率不高,通用性弱的问题。Purpose of the invention: In view of this, the present invention provides a multilingual-oriented method for calculating semantic similarity of general place names, which aims to solve the problems of low accuracy and weak versatility of existing method for calculating place names of similarity.
技术方案:为实现上述发明目的,本发明采用如下技术方案:Technical solution: In order to achieve the above-mentioned purpose of the invention, the present invention adopts the following technical solutions:
面向多语种的通用地名语义相似度计算方法,包括如下步骤:The method for calculating the semantic similarity of universal place names for multiple languages includes the following steps:
根据语种编码区间确定地名语种,并根据文献信息将地名进行归一化为罗马化地名;Determine the language of place names according to the language coding interval, and normalize place names into romanized place names based on document information;
从地名信息库中获取两个地名的类别属性信息,根据地名分类体系及地名类别相似度模型计算地名类别相似度;Obtain the category attribute information of two place names from the place-name information database, and calculate the similarity of place-name categories according to the place-name classification system and place-name category similarity model;
根据地名字符串相似度模型计算罗马化后地名的字符串相似度;Calculate the string similarity of place names after romanization according to the place name string similarity model;
从地名信息库中获取两个地名的经纬度,根据地名空间邻近度模型计算地名的空间邻近度;Obtain the latitude and longitude of two place names from the place-name information database, and calculate the spatial proximity of place names according to the place-name spatial proximity model;
根据地名类别相似度、字符串相似度和空间邻近度确定地名相似度;Determine the similarity of place names according to the similarity of place name categories, string similarity and spatial proximity;
作为优选,根据地名分类体系及地名类别相似度模型计算地名类别相似度包括:Preferably, calculating the similarity of place-name categories according to the place-name classification system and place-name category similarity model includes:
如果两地名类别位于分类体系的同一子类下的类别,则计算共同父类到根节点的距离之和最近的共同父类地名类别到两地名类别的距离,然后利用同类别相似度模型计算属性相似度;如果两地名类别位于不同子类下的类别,则计算两地名类别所在子类的相关度后利用非同类别相似度模型计算类别相似度。If two place-name categories are in the category under the same sub-category of the classification system, calculate the sum of the distances from the common parent category to the root node and the distance from the nearest common parent place-name category to the two place-name categories, and then use the same category similarity model to calculate the attributes Similarity: If two place-name categories are in different sub-categories, calculate the correlation between the two place-name categories and then use the non-same category similarity model to calculate the category similarity.
作为优选,同一子类下的类别相似度模型表示为:Preferably, the category similarity model under the same subcategory is expressed as:
Figure PCTCN2020085814-appb-000001
Figure PCTCN2020085814-appb-000001
其中,S c(i,j)表示地名i和j的地名类别相似度,l表示地名i和j的类别的最近的共同父类到根节点的距离,d i表示地名i和j的类别的最近的共同父类到i的类别的距离,d j表示地名i和j的类别的最近的共同父类到j的类别的距离,α(i,j)表示最近的共同父类到i和j的类别的距离之和 Among them, S c (i, j) represents the similarity of the place names of place names i and j, l represents the distance from the nearest common parent of the category of place names i and j to the root node, and d i represents the category of place names i and j. The distance from the nearest common parent to the category of i, d j represents the distance from the nearest common parent of the categories of place names i and j to the category of j, and α(i,j) represents the nearest common parent to i and j The sum of the distances of the categories
作为优选,不同子类下的类别相似度模型表示为:Preferably, the category similarity model under different subcategories is expressed as:
Figure PCTCN2020085814-appb-000002
Figure PCTCN2020085814-appb-000002
其中,S c(i,j)表示地名i和j的地名类别相似度,β’表示i和j的类别所在子类的相关度,d’ i表示i和j的类别的最近的共同父类到i的类别的距离,d’ j表示i和j的类别的最近的共同父类 到j的类别的距离;α’(i,j)表示最近的共同父类到i和j的类别的距离之和。 Among them, S c (i, j) represents the similarity of the place names of place names i and j, β'represents the correlation degree of the subcategories of the categories of i and j, and d'i represents the nearest common parent category of the categories of i and j. The distance to the category of i, d' j represents the distance from the nearest common parent of the categories of i and j to the category of j; α'(i,j) represents the distance from the nearest common parent to the category of i and j Sum.
作为优选,地名字符串相似度模型表示为:Preferably, the similarity model of place name strings is expressed as:
Figure PCTCN2020085814-appb-000003
Figure PCTCN2020085814-appb-000003
其中,A(i,j)表示地名i和j的地名字符串相似度,d[i,j]代表地名i和j的编辑距离,ML代表地名i和j字符串长度的最大值,Len代表最小匹配长度,L(i)代表地名i字符串的长度,L(j)代表地名j字符串的长度,a和b表示权重。Among them, A(i,j) represents the similarity of the place name strings of place names i and j, d[i,j] represents the edit distance of place names i and j, ML represents the maximum string length of place names i and j, and Len represents The minimum matching length, L(i) represents the length of the character string of the place name i, L(j) represents the length of the character string of the place name j, and a and b indicate the weight.
作为优选,采用地名空间邻近度模型计算空间邻近度。地名空间邻近度模型表示为:Preferably, the spatial proximity model of place names is used to calculate the spatial proximity. The spatial proximity model of place names is expressed as:
Figure PCTCN2020085814-appb-000004
Figure PCTCN2020085814-appb-000004
Figure PCTCN2020085814-appb-000005
Figure PCTCN2020085814-appb-000005
其中,S E(i,j)代表地名i和j的地名的空间邻近度,lon i,lon j,lat i和lat j分别为地名i和j的经纬度。 Among them, S E (i, j) represents the spatial proximity of place names i and j, and lon i , lon j , lat i and lat j are the latitude and longitude of place names i and j, respectively.
作为优选,地名语义相似度的计算模型为:Preferably, the calculation model for the semantic similarity of geographical names is:
F(i,j)=A(i,j)S E(i,j)S C(i,j) F(i,j)=A(i,j)S E (i,j)S C (i,j)
其中,F(i,j)表示地名i和j的地名语义相似度。Among them, F(i,j) represents the semantic similarity of place names i and j.
所述的地名语义相似度计算方法在多语种地名数据查询中的应用,主要包括如下步骤:The application of the geographical name semantic similarity calculation method in multilingual geographical name data query mainly includes the following steps:
通过地名信息库提取所有地名的字符串、类别和经纬度属性,根据语种编码区间确定地名语种和进行地名归一化处理,并依据地名语种的不同特征分为表音型和表意型索引方法,其中表音型文字以字母相似度为基准,结合字母总数、字母部首数、单词总数和单词首字母编码语言特征,基于多维特征统计向量的索引组织方式进行表音型地名索引构建;表意型文字以字符局部相似度为基准,结合地名的相同字符、字符数量、字符位置语言特征,基于单个字的地名索引组织方式进行表意型地名索引构建;Extract the character strings, categories, and latitude and longitude attributes of all place names through the place name information database, determine the place name languages and normalize place names according to the language coding interval, and divide them into phonetic and ideographic indexing methods according to the different characteristics of place names. Phonographic characters are based on letter similarity, combined with the total number of letters, the number of letter radicals, the total number of words and the coding language characteristics of the first letter of the word, and the construction of the phonographic place name index based on the index organization method of the multi-dimensional feature statistical vector; ideographic characters Taking the local similarity of characters as the benchmark, combining the same characters, number of characters, and language characteristics of character positions in place names, constructing an ideographic place-name index based on the organization of single-word place-name indexes;
确定待查询的地名的字符串、类别和经纬度属性,并进行归一化处理;Determine the character string, category, and latitude and longitude attributes of the place name to be queried, and normalize it;
根据待查询地名所确定的字符串、类别和经纬度属性依次对索引中的所有进行筛选,其中依据确定的地名字符串,使用地名字符串相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若字符串为空则直接符合筛选条件;依据确定的地名类别,使用类别相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若类别为空则直接符合筛选条件;依据确定的地名经纬度,使用地名空间邻近度 模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若经纬度为空则直接符合筛选条件;According to the character string, category, and latitude and longitude attributes determined by the place name to be queried, all the items in the index are sequentially filtered. According to the determined place name string, the place name string similarity model is used for calculation, and the calculation result is higher than the set threshold. Filter conditions, otherwise filter the place name, if the string is empty, it will directly meet the filter conditions; according to the determined place name category, use the category similarity model to calculate, the calculation result is higher than the set threshold and meet the filter conditions, otherwise it will be filtered If the category is empty, the place name directly meets the filtering conditions; according to the determined place name latitude and longitude, the place name spatial proximity model is used for calculation. When the calculation result is higher than the set threshold, the filtering conditions are met. Otherwise, the place name is filtered. If it is empty, it will directly meet the filter conditions;
依次将待查询地名与所有候选地名采用根据所述的面向多语种的通用地名语义相似度计算方法进行计算;In turn, the place names to be queried and all candidate place names are calculated according to the multilingual-oriented common place name semantic similarity calculation method;
将计算结果进行倒序排列,排序越靠前的地名与待查询地名越相似。Sort the calculation results in reverse order, the higher the ranking, the more similar the place name to be queried.
有益效果:本发明根据地名的构词特点、地名类别和位置特征,分别构建了地名类别相似度模型、地名字符串相似度模型和地名空间邻近度模型,并根据这三种模型,提出一种通用地名语义相似度计算方法。本发明的有益效果在于改进编辑距离算法,从而能够同时顾及通名和专名的影响。引入地名类别特征,根据地名类别分类体系构建地名类别相似度模型。同时考虑地名的空间特征,构建地名空间邻近度模型;最后,综合考虑地名字符串、位置和类别特征,提出一种通用地名语义相似度计算方法。因此,相对于针对单一特征的地名相似度计算方法,具有更高的准确率和普适性。Beneficial effects: The present invention constructs a place name category similarity model, a place name string similarity model and a place name spatial proximity model according to the word formation characteristics, place name categories and location characteristics of place names, and proposes a model based on these three models. General method for calculating semantic similarity of geographical names. The beneficial effect of the present invention is to improve the edit distance algorithm, so that the influence of the generic name and the proper name can be taken into consideration at the same time. Introduce the characteristics of place-name categories, and construct a place-name category similarity model based on the place-name category classification system. At the same time, the spatial characteristics of geographical names are considered to construct a spatial proximity model of geographical names; finally, a comprehensive method for calculating the semantic similarity of geographical names is proposed by comprehensively considering the character strings, location and category characteristics of geographical names. Therefore, it has higher accuracy and universal applicability than the calculation method of place name similarity for a single feature.
附图说明Description of the drawings
图1为本发明实施例的方法流程图。Fig. 1 is a flowchart of a method according to an embodiment of the present invention.
图2为本发明实施例中地名类别结构示意图。Figure 2 is a schematic diagram of the structure of place name categories in an embodiment of the present invention.
具体实施方式Detailed ways
下面结合具体实施方式对本发明进行详细说明。The present invention will be described in detail below in conjunction with specific embodiments.
如图1所示,本发明实施例公开的一种面向多语种的通用地名语义相似度计算方法,主要包括如下步骤:As shown in FIG. 1, the method for calculating the semantic similarity of universal place names for multilingualism disclosed in the embodiment of the present invention mainly includes the following steps:
步骤1:根据地名编码区间识别地名i和j的语种,并根据文献信息归一化地名i和j为罗马化地名。Step 1: Identify the languages of place names i and j according to the coding interval of place names, and normalize place names i and j into romanized place names based on document information.
由于数据获取手段和人为因素等影响,不同语种的数据在数据格式和编码等方面差异较大,因此需要对地名进行预处理,以便于在地名信息库中找到相应的地名类别等信息。Due to the influence of data acquisition methods and human factors, data in different languages differ greatly in data format and coding, so place names need to be preprocessed to find the corresponding place name category and other information in the place name information database.
本步骤中,地名编码区间是指每个语种所对应的不同的编码区间,即每个语种的Unicode十六进制编码区间是惟一的,因此能够根据地名编码区间确定地名语种。In this step, the geographical name coding interval refers to the different coding intervals corresponding to each language, that is, the Unicode hexadecimal coding interval of each language is unique, so the geographical name language can be determined according to the geographical name coding interval.
罗马化地名是指每个国家官方最新出版的地名录、地名词典和地方志等资料中含有地名相对应的罗马地名。Romanized place names refer to the Roman place names corresponding to the place names contained in the latest official gazetteers, place-name dictionaries, and local chronicles of each country.
步骤2:从地名信息库中获得地名i和j的类别,根据地名类别相似度模型,计算地名i和j的类别相似度。Step 2: Obtain the categories of place names i and j from the place name information database, and calculate the category similarity of place names i and j according to the place name category similarity model.
本步骤中,地名类别相似度是指在同一分类体系中,两个地名数据所属类别的相关程度。 地名数据类别是指数据按专题要素进行分类,分类体系可以使用层次化的树状结构来描述类与类之间的逻辑关系。地名类别按照地名分类体系,分类对照表如表1所示。In this step, place name category similarity refers to the degree of correlation between the categories of two place name data in the same classification system. The category of place-name data refers to the classification of data according to thematic elements. The classification system can use a hierarchical tree structure to describe the logical relationship between classes. Place name categories are based on the place name classification system, and the classification comparison table is shown in Table 1.
表1 GeoNames、GNS要素类别对照表Table 1 Comparison table of GeoNames and GNS element categories
Figure PCTCN2020085814-appb-000006
Figure PCTCN2020085814-appb-000006
GNIS数据源直接提供类别的全称,可参照上述分类标准,总结各大类所包含的地名要素类别,设计GNIS类别与标准分类映射表,如表2所示。通过表中映射关系,添加GNIS要素类别代码属性,表3为部分地名分类代码表。The GNIS data source directly provides the full name of the category. You can refer to the above classification standards to summarize the geographical name element categories included in each category, and design the GNIS category and standard classification mapping table, as shown in Table 2. Through the mapping relationship in the table, add the GNIS feature category code attribute. Table 3 is a part of the geographical name category code table.
表2 GNIS类别与标准分类映射表Table 2 GNIS category and standard classification mapping table
Figure PCTCN2020085814-appb-000007
Figure PCTCN2020085814-appb-000007
Figure PCTCN2020085814-appb-000008
Figure PCTCN2020085814-appb-000008
表3 部分地名分类代码表Table 3 Part of the geographical name classification code table
Figure PCTCN2020085814-appb-000009
Figure PCTCN2020085814-appb-000009
通过分析发现,地名属性中类别相似度能够反映同一分类体系中两个数据所属类别的相关程度。因此,计算类与类的相关性需要处理分类树中父子节点、兄弟节点等不同类型的关系。为了便于理解以大类P部分类别为例,做树状图,如图2所示。地名类别相似度算法函数由S C(i,j)表示,当地名i和j在同一子类下的类别时S C(i,j)的计算如下所示(例如,如图2所示,当地名i和j分别属于PPA1和PPA3类别,则PPA1和PPA3都属于同一子类PPA): Through analysis, it is found that the category similarity in place name attributes can reflect the degree of correlation between the categories of two data in the same classification system. Therefore, calculating the correlation between classes and classes needs to deal with different types of relationships such as parent-child nodes and sibling nodes in the classification tree. In order to facilitate understanding, take the big category P part category as an example, make a tree diagram, as shown in Figure 2. The similarity algorithm function of place name category is represented by S C (i,j). When the local names i and j are in the same subcategory , the calculation of S C (i,j) is as follows (for example, as shown in Figure 2, The local names i and j belong to the categories of PPA1 and PPA3, respectively, so PPA1 and PPA3 belong to the same sub-category PPA):
Figure PCTCN2020085814-appb-000010
Figure PCTCN2020085814-appb-000010
式中,l表示i和j的类别的最近的共同父类到根节点的距离(边的数量);d i表示i和j的类别的最近的共同父类到i的类别的距离(边的数量),d j表示i和j的类别的最近的共同父类到j的类别的距离(边的数量);α(i,j)表示最近的共同父类到i和j的类别的距离之和。 Wherein, l i and j represent the category of the last common parent class to the root of the distance (number of edges); D i represents the category of the distance class i and j last common parent class to i (edges Number), d j represents the distance from the nearest common parent of the categories of i and j to the category of j (the number of edges); α(i,j) represents the distance between the nearest common parent to the category of i and j and.
当i和j在不同一子类下的类别时S C(i,j)的计算如下所示: When i and j are in different categories under a subcategory, the calculation of S C (i,j) is as follows:
Figure PCTCN2020085814-appb-000011
Figure PCTCN2020085814-appb-000011
式中β’表示i和j的类别所在子类的相关度,取值在[0,1],可根据实际应用由领域专家给出,d’ i表示i和j的类别的最近的共同父类到i的类别的距离(边的数量),d’ j表示i和j的类 别的最近的共同父类到j的类别的距离(边的数量);α’(i,j)表示最近的共同父类到i和j的类别的距离之和。 Where β'represents the correlation degree of the subcategories of the categories of i and j, and the value is [0,1], which can be given by domain experts according to practical applications. d' i represents the nearest common parent of the categories of i and j. The distance from the category to the category of i (the number of edges), d' j represents the distance from the nearest common parent category of the categories of i and j to the category of j (the number of edges); α'(i,j) represents the nearest The sum of the distances from the common parent category to the categories i and j.
步骤3:根据地名字符串相似度模型,计算罗马化地名i和j的名称相似度。Step 3: Calculate the name similarity of the romanized place names i and j according to the place-name string similarity model.
编辑距离又称Levenshtein距离,是一种用于衡量两个序列相似度的距离度量函数。在自然语言处理中,编辑距离是用来计算从原字符串转换到目标字符串所需要进行的插入、删除和替换操作的最少次数。设S i=s 1s 2…s i和T j=t 1t 2…t j代表两个字符串,距离d[i,j]是S j字符串编辑到T j字符串所用的最小操作数,d[i,j]表明地名i,j的编辑距离,能够有效反映地名间的字符相似程度,公式如下所示: Edit distance, also known as Levenshtein distance, is a distance measurement function used to measure the similarity of two sequences. In natural language processing, the edit distance is used to calculate the minimum number of insertion, deletion, and replacement operations required to convert from the original string to the target string. Suppose S i = s 1 s 2 … s i and T j = t 1 t 2 … t j represent two character strings, and the distance d[i,j] is the minimum operation used to edit the string S j to the string T j The number, d[i,j] indicates the edit distance of place names i and j, which can effectively reflect the similarity of characters between place names. The formula is as follows:
Figure PCTCN2020085814-appb-000012
Figure PCTCN2020085814-appb-000012
编辑距离是一种用于衡量两个序列相似度的距离度量函数,常用来计算地名字符串的相似度,然而该算法无法有效减少通名的影响,因此对该算法进行了改进,改进后的模型如下所示:Edit distance is a distance measurement function used to measure the similarity of two sequences. It is commonly used to calculate the similarity of place name strings. However, this algorithm cannot effectively reduce the impact of generic names. Therefore, the algorithm has been improved. The model is as follows:
Figure PCTCN2020085814-appb-000013
Figure PCTCN2020085814-appb-000013
式中d[i,j]代表地名i,j的编辑距离,ML代表地名i,j字符串长度的最大值,Len代表最小匹配长度(Len≥1),L(i)代表i字符串的长度,L(j)代表j字符串的长度,a和b表示权重,分别为0.6和0.4。改进后的模型与现有模型名称相似度计算结果比较如表4所示。Where d[i,j] represents the edit distance of place name i, j, ML represents the maximum length of the place name i, j string, Len represents the minimum matching length (Len≥1), L(i) represents the i string Length, L(j) represents the length of the j string, and a and b represent the weight, which are 0.6 and 0.4, respectively. Table 4 shows the comparison between the improved model and the existing model name similarity calculation results.
表4 地名字符串相似度计算结果比较Table 4 Comparison of calculation results of place name string similarity
Figure PCTCN2020085814-appb-000014
Figure PCTCN2020085814-appb-000014
从上表可以看出,Gwaun Creek和Gunye Creek为不同地名,然而编辑距离算法计算相似度高达0.636;Wilipini和Willipinee是相同地名,贪婪字符串匹配算法的相似度结果为0.555,Gbonga和Gbondoi为不同地名,计算结果却为0.615;可以明显发现本发明改进的算法计算的相似度与实际更加吻合。As can be seen from the above table, Gwaun Creek and Gunye Creek are different place names, but the edit distance algorithm calculates the similarity as high as 0.636; Wilipini and Willipee are the same place names, and the greedy string matching algorithm has a similarity result of 0.555, while Gbonga and Gbondoi are different For the place name, the calculation result is 0.615; it can be clearly found that the similarity calculated by the improved algorithm of the present invention is more consistent with the actual situation.
步骤4:从地名信息库中获得地名i和j的经纬度,根据地名空间邻近度模型,计算地名的空间邻近度。Step 4: Obtain the latitude and longitude of place names i and j from the place name information database, and calculate the spatial proximity of place names according to the place name spatial proximity model.
地名作为基础的地理要素,它可以是一个点要素(比如一个小村庄的地名)、线要素(比如一条公路的地名)、也可以是一个面要素(比如一个行政区的地名),因此,地名数据的几何相似性包含了点要素位置相似性的度量、线要素相似性的度量以及面要素几何相似性的度量,而本发明所研究的全球地名数据均为点要素地名。As the basic geographic element, place name can be a point element (such as the name of a small village), a line element (such as the name of a highway), or a polygon element (such as the name of an administrative district). Therefore, place name data The geometric similarity of the includes the measurement of the similarity of the position of the point element, the measurement of the similarity of the line element and the measurement of the geometric similarity of the area element, and the global place name data studied in the present invention are all the place names of the point elements.
对于点要素地名位置的度量通常采用计算距离的方式,基本的思路是分别从两个点要素地名中提取出一组特征向量,在一定的距离空间中对这两组向量的距离进行计算。距离越小,则表明两个地名越相似;反之,距离越大,表明两个地名存在较大的差异。经常用欧式距离来代表两点之间的距离。The measurement of the place names of point elements usually adopts the method of calculating the distance. The basic idea is to extract a set of feature vectors from the place names of two point elements, and calculate the distance of these two sets of vectors in a certain distance space. The smaller the distance, the more similar the two place names; conversely, the larger the distance, the greater the difference between the two place names. Euclidean distance is often used to represent the distance between two points.
欧氏距离(Euclidean Distance),是欧几里得空间中两点之间的普通直线距离,衡量多维空间中各个点之间的绝对距离。其中,若地名之间的欧式距离越大,则所描述地名相似度越低。设i,j表示两个地名,其经纬度分别记为lon i,lon j,lat i和lat j。两个地名之间的欧式空间距离记为dis i-jEuclidean Distance (Euclidean Distance) is the ordinary straight-line distance between two points in Euclidean space, which measures the absolute distance between points in a multidimensional space. Among them, if the Euclidean distance between place names is larger, the similarity of the place names described is lower. Set i, j represents a two place names, which are referred to as latitude and longitude lon i, lon j, lat i and lat j. The Euclidean spatial distance between two place names is recorded as dis ij .
Figure PCTCN2020085814-appb-000015
Figure PCTCN2020085814-appb-000015
设地名空间邻近度函数为S E(i,j),则本发明针对地名数据空间特征设计的空间距离相似度模型如下所示。 Assuming that the spatial proximity function of place names is S E (i, j), the spatial distance similarity model designed by the present invention for the spatial characteristics of place name data is as follows.
Figure PCTCN2020085814-appb-000016
Figure PCTCN2020085814-appb-000016
其中,S E(i,j)表示两个地名的空间范围相似程度,若两者一致,则取值为1;若两者空间距离越远,则空间范围一致性程度越趋近于0。 Among them, S E (i, j) represents the degree of similarity in the spatial range of two place names. If the two are the same, the value is 1; if the distance between the two is farther, the consistency of the spatial range is closer to 0.
步骤5:根据地名语义相似度模型,计算地名语义相似度。Step 5: Calculate the semantic similarity of geographical names according to the semantic similarity model of geographical names.
地名语义相似度模型如下:The semantic similarity model of geographical names is as follows:
F(i,j)=A(i,j)S E(i,j)S C(i,j) F(i,j)=A(i,j)S E (i,j)S C (i,j)
其中,F(i,j)表示地名语义相似度,A(i,j),S E(i,j)和S c(i,j)三个变量分别表示归一化至[0,1]值域范围内的地名字符串相似度与地名空间邻近度和地名类别相似度。 Among them, F(i,j) represents the semantic similarity of geographical names, and the three variables A(i,j), S E (i,j) and S c (i,j) respectively represent normalization to [0,1] The similarity of place-name strings in the range of the value range is similar to the spatial proximity of place-names and the similarity of place-name categories.
以洪都拉斯、毛里求斯、利比里亚、蒙古、津巴布韦等5个国家各个数据源地名共计约16.7万条地名数据作为实验数据,其中具有一致性可以匹配共计约4.77万条,采用本发明提出的面向多语种的通用地名语义相似度计算方法进行实验,结果如表5所示。Taking Honduras, Mauritius, Liberia, Mongolia, Zimbabwe and other 5 countries, a total of about 167,000 geographical names from each data source are used as experimental data, of which about 47,700 can be matched with consistency. The multilingual-oriented data proposed by the present invention is used as the experimental data. Experiments were carried out on the calculation method of semantic similarity of general place names, and the results are shown in Table 5.
表5 实验结果评价指标统计Table 5 Statistics of evaluation indicators of experimental results
Figure PCTCN2020085814-appb-000017
Figure PCTCN2020085814-appb-000017
实验结果表明,面向多语种的通用地名语义相似度计算方法对地名进行匹配不仅在准确率上保持在98%以上,而且能够达到97%以上的实际地名数据匹配。The experimental results show that the universal semantic similarity calculation method for multilingual geographical names not only maintains an accuracy rate of 98% or more, but also achieves more than 97% of actual geographical name data matching.
本发明实施例公开的地名语义相似度计算方法在多语种地名数据查询中的应用,主要包括如下步骤:The application of the method for calculating the semantic similarity of place names disclosed in the embodiment of the present invention in multilingual place name data query mainly includes the following steps:
步骤一:通过地名信息库提取所有地名的字符串、类别和经纬度等属性,根据语种编码区间确定地名语种和进行地名归一化处理,并依据地名语种的不同特征分为表音型和表意型索引方法,其中表音型文字以字母相似度为基准,结合字母总数、字母部首数、单词总数和单词首字母编码等语言特征,基于多维特征统计向量的索引组织方式进行表音型地名索引构建;表意型文字以字符局部相似度为基准,结合地名的相同字符、字符数量、字符位置等语言特征,基于单个字的地名索引组织方式进行表意型地名索引构建。Step 1: Extract the character string, category, latitude and longitude attributes of all place names through the place name information database, determine the place name language based on the language coding interval and perform place name normalization processing, and divide it into phonological type and ideographic type according to the different characteristics of place name languages Indexing method, in which phonological characters are based on letter similarity, combined with language features such as the total number of letters, the number of letter radicals, the total number of words, and the code of the first letter of the word, and the phonological place name index is based on the index organization method of multi-dimensional feature statistical vectors Construction: Ideographic characters are based on the local similarity of characters, combined with the same characters, number of characters, character positions and other language characteristics of place names, and the ideographic place name index is constructed based on the organization of single word place names.
步骤二:确定待查询的地名的字符串、类别和经纬度等全部或部分属性,并进行归一化处理。Step 2: Determine all or part of the attributes such as the character string, category, latitude and longitude of the place name to be queried, and perform normalization processing.
步骤三:根据待查询地名所确定的字符串、类别和经纬度等属性依次对索引中的所有进行筛选,其中依据确定的地名字符串,使用地名字符串相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若字符串为空则直接符合筛选条件;依据确定的地名类别,使用类别相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若类别为空则直接符合筛选条件;依据确定的地名经纬度,使用地名空间邻近度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若经纬度为空则直接符合筛选条件。Step 3: According to the attributes such as the character string, category, latitude and longitude determined by the place name to be queried, all the items in the index are successively filtered. According to the determined place name string, the place name string similarity model is used for calculation, and the calculation result is higher than the set value. When the threshold is set, the filter condition is met, otherwise the place name is filtered. If the string is empty, the filter condition is directly met; the category similarity model is used for calculation according to the determined place name category, and the calculation result is higher than the set threshold and the filter condition is met , Otherwise it will filter the place name, if the category is empty, it will directly meet the filtering conditions; according to the determined place name longitude and latitude, use the geographical name spatial proximity model to calculate, and the calculation result will meet the filtering conditions when the calculation result is higher than the set threshold, otherwise it will be filtered The place name, if the latitude and longitude is empty, it directly meets the filter conditions.
步骤四:依次将待查询地名与所有候选地名采用面向多语种的通用地名语义相似度计算方法进行计算。Step 4: The place name to be queried and all candidate place names are calculated in turn using a multilingual universal place name semantic similarity calculation method.
步骤五:将计算结果进行倒序排列,排序越靠前的地名与待查询地名越相似。Step 5: Arrange the calculation results in reverse order. The higher the order, the more similar the place name to be queried.

Claims (8)

  1. 一种面向多语种的通用地名语义相似度计算方法,其特征在于,包括如下步骤:A method for calculating the semantic similarity of general place names for multilingualism, which is characterized in that it comprises the following steps:
    根据语种编码区间确定地名语种,并根据文献信息将地名进行归一化为罗马化地名;Determine the language of place names according to the language coding interval, and normalize place names into romanized place names based on document information;
    从地名信息库中获取两个地名的类别属性信息,根据地名分类体系及地名类别相似度模型计算地名类别相似度;Obtain the category attribute information of two place names from the place-name information database, and calculate the similarity of place-name categories according to the place-name classification system and place-name category similarity model;
    根据地名字符串相似度模型计算罗马化后地名的字符串相似度;Calculate the string similarity of place names after romanization according to the place name string similarity model;
    从地名信息库中获取两个地名的经纬度,然后根据地名空间邻近度模型计算空间邻近度;Obtain the latitude and longitude of two place names from the place-name information database, and then calculate the spatial proximity according to the geographic-name spatial proximity model;
    根据地名的类别相似度、字符串相似度和空间邻近度确定地名语义相似度。The semantic similarity of geographical names is determined according to the category similarity, string similarity and spatial proximity of geographical names.
  2. 根据权利要求1所述的地名语义相似度计算方法,其特征在于,根据地名分类体系及地名类别相似度模型计算地名类别相似度包括:The method for calculating the semantic similarity of place names according to claim 1, wherein calculating the similarity of place names according to the place name classification system and the place name category similarity model comprises:
    如果两地名所属类别位于地名分类体系的同一子类下的类别,则计算共同父类到根节点的距离之和以及最近的共同父类地名类别到两地名类别的距离,然后利用同类别相似模型计算类别相似度;如果两地名所属类别位于不同子类下的类别,则计算两地名类别所在子类的相关度后利用非同类别相似度模型计算类别相似度。If the categories of two place names are in the category under the same subcategory of the place name classification system, calculate the sum of the distances from the common parent category to the root node and the distance from the nearest common parent category to the two place name categories, and then use the same category similarity model Calculate category similarity; if the categories of two place names are in categories under different subcategories, calculate the correlation between the subcategories of the two place name categories and then use the non-same category similarity model to calculate the category similarity.
  3. 根据权利要求2所述的地名语义相似度计算方法,其特征在于,同一子类下的类别相似度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 2, wherein the category similarity model under the same subcategory is expressed as:
    Figure PCTCN2020085814-appb-100001
    Figure PCTCN2020085814-appb-100001
    其中,S c(i,j)表示地名i和j的地名类别相似度,l表示地名i和j的类别的最近的共同父类到根节点的距离,d i表示地名i和j的类别的最近的共同父类到i的类别的距离,d j表示地名i和j的类别的最近的共同父类到j的类别的距离,α(i,j)表示最近的共同父类到i和j的类别的距离之和。 Among them, S c (i, j) represents the similarity of the place names of place names i and j, l represents the distance from the nearest common parent of the category of place names i and j to the root node, and d i represents the category of place names i and j. The distance from the nearest common parent to the category of i, d j represents the distance from the nearest common parent of the categories of place names i and j to the category of j, and α(i,j) represents the nearest common parent to i and j The sum of the distances of the categories.
  4. 根据权利要求2所述的地名语义相似度计算方法,其特征在于,不同子类下的类别相似度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 2, wherein the category similarity models under different subcategories are expressed as:
    Figure PCTCN2020085814-appb-100002
    Figure PCTCN2020085814-appb-100002
    其中,S c(i,j)表示地名i和j的地名类别相似度,β’表示i和j的类别所在子类的相关度,d’ i表示i和j的类别的最近的共同父类到i的类别的距离,d’ j表示i和j的类别的最近的共同父类到j的类别的距离;α’(i,j)表示最近的共同父类到i和j的类别的距离之和。 Among them, S c (i, j) represents the similarity of the place names of place names i and j, β'represents the correlation degree of the subcategories of the categories of i and j, and d'i represents the nearest common parent category of the categories of i and j. The distance to the category of i, d' j represents the distance from the nearest common parent of the categories of i and j to the category of j; α'(i,j) represents the distance from the nearest common parent to the category of i and j Sum.
  5. 根据权利要求1所述的地名语义相似度计算方法,其特征在于,地名字符串相似度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 1, wherein the string similarity model of geographical names is expressed as:
    Figure PCTCN2020085814-appb-100003
    Figure PCTCN2020085814-appb-100003
    其中,A(i,j)表示地名i和j的地名字符串相似度,d[i,j]代表地名i和j的编辑距离,ML代表地名i和j字符串长度的最大值,Len代表最小匹配长度,L(i)代表地名i字符串的长度,L(j)代表地名j字符串的长度,a和b表示权重。Among them, A(i,j) represents the similarity of the place name strings of place names i and j, d[i,j] represents the edit distance of place names i and j, ML represents the maximum string length of place names i and j, and Len represents The minimum matching length, L(i) represents the length of the character string of the place name i, L(j) represents the length of the character string of the place name j, and a and b indicate the weight.
  6. 根据权利要求1所述的地名语义相似度计算方法,其特征在于,地名空间邻近度模型表示为:The method for calculating the semantic similarity of geographical names according to claim 1, wherein the spatial proximity model of geographical names is expressed as:
    Figure PCTCN2020085814-appb-100004
    Figure PCTCN2020085814-appb-100004
    Figure PCTCN2020085814-appb-100005
    Figure PCTCN2020085814-appb-100005
    其中,S E(i,j)代表地名i和j的地名空间邻近度,lon i,lon j,lat i和lat j分别为地名i和j的经纬度。 Wherein, S E (i, j) representative of the space names names i and j proximity, lon i, lon j, lat i and j LAT i and j are names of latitude and longitude.
  7. 根据权利要求1所述的地名语义相似度计算方法,其特征在于,地名语义相似度的计算模型为:The method for calculating the semantic similarity of geographical names according to claim 1, wherein the calculation model of the semantic similarity of geographical names is:
    F(i,j)=A(i,j)S E(i,j)S C(i,j) F(i,j)=A(i,j)S E (i,j)S C (i,j)
    其中,S c(i,j)表示地名i和j的地名类别相似度,A(i,j)表示地名i和j的地名字符串相似度,S E(i,j)表示地名i和j的地名空间邻近度,F(i,j)表示地名i和j的地名语义相似度。 Among them, S c (i,j) represents the similarity of place names i and j, A(i,j) represents the similarity of place names i and j, and S E (i,j) represents place names i and j The spatial proximity of place names, F(i,j) represents the semantic similarity of place names i and j.
  8. 地名语义相似度计算方法在多语种地名数据查询中的应用,其特征在于,包括如下步骤:The application of the method for calculating the semantic similarity of geographical names in multilingual geographical name data query is characterized in that it includes the following steps:
    通过地名信息库提取所有地名的字符串、类别和经纬度属性,根据语种编码区间确定地名语种和进行地名归一化处理,并依据地名语种的不同特征分为表音型和表意型索引方法,其中表音型文字以字母相似度为基准,结合字母总数、字母部首数、单词总数和单词首字母编码语言特征,基于多维特征统计向量的索引组织方式进行表音型地名索引构建;表意型文字以字符局部相似度为基准,结合地名的相同字符、字符数量、字符位置语言特征,基于单个字的地名索引组织方式进行表意型地名索引构建;Extract the character strings, categories, and latitude and longitude attributes of all place names through the place name information database, determine the place name languages and normalize place names according to the language coding interval, and divide them into phonetic and ideographic indexing methods according to the different characteristics of place names. Phonographic characters are based on letter similarity, combined with the total number of letters, the number of letter radicals, the total number of words and the coding language characteristics of the first letter of the word, and the construction of the phonographic place name index based on the index organization method of the multi-dimensional feature statistical vector; ideographic characters Taking the local similarity of characters as the benchmark, combining the same characters, number of characters, and language characteristics of character positions in place names, constructing an ideographic place-name index based on the organization of single-word place-name indexes;
    确定待查询的地名的字符串、类别和经纬度属性,并进行归一化处理;Determine the character string, category, and latitude and longitude attributes of the place name to be queried, and normalize it;
    根据待查询地名所确定的字符串、类别和经纬度属性依次对索引中的所有进行筛选,其中依据确定的地名字符串,使用地名字符串相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若字符串为空则直接符合筛选条件;依据确定的地名 类别,使用类别相似度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若类别为空则直接符合筛选条件;依据确定的地名经纬度,使用地名空间邻近度模型进行计算,计算结果高于设定阈值时符合筛选条件,否则就过滤该条地名,若经纬度为空则直接符合筛选条件;According to the character string, category, and latitude and longitude attributes determined by the place name to be queried, all the items in the index are sequentially filtered. According to the determined place name string, the place name string similarity model is used for calculation, and the calculation result is higher than the set threshold. Filter conditions, otherwise filter the place name, if the string is empty, it will directly meet the filter conditions; according to the determined place name category, use the category similarity model to calculate, the calculation result is higher than the set threshold and meet the filter conditions, otherwise it will be filtered If the category is empty, the place name directly meets the filtering conditions; according to the determined place name latitude and longitude, the place name spatial proximity model is used for calculation. When the calculation result is higher than the set threshold, the filtering conditions are met. Otherwise, the place name is filtered. If it is empty, it will directly meet the filter conditions;
    依次将待查询地名与所有候选地名采用根据权利要求1-7任一项所述的面向多语种的通用地名语义相似度计算方法进行计算;In turn, the place name to be queried and all candidate place names are calculated by using the multilingual-oriented universal place name semantic similarity calculation method according to any one of claims 1-7;
    将计算结果进行倒序排列,排序越靠前的地名与待查询地名越相似。Sort the calculation results in reverse order, the higher the ranking, the more similar the place name to be queried.
PCT/CN2020/085814 2020-01-19 2020-04-21 Multilingual-oriented semantic similarity calculation method for general place names, and application thereof WO2021142968A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010058317.6A CN111325235B (en) 2020-01-19 2020-01-19 Multilingual-oriented universal place name semantic similarity calculation method and application thereof
CN202010058317.6 2020-01-19

Publications (1)

Publication Number Publication Date
WO2021142968A1 true WO2021142968A1 (en) 2021-07-22

Family

ID=71170946

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085814 WO2021142968A1 (en) 2020-01-19 2020-04-21 Multilingual-oriented semantic similarity calculation method for general place names, and application thereof

Country Status (3)

Country Link
CN (1) CN111325235B (en)
AU (1) AU2020101024A4 (en)
WO (1) WO2021142968A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076734B (en) * 2021-04-15 2023-01-20 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229501A1 (en) * 2011-10-20 2014-08-14 Deutsche Post Ag Comparing positional information
CN107239442A (en) * 2017-05-09 2017-10-10 北京京东金融科技控股有限公司 A kind of method and apparatus of calculating address similarity
CN107861947A (en) * 2017-11-07 2018-03-30 昆明理工大学 A kind of method of the card language name Entity recognition based on across language resource
CN108171529A (en) * 2017-12-04 2018-06-15 昆明理工大学 A kind of address similarity estimating method
CN108572960A (en) * 2017-03-08 2018-09-25 富士通株式会社 Place name disappears qi method and place name disappears qi device
CN108804398A (en) * 2017-05-03 2018-11-13 阿里巴巴集团控股有限公司 The similarity calculating method and device of address text
CN110276021A (en) * 2019-04-29 2019-09-24 小轮(上海)网络科技有限公司 Place name matching process and device based on semantic similarity
CN110598791A (en) * 2019-09-12 2019-12-20 深圳前海微众银行股份有限公司 Address similarity evaluation method, device, equipment and medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2158540A4 (en) * 2007-06-18 2010-10-20 Geographic Services Inc Geographic feature name search system
CN103605752A (en) * 2013-11-21 2014-02-26 武大吉奥信息技术有限公司 Address matching method based on semantic recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229501A1 (en) * 2011-10-20 2014-08-14 Deutsche Post Ag Comparing positional information
CN108572960A (en) * 2017-03-08 2018-09-25 富士通株式会社 Place name disappears qi method and place name disappears qi device
CN108804398A (en) * 2017-05-03 2018-11-13 阿里巴巴集团控股有限公司 The similarity calculating method and device of address text
CN107239442A (en) * 2017-05-09 2017-10-10 北京京东金融科技控股有限公司 A kind of method and apparatus of calculating address similarity
CN107861947A (en) * 2017-11-07 2018-03-30 昆明理工大学 A kind of method of the card language name Entity recognition based on across language resource
CN108171529A (en) * 2017-12-04 2018-06-15 昆明理工大学 A kind of address similarity estimating method
CN110276021A (en) * 2019-04-29 2019-09-24 小轮(上海)网络科技有限公司 Place name matching process and device based on semantic similarity
CN110598791A (en) * 2019-09-12 2019-12-20 深圳前海微众银行股份有限公司 Address similarity evaluation method, device, equipment and medium

Also Published As

Publication number Publication date
AU2020101024A4 (en) 2020-07-23
CN111325235B (en) 2023-04-25
CN111325235A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
US10515090B2 (en) Data extraction and transformation method and system
US7827125B1 (en) Learning based on feedback for contextual personalized information retrieval
CN111291161A (en) Legal case knowledge graph query method, device, equipment and storage medium
CN111680173A (en) CMR model for uniformly retrieving cross-media information
CN111522910B (en) Intelligent semantic retrieval method based on cultural relic knowledge graph
CN109190117A (en) A kind of short text semantic similarity calculation method based on term vector
CN106844331A (en) A kind of sentence similarity computational methods and system
CN110377747B (en) Knowledge base fusion method for encyclopedic website
CN110147421B (en) Target entity linking method, device, equipment and storage medium
CN110413787B (en) Text clustering method, device, terminal and storage medium
JP5057474B2 (en) Method and system for calculating competition index between objects
WO2021114825A1 (en) Method and device for institution standardization, electronic device, and storage medium
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
Fu et al. Automatic record linkage of individuals and households in historical census data
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN106886565B (en) Automatic polymerization method for foundation house type
Hossny et al. Feature selection methods for event detection in Twitter: a text mining approach
CN116848490A (en) Document analysis using model intersection
Sebti et al. A new word sense similarity measure in WordNet
CN111553160A (en) Method and system for obtaining answers to question sentences in legal field
CN114997288A (en) Design resource association method
CN112486919A (en) Document management method, system and storage medium
Pang et al. A text similarity measurement based on semantic fingerprint of characteristic phrases
WO2021142968A1 (en) Multilingual-oriented semantic similarity calculation method for general place names, and application thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913178

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20913178

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20913178

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20913178

Country of ref document: EP

Kind code of ref document: A1