WO2021142968A1

WO2021142968A1 - Multilingual-oriented semantic similarity calculation method for general place names, and application thereof

Info

Publication number: WO2021142968A1
Application number: PCT/CN2020/085814
Authority: WO
Inventors: 张雪英; 薛理; 叶鹏; 赵文强; 吴恪涵
Original assignee: 南京师范大学
Priority date: 2020-01-19
Filing date: 2020-04-21
Publication date: 2021-07-22
Also published as: AU2020101024A4; CN111325235B; CN111325235A

Abstract

A multilingual-oriented semantic similarity calculation method for general place names, and the application thereof. The method comprises: determining the language of a place name according to language coding intervals, and normalizing the place name to a romanized place name according to document information; acquiring, from a place name information library, category attribute information of two place names, and calculating a place name category similarity according to a place name classification system and a place name category similarity model; calculating, according to a place name character string similarity model, a character string similarity of the romanized place name; acquiring, from the place name information library, the longitude and latitude for each of the two place names, and then calculating spatial proximity according to a place name spatial proximity model; and determining a place name semantic similarity according to the place name category similarity, the character string similarity, and the spatial proximity. Compared with place name similarity calculation methods which only take place name character strings or spatial geometric features into consideration, the method can significantly improve the accuracy of place name similarity calculation, and can better satisfy application requirements, such as querying, matching and sharing services for multilingual place names, in a big data environment.

Description

Multilingual-oriented Semantic Similarity Calculation Method of General Geographical Names and Its Application

Technical field

The invention belongs to the field of geographic information science, and relates to a multilingual-oriented method for calculating the semantic similarity of general place names and its application in multilingual database place name queries.

Background technique

Geographical names are linguistic signs agreed upon by humans for geographic objects and geographic phenomena that have specific locations, ranges, and morphological characteristics of the geographic environment. Semantics is the meaning of the concepts represented by data (symbols) and the relationship between these meanings. With the development of computer technology and the popularization of mobile Internet, different countries, institutions or enterprises have established various types of place-name information databases, and most of the place-name information databases contain information such as place-name categories, latitude and longitude. However, these geographical name information databases are quite different in terms of coverage, data format, language type, and data content. Therefore, how to quickly and accurately calculate the similarity of place names in different place-name information databases has become an important topic in place-name research.

At present, the calculation methods of place name similarity are mainly divided into three categories. ① The first type is based on the place name string, that is, the place name similarity is calculated by comparing the string of place names. For example, Smart, etc. combine the rule model with the hidden Markov model, which can effectively solve the inconsistency of place name spelling, format, character set, etc. Question: Zhan Binbin and others used the generic name dictionary and structural rule database based on geographical names to determine the type of geographical names, and then obtained the best geographical name data matching results through string similarity matching, which was well verified in the experimental area of Dezhou City Result: Ye Peng et al. built an index of geographical names based on the Chinese geographical names dictionary, taking into account the multi-level features of Chinese characters, and used mechanisms such as character filtering and similarity sorting to achieve efficient matching of Chinese geographical names. ② The second category is based on geographic elements, that is, the similarity of place names is calculated using geometric information such as the spatial location, area, and shape of place names. For example, Egenhofer and Clementini proposed a standard to measure the inconsistency of spatial geometric data structure and the inconsistency of topological relations in multiple expressions, which can ideally judge the consistency of spatial geometric data; Van et al. used K center point clustering and naive Bayesian classification The method is able to process geographically-labeled photos for consistency in place names. ③The third category is the similarity calculation method based on the semantics of geographical names. For example, Chen Jiali’s multiple representations of spatial data may have inconsistencies in spatial relations, semantics, and geometry. Therefore, these inconsistencies must be evaluated and corrected, and the ontology must be introduced into geographic information modeling, combined with semantic consistency, and based on object matching. Method to achieve data matching.

The above-mentioned scholars have achieved good results in calculating the similarity of place names. However, there are still some problems: ① Algorithms such as the edit distance algorithm calculate the similarity of place names by analyzing the characteristics of the list of places, such as place name strings or geometric characteristics of place names, and do not consider other characteristics of place names, resulting in similar place names in some special cases. The accuracy of the degree is not ideal, especially in special cases such as duplicate place names and close spatial locations of place names. ② Some algorithms are proposed for specific languages, and are not applicable to other languages. Therefore, how to calculate the similarity of geographical names under the circumstances of a wide range of geographical name data sources, complex data structures, and large semantic differences is a difficult problem for those skilled in the art to study and solve.

Summary of the invention

Purpose of the invention: In view of this, the present invention provides a multilingual-oriented method for calculating semantic similarity of general place names, which aims to solve the problems of low accuracy and weak versatility of existing method for calculating place names of similarity.

Technical solution: In order to achieve the above-mentioned purpose of the invention, the present invention adopts the following technical solutions:

The method for calculating the semantic similarity of universal place names for multiple languages includes the following steps:

Determine the language of place names according to the language coding interval, and normalize place names into romanized place names based on document information;

Obtain the category attribute information of two place names from the place-name information database, and calculate the similarity of place-name categories according to the place-name classification system and place-name category similarity model;

Calculate the string similarity of place names after romanization according to the place name string similarity model;

Obtain the latitude and longitude of two place names from the place-name information database, and calculate the spatial proximity of place names according to the place-name spatial proximity model;

Determine the similarity of place names according to the similarity of place name categories, string similarity and spatial proximity;

Preferably, calculating the similarity of place-name categories according to the place-name classification system and place-name category similarity model includes:

If two place-name categories are in the category under the same sub-category of the classification system, calculate the sum of the distances from the common parent category to the root node and the distance from the nearest common parent place-name category to the two place-name categories, and then use the same category similarity model to calculate the attributes Similarity: If two place-name categories are in different sub-categories, calculate the correlation between the two place-name categories and then use the non-same category similarity model to calculate the category similarity.

Preferably, the category similarity model under the same subcategory is expressed as:

Among them, S _c (i, j) represents the similarity of the place names of place names i and j, l represents the distance from the nearest common parent of the category of place names i and j to the root node, and d _i represents the category of place names i and j. The distance from the nearest common parent to the category of i, d _j represents the distance from the nearest common parent of the categories of place names i and j to the category of j, and α(i,j) represents the nearest common parent to i and j The sum of the distances of the categories

Preferably, the category similarity model under different subcategories is expressed as:

Among them, S _c (i, j) represents the similarity of the place names of place names i and j, β'represents the correlation degree _{of the subcategories of the categories of i and j, and d'i} represents the nearest common parent category of the categories of i and j. The distance to the category of i, d' _j represents the distance from the nearest common parent of the categories of i and j to the category of j; α'(i,j) represents the distance from the nearest common parent to the category of i and j Sum.

Preferably, the similarity model of place name strings is expressed as:

Among them, A(i,j) represents the similarity of the place name strings of place names i and j, d[i,j] represents the edit distance of place names i and j, ML represents the maximum string length of place names i and j, and Len represents The minimum matching length, L(i) represents the length of the character string of the place name i, L(j) represents the length of the character string of the place name j, and a and b indicate the weight.

Preferably, the spatial proximity model of place names is used to calculate the spatial proximity. The spatial proximity model of place names is expressed as:

Among them, S _E (i, j) represents the spatial proximity of place names i and j, and lon _i , lon _j , lat _i and lat _j are the latitude and longitude of place names i and j, respectively.

Preferably, the calculation model for the semantic similarity of geographical names is:

F(i,j)=A(i,j)S _E (i,j)S _C (i,j)

Among them, F(i,j) represents the semantic similarity of place names i and j.

The application of the geographical name semantic similarity calculation method in multilingual geographical name data query mainly includes the following steps:

Extract the character strings, categories, and latitude and longitude attributes of all place names through the place name information database, determine the place name languages and normalize place names according to the language coding interval, and divide them into phonetic and ideographic indexing methods according to the different characteristics of place names. Phonographic characters are based on letter similarity, combined with the total number of letters, the number of letter radicals, the total number of words and the coding language characteristics of the first letter of the word, and the construction of the phonographic place name index based on the index organization method of the multi-dimensional feature statistical vector; ideographic characters Taking the local similarity of characters as the benchmark, combining the same characters, number of characters, and language characteristics of character positions in place names, constructing an ideographic place-name index based on the organization of single-word place-name indexes;

Determine the character string, category, and latitude and longitude attributes of the place name to be queried, and normalize it;

According to the character string, category, and latitude and longitude attributes determined by the place name to be queried, all the items in the index are sequentially filtered. According to the determined place name string, the place name string similarity model is used for calculation, and the calculation result is higher than the set threshold. Filter conditions, otherwise filter the place name, if the string is empty, it will directly meet the filter conditions; according to the determined place name category, use the category similarity model to calculate, the calculation result is higher than the set threshold and meet the filter conditions, otherwise it will be filtered If the category is empty, the place name directly meets the filtering conditions; according to the determined place name latitude and longitude, the place name spatial proximity model is used for calculation. When the calculation result is higher than the set threshold, the filtering conditions are met. Otherwise, the place name is filtered. If it is empty, it will directly meet the filter conditions;

In turn, the place names to be queried and all candidate place names are calculated according to the multilingual-oriented common place name semantic similarity calculation method;

Sort the calculation results in reverse order, the higher the ranking, the more similar the place name to be queried.

Beneficial effects: The present invention constructs a place name category similarity model, a place name string similarity model and a place name spatial proximity model according to the word formation characteristics, place name categories and location characteristics of place names, and proposes a model based on these three models. General method for calculating semantic similarity of geographical names. The beneficial effect of the present invention is to improve the edit distance algorithm, so that the influence of the generic name and the proper name can be taken into consideration at the same time. Introduce the characteristics of place-name categories, and construct a place-name category similarity model based on the place-name category classification system. At the same time, the spatial characteristics of geographical names are considered to construct a spatial proximity model of geographical names; finally, a comprehensive method for calculating the semantic similarity of geographical names is proposed by comprehensively considering the character strings, location and category characteristics of geographical names. Therefore, it has higher accuracy and universal applicability than the calculation method of place name similarity for a single feature.

Description of the drawings

Fig. 1 is a flowchart of a method according to an embodiment of the present invention.

Figure 2 is a schematic diagram of the structure of place name categories in an embodiment of the present invention.

Detailed ways

The present invention will be described in detail below in conjunction with specific embodiments.

As shown in FIG. 1, the method for calculating the semantic similarity of universal place names for multilingualism disclosed in the embodiment of the present invention mainly includes the following steps:

Step 1: Identify the languages of place names i and j according to the coding interval of place names, and normalize place names i and j into romanized place names based on document information.

Due to the influence of data acquisition methods and human factors, data in different languages differ greatly in data format and coding, so place names need to be preprocessed to find the corresponding place name category and other information in the place name information database.

In this step, the geographical name coding interval refers to the different coding intervals corresponding to each language, that is, the Unicode hexadecimal coding interval of each language is unique, so the geographical name language can be determined according to the geographical name coding interval.

Romanized place names refer to the Roman place names corresponding to the place names contained in the latest official gazetteers, place-name dictionaries, and local chronicles of each country.

Step 2: Obtain the categories of place names i and j from the place name information database, and calculate the category similarity of place names i and j according to the place name category similarity model.

In this step, place name category similarity refers to the degree of correlation between the categories of two place name data in the same classification system. The category of place-name data refers to the classification of data according to thematic elements. The classification system can use a hierarchical tree structure to describe the logical relationship between classes. Place name categories are based on the place name classification system, and the classification comparison table is shown in Table 1.

Table 1 Comparison table of GeoNames and GNS element categories

The GNIS data source directly provides the full name of the category. You can refer to the above classification standards to summarize the geographical name element categories included in each category, and design the GNIS category and standard classification mapping table, as shown in Table 2. Through the mapping relationship in the table, add the GNIS feature category code attribute. Table 3 is a part of the geographical name category code table.

Table 2 GNIS category and standard classification mapping table

Table 3 Part of the geographical name classification code table

Through analysis, it is found that the category similarity in place name attributes can reflect the degree of correlation between the categories of two data in the same classification system. Therefore, calculating the correlation between classes and classes needs to deal with different types of relationships such as parent-child nodes and sibling nodes in the classification tree. In order to facilitate understanding, take the big category P part category as an example, make a tree diagram, as shown in Figure 2. The similarity algorithm function of place name category is represented by S _C (i,j). When the local names i and j are in the same subcategory _{, the calculation of S C} (i,j) is as follows (for example, as shown in Figure 2, The local names i and j belong to the categories of PPA1 and PPA3, respectively, so PPA1 and PPA3 belong to the same sub-category PPA):

Wherein, l i and j represent the category of the last common parent class to the root of the distance (number of edges); D _i represents the category of the distance class i and j last common parent class to i (edges Number), d _j represents the distance from the nearest common parent of the categories of i and j to the category of j (the number of edges); α(i,j) represents the distance between the nearest common parent to the category of i and j and.

When i and j are in different categories under a subcategory, _{the calculation of S C} (i,j) is as follows:

Where β'represents the correlation degree of the subcategories of the categories of i and j, and the value is [0,1], which can be given by domain experts according to practical applications. d' _i represents the nearest common parent of the categories of i and j. The distance from the category to the category of i (the number of edges), d' _j represents the distance from the nearest common parent category of the categories of i and j to the category of j (the number of edges); α'(i,j) represents the nearest The sum of the distances from the common parent category to the categories i and j.

Step 3: Calculate the name similarity of the romanized place names i and j according to the place-name string similarity model.

Edit distance, also known as Levenshtein distance, is a distance measurement function used to measure the similarity of two sequences. In natural language processing, the edit distance is used to calculate the minimum number of insertion, deletion, and replacement operations required to convert from the original string to the target string. Suppose S _i = s ₁ s ₂ … s _i and T _j = t ₁ t ₂ … t _j represent two character strings, and the distance d[i,j] is the minimum operation used to edit the string _{S j to the} _{string T j} The number, d[i,j] indicates the edit distance of place names i and j, which can effectively reflect the similarity of characters between place names. The formula is as follows:

Edit distance is a distance measurement function used to measure the similarity of two sequences. It is commonly used to calculate the similarity of place name strings. However, this algorithm cannot effectively reduce the impact of generic names. Therefore, the algorithm has been improved. The model is as follows:

Where d[i,j] represents the edit distance of place name i, j, ML represents the maximum length of the place name i, j string, Len represents the minimum matching length (Len≥1), L(i) represents the i string Length, L(j) represents the length of the j string, and a and b represent the weight, which are 0.6 and 0.4, respectively. Table 4 shows the comparison between the improved model and the existing model name similarity calculation results.

Table 4 Comparison of calculation results of place name string similarity

As can be seen from the above table, Gwaun Creek and Gunye Creek are different place names, but the edit distance algorithm calculates the similarity as high as 0.636; Wilipini and Willipee are the same place names, and the greedy string matching algorithm has a similarity result of 0.555, while Gbonga and Gbondoi are different For the place name, the calculation result is 0.615; it can be clearly found that the similarity calculated by the improved algorithm of the present invention is more consistent with the actual situation.

Step 4: Obtain the latitude and longitude of place names i and j from the place name information database, and calculate the spatial proximity of place names according to the place name spatial proximity model.

As the basic geographic element, place name can be a point element (such as the name of a small village), a line element (such as the name of a highway), or a polygon element (such as the name of an administrative district). Therefore, place name data The geometric similarity of the includes the measurement of the similarity of the position of the point element, the measurement of the similarity of the line element and the measurement of the geometric similarity of the area element, and the global place name data studied in the present invention are all the place names of the point elements.

The measurement of the place names of point elements usually adopts the method of calculating the distance. The basic idea is to extract a set of feature vectors from the place names of two point elements, and calculate the distance of these two sets of vectors in a certain distance space. The smaller the distance, the more similar the two place names; conversely, the larger the distance, the greater the difference between the two place names. Euclidean distance is often used to represent the distance between two points.

Euclidean Distance (Euclidean Distance) is the ordinary straight-line distance between two points in Euclidean space, which measures the absolute distance between points in a multidimensional space. Among them, if the Euclidean distance between place names is larger, the similarity of the place names described is lower. Set i, j represents a two place names, which are referred to as latitude and longitude lon _{_i,} lon _j, lat _i and lat _j. The Euclidean spatial distance between two place names is recorded as dis _ij .

Assuming that the spatial proximity function of place names is S _E (i, j), the spatial distance similarity model designed by the present invention for the spatial characteristics of place name data is as follows.

Among them, S _E (i, j) represents the degree of similarity in the spatial range of two place names. If the two are the same, the value is 1; if the distance between the two is farther, the consistency of the spatial range is closer to 0.

Step 5: Calculate the semantic similarity of geographical names according to the semantic similarity model of geographical names.

The semantic similarity model of geographical names is as follows:

F(i,j)=A(i,j)S _E (i,j)S _C (i,j)

Among them, F(i,j) represents the semantic similarity of geographical names, _{and the three variables A(i,j), S E} (i,j) and S _c (i,j) respectively represent normalization to [0,1] The similarity of place-name strings in the range of the value range is similar to the spatial proximity of place-names and the similarity of place-name categories.

Taking Honduras, Mauritius, Liberia, Mongolia, Zimbabwe and other 5 countries, a total of about 167,000 geographical names from each data source are used as experimental data, of which about 47,700 can be matched with consistency. The multilingual-oriented data proposed by the present invention is used as the experimental data. Experiments were carried out on the calculation method of semantic similarity of general place names, and the results are shown in Table 5.

Table 5 Statistics of evaluation indicators of experimental results

The experimental results show that the universal semantic similarity calculation method for multilingual geographical names not only maintains an accuracy rate of 98% or more, but also achieves more than 97% of actual geographical name data matching.

The application of the method for calculating the semantic similarity of place names disclosed in the embodiment of the present invention in multilingual place name data query mainly includes the following steps:

Step 1: Extract the character string, category, latitude and longitude attributes of all place names through the place name information database, determine the place name language based on the language coding interval and perform place name normalization processing, and divide it into phonological type and ideographic type according to the different characteristics of place name languages Indexing method, in which phonological characters are based on letter similarity, combined with language features such as the total number of letters, the number of letter radicals, the total number of words, and the code of the first letter of the word, and the phonological place name index is based on the index organization method of multi-dimensional feature statistical vectors Construction: Ideographic characters are based on the local similarity of characters, combined with the same characters, number of characters, character positions and other language characteristics of place names, and the ideographic place name index is constructed based on the organization of single word place names.

Step 2: Determine all or part of the attributes such as the character string, category, latitude and longitude of the place name to be queried, and perform normalization processing.

Step 3: According to the attributes such as the character string, category, latitude and longitude determined by the place name to be queried, all the items in the index are successively filtered. According to the determined place name string, the place name string similarity model is used for calculation, and the calculation result is higher than the set value. When the threshold is set, the filter condition is met, otherwise the place name is filtered. If the string is empty, the filter condition is directly met; the category similarity model is used for calculation according to the determined place name category, and the calculation result is higher than the set threshold and the filter condition is met , Otherwise it will filter the place name, if the category is empty, it will directly meet the filtering conditions; according to the determined place name longitude and latitude, use the geographical name spatial proximity model to calculate, and the calculation result will meet the filtering conditions when the calculation result is higher than the set threshold, otherwise it will be filtered The place name, if the latitude and longitude is empty, it directly meets the filter conditions.

Step 4: The place name to be queried and all candidate place names are calculated in turn using a multilingual universal place name semantic similarity calculation method.

Step 5: Arrange the calculation results in reverse order. The higher the order, the more similar the place name to be queried.

Claims

A method for calculating the semantic similarity of general place names for multilingualism, which is characterized in that it comprises the following steps:

Determine the language of place names according to the language coding interval, and normalize place names into romanized place names based on document information;

Obtain the category attribute information of two place names from the place-name information database, and calculate the similarity of place-name categories according to the place-name classification system and place-name category similarity model;

Calculate the string similarity of place names after romanization according to the place name string similarity model;

Obtain the latitude and longitude of two place names from the place-name information database, and then calculate the spatial proximity according to the geographic-name spatial proximity model;

The semantic similarity of geographical names is determined according to the category similarity, string similarity and spatial proximity of geographical names.
The method for calculating the semantic similarity of place names according to claim 1, wherein calculating the similarity of place names according to the place name classification system and the place name category similarity model comprises:

If the categories of two place names are in the category under the same subcategory of the place name classification system, calculate the sum of the distances from the common parent category to the root node and the distance from the nearest common parent category to the two place name categories, and then use the same category similarity model Calculate category similarity; if the categories of two place names are in categories under different subcategories, calculate the correlation between the subcategories of the two place name categories and then use the non-same category similarity model to calculate the category similarity.
The method for calculating the semantic similarity of geographical names according to claim 2, wherein the category similarity model under the same subcategory is expressed as:

Among them, S c (i, j) represents the similarity of the place names of place names i and j, l represents the distance from the nearest common parent of the category of place names i and j to the root node, and d i represents the category of place names i and j. The distance from the nearest common parent to the category of i, d j represents the distance from the nearest common parent of the categories of place names i and j to the category of j, and α(i,j) represents the nearest common parent to i and j The sum of the distances of the categories.
The method for calculating the semantic similarity of geographical names according to claim 2, wherein the category similarity models under different subcategories are expressed as:

Among them, S c (i, j) represents the similarity of the place names of place names i and j, β'represents the correlation degree of the subcategories of the categories of i and j, and d'i represents the nearest common parent category of the categories of i and j. The distance to the category of i, d' j represents the distance from the nearest common parent of the categories of i and j to the category of j; α'(i,j) represents the distance from the nearest common parent to the category of i and j Sum.
The method for calculating the semantic similarity of geographical names according to claim 1, wherein the string similarity model of geographical names is expressed as:

Among them, A(i,j) represents the similarity of the place name strings of place names i and j, d[i,j] represents the edit distance of place names i and j, ML represents the maximum string length of place names i and j, and Len represents The minimum matching length, L(i) represents the length of the character string of the place name i, L(j) represents the length of the character string of the place name j, and a and b indicate the weight.
The method for calculating the semantic similarity of geographical names according to claim 1, wherein the spatial proximity model of geographical names is expressed as:

Wherein, S E (i, j) representative of the space names names i and j proximity, lon i, lon j, lat i and j LAT i and j are names of latitude and longitude.
The method for calculating the semantic similarity of geographical names according to claim 1, wherein the calculation model of the semantic similarity of geographical names is:

F(i,j)=A(i,j)S E (i,j)S C (i,j)

Among them, S c (i,j) represents the similarity of place names i and j, A(i,j) represents the similarity of place names i and j, and S E (i,j) represents place names i and j The spatial proximity of place names, F(i,j) represents the semantic similarity of place names i and j.
The application of the method for calculating the semantic similarity of geographical names in multilingual geographical name data query is characterized in that it includes the following steps:

Extract the character strings, categories, and latitude and longitude attributes of all place names through the place name information database, determine the place name languages and normalize place names according to the language coding interval, and divide them into phonetic and ideographic indexing methods according to the different characteristics of place names. Phonographic characters are based on letter similarity, combined with the total number of letters, the number of letter radicals, the total number of words and the coding language characteristics of the first letter of the word, and the construction of the phonographic place name index based on the index organization method of the multi-dimensional feature statistical vector; ideographic characters Taking the local similarity of characters as the benchmark, combining the same characters, number of characters, and language characteristics of character positions in place names, constructing an ideographic place-name index based on the organization of single-word place-name indexes;

Determine the character string, category, and latitude and longitude attributes of the place name to be queried, and normalize it;

According to the character string, category, and latitude and longitude attributes determined by the place name to be queried, all the items in the index are sequentially filtered. According to the determined place name string, the place name string similarity model is used for calculation, and the calculation result is higher than the set threshold. Filter conditions, otherwise filter the place name, if the string is empty, it will directly meet the filter conditions; according to the determined place name category, use the category similarity model to calculate, the calculation result is higher than the set threshold and meet the filter conditions, otherwise it will be filtered If the category is empty, the place name directly meets the filtering conditions; according to the determined place name latitude and longitude, the place name spatial proximity model is used for calculation. When the calculation result is higher than the set threshold, the filtering conditions are met. Otherwise, the place name is filtered. If it is empty, it will directly meet the filter conditions;

In turn, the place name to be queried and all candidate place names are calculated by using the multilingual-oriented universal place name semantic similarity calculation method according to any one of claims 1-7;

Sort the calculation results in reverse order, the higher the ranking, the more similar the place name to be queried.