AU2020101024A4

AU2020101024A4 - Multi-language oriented general method for calculating place name semanteme similarity and use thereof

Info

Publication number: AU2020101024A4
Application number: AU2020101024A
Authority: AU
Inventors: Kehan WU; Li XUE; Peng Ye; Xueying ZHANG; Wenqiang Zhao
Original assignee: Nanjing Normal University; Nanjing Tech University
Current assignee: Nanjing Normal University
Priority date: 2020-01-19
Filing date: 2020-04-21
Publication date: 2020-07-23
Anticipated expiration: 2028-04-21
Also published as: WO2021142968A1; CN111325235A; CN111325235B

Abstract

The present invention discloses a multi-language oriented general method for calculating a place name semanteme similarity and use thereof. By analyzing the semantic features of multi-language place names such as word formation feature, affiliation, spatial position and the like, it is found that the features of a place name such as a category, a character string and a spatial 5 position can be easily acquired and can effectively differentiate place names. Therefore, the present invention respectively constructs a place name category similarity model, a place name character string similarity model and a place name space proximity model according to the three semantic features of places. Then, by comprehensively considering a place name category similarity, a character string similarity and space proximity, the present invention provides a multi-language 0 oriented general method for calculating a place name semanteme similarity. Compared with a place name similarity calculation method which only considering a place name character string or a spatial geometric feature, the method provided by the present invention can remarkably improve the calculation accuracy of the place name similarity, and can still better satisfy the application requirements for multi-language place name query, matching and share services in the big data 5 environment. 0

Description

Specification

Multi-Language Oriented General Method for Calculating Place Name Semanteme Similarity and Use Thereof

Technical Field The present invention relates to the field of geographic information science, in particular to a multi-language oriented general method for calculating a place name semanteme similarity and use thereof in place name query in a multi-language database. Background A place name is a language symbol commonly agreed by human beings for a geographic object and a geographic phenomenon having the features such as a specific location, a range and morphology in a geographic environment. Semanteme is the meaning of a concept represented by data (symbols) and a relationship between the meanings. With the development of computer technology and the popularization of mobile Internet, different countries, institutions or enterprises have established various types of place name information libraries, and most of the place name information libraries comprise the information pertaining to place name category, longitude and latitude and the like. However, the place name information libraries greatly vary in the aspects of coverage area, data form, language type, data content and the like. Therefore, how to quickly and accurately calculate a similarity between place names in different place name information libraries has become an important topic in the study of place names. At present, place name similarity calculation methods are mainly divided into three categories. (1) the first one is on the basis of place name character strings, that is, the similarity between place names is calculated by comparing the place name character strings. For example, Smart et al. combined a rule model and the hidden Markov model, and can effectively solve the problems that place name spellings, formats, character sets and the like are inconsistent; Zhan Binbin et al. utilized a structure rule library and a general name dictionary established on the basis of place names to determine the category of a place name, then obtained an optimum place name data matching result by means of character string similarity matching, and obtained a good verification result in Dezhou experimental area; Ye Peng et al., with consideration to the multi-stage feature of Chinese characters, constructed a single word index for place names on the basis of a Chinese place name dictionary, and utilized the mechanisms such as character filtration, similarity sequencing and the like to realize the efficient matching of Chinese place names. (2) the second one is on the basis of geographic elements, that is, the similarity between place names is calculated by utilizing the geometric information of the place names such as spatial positions, areas, shapes and the like. For example, Egenhofer and Clementini put forward a standard for measuring the inconsistency of a

Specification

spatial geometry data structure and the inconsistency of a topological relationship in multiple representations, and can ideally determine the consistency of spatial geometry data; Van et al. utilized the K-center clustering algorithm and the naive Bayes classification method perform a place name consistency process on photos with geographic labels. (3) the third one is a place name semanteme based similarity calculation method. For example, Chen Jiali put forward that the multiple-represented spatial data may have inconsistencies in the aspects of spatial relationship, semanteme and geometry, and therefore, the inconsistencies must be evaluated and corrected. Chen Jiali introduced reality to geographic information modeling, and realized data matching with an object matching based method in combination with semanteme consistency. The above scholars have achieved great results in the aspect of place name similarity calculation. However, the prior art still has certain problems: (1) the algorithms such as the edit distance algorithm calculate the similarity between place names by analyzing a single feature of place names, such as the place name character string or a geometric feature of place names, but do not consider other features of place names; therefore, the accuracy of the similarity between place names is unideal in certain special cases, especially in the special cases such as duplicate place name, close spatial positions of place names and the like. (2) Certain algorithms are proposed for a specific language, and are not suitable for other languages. Therefore, how to calculate the similarity between place names under the situations of wide place name data sources, complex data structure, large semantic differences and the like is a difficult problem that a person skilled in the art needs to study and solve. Summary of the Invention Object of invention: in view of the existing status, the present invention provides a multi-language oriented general method for calculating a place name semanteme similarity, with the purpose of solving the problems that the existing place name similarity calculation method has a low accuracy and poor generality. Technical solution: to achieve the above object of the present invention, the present invention adopts the following technical solution: A multi-language oriented general method for calculating a place name semanteme similarity, comprising the following steps: Determining languages of place names according to a language encoding interval, and normalizing the place names to be romanized place names according to literature information; Acquiring category attribute information of two place names from a place name information library, and calculating a place name category similarity according to a place name classification system and a place name category similarity model;

Specification

Calculating a character string similarity between the romanized place names according to a place name character string similarity model; Acquiring the longitudes and latitudes of the two place names from the place name information library, and calculating a place name space proximity according to a place name space proximity model; and Determining a place name similarity according to the place name category similarity, the character string similarity and the space proximity; As preferred, calculating a place name category similarity according to a place name classification system and a place name category similarity model comprises: If the categories of the two place names belong to the same subcategory of the classification system, then calculating the sum of distances from common parent categories to a root node, and distances from the closest common parent category to the categories of the two place names, and utilizing a same-category similarity model to calculate an attribute similarity; and If the categories of the two place names belong to different subcategories, then calculating a relevancy between the subcategories to which the categories of the two place names belong, and utilizing a different-category similarity model to calculate the category similarity. As preferred, the category similarity model under the same subcategory is denoted as:

S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d

Wherein S(i, j) denotes the place name category similarity between the place names i and j; denotes the distance from the closest common parent category of the categories of the place names i and j to the root node; di denotes the distance from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. As preferred, the category similarity model under different subcategories is denoted as:

S, (i, j) ,

S' +a'(i,j)d' +(1-a(i, j))d

Wherein Sci, j) denotes the place name category similarity between the place names i and j; #'denotes the

relevancy between the subcategories to which the categories of the place names i and j belong; d'i denotes the

distance from the closest common parent category of the categories of the place names i and j to the category of

the place name i; d; denotes the distance from the closest common parent category of the categories of the place

Specification

names i and j to the category of the place name j; and '(i, j) denotes the sum of the distances from the closest

common parent category to the categories of the place names i andj. As preferred, the place name character string similarity model is denoted as: A~i'-~1 d[i~j]) 2 Len A(i, j) =a(1- di )+2b Le ML L(i)+L(j) Wherein A(i, j) denotes the place name character string similarity between the place names i

and j; d[i, j] represents an edit distance between the place names i and j; ML represents a maximum value for the character string lengths of the place names i and j; Len represents a minimum match length; L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights. As preferred, the space proximity is calculated according to the place name space proximity model. The place name space proximity model is denoted as:

cos(sin lat, sin lat,+ cos lat, cos lat, cos(lon, -lon

) SE(ij)e

Wherein SE(i, j) represents the place name space proximity between the place names i and j; loni, lon, lati and lat are respectively the longitudes and latitudes of the place names i andj. As preferred, a place name semanteme similarity calculation model is:

F(i,j)= A(i, j)SE (ij)SC(ij)

Wherein F(ij) denotes the place name semanteme similarity between the place names i andj. A use of the method for calculating a place name semanteme similarity in multi-language place name data query, mainly comprising the following steps: Extracting the attributes of all the place names such as character strings, categories, and longitudes and latitudes from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the

Specification

same character of the place names, the number of characters, and character position language feature; Determining the attributes of a place name to be queried such as a character string, a category, and longitude and latitude, and normalizing the place name; Sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, and the longitude and latitude; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies afilter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition; Sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a place name semanteme similarity; Sequencing the calculation results in a descending order, wherein the higher a place name is ranked, the more similar to the place name to be queried.

Beneficial effects: the present invention respectively constructs a place name category

similarity model, a place name character string similarity model and a place name space proximity

model according to the word formation features of place names, place name categories and position

features, and provides a general method for calculating a place name semanteme similarity. The

present invention improves the edit distance algorithm, and thus can give consideration to the

influence of both a general name and a proper name. The present invention introduces a place name

category feature, and constructs a place name category similarity model according to a place name

category classification system. Furthermore, the present invention considers a place name space

feature, and constructs a place name space proximity model. Finally, the present invention

comprehensively considers the place name character string, position and category features, and

Specification

provides a general method for calculating a place name semanteme similarity. Therefore, compared

with the place name similarity calculation method which only considers a single feature, the present

invention has a high accuracy and generality. Brief Description of the Drawings Fig. 1 is a flow chart of the method according to one embodiment of the present invention; and Fig. 2 is a structural schematic view of place name categories according to one embodiment of the present invention.

Detailed Description of the Preferred Embodiments

The present invention will be described in detail hereafter in combination with specific

embodiments. As shown in Fig. 1, an embodiment of the present invention discloses a multi-language oriented general method for calculating a place name semanteme similarity, mainly comprising the following steps: Step 1, identifying languages of place names i andj according to a place name encoding interval, normalizing the place names i andj to be romanized place names according to literature information. Due to the affect of data acquisition means, human factors and the like, the data in different languages are quite different in the aspects of data format and coding; therefore, the place names need to be preprocessed, such that the information such as the corresponding categories of the place names can be found in a place name information library. In the present step, the place name encoding interval refers to the different encoding intervals corresponding to different languages, that is, the Unicode hexadecimal encoding interval of each language is unique. Therefore, the languages of the place names can be determined according to the place name encoding intervals. The romanized place names refer to the place names corresponding to the place names contained in latest official gazetteers, place name dictionaries, local chronicles and the like of each country. Step 2, acquiring the categories of the place names i andj from the place name information library, and calculating a category similarity between the place names i andj according to a place name category similarity model.

In the present step, the place name category similarity refers to the relevancy between the

categories of the two place names in the same classification system. The place name category refers

to the classification of data according to thematic elements. The classification system can use a

Specification

hierarchical tree structure to describe a logical relationship between categories. Place names are

classified according to a place name classification system, and the classification comparison table is

as shown in table 1.

Table 1 GeoNames and GNS element category comparison table

Category Category Description code A Country, region,... Administrative division H River, lake,... Hydrology L Park, ... Land utilization P City, countryside,... Densely populated district R Road, railway,... Traffic line S Building, farm,... Residential area and auxiliary facilities T Mountain peak, hill.... Land form U Seabed Underwater V Forest, barren land,... Vegetation A GNIS data source directly provides full names of categories. The categories of place names contained in each major category can be summarized with reference to the above classification standards, so as to design a GNIS category and standard classification mapping table as shown in table 2. The attribute of GNIS element category code is added through the mapping relationships in the table. Table 3 shows a part of the place name classification codes table.

Table 2 GNIS category and standard classification mapping table

Map to Map toprmy Category primary major Category primary category category Unknown A Turret S place Civil area A Tunnel S Event Military area A occurrence S place Island A Cross road S Conservation A Bridge S area Marsh H Slum S

Canal H Burial 5 ground H Continental 5 Rivulet slope Reservoir H Bar S

Specification

Dam H Church S Spring H Corner S Water fall H Building S Rapid stream H Ridge S Lake H Airport S River H Dry valley S Strait H Arched door S Bay H Peak top S Beach H Isthmus S Sea H Embankment S Harbor H Breach T Green land L Cliff T Park L Valley T Population gathering P Lava T point Path R Mine T

Column pier S Seabed terrace

Crater S Livestock V farm Oil field S Wood V Post office S Forest V Hospital S Plain V School S Flat land V Curve S Basin V Glacier S

Table 3 a part of the place name classification codes table

Majo Subcategory r categ ory A ADMIT, ADM1H, ADM2, ADM2H, ADM3, ADM3H, ADM4, ADM4H, ADM5 . . H AIRS, ANCH, BAY, BAYS, BGHT, BNK, BNKR, BNKX, BOG, CAPG, CHN . . L AGRC, AMUS, AREA, BSND, BSNP, BTL, CLG, CMN, CNS, COLF, CONT . . P PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLCH, PPLF, PPLG, PPLH . . R CSWY, OILP, PRMN, PTGE, RD, RDA, RDB, RDCUT, RDJCT, RJCT, RR . . S ADMF, AGRF, AIRB, AIRF, AIRH, AIRP, AIRQ, AMTH, ANS, AQC, ARCH . . T ASPH, ATOL, BAR, BCH, BCHS, BDLD, BLDR, BLHL, BLOW, BNCH . . U APNU, ARCU, ARRU, BDLU, BKSU, BNKU, BSNU, CDAU, CNSU, CNYU . . V BUSH, CULT, FRST, FRSTF, GRSLD, GRVC, GRVO, GRVP, GRVPN, HTH...

It is found through analysis that the category similarity in the attributes of place names can

reflect the relevancy between the categories of two pieces data in the same classification system.

Therefore, the calculation of the relevancy between categories needs to process different types of

Specification

relationships in a classification tree such as a relationship between parent-child nodes and a relationship between sibling nodes. To facilitate understanding, a part of the categories under the major category P are taken as an example to establish a tree diagram, as shown in Fig. 2. A place name category similarity algorithm function is denoted by Sc(i, j); when the categories of the place names i and j are under the same subcategory, Sc(i, j) is calculated as follows (for example, as shown in Fig. 2, if the categories of the place names i and jare respectively PPA1 and PPA3, then PPA1 and PPA3 both belong to the same subcategory PPA):

S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d

Wherein I denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i and j to a root node; di denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name i; d; denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. When the categories of the place names i and j are under different subcategories, Sc(i, j) is calculated as follows:

S, (i, j) ,

S' +a'(i,j)d' +(1-a(i, j))d

Wherein ' denotes the relevancy between the subcategories to which the categories of the place names i and j belong, and the value is in the range of [0, 1], and can be given by an expert in the art according to practical use; d' denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a'(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. Step 3, calculating a name similarity between the romanized place names i andj according to a place name character string similarity model.

Specification

Edit distance, also known as Levenshtein distance, is a distance measurement function for measuring the similarity between two sequences. In the processing of a natural language, edit distance is used to calculate the minimum times of insertion, deletion and replacement operations required for converting an original character string to a target character string. Let Si=sis2...si and Tj=tit2...tj which represent two character strings. The distance d[i, j] is the minimum operation times for editing the character string Sj to be the character string Tj; d[i, j] denotes the edit distance between the place names i andj, and can effectively reflect the character similarity between place names. The formula is as follows: 0,i=0,j=0 0,s, 0O' = d[i-,j-]+ ros=t d[i, j]= min = j d[i - 1, j]+1 , i > O dj > 0 d[i, j-1]+1

Edit distance is a distance measurement function for measuring the similarity between two sequences, and is often used to calculate the place name character string similarity. However, the algorithm cannot effectively reduce the influence of a general name. Therefore, the algorithm is improved. And the improved model is as follows: A~i'-~1 d[i~i]) 2 Len A(i,j) =a(1- ) +2b Le ML L(i)+L(j) Wherein d[i, j] represents an edit distance between the place names i andj; ML represents a maximum value for the character string lengths of the place names i andj; Len represents a minimum match length (Len>1); L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights, and are respectively 0.6 and 0.4. The comparison between the name similarity calculation results of the improved model and the existing model is as shown in table 4.

Table 4 Comparison between place name character string similarity calculation results Greedy Whether the Edit character Place Place name Improved same place name 1 2 model name in algorithm matching algorithm Gwenema Gwenima 0.857 0.571 0.742 Yes

Specification

Merendon Merend6n 0.875 0.750 1.000 Yes Reputa Wreputa 0.714 0.769 0.883 Yes Stephenta Stephen Ta 0.800 0.736 1.000 Yes Wilipini Willipinee 0.700 0.555 0.642 Yes Gwaun 0.545 0.560 No Creek Gunye Creek 0.636

Gbonga Gbondoi 0.571 0.615 0.589 No

It can be seen from the above table that Gwaun Creek and Gunye Creek are different place

names, but the similarity calculated with the edit distance algorithm is as high as 0.636; Wilipini

and Willipinee are the same place names, the similarity result of greedy character string matching

algorithm is 0.555; Gbonga and Gbondoi are different place names, but the calculation result is

0.615. It can be obviously found that the similarity calculated with the improved algorithm of the

present invention is more consistent with the actual situation.

Step 4, acquiring the longitudes and latitudes of the place names i andj from the place name

information library, and calculating place name space proximity according to a place name space

proximity model.

A place name, as a basic geographical element, can be a point element (for example, the place

name of a small village), a line element (for example, the place name of a highway), and can also be

a plane element (for example, the place name of an administrative district). Therefore, the

geometric similarity between place name data comprises the measurement of a point element

position similarity, the measurement of a line element similarity, and the measurement of a plane

element geometric similarity. The global place name data studied in the present invention are all

point element place names.

The position of a point element place name is generally measured by means of distance

calculation. The basic thought is: a set of feature vectors are extracted from two point element place

names respectively, and then the distance between the two sets of vectors is calculated in a certain

distance space. The smaller the distance is, the more similar the two place names are; on the

contrary, the greater the distance is, the more different the two place names would be. The distance

between two points is often replaced with the Euclidean distance.

Euclidean distance is an ordinary straight line distance between two points in Euclidean space,

and can measure the absolute distance between points in a multi-dimensional space. The greater the

Specification

Euclidean distance between place names is, the lower the similarity between the described place names is. Let i and j denote two place names, and the longitudes and latitudes thereof are respectively loni, lon, lati and lat;. The Euclidean distance between two place names is denoted as disi-i.

cos(sin lat1 sin lat, + cos tat, cos lat, cos(lon, -lon,)

Let a place name space proximity function is SE (i, j); the present invention designs a spatial distance similarity model as follows according to the spatial feature of place name data.

SE(ij)=

Wherein SE(i, j) denotes the spatial range similarity between two place names; if the two are consistent, then the value will be 1; and the farther the spatial distance between the two is, the closer to 0 the spatial range consistency would become. Step 5, calculating a place name semanteme similarity according to a place name semanteme similarity model. The place name semanteme similarity model is as follows:

F(i, j)= A(i, j)SE (ij)SC(ij)

Wherein F(i,j) denotes the place name semanteme similarity; the three variables A(i, j), SE(ij) and S(i, j) respectively denote the place name character string similarity, the place name space proximity and the place name category similarity which are normalized to the value range [0, 1]. Totally about 167 thousand pieces of place name data are acquired from the place name data sources of five countries Honduras, Mauritius, Liberia, Mongolia and Zimbabwe as experimental data, wherein about 47.7 thousand pieces of data can perform consistency matching. An experiment is performed with the multi-language oriented general method for calculating a place name semanteme similarity provided by the present invention, and the results are as shown in table 5.

Table 5 Experiment result evaluation indicator statistics Number Number Number Accuracy Coverage Test set of place of of Accuracy rate names matched accurately rate%())

Specification

which place matched can be names place actually (no.) names matched (no.) (no.) Honduras 17835 17535 17300 98.65 97.00 Mauritius 1130 1126 1119 99.37 99.02 Liberia 7984 7899 7870 99.63 98.57 Mongolia 12594 12571 12557 99.88 99.70 Zimbabwe 8174 8039 7997 99.48 97.83

The experimental results show that the multi-language oriented general method for calculating a place name semanteme similarity not only can retain the place name matching accuracy rate more than 98%, but also can achieve more than 97% of actual place name data matching. An embodiment of the present invention discloses a use of the method for calculating a place name semanteme similarity in multi-language place name data query, mainly comprising the following steps: Step I, extracting the attributes of all the place names such as character strings, categories, longitude and latitude and the like from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the language features such as the same character of the place names, the number of characters, character position and the like. Steps II. determining the whole or a part of the attributes of a place name to be queried such as a character string, a category, longitude and latitude and the like, and normalizing the place name. Step III, sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, the longitude and latitude and the like; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of

Specification

the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition. Step IV, sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a place name semanteme similarity.

Step V, sequencing the calculation results in a descending order, wherein the higher a place

name is ranked, the more similar to the place name to be queried.

Claims

1. A multi-language oriented general method for calculating a place name semanteme similarity, comprising the following steps: determining languages of place names according to a language encoding interval, and normalizing the place names to be romanized place names according to literature information; acquiring category attribute information of two place names from a place name information library, and calculating a place name category similarity according to a place name classification system and a place name category similarity model; calculating a character string similarity between the romanized place names according to a place name character string similarity model; acquiring the longitudes and latitudes of the two place names from the place name information library, then calculating a space proximity according to a place name space proximity model; and determining a place name semanteme similarity according to the place name category similarity, the character string similarity and the space proximity.

2. The method for calculating a place name semanteme similarity according to claim 1, wherein calculating a place name category similarity according to a place name classification system and a place name category similarity model comprises: if the categories of the two place names belong to the same subcategory of the place name classification system, then calculating the sum of distances from common parent categories to a root node, and distances from the closest common parent category to the categories of the two place names, and utilizing a same-category similarity model to calculate the category similarity; and if the categories of the two place names belong to different subcategories, then calculating a relevancy between the subcategories to which the categories of the two place names belong, and utilizing a different-category similarity model to calculate the category similarity.

3. The method for calculating a place name semanteme similarity according to claim 2, wherein the category similarity model under the same subcategory is denoted as:

S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d

wherein S(i, j) denotes the place name category similarity between the place names i and ;i denotes the distance from the closest common parent category of the categories of the place names i and j to the root node; di denotes the distance from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj.

Claims

4. The method for calculating a place name semanteme similarity according to claim 2, wherein the category similarity model under different subcategories is denoted as:

S (i, j)= 6 +a'(i,j)d + (1-a'(i, j))d

wherein Sc(i, j) denotes the place name category similarity between the place names i and j; #' denotes the

relevancy between the subcategories to which the categories of the place names i and j belong; d denotes the

names i and j to the category of the place name j; and a'(i, j) denotes the sum of the distances from the closest

common parent category to the categories of the place names i andj.

5. The method for calculating a place name semanteme similarity according to claim 1, wherein the place name character string similarity model is denoted as: A~i'-~1 d[i~i]) 2 Len A(i,j) =a(1- ) +2b Le ML L(i)+L(j) wherein A(i,j) denotes the place name character string similarity between the place names i andj; d[i,j] represents an edit distance between the place names i andj; ML represents a maximum value for the character string lengths of the place names i andj; Len represents a minimum match length; L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights.

6. The method for calculating a place name semanteme similarity according to claim 1, wherein the place name space proximity model is denoted as:

d =cos( 1 '-' sin lat, sin lat.+ cos latIcos lat cos(lon -lon)

SE(ij)=

wherein SE(i, j) represents the place name space proximity between the place names i and j; loni, lon, lati and lat are respectively the longitudes and latitudes of the place names i andj.

7. The method for calculating a place name semanteme similarity according to claim 1, wherein a place name semanteme similarity calculation model is:

F(i,j)= A(i, j)SE (i,)SC(ii)

wherein Sc(i,j) denotes the place name category similarity between the place names i andj; A(i, j) denotes the place name character string similarity between the place names i andj; SE(ij) denotes

Claims

the place name space proximity between the place names i andj; and F(ij) denotes the place name semanteme similarity between the place names i andj.

8. A use of the method for calculating a place name semanteme similarity in multi-language place name data query, comprising the following steps: extracting the attributes of all the place names such as character strings, categories, and longitudes and latitudes from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the language features such as the same character of the place names, the number of characters, character position and the like. determining the attributes of a place name to be queried such as a character string, a category, and longitude and latitude, and normalizing the place name; sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, and the longitude and latitude; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition; sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a

Claims

place name semanteme similarity as claimed in any one of claims 1-7; sequencing the calculation results in a descending order, wherein the higher a place name is ranked, the more similar to the place name to be queried.