AU2020101024A4 - Multi-language oriented general method for calculating place name semanteme similarity and use thereof - Google Patents

Multi-language oriented general method for calculating place name semanteme similarity and use thereof Download PDF

Info

Publication number
AU2020101024A4
AU2020101024A4 AU2020101024A AU2020101024A AU2020101024A4 AU 2020101024 A4 AU2020101024 A4 AU 2020101024A4 AU 2020101024 A AU2020101024 A AU 2020101024A AU 2020101024 A AU2020101024 A AU 2020101024A AU 2020101024 A4 AU2020101024 A4 AU 2020101024A4
Authority
AU
Australia
Prior art keywords
place name
place
similarity
category
names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2020101024A
Inventor
Kehan WU
Li XUE
Peng Ye
Xueying ZHANG
Wenqiang Zhao
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Normal University
Original Assignee
Nanjing Normal University
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Normal University, Nanjing Tech University filed Critical Nanjing Normal University
Application granted granted Critical
Publication of AU2020101024A4 publication Critical patent/AU2020101024A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Remote Sensing (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention discloses a multi-language oriented general method for calculating a place name semanteme similarity and use thereof. By analyzing the semantic features of multi-language place names such as word formation feature, affiliation, spatial position and the like, it is found that the features of a place name such as a category, a character string and a spatial 5 position can be easily acquired and can effectively differentiate place names. Therefore, the present invention respectively constructs a place name category similarity model, a place name character string similarity model and a place name space proximity model according to the three semantic features of places. Then, by comprehensively considering a place name category similarity, a character string similarity and space proximity, the present invention provides a multi-language 0 oriented general method for calculating a place name semanteme similarity. Compared with a place name similarity calculation method which only considering a place name character string or a spatial geometric feature, the method provided by the present invention can remarkably improve the calculation accuracy of the place name similarity, and can still better satisfy the application requirements for multi-language place name query, matching and share services in the big data 5 environment. 0

Description

Specification
Multi-Language Oriented General Method for Calculating Place Name Semanteme Similarity and Use Thereof
Technical Field The present invention relates to the field of geographic information science, in particular to a multi-language oriented general method for calculating a place name semanteme similarity and use thereof in place name query in a multi-language database. Background A place name is a language symbol commonly agreed by human beings for a geographic object and a geographic phenomenon having the features such as a specific location, a range and morphology in a geographic environment. Semanteme is the meaning of a concept represented by data (symbols) and a relationship between the meanings. With the development of computer technology and the popularization of mobile Internet, different countries, institutions or enterprises have established various types of place name information libraries, and most of the place name information libraries comprise the information pertaining to place name category, longitude and latitude and the like. However, the place name information libraries greatly vary in the aspects of coverage area, data form, language type, data content and the like. Therefore, how to quickly and accurately calculate a similarity between place names in different place name information libraries has become an important topic in the study of place names. At present, place name similarity calculation methods are mainly divided into three categories. (1) the first one is on the basis of place name character strings, that is, the similarity between place names is calculated by comparing the place name character strings. For example, Smart et al. combined a rule model and the hidden Markov model, and can effectively solve the problems that place name spellings, formats, character sets and the like are inconsistent; Zhan Binbin et al. utilized a structure rule library and a general name dictionary established on the basis of place names to determine the category of a place name, then obtained an optimum place name data matching result by means of character string similarity matching, and obtained a good verification result in Dezhou experimental area; Ye Peng et al., with consideration to the multi-stage feature of Chinese characters, constructed a single word index for place names on the basis of a Chinese place name dictionary, and utilized the mechanisms such as character filtration, similarity sequencing and the like to realize the efficient matching of Chinese place names. (2) the second one is on the basis of geographic elements, that is, the similarity between place names is calculated by utilizing the geometric information of the place names such as spatial positions, areas, shapes and the like. For example, Egenhofer and Clementini put forward a standard for measuring the inconsistency of a
Specification
spatial geometry data structure and the inconsistency of a topological relationship in multiple representations, and can ideally determine the consistency of spatial geometry data; Van et al. utilized the K-center clustering algorithm and the naive Bayes classification method perform a place name consistency process on photos with geographic labels. (3) the third one is a place name semanteme based similarity calculation method. For example, Chen Jiali put forward that the multiple-represented spatial data may have inconsistencies in the aspects of spatial relationship, semanteme and geometry, and therefore, the inconsistencies must be evaluated and corrected. Chen Jiali introduced reality to geographic information modeling, and realized data matching with an object matching based method in combination with semanteme consistency. The above scholars have achieved great results in the aspect of place name similarity calculation. However, the prior art still has certain problems: (1) the algorithms such as the edit distance algorithm calculate the similarity between place names by analyzing a single feature of place names, such as the place name character string or a geometric feature of place names, but do not consider other features of place names; therefore, the accuracy of the similarity between place names is unideal in certain special cases, especially in the special cases such as duplicate place name, close spatial positions of place names and the like. (2) Certain algorithms are proposed for a specific language, and are not suitable for other languages. Therefore, how to calculate the similarity between place names under the situations of wide place name data sources, complex data structure, large semantic differences and the like is a difficult problem that a person skilled in the art needs to study and solve. Summary of the Invention Object of invention: in view of the existing status, the present invention provides a multi-language oriented general method for calculating a place name semanteme similarity, with the purpose of solving the problems that the existing place name similarity calculation method has a low accuracy and poor generality. Technical solution: to achieve the above object of the present invention, the present invention adopts the following technical solution: A multi-language oriented general method for calculating a place name semanteme similarity, comprising the following steps: Determining languages of place names according to a language encoding interval, and normalizing the place names to be romanized place names according to literature information; Acquiring category attribute information of two place names from a place name information library, and calculating a place name category similarity according to a place name classification system and a place name category similarity model;
Specification
Calculating a character string similarity between the romanized place names according to a place name character string similarity model; Acquiring the longitudes and latitudes of the two place names from the place name information library, and calculating a place name space proximity according to a place name space proximity model; and Determining a place name similarity according to the place name category similarity, the character string similarity and the space proximity; As preferred, calculating a place name category similarity according to a place name classification system and a place name category similarity model comprises: If the categories of the two place names belong to the same subcategory of the classification system, then calculating the sum of distances from common parent categories to a root node, and distances from the closest common parent category to the categories of the two place names, and utilizing a same-category similarity model to calculate an attribute similarity; and If the categories of the two place names belong to different subcategories, then calculating a relevancy between the subcategories to which the categories of the two place names belong, and utilizing a different-category similarity model to calculate the category similarity. As preferred, the category similarity model under the same subcategory is denoted as:
S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d
Wherein S(i, j) denotes the place name category similarity between the place names i and j; denotes the distance from the closest common parent category of the categories of the place names i and j to the root node; di denotes the distance from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. As preferred, the category similarity model under different subcategories is denoted as:
S, (i, j) ,
S' +a'(i,j)d' +(1-a(i, j))d
Wherein Sci, j) denotes the place name category similarity between the place names i and j; #'denotes the
relevancy between the subcategories to which the categories of the place names i and j belong; d'i denotes the
distance from the closest common parent category of the categories of the place names i and j to the category of
the place name i; d; denotes the distance from the closest common parent category of the categories of the place
Specification
names i and j to the category of the place name j; and '(i, j) denotes the sum of the distances from the closest
common parent category to the categories of the place names i andj. As preferred, the place name character string similarity model is denoted as: A~i'-~1 d[i~j]) 2 Len A(i, j) =a(1- di )+2b Le ML L(i)+L(j) Wherein A(i, j) denotes the place name character string similarity between the place names i
and j; d[i, j] represents an edit distance between the place names i and j; ML represents a maximum value for the character string lengths of the place names i and j; Len represents a minimum match length; L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights. As preferred, the space proximity is calculated according to the place name space proximity model. The place name space proximity model is denoted as:
cos(sin lat, sin lat,+ cos lat, cos lat, cos(lon, -lon
) SE(ij)e
Wherein SE(i, j) represents the place name space proximity between the place names i and j; loni, lon, lati and lat are respectively the longitudes and latitudes of the place names i andj. As preferred, a place name semanteme similarity calculation model is:
F(i,j)= A(i, j)SE (ij)SC(ij)
Wherein F(ij) denotes the place name semanteme similarity between the place names i andj. A use of the method for calculating a place name semanteme similarity in multi-language place name data query, mainly comprising the following steps: Extracting the attributes of all the place names such as character strings, categories, and longitudes and latitudes from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the
Specification
same character of the place names, the number of characters, and character position language feature; Determining the attributes of a place name to be queried such as a character string, a category, and longitude and latitude, and normalizing the place name; Sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, and the longitude and latitude; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies afilter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition; Sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a place name semanteme similarity; Sequencing the calculation results in a descending order, wherein the higher a place name is ranked, the more similar to the place name to be queried.
Beneficial effects: the present invention respectively constructs a place name category
similarity model, a place name character string similarity model and a place name space proximity
model according to the word formation features of place names, place name categories and position
features, and provides a general method for calculating a place name semanteme similarity. The
present invention improves the edit distance algorithm, and thus can give consideration to the
influence of both a general name and a proper name. The present invention introduces a place name
category feature, and constructs a place name category similarity model according to a place name
category classification system. Furthermore, the present invention considers a place name space
feature, and constructs a place name space proximity model. Finally, the present invention
comprehensively considers the place name character string, position and category features, and
Specification
provides a general method for calculating a place name semanteme similarity. Therefore, compared
with the place name similarity calculation method which only considers a single feature, the present
invention has a high accuracy and generality. Brief Description of the Drawings Fig. 1 is a flow chart of the method according to one embodiment of the present invention; and Fig. 2 is a structural schematic view of place name categories according to one embodiment of the present invention.
Detailed Description of the Preferred Embodiments
The present invention will be described in detail hereafter in combination with specific
embodiments. As shown in Fig. 1, an embodiment of the present invention discloses a multi-language oriented general method for calculating a place name semanteme similarity, mainly comprising the following steps: Step 1, identifying languages of place names i andj according to a place name encoding interval, normalizing the place names i andj to be romanized place names according to literature information. Due to the affect of data acquisition means, human factors and the like, the data in different languages are quite different in the aspects of data format and coding; therefore, the place names need to be preprocessed, such that the information such as the corresponding categories of the place names can be found in a place name information library. In the present step, the place name encoding interval refers to the different encoding intervals corresponding to different languages, that is, the Unicode hexadecimal encoding interval of each language is unique. Therefore, the languages of the place names can be determined according to the place name encoding intervals. The romanized place names refer to the place names corresponding to the place names contained in latest official gazetteers, place name dictionaries, local chronicles and the like of each country. Step 2, acquiring the categories of the place names i andj from the place name information library, and calculating a category similarity between the place names i andj according to a place name category similarity model.
In the present step, the place name category similarity refers to the relevancy between the
categories of the two place names in the same classification system. The place name category refers
to the classification of data according to thematic elements. The classification system can use a
Specification
hierarchical tree structure to describe a logical relationship between categories. Place names are
classified according to a place name classification system, and the classification comparison table is
as shown in table 1.
Table 1 GeoNames and GNS element category comparison table
Category Category Description code A Country, region,... Administrative division H River, lake,... Hydrology L Park, ... Land utilization P City, countryside,... Densely populated district R Road, railway,... Traffic line S Building, farm,... Residential area and auxiliary facilities T Mountain peak, hill.... Land form U Seabed Underwater V Forest, barren land,... Vegetation A GNIS data source directly provides full names of categories. The categories of place names contained in each major category can be summarized with reference to the above classification standards, so as to design a GNIS category and standard classification mapping table as shown in table 2. The attribute of GNIS element category code is added through the mapping relationships in the table. Table 3 shows a part of the place name classification codes table.
Table 2 GNIS category and standard classification mapping table
Map to Map toprmy Category primary major Category primary category category Unknown A Turret S place Civil area A Tunnel S Event Military area A occurrence S place Island A Cross road S Conservation A Bridge S area Marsh H Slum S
Canal H Burial 5 ground H Continental 5 Rivulet slope Reservoir H Bar S
Specification
Dam H Church S Spring H Corner S Water fall H Building S Rapid stream H Ridge S Lake H Airport S River H Dry valley S Strait H Arched door S Bay H Peak top S Beach H Isthmus S Sea H Embankment S Harbor H Breach T Green land L Cliff T Park L Valley T Population gathering P Lava T point Path R Mine T
Column pier S Seabed terrace
Crater S Livestock V farm Oil field S Wood V Post office S Forest V Hospital S Plain V School S Flat land V Curve S Basin V Glacier S
Table 3 a part of the place name classification codes table
Majo Subcategory r categ ory A ADMIT, ADM1H, ADM2, ADM2H, ADM3, ADM3H, ADM4, ADM4H, ADM5 . . H AIRS, ANCH, BAY, BAYS, BGHT, BNK, BNKR, BNKX, BOG, CAPG, CHN . . L AGRC, AMUS, AREA, BSND, BSNP, BTL, CLG, CMN, CNS, COLF, CONT . . P PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLCH, PPLF, PPLG, PPLH . . R CSWY, OILP, PRMN, PTGE, RD, RDA, RDB, RDCUT, RDJCT, RJCT, RR . . S ADMF, AGRF, AIRB, AIRF, AIRH, AIRP, AIRQ, AMTH, ANS, AQC, ARCH . . T ASPH, ATOL, BAR, BCH, BCHS, BDLD, BLDR, BLHL, BLOW, BNCH . . U APNU, ARCU, ARRU, BDLU, BKSU, BNKU, BSNU, CDAU, CNSU, CNYU . . V BUSH, CULT, FRST, FRSTF, GRSLD, GRVC, GRVO, GRVP, GRVPN, HTH...
It is found through analysis that the category similarity in the attributes of place names can
reflect the relevancy between the categories of two pieces data in the same classification system.
Therefore, the calculation of the relevancy between categories needs to process different types of
Specification
relationships in a classification tree such as a relationship between parent-child nodes and a relationship between sibling nodes. To facilitate understanding, a part of the categories under the major category P are taken as an example to establish a tree diagram, as shown in Fig. 2. A place name category similarity algorithm function is denoted by Sc(i, j); when the categories of the place names i and j are under the same subcategory, Sc(i, j) is calculated as follows (for example, as shown in Fig. 2, if the categories of the place names i and jare respectively PPA1 and PPA3, then PPA1 and PPA3 both belong to the same subcategory PPA):
S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d
Wherein I denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i and j to a root node; di denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name i; d; denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. When the categories of the place names i and j are under different subcategories, Sc(i, j) is calculated as follows:
S, (i, j) ,
S' +a'(i,j)d' +(1-a(i, j))d
Wherein ' denotes the relevancy between the subcategories to which the categories of the place names i and j belong, and the value is in the range of [0, 1], and can be given by an expert in the art according to practical use; d' denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a'(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. Step 3, calculating a name similarity between the romanized place names i andj according to a place name character string similarity model.
Specification
Edit distance, also known as Levenshtein distance, is a distance measurement function for measuring the similarity between two sequences. In the processing of a natural language, edit distance is used to calculate the minimum times of insertion, deletion and replacement operations required for converting an original character string to a target character string. Let Si=sis2...si and Tj=tit2...tj which represent two character strings. The distance d[i, j] is the minimum operation times for editing the character string Sj to be the character string Tj; d[i, j] denotes the edit distance between the place names i andj, and can effectively reflect the character similarity between place names. The formula is as follows: 0,i=0,j=0 0,s, 0O' = d[i-,j-]+ ros=t d[i, j]= min = j d[i - 1, j]+1 , i > O dj > 0 d[i, j-1]+1
Edit distance is a distance measurement function for measuring the similarity between two sequences, and is often used to calculate the place name character string similarity. However, the algorithm cannot effectively reduce the influence of a general name. Therefore, the algorithm is improved. And the improved model is as follows: A~i'-~1 d[i~i]) 2 Len A(i,j) =a(1- ) +2b Le ML L(i)+L(j) Wherein d[i, j] represents an edit distance between the place names i andj; ML represents a maximum value for the character string lengths of the place names i andj; Len represents a minimum match length (Len>1); L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights, and are respectively 0.6 and 0.4. The comparison between the name similarity calculation results of the improved model and the existing model is as shown in table 4.
Table 4 Comparison between place name character string similarity calculation results Greedy Whether the Edit character Place Place name Improved same place name 1 2 model name in algorithm matching algorithm Gwenema Gwenima 0.857 0.571 0.742 Yes
Specification
Merendon Merend6n 0.875 0.750 1.000 Yes Reputa Wreputa 0.714 0.769 0.883 Yes Stephenta Stephen Ta 0.800 0.736 1.000 Yes Wilipini Willipinee 0.700 0.555 0.642 Yes Gwaun 0.545 0.560 No Creek Gunye Creek 0.636
Gbonga Gbondoi 0.571 0.615 0.589 No
It can be seen from the above table that Gwaun Creek and Gunye Creek are different place
names, but the similarity calculated with the edit distance algorithm is as high as 0.636; Wilipini
and Willipinee are the same place names, the similarity result of greedy character string matching
algorithm is 0.555; Gbonga and Gbondoi are different place names, but the calculation result is
0.615. It can be obviously found that the similarity calculated with the improved algorithm of the
present invention is more consistent with the actual situation.
Step 4, acquiring the longitudes and latitudes of the place names i andj from the place name
information library, and calculating place name space proximity according to a place name space
proximity model.
A place name, as a basic geographical element, can be a point element (for example, the place
name of a small village), a line element (for example, the place name of a highway), and can also be
a plane element (for example, the place name of an administrative district). Therefore, the
geometric similarity between place name data comprises the measurement of a point element
position similarity, the measurement of a line element similarity, and the measurement of a plane
element geometric similarity. The global place name data studied in the present invention are all
point element place names.
The position of a point element place name is generally measured by means of distance
calculation. The basic thought is: a set of feature vectors are extracted from two point element place
names respectively, and then the distance between the two sets of vectors is calculated in a certain
distance space. The smaller the distance is, the more similar the two place names are; on the
contrary, the greater the distance is, the more different the two place names would be. The distance
between two points is often replaced with the Euclidean distance.
Euclidean distance is an ordinary straight line distance between two points in Euclidean space,
and can measure the absolute distance between points in a multi-dimensional space. The greater the
Specification
Euclidean distance between place names is, the lower the similarity between the described place names is. Let i and j denote two place names, and the longitudes and latitudes thereof are respectively loni, lon, lati and lat;. The Euclidean distance between two place names is denoted as disi-i.
cos(sin lat1 sin lat, + cos tat, cos lat, cos(lon, -lon,)
Let a place name space proximity function is SE (i, j); the present invention designs a spatial distance similarity model as follows according to the spatial feature of place name data.
SE(ij)=
Wherein SE(i, j) denotes the spatial range similarity between two place names; if the two are consistent, then the value will be 1; and the farther the spatial distance between the two is, the closer to 0 the spatial range consistency would become. Step 5, calculating a place name semanteme similarity according to a place name semanteme similarity model. The place name semanteme similarity model is as follows:
F(i, j)= A(i, j)SE (ij)SC(ij)
Wherein F(i,j) denotes the place name semanteme similarity; the three variables A(i, j), SE(ij) and S(i, j) respectively denote the place name character string similarity, the place name space proximity and the place name category similarity which are normalized to the value range [0, 1]. Totally about 167 thousand pieces of place name data are acquired from the place name data sources of five countries Honduras, Mauritius, Liberia, Mongolia and Zimbabwe as experimental data, wherein about 47.7 thousand pieces of data can perform consistency matching. An experiment is performed with the multi-language oriented general method for calculating a place name semanteme similarity provided by the present invention, and the results are as shown in table 5.
Table 5 Experiment result evaluation indicator statistics Number Number Number Accuracy Coverage Test set of place of of Accuracy rate names matched accurately rate%())
Specification
which place matched can be names place actually (no.) names matched (no.) (no.) Honduras 17835 17535 17300 98.65 97.00 Mauritius 1130 1126 1119 99.37 99.02 Liberia 7984 7899 7870 99.63 98.57 Mongolia 12594 12571 12557 99.88 99.70 Zimbabwe 8174 8039 7997 99.48 97.83
The experimental results show that the multi-language oriented general method for calculating a place name semanteme similarity not only can retain the place name matching accuracy rate more than 98%, but also can achieve more than 97% of actual place name data matching. An embodiment of the present invention discloses a use of the method for calculating a place name semanteme similarity in multi-language place name data query, mainly comprising the following steps: Step I, extracting the attributes of all the place names such as character strings, categories, longitude and latitude and the like from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the language features such as the same character of the place names, the number of characters, character position and the like. Steps II. determining the whole or a part of the attributes of a place name to be queried such as a character string, a category, longitude and latitude and the like, and normalizing the place name. Step III, sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, the longitude and latitude and the like; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of
Specification
the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition. Step IV, sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a place name semanteme similarity.
Step V, sequencing the calculation results in a descending order, wherein the higher a place
name is ranked, the more similar to the place name to be queried.

Claims (8)

Claims
1. A multi-language oriented general method for calculating a place name semanteme similarity, comprising the following steps: determining languages of place names according to a language encoding interval, and normalizing the place names to be romanized place names according to literature information; acquiring category attribute information of two place names from a place name information library, and calculating a place name category similarity according to a place name classification system and a place name category similarity model; calculating a character string similarity between the romanized place names according to a place name character string similarity model; acquiring the longitudes and latitudes of the two place names from the place name information library, then calculating a space proximity according to a place name space proximity model; and determining a place name semanteme similarity according to the place name category similarity, the character string similarity and the space proximity.
2. The method for calculating a place name semanteme similarity according to claim 1, wherein calculating a place name category similarity according to a place name classification system and a place name category similarity model comprises: if the categories of the two place names belong to the same subcategory of the place name classification system, then calculating the sum of distances from common parent categories to a root node, and distances from the closest common parent category to the categories of the two place names, and utilizing a same-category similarity model to calculate the category similarity; and if the categories of the two place names belong to different subcategories, then calculating a relevancy between the subcategories to which the categories of the two place names belong, and utilizing a different-category similarity model to calculate the category similarity.
3. The method for calculating a place name semanteme similarity according to claim 2, wherein the category similarity model under the same subcategory is denoted as:
S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d
wherein S(i, j) denotes the place name category similarity between the place names i and ;i denotes the distance from the closest common parent category of the categories of the place names i and j to the root node; di denotes the distance from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj.
Claims
4. The method for calculating a place name semanteme similarity according to claim 2, wherein the category similarity model under different subcategories is denoted as:
S (i, j)= 6 +a'(i,j)d + (1-a'(i, j))d
wherein Sc(i, j) denotes the place name category similarity between the place names i and j; #' denotes the
relevancy between the subcategories to which the categories of the place names i and j belong; d denotes the
distance from the closest common parent category of the categories of the place names i and j to the category of
the place name i; d; denotes the distance from the closest common parent category of the categories of the place
names i and j to the category of the place name j; and a'(i, j) denotes the sum of the distances from the closest
common parent category to the categories of the place names i andj.
5. The method for calculating a place name semanteme similarity according to claim 1, wherein the place name character string similarity model is denoted as: A~i'-~1 d[i~i]) 2 Len A(i,j) =a(1- ) +2b Le ML L(i)+L(j) wherein A(i,j) denotes the place name character string similarity between the place names i andj; d[i,j] represents an edit distance between the place names i andj; ML represents a maximum value for the character string lengths of the place names i andj; Len represents a minimum match length; L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights.
6. The method for calculating a place name semanteme similarity according to claim 1, wherein the place name space proximity model is denoted as:
d =cos( 1 '-' sin lat, sin lat.+ cos latIcos lat cos(lon -lon)
SE(ij)=
wherein SE(i, j) represents the place name space proximity between the place names i and j; loni, lon, lati and lat are respectively the longitudes and latitudes of the place names i andj.
7. The method for calculating a place name semanteme similarity according to claim 1, wherein a place name semanteme similarity calculation model is:
F(i,j)= A(i, j)SE (i,)SC(ii)
wherein Sc(i,j) denotes the place name category similarity between the place names i andj; A(i, j) denotes the place name character string similarity between the place names i andj; SE(ij) denotes
Claims
the place name space proximity between the place names i andj; and F(ij) denotes the place name semanteme similarity between the place names i andj.
8. A use of the method for calculating a place name semanteme similarity in multi-language place name data query, comprising the following steps: extracting the attributes of all the place names such as character strings, categories, and longitudes and latitudes from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the language features such as the same character of the place names, the number of characters, character position and the like. determining the attributes of a place name to be queried such as a character string, a category, and longitude and latitude, and normalizing the place name; sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, and the longitude and latitude; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition; sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a
Claims
place name semanteme similarity as claimed in any one of claims 1-7; sequencing the calculation results in a descending order, wherein the higher a place name is ranked, the more similar to the place name to be queried.
AU2020101024A 2020-01-19 2020-04-21 Multi-language oriented general method for calculating place name semanteme similarity and use thereof Ceased AU2020101024A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010058317.6A CN111325235B (en) 2020-01-19 2020-01-19 Multilingual-oriented universal place name semantic similarity calculation method and application thereof
CN202010058317.6 2020-01-19

Publications (1)

Publication Number Publication Date
AU2020101024A4 true AU2020101024A4 (en) 2020-07-23

Family

ID=71170946

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2020101024A Ceased AU2020101024A4 (en) 2020-01-19 2020-04-21 Multi-language oriented general method for calculating place name semanteme similarity and use thereof

Country Status (3)

Country Link
CN (1) CN111325235B (en)
AU (1) AU2020101024A4 (en)
WO (1) WO2021142968A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076734A (en) * 2021-04-15 2021-07-06 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880996B (en) * 2022-03-01 2024-08-09 中国人民解放军92728部队 Mechanism name normalization method based on segmentation weighted similarity matching algorithm

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8015196B2 (en) * 2007-06-18 2011-09-06 Geographic Services, Inc. Geographic feature name search system
EP2584505B1 (en) * 2011-10-20 2017-08-02 Deutsche Post AG Comparison of position information
CN103605752A (en) * 2013-11-21 2014-02-26 武大吉奥信息技术有限公司 Address matching method based on semantic recognition
CN108572960A (en) * 2017-03-08 2018-09-25 富士通株式会社 Place name disappears qi method and place name disappears qi device
CN108804398A (en) * 2017-05-03 2018-11-13 阿里巴巴集团控股有限公司 The similarity calculating method and device of address text
CN107239442A (en) * 2017-05-09 2017-10-10 北京京东金融科技控股有限公司 A kind of method and apparatus of calculating address similarity
CN107861947B (en) * 2017-11-07 2021-01-05 昆明理工大学 Method for identifying invitation named entities based on cross-language resources
CN108171529B (en) * 2017-12-04 2021-09-14 昆明理工大学 Address similarity evaluation method
CN110276021A (en) * 2019-04-29 2019-09-24 小轮(上海)网络科技有限公司 Place name matching process and device based on semantic similarity
CN110598791A (en) * 2019-09-12 2019-12-20 深圳前海微众银行股份有限公司 Address similarity evaluation method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076734A (en) * 2021-04-15 2021-07-06 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts
CN113076734B (en) * 2021-04-15 2023-01-20 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts

Also Published As

Publication number Publication date
WO2021142968A1 (en) 2021-07-22
CN111325235A (en) 2020-06-23
CN111325235B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
Zheng et al. U-air: When urban air quality inference meets big data
Laurini et al. Fundamentals of spatial information systems
CN104008169B (en) Semanteme based geographical label content safe checking method and device
CN102841920B (en) Method and device for extracting webpage frame information
CN106909611B (en) Hotel automatic matching method based on text information extraction
AU2020101024A4 (en) Multi-language oriented general method for calculating place name semanteme similarity and use thereof
CN114564966A (en) Spatial relation semantic analysis method based on knowledge graph
CN108388559A (en) Name entity recognition method and system, computer program of the geographical space under
CN112527915B (en) Linear cultural heritage knowledge graph construction method, system, computing device and medium
CN107368471B (en) Method for extracting place name address from webpage text
Lorini et al. Integrating social media into a pan-european flood awareness system: A multilingual approach
Lai et al. A natural language processing approach to understanding context in the extraction and geocoding of historical floods, storms, and adaptation measures
CN113449111B (en) Social governance hot topic automatic identification method based on time-space semantic knowledge migration
CN109815340A (en) A kind of construction method of national culture information resources knowledge mapping
CN104199840A (en) Intelligent placename recognition technology based on statistical model
CN111625732A (en) Address matching method and device
Zhang et al. Social media meets big urban data: A case study of urban waterlogging analysis
CN117709580A (en) Ocean disaster-bearing body vulnerability evaluation method based on SETR and geographic grid
CN113360480B (en) Earthquake prevention and control subject library construction method and system, electronic equipment and storage medium
Laparra et al. A dataset and evaluation framework for complex geographical description parsing
Wang et al. The level of delay caused by crashes (LDC) in metropolitan and non-metropolitan areas: a comparative analysis of improved Random Forests and LightGBM
CN112818668B (en) Meteorological disaster data semantic recognition analysis method and system
CN110060472A (en) Road traffic accident localization method, system, readable storage medium storing program for executing and equipment
CN113886512A (en) Address element analysis method and device and electronic equipment
Yenkar et al. Gazetteer based unsupervised learning approach for location extraction from complaint tweets

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry