AU2020101024A4 - Multi-language oriented general method for calculating place name semanteme similarity and use thereof - Google Patents
Multi-language oriented general method for calculating place name semanteme similarity and use thereof Download PDFInfo
- Publication number
- AU2020101024A4 AU2020101024A4 AU2020101024A AU2020101024A AU2020101024A4 AU 2020101024 A4 AU2020101024 A4 AU 2020101024A4 AU 2020101024 A AU2020101024 A AU 2020101024A AU 2020101024 A AU2020101024 A AU 2020101024A AU 2020101024 A4 AU2020101024 A4 AU 2020101024A4
- Authority
- AU
- Australia
- Prior art keywords
- place name
- place
- similarity
- category
- names
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000007429 general method Methods 0.000 title claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims abstract description 28
- 238000000034 method Methods 0.000 claims abstract description 18
- 230000008520 organization Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 abstract description 2
- 238000005259 measurement Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- OXXJZDJLYSMGIQ-ZRDIBKRKSA-N 8-[2-[(e)-3-hydroxypent-1-enyl]-5-oxocyclopent-3-en-1-yl]octanoic acid Chemical compound CCC(O)\C=C\C1C=CC(=O)C1CCCCCCCC(O)=O OXXJZDJLYSMGIQ-ZRDIBKRKSA-N 0.000 description 2
- 101100397117 Arabidopsis thaliana PPA3 gene Proteins 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 2
- 101001057699 Homo sapiens Inorganic pyrophosphatase Proteins 0.000 description 2
- 102100027050 Inorganic pyrophosphatase Human genes 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 102100030492 1-phosphatidylinositol 4,5-bisphosphate phosphodiesterase epsilon-1 Human genes 0.000 description 1
- AILFSZXBRNLVHY-UHFFFAOYSA-N 2,5-Dimethyl-4-ethoxy-3(2H)-furanone Chemical compound CCOC1=C(C)OC(C)C1=O AILFSZXBRNLVHY-UHFFFAOYSA-N 0.000 description 1
- -1 ADM4H Proteins 0.000 description 1
- 101700073590 ADM5 Proteins 0.000 description 1
- 102100022108 Aspartyl/asparaginyl beta-hydroxylase Human genes 0.000 description 1
- 101100054862 Caenorhabditis elegans adm-4 gene Proteins 0.000 description 1
- 101100182247 Caenorhabditis elegans lat-1 gene Proteins 0.000 description 1
- 102100035855 Cytosolic 5'-nucleotidase 1B Human genes 0.000 description 1
- 101100096444 Drosophila melanogaster spin gene Proteins 0.000 description 1
- 101000901030 Homo sapiens Aspartyl/asparaginyl beta-hydroxylase Proteins 0.000 description 1
- 101000802746 Homo sapiens Cytosolic 5'-nucleotidase 1B Proteins 0.000 description 1
- 101000760817 Homo sapiens Macrophage-capping protein Proteins 0.000 description 1
- 101100408465 Homo sapiens PLCE1 gene Proteins 0.000 description 1
- 101000983077 Homo sapiens Phospholipase A2 Proteins 0.000 description 1
- 101000796953 Homo sapiens Protein ADM2 Proteins 0.000 description 1
- 102100024573 Macrophage-capping protein Human genes 0.000 description 1
- 102100026918 Phospholipase A2 Human genes 0.000 description 1
- 102100032586 Protein ADM2 Human genes 0.000 description 1
- 102100026654 Putative adrenomedullin-5-like protein Human genes 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000009933 burial Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 244000144972 livestock Species 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
- 239000002023 wood Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Remote Sensing (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention discloses a multi-language oriented general method for calculating a
place name semanteme similarity and use thereof. By analyzing the semantic features of
multi-language place names such as word formation feature, affiliation, spatial position and the like,
it is found that the features of a place name such as a category, a character string and a spatial
5 position can be easily acquired and can effectively differentiate place names. Therefore, the present
invention respectively constructs a place name category similarity model, a place name character
string similarity model and a place name space proximity model according to the three semantic
features of places. Then, by comprehensively considering a place name category similarity, a
character string similarity and space proximity, the present invention provides a multi-language
0 oriented general method for calculating a place name semanteme similarity. Compared with a place
name similarity calculation method which only considering a place name character string or a
spatial geometric feature, the method provided by the present invention can remarkably improve the
calculation accuracy of the place name similarity, and can still better satisfy the application
requirements for multi-language place name query, matching and share services in the big data
5 environment.
0
Description
Specification
Multi-Language Oriented General Method for Calculating Place Name Semanteme Similarity and Use Thereof
Technical Field The present invention relates to the field of geographic information science, in particular to a multi-language oriented general method for calculating a place name semanteme similarity and use thereof in place name query in a multi-language database. Background A place name is a language symbol commonly agreed by human beings for a geographic object and a geographic phenomenon having the features such as a specific location, a range and morphology in a geographic environment. Semanteme is the meaning of a concept represented by data (symbols) and a relationship between the meanings. With the development of computer technology and the popularization of mobile Internet, different countries, institutions or enterprises have established various types of place name information libraries, and most of the place name information libraries comprise the information pertaining to place name category, longitude and latitude and the like. However, the place name information libraries greatly vary in the aspects of coverage area, data form, language type, data content and the like. Therefore, how to quickly and accurately calculate a similarity between place names in different place name information libraries has become an important topic in the study of place names. At present, place name similarity calculation methods are mainly divided into three categories. (1) the first one is on the basis of place name character strings, that is, the similarity between place names is calculated by comparing the place name character strings. For example, Smart et al. combined a rule model and the hidden Markov model, and can effectively solve the problems that place name spellings, formats, character sets and the like are inconsistent; Zhan Binbin et al. utilized a structure rule library and a general name dictionary established on the basis of place names to determine the category of a place name, then obtained an optimum place name data matching result by means of character string similarity matching, and obtained a good verification result in Dezhou experimental area; Ye Peng et al., with consideration to the multi-stage feature of Chinese characters, constructed a single word index for place names on the basis of a Chinese place name dictionary, and utilized the mechanisms such as character filtration, similarity sequencing and the like to realize the efficient matching of Chinese place names. (2) the second one is on the basis of geographic elements, that is, the similarity between place names is calculated by utilizing the geometric information of the place names such as spatial positions, areas, shapes and the like. For example, Egenhofer and Clementini put forward a standard for measuring the inconsistency of a
Specification
spatial geometry data structure and the inconsistency of a topological relationship in multiple representations, and can ideally determine the consistency of spatial geometry data; Van et al. utilized the K-center clustering algorithm and the naive Bayes classification method perform a place name consistency process on photos with geographic labels. (3) the third one is a place name semanteme based similarity calculation method. For example, Chen Jiali put forward that the multiple-represented spatial data may have inconsistencies in the aspects of spatial relationship, semanteme and geometry, and therefore, the inconsistencies must be evaluated and corrected. Chen Jiali introduced reality to geographic information modeling, and realized data matching with an object matching based method in combination with semanteme consistency. The above scholars have achieved great results in the aspect of place name similarity calculation. However, the prior art still has certain problems: (1) the algorithms such as the edit distance algorithm calculate the similarity between place names by analyzing a single feature of place names, such as the place name character string or a geometric feature of place names, but do not consider other features of place names; therefore, the accuracy of the similarity between place names is unideal in certain special cases, especially in the special cases such as duplicate place name, close spatial positions of place names and the like. (2) Certain algorithms are proposed for a specific language, and are not suitable for other languages. Therefore, how to calculate the similarity between place names under the situations of wide place name data sources, complex data structure, large semantic differences and the like is a difficult problem that a person skilled in the art needs to study and solve. Summary of the Invention Object of invention: in view of the existing status, the present invention provides a multi-language oriented general method for calculating a place name semanteme similarity, with the purpose of solving the problems that the existing place name similarity calculation method has a low accuracy and poor generality. Technical solution: to achieve the above object of the present invention, the present invention adopts the following technical solution: A multi-language oriented general method for calculating a place name semanteme similarity, comprising the following steps: Determining languages of place names according to a language encoding interval, and normalizing the place names to be romanized place names according to literature information; Acquiring category attribute information of two place names from a place name information library, and calculating a place name category similarity according to a place name classification system and a place name category similarity model;
Specification
Calculating a character string similarity between the romanized place names according to a place name character string similarity model; Acquiring the longitudes and latitudes of the two place names from the place name information library, and calculating a place name space proximity according to a place name space proximity model; and Determining a place name similarity according to the place name category similarity, the character string similarity and the space proximity; As preferred, calculating a place name category similarity according to a place name classification system and a place name category similarity model comprises: If the categories of the two place names belong to the same subcategory of the classification system, then calculating the sum of distances from common parent categories to a root node, and distances from the closest common parent category to the categories of the two place names, and utilizing a same-category similarity model to calculate an attribute similarity; and If the categories of the two place names belong to different subcategories, then calculating a relevancy between the subcategories to which the categories of the two place names belong, and utilizing a different-category similarity model to calculate the category similarity. As preferred, the category similarity model under the same subcategory is denoted as:
S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d
Wherein S(i, j) denotes the place name category similarity between the place names i and j; denotes the distance from the closest common parent category of the categories of the place names i and j to the root node; di denotes the distance from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. As preferred, the category similarity model under different subcategories is denoted as:
S, (i, j) ,
S' +a'(i,j)d' +(1-a(i, j))d
Wherein Sci, j) denotes the place name category similarity between the place names i and j; #'denotes the
relevancy between the subcategories to which the categories of the place names i and j belong; d'i denotes the
distance from the closest common parent category of the categories of the place names i and j to the category of
the place name i; d; denotes the distance from the closest common parent category of the categories of the place
Specification
names i and j to the category of the place name j; and '(i, j) denotes the sum of the distances from the closest
common parent category to the categories of the place names i andj. As preferred, the place name character string similarity model is denoted as: A~i'-~1 d[i~j]) 2 Len A(i, j) =a(1- di )+2b Le ML L(i)+L(j) Wherein A(i, j) denotes the place name character string similarity between the place names i
and j; d[i, j] represents an edit distance between the place names i and j; ML represents a maximum value for the character string lengths of the place names i and j; Len represents a minimum match length; L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights. As preferred, the space proximity is calculated according to the place name space proximity model. The place name space proximity model is denoted as:
cos(sin lat, sin lat,+ cos lat, cos lat, cos(lon, -lon
) SE(ij)e
Wherein SE(i, j) represents the place name space proximity between the place names i and j; loni, lon, lati and lat are respectively the longitudes and latitudes of the place names i andj. As preferred, a place name semanteme similarity calculation model is:
F(i,j)= A(i, j)SE (ij)SC(ij)
Wherein F(ij) denotes the place name semanteme similarity between the place names i andj. A use of the method for calculating a place name semanteme similarity in multi-language place name data query, mainly comprising the following steps: Extracting the attributes of all the place names such as character strings, categories, and longitudes and latitudes from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the
Specification
same character of the place names, the number of characters, and character position language feature; Determining the attributes of a place name to be queried such as a character string, a category, and longitude and latitude, and normalizing the place name; Sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, and the longitude and latitude; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies afilter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition; Sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a place name semanteme similarity; Sequencing the calculation results in a descending order, wherein the higher a place name is ranked, the more similar to the place name to be queried.
Beneficial effects: the present invention respectively constructs a place name category
similarity model, a place name character string similarity model and a place name space proximity
model according to the word formation features of place names, place name categories and position
features, and provides a general method for calculating a place name semanteme similarity. The
present invention improves the edit distance algorithm, and thus can give consideration to the
influence of both a general name and a proper name. The present invention introduces a place name
category feature, and constructs a place name category similarity model according to a place name
category classification system. Furthermore, the present invention considers a place name space
feature, and constructs a place name space proximity model. Finally, the present invention
comprehensively considers the place name character string, position and category features, and
Specification
provides a general method for calculating a place name semanteme similarity. Therefore, compared
with the place name similarity calculation method which only considers a single feature, the present
invention has a high accuracy and generality. Brief Description of the Drawings Fig. 1 is a flow chart of the method according to one embodiment of the present invention; and Fig. 2 is a structural schematic view of place name categories according to one embodiment of the present invention.
Detailed Description of the Preferred Embodiments
The present invention will be described in detail hereafter in combination with specific
embodiments. As shown in Fig. 1, an embodiment of the present invention discloses a multi-language oriented general method for calculating a place name semanteme similarity, mainly comprising the following steps: Step 1, identifying languages of place names i andj according to a place name encoding interval, normalizing the place names i andj to be romanized place names according to literature information. Due to the affect of data acquisition means, human factors and the like, the data in different languages are quite different in the aspects of data format and coding; therefore, the place names need to be preprocessed, such that the information such as the corresponding categories of the place names can be found in a place name information library. In the present step, the place name encoding interval refers to the different encoding intervals corresponding to different languages, that is, the Unicode hexadecimal encoding interval of each language is unique. Therefore, the languages of the place names can be determined according to the place name encoding intervals. The romanized place names refer to the place names corresponding to the place names contained in latest official gazetteers, place name dictionaries, local chronicles and the like of each country. Step 2, acquiring the categories of the place names i andj from the place name information library, and calculating a category similarity between the place names i andj according to a place name category similarity model.
In the present step, the place name category similarity refers to the relevancy between the
categories of the two place names in the same classification system. The place name category refers
to the classification of data according to thematic elements. The classification system can use a
Specification
hierarchical tree structure to describe a logical relationship between categories. Place names are
classified according to a place name classification system, and the classification comparison table is
as shown in table 1.
Table 1 GeoNames and GNS element category comparison table
Category Category Description code A Country, region,... Administrative division H River, lake,... Hydrology L Park, ... Land utilization P City, countryside,... Densely populated district R Road, railway,... Traffic line S Building, farm,... Residential area and auxiliary facilities T Mountain peak, hill.... Land form U Seabed Underwater V Forest, barren land,... Vegetation A GNIS data source directly provides full names of categories. The categories of place names contained in each major category can be summarized with reference to the above classification standards, so as to design a GNIS category and standard classification mapping table as shown in table 2. The attribute of GNIS element category code is added through the mapping relationships in the table. Table 3 shows a part of the place name classification codes table.
Table 2 GNIS category and standard classification mapping table
Map to Map toprmy Category primary major Category primary category category Unknown A Turret S place Civil area A Tunnel S Event Military area A occurrence S place Island A Cross road S Conservation A Bridge S area Marsh H Slum S
Canal H Burial 5 ground H Continental 5 Rivulet slope Reservoir H Bar S
Specification
Dam H Church S Spring H Corner S Water fall H Building S Rapid stream H Ridge S Lake H Airport S River H Dry valley S Strait H Arched door S Bay H Peak top S Beach H Isthmus S Sea H Embankment S Harbor H Breach T Green land L Cliff T Park L Valley T Population gathering P Lava T point Path R Mine T
Column pier S Seabed terrace
Crater S Livestock V farm Oil field S Wood V Post office S Forest V Hospital S Plain V School S Flat land V Curve S Basin V Glacier S
Table 3 a part of the place name classification codes table
Majo Subcategory r categ ory A ADMIT, ADM1H, ADM2, ADM2H, ADM3, ADM3H, ADM4, ADM4H, ADM5 . . H AIRS, ANCH, BAY, BAYS, BGHT, BNK, BNKR, BNKX, BOG, CAPG, CHN . . L AGRC, AMUS, AREA, BSND, BSNP, BTL, CLG, CMN, CNS, COLF, CONT . . P PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLCH, PPLF, PPLG, PPLH . . R CSWY, OILP, PRMN, PTGE, RD, RDA, RDB, RDCUT, RDJCT, RJCT, RR . . S ADMF, AGRF, AIRB, AIRF, AIRH, AIRP, AIRQ, AMTH, ANS, AQC, ARCH . . T ASPH, ATOL, BAR, BCH, BCHS, BDLD, BLDR, BLHL, BLOW, BNCH . . U APNU, ARCU, ARRU, BDLU, BKSU, BNKU, BSNU, CDAU, CNSU, CNYU . . V BUSH, CULT, FRST, FRSTF, GRSLD, GRVC, GRVO, GRVP, GRVPN, HTH...
It is found through analysis that the category similarity in the attributes of place names can
reflect the relevancy between the categories of two pieces data in the same classification system.
Therefore, the calculation of the relevancy between categories needs to process different types of
Specification
relationships in a classification tree such as a relationship between parent-child nodes and a relationship between sibling nodes. To facilitate understanding, a part of the categories under the major category P are taken as an example to establish a tree diagram, as shown in Fig. 2. A place name category similarity algorithm function is denoted by Sc(i, j); when the categories of the place names i and j are under the same subcategory, Sc(i, j) is calculated as follows (for example, as shown in Fig. 2, if the categories of the place names i and jare respectively PPA1 and PPA3, then PPA1 and PPA3 both belong to the same subcategory PPA):
S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d
Wherein I denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i and j to a root node; di denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name i; d; denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. When the categories of the place names i and j are under different subcategories, Sc(i, j) is calculated as follows:
S, (i, j) ,
S' +a'(i,j)d' +(1-a(i, j))d
Wherein ' denotes the relevancy between the subcategories to which the categories of the place names i and j belong, and the value is in the range of [0, 1], and can be given by an expert in the art according to practical use; d' denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance (the number of sides) from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a'(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj. Step 3, calculating a name similarity between the romanized place names i andj according to a place name character string similarity model.
Specification
Edit distance, also known as Levenshtein distance, is a distance measurement function for measuring the similarity between two sequences. In the processing of a natural language, edit distance is used to calculate the minimum times of insertion, deletion and replacement operations required for converting an original character string to a target character string. Let Si=sis2...si and Tj=tit2...tj which represent two character strings. The distance d[i, j] is the minimum operation times for editing the character string Sj to be the character string Tj; d[i, j] denotes the edit distance between the place names i andj, and can effectively reflect the character similarity between place names. The formula is as follows: 0,i=0,j=0 0,s, 0O' = d[i-,j-]+ ros=t d[i, j]= min = j d[i - 1, j]+1 , i > O dj > 0 d[i, j-1]+1
Edit distance is a distance measurement function for measuring the similarity between two sequences, and is often used to calculate the place name character string similarity. However, the algorithm cannot effectively reduce the influence of a general name. Therefore, the algorithm is improved. And the improved model is as follows: A~i'-~1 d[i~i]) 2 Len A(i,j) =a(1- ) +2b Le ML L(i)+L(j) Wherein d[i, j] represents an edit distance between the place names i andj; ML represents a maximum value for the character string lengths of the place names i andj; Len represents a minimum match length (Len>1); L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights, and are respectively 0.6 and 0.4. The comparison between the name similarity calculation results of the improved model and the existing model is as shown in table 4.
Table 4 Comparison between place name character string similarity calculation results Greedy Whether the Edit character Place Place name Improved same place name 1 2 model name in algorithm matching algorithm Gwenema Gwenima 0.857 0.571 0.742 Yes
Specification
Merendon Merend6n 0.875 0.750 1.000 Yes Reputa Wreputa 0.714 0.769 0.883 Yes Stephenta Stephen Ta 0.800 0.736 1.000 Yes Wilipini Willipinee 0.700 0.555 0.642 Yes Gwaun 0.545 0.560 No Creek Gunye Creek 0.636
Gbonga Gbondoi 0.571 0.615 0.589 No
It can be seen from the above table that Gwaun Creek and Gunye Creek are different place
names, but the similarity calculated with the edit distance algorithm is as high as 0.636; Wilipini
and Willipinee are the same place names, the similarity result of greedy character string matching
algorithm is 0.555; Gbonga and Gbondoi are different place names, but the calculation result is
0.615. It can be obviously found that the similarity calculated with the improved algorithm of the
present invention is more consistent with the actual situation.
Step 4, acquiring the longitudes and latitudes of the place names i andj from the place name
information library, and calculating place name space proximity according to a place name space
proximity model.
A place name, as a basic geographical element, can be a point element (for example, the place
name of a small village), a line element (for example, the place name of a highway), and can also be
a plane element (for example, the place name of an administrative district). Therefore, the
geometric similarity between place name data comprises the measurement of a point element
position similarity, the measurement of a line element similarity, and the measurement of a plane
element geometric similarity. The global place name data studied in the present invention are all
point element place names.
The position of a point element place name is generally measured by means of distance
calculation. The basic thought is: a set of feature vectors are extracted from two point element place
names respectively, and then the distance between the two sets of vectors is calculated in a certain
distance space. The smaller the distance is, the more similar the two place names are; on the
contrary, the greater the distance is, the more different the two place names would be. The distance
between two points is often replaced with the Euclidean distance.
Euclidean distance is an ordinary straight line distance between two points in Euclidean space,
and can measure the absolute distance between points in a multi-dimensional space. The greater the
Specification
Euclidean distance between place names is, the lower the similarity between the described place names is. Let i and j denote two place names, and the longitudes and latitudes thereof are respectively loni, lon, lati and lat;. The Euclidean distance between two place names is denoted as disi-i.
cos(sin lat1 sin lat, + cos tat, cos lat, cos(lon, -lon,)
Let a place name space proximity function is SE (i, j); the present invention designs a spatial distance similarity model as follows according to the spatial feature of place name data.
SE(ij)=
Wherein SE(i, j) denotes the spatial range similarity between two place names; if the two are consistent, then the value will be 1; and the farther the spatial distance between the two is, the closer to 0 the spatial range consistency would become. Step 5, calculating a place name semanteme similarity according to a place name semanteme similarity model. The place name semanteme similarity model is as follows:
F(i, j)= A(i, j)SE (ij)SC(ij)
Wherein F(i,j) denotes the place name semanteme similarity; the three variables A(i, j), SE(ij) and S(i, j) respectively denote the place name character string similarity, the place name space proximity and the place name category similarity which are normalized to the value range [0, 1]. Totally about 167 thousand pieces of place name data are acquired from the place name data sources of five countries Honduras, Mauritius, Liberia, Mongolia and Zimbabwe as experimental data, wherein about 47.7 thousand pieces of data can perform consistency matching. An experiment is performed with the multi-language oriented general method for calculating a place name semanteme similarity provided by the present invention, and the results are as shown in table 5.
Table 5 Experiment result evaluation indicator statistics Number Number Number Accuracy Coverage Test set of place of of Accuracy rate names matched accurately rate%())
Specification
which place matched can be names place actually (no.) names matched (no.) (no.) Honduras 17835 17535 17300 98.65 97.00 Mauritius 1130 1126 1119 99.37 99.02 Liberia 7984 7899 7870 99.63 98.57 Mongolia 12594 12571 12557 99.88 99.70 Zimbabwe 8174 8039 7997 99.48 97.83
The experimental results show that the multi-language oriented general method for calculating a place name semanteme similarity not only can retain the place name matching accuracy rate more than 98%, but also can achieve more than 97% of actual place name data matching. An embodiment of the present invention discloses a use of the method for calculating a place name semanteme similarity in multi-language place name data query, mainly comprising the following steps: Step I, extracting the attributes of all the place names such as character strings, categories, longitude and latitude and the like from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the language features such as the same character of the place names, the number of characters, character position and the like. Steps II. determining the whole or a part of the attributes of a place name to be queried such as a character string, a category, longitude and latitude and the like, and normalizing the place name. Step III, sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, the longitude and latitude and the like; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of
Specification
the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition. Step IV, sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a place name semanteme similarity.
Step V, sequencing the calculation results in a descending order, wherein the higher a place
name is ranked, the more similar to the place name to be queried.
Claims (8)
1. A multi-language oriented general method for calculating a place name semanteme similarity, comprising the following steps: determining languages of place names according to a language encoding interval, and normalizing the place names to be romanized place names according to literature information; acquiring category attribute information of two place names from a place name information library, and calculating a place name category similarity according to a place name classification system and a place name category similarity model; calculating a character string similarity between the romanized place names according to a place name character string similarity model; acquiring the longitudes and latitudes of the two place names from the place name information library, then calculating a space proximity according to a place name space proximity model; and determining a place name semanteme similarity according to the place name category similarity, the character string similarity and the space proximity.
2. The method for calculating a place name semanteme similarity according to claim 1, wherein calculating a place name category similarity according to a place name classification system and a place name category similarity model comprises: if the categories of the two place names belong to the same subcategory of the place name classification system, then calculating the sum of distances from common parent categories to a root node, and distances from the closest common parent category to the categories of the two place names, and utilizing a same-category similarity model to calculate the category similarity; and if the categories of the two place names belong to different subcategories, then calculating a relevancy between the subcategories to which the categories of the two place names belong, and utilizing a different-category similarity model to calculate the category similarity.
3. The method for calculating a place name semanteme similarity according to claim 2, wherein the category similarity model under the same subcategory is denoted as:
S (i,j>j)= SC (A=l+a(i, j)d, +(1-a(i,j))d
wherein S(i, j) denotes the place name category similarity between the place names i and ;i denotes the distance from the closest common parent category of the categories of the place names i and j to the root node; di denotes the distance from the closest common parent category of the categories of the place names i and j to the category of the place name i; d; denotes the distance from the closest common parent category of the categories of the place names i andj to the category of the place name j; and a(i, j) denotes the sum of the distances from the closest common parent category to the categories of the place names i andj.
Claims
4. The method for calculating a place name semanteme similarity according to claim 2, wherein the category similarity model under different subcategories is denoted as:
S (i, j)= 6 +a'(i,j)d + (1-a'(i, j))d
wherein Sc(i, j) denotes the place name category similarity between the place names i and j; #' denotes the
relevancy between the subcategories to which the categories of the place names i and j belong; d denotes the
distance from the closest common parent category of the categories of the place names i and j to the category of
the place name i; d; denotes the distance from the closest common parent category of the categories of the place
names i and j to the category of the place name j; and a'(i, j) denotes the sum of the distances from the closest
common parent category to the categories of the place names i andj.
5. The method for calculating a place name semanteme similarity according to claim 1, wherein the place name character string similarity model is denoted as: A~i'-~1 d[i~i]) 2 Len A(i,j) =a(1- ) +2b Le ML L(i)+L(j) wherein A(i,j) denotes the place name character string similarity between the place names i andj; d[i,j] represents an edit distance between the place names i andj; ML represents a maximum value for the character string lengths of the place names i andj; Len represents a minimum match length; L(i) represents a character string length of the place name i; L(j) represents a character string length of the place name; a and b denote weights.
6. The method for calculating a place name semanteme similarity according to claim 1, wherein the place name space proximity model is denoted as:
d =cos( 1 '-' sin lat, sin lat.+ cos latIcos lat cos(lon -lon)
SE(ij)=
wherein SE(i, j) represents the place name space proximity between the place names i and j; loni, lon, lati and lat are respectively the longitudes and latitudes of the place names i andj.
7. The method for calculating a place name semanteme similarity according to claim 1, wherein a place name semanteme similarity calculation model is:
F(i,j)= A(i, j)SE (i,)SC(ii)
wherein Sc(i,j) denotes the place name category similarity between the place names i andj; A(i, j) denotes the place name character string similarity between the place names i andj; SE(ij) denotes
Claims
the place name space proximity between the place names i andj; and F(ij) denotes the place name semanteme similarity between the place names i andj.
8. A use of the method for calculating a place name semanteme similarity in multi-language place name data query, comprising the following steps: extracting the attributes of all the place names such as character strings, categories, and longitudes and latitudes from a place name information library; determining languages of the place names according to a language encoding interval, and normalizing the place names; dividing into phonetic and ideographic index methods on the basis of different features of the place name languages, wherein phonetic characters are based on the similarity of letters, and a phonetic place name index is constructed on the basis of an index organization mode of multidimensional feature statistical vectors in combination with the language features such as the total number of letters, the number of letter radicals, the total number of words, acronyms and the like; ideographic characters are based on the local similarity of characters, and an ideographic place name index is constructed on the basis of an index organization mode of single word place names in combination with the language features such as the same character of the place names, the number of characters, character position and the like. determining the attributes of a place name to be queried such as a character string, a category, and longitude and latitude, and normalizing the place name; sequentially filtering all the place names in the index according to the determined attributes of the place name to be queried such as the character string, the category, and the longitude and latitude; specifically, using a place name character string similarity model to perform calculation on the basis of the determined place name character string; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the character string is null, then the place name would directly satisfy the filter condition; using a category similarity model to perform calculation on the basis of the determined place name category; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the category is null, then the place name would directly satisfy the filter condition; using a place name space proximity model to perform calculation on the basis of the determined place name longitude and latitude; the place name with a calculation result higher than a preset threshold value satisfies a filter condition, otherwise the place name would be filtered off; if the longitude and latitude are null, then the place name would directly satisfy the filter condition; sequentially calculating the semanteme similarities between the place name to be queried and all the candidate place names with the multi-language oriented general method for calculating a
Claims
place name semanteme similarity as claimed in any one of claims 1-7; sequencing the calculation results in a descending order, wherein the higher a place name is ranked, the more similar to the place name to be queried.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010058317.6A CN111325235B (en) | 2020-01-19 | 2020-01-19 | Multilingual-oriented universal place name semantic similarity calculation method and application thereof |
CN202010058317.6 | 2020-01-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2020101024A4 true AU2020101024A4 (en) | 2020-07-23 |
Family
ID=71170946
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2020101024A Ceased AU2020101024A4 (en) | 2020-01-19 | 2020-04-21 | Multi-language oriented general method for calculating place name semanteme similarity and use thereof |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN111325235B (en) |
AU (1) | AU2020101024A4 (en) |
WO (1) | WO2021142968A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076734A (en) * | 2021-04-15 | 2021-07-06 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114880996B (en) * | 2022-03-01 | 2024-08-09 | 中国人民解放军92728部队 | Mechanism name normalization method based on segmentation weighted similarity matching algorithm |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8015196B2 (en) * | 2007-06-18 | 2011-09-06 | Geographic Services, Inc. | Geographic feature name search system |
EP2584505B1 (en) * | 2011-10-20 | 2017-08-02 | Deutsche Post AG | Comparison of position information |
CN103605752A (en) * | 2013-11-21 | 2014-02-26 | 武大吉奥信息技术有限公司 | Address matching method based on semantic recognition |
CN108572960A (en) * | 2017-03-08 | 2018-09-25 | 富士通株式会社 | Place name disappears qi method and place name disappears qi device |
CN108804398A (en) * | 2017-05-03 | 2018-11-13 | 阿里巴巴集团控股有限公司 | The similarity calculating method and device of address text |
CN107239442A (en) * | 2017-05-09 | 2017-10-10 | 北京京东金融科技控股有限公司 | A kind of method and apparatus of calculating address similarity |
CN107861947B (en) * | 2017-11-07 | 2021-01-05 | 昆明理工大学 | Method for identifying invitation named entities based on cross-language resources |
CN108171529B (en) * | 2017-12-04 | 2021-09-14 | 昆明理工大学 | Address similarity evaluation method |
CN110276021A (en) * | 2019-04-29 | 2019-09-24 | 小轮(上海)网络科技有限公司 | Place name matching process and device based on semantic similarity |
CN110598791A (en) * | 2019-09-12 | 2019-12-20 | 深圳前海微众银行股份有限公司 | Address similarity evaluation method, device, equipment and medium |
-
2020
- 2020-01-19 CN CN202010058317.6A patent/CN111325235B/en active Active
- 2020-04-21 AU AU2020101024A patent/AU2020101024A4/en not_active Ceased
- 2020-04-21 WO PCT/CN2020/085814 patent/WO2021142968A1/en active Application Filing
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076734A (en) * | 2021-04-15 | 2021-07-06 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
CN113076734B (en) * | 2021-04-15 | 2023-01-20 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
Also Published As
Publication number | Publication date |
---|---|
WO2021142968A1 (en) | 2021-07-22 |
CN111325235A (en) | 2020-06-23 |
CN111325235B (en) | 2023-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zheng et al. | U-air: When urban air quality inference meets big data | |
Laurini et al. | Fundamentals of spatial information systems | |
CN104008169B (en) | Semanteme based geographical label content safe checking method and device | |
CN102841920B (en) | Method and device for extracting webpage frame information | |
CN106909611B (en) | Hotel automatic matching method based on text information extraction | |
AU2020101024A4 (en) | Multi-language oriented general method for calculating place name semanteme similarity and use thereof | |
CN114564966A (en) | Spatial relation semantic analysis method based on knowledge graph | |
CN108388559A (en) | Name entity recognition method and system, computer program of the geographical space under | |
CN112527915B (en) | Linear cultural heritage knowledge graph construction method, system, computing device and medium | |
CN107368471B (en) | Method for extracting place name address from webpage text | |
Lorini et al. | Integrating social media into a pan-european flood awareness system: A multilingual approach | |
Lai et al. | A natural language processing approach to understanding context in the extraction and geocoding of historical floods, storms, and adaptation measures | |
CN113449111B (en) | Social governance hot topic automatic identification method based on time-space semantic knowledge migration | |
CN109815340A (en) | A kind of construction method of national culture information resources knowledge mapping | |
CN104199840A (en) | Intelligent placename recognition technology based on statistical model | |
CN111625732A (en) | Address matching method and device | |
Zhang et al. | Social media meets big urban data: A case study of urban waterlogging analysis | |
CN117709580A (en) | Ocean disaster-bearing body vulnerability evaluation method based on SETR and geographic grid | |
CN113360480B (en) | Earthquake prevention and control subject library construction method and system, electronic equipment and storage medium | |
Laparra et al. | A dataset and evaluation framework for complex geographical description parsing | |
Wang et al. | The level of delay caused by crashes (LDC) in metropolitan and non-metropolitan areas: a comparative analysis of improved Random Forests and LightGBM | |
CN112818668B (en) | Meteorological disaster data semantic recognition analysis method and system | |
CN110060472A (en) | Road traffic accident localization method, system, readable storage medium storing program for executing and equipment | |
CN113886512A (en) | Address element analysis method and device and electronic equipment | |
Yenkar et al. | Gazetteer based unsupervised learning approach for location extraction from complaint tweets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |