WO2022095256A1 - Geocoding method and system, terminal and storage medium - Google Patents

Geocoding method and system, terminal and storage medium Download PDF

Info

Publication number
WO2022095256A1
WO2022095256A1 PCT/CN2020/139759 CN2020139759W WO2022095256A1 WO 2022095256 A1 WO2022095256 A1 WO 2022095256A1 CN 2020139759 W CN2020139759 W CN 2020139759W WO 2022095256 A1 WO2022095256 A1 WO 2022095256A1
Authority
WO
WIPO (PCT)
Prior art keywords
path
place name
node
geocoding
address
Prior art date
Application number
PCT/CN2020/139759
Other languages
French (fr)
Chinese (zh)
Inventor
钱静
彭树宏
陈朝亮
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022095256A1 publication Critical patent/WO2022095256A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Definitions

  • the present application belongs to the technical field of geocoding, and in particular, relates to a geocoding method, system, terminal and storage medium.
  • geographic information system As the product of the combination of location service and information platform, geographic information system has a wider and wider range of applications. With the popularization and continuous maturity of geographic information technology, many enterprises, units and government departments have established business based on geographic information, such as pharmaceuticals, media, etc., and the demand for management and operation with the help of geographic information has become increasingly prominent. However, the naming methods of geographic information such as national place names and addresses have the characteristics of messy semantics and disordered word order, that is, there is no unified criterion to standardize them. In addition, the geographic information that can be collected by ordinary departmental units is only the textual description information (non-spatial information) of various disorganized place names and addresses, and the spatial coordinate information that can be used directly cannot be obtained.
  • the present application provides a geocoding method, system, terminal and storage medium, aiming to solve one of the above technical problems in the prior art at least to a certain extent.
  • a geocoding method that includes:
  • a geocoding library is established according to the place name and address model, and the geocoding library includes an administrative area entity data table, a street and alley entity data table, and a community entity data table;
  • the technical solution adopted in the embodiment of the present application further includes: before the establishment of the place-name-address model according to the place-name-address data, the following further includes:
  • the technical solution adopted in the embodiment of the present application further includes: the establishing of the geographic coding library according to the place name and address model includes:
  • the address dictionary-based, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data includes:
  • the place name groups in the place name address data are matched, and a directed acyclic graph is constructed.
  • Each phrase is a node in the directed acyclic graph, and corresponds to a side given length;
  • All possible word edges of the directed acyclic graph are established according to preset rules, so that all words contained in the geographical name geographic data correspond to the edges of the directed acyclic graph one-to-one respectively, and solve the To the N-shortest path set from the start node to the end node in the acyclic graph, the place name address data is segmented according to the N-shortest path set.
  • the preset rules for establishing all possible word edges in the directed acyclic graph are: :
  • a directed edge ⁇ Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0 ⁇ i ⁇ j ⁇ n).
  • the technical solutions adopted in the embodiments of the present application further include: the solving of the set of N-shortest paths from the start node to the end node in the directed acyclic graph includes:
  • Path(i,j) is the set of all paths from node Vi to node Vj;
  • Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path;
  • LS is directed acyclic
  • NLS be the set of N-shortest path lengths from V0 to Vn
  • NSP be the set of N-shortest path lengths from V0 to Vn
  • RS is the final N-shortest path rough division result set
  • min(
  • NSP ⁇ path
  • path ⁇ Path(0,n),Length(path) ⁇ NLS ⁇ RS ⁇ w1w2...wm
  • , wi is path The word corresponding to the i-th edge of , i 1,2,...,m, where path ⁇ NSP ⁇ , n is the number of shortest paths.
  • the said address dictionary-based, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data further includes:
  • a geographic coding system comprising:
  • Place name address model building module used to build place name address model based on place name address data
  • Geographical coding library building module used to establish a geographic coding library according to the place name and address model, and the geographic coding library includes an administrative area entity data table, a street and lane entity data table and a community entity data table;
  • Word segmentation and standardization processing module used to perform word segmentation and standardization processing on the place name address data based on the address dictionary, using the N-shortest path optimization algorithm, and divide the place name address data into at least one phrase;
  • Coordinate matching module used to convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and The geographic coordinates matched by the character string are used as the standard geographic coordinates of the address corresponding to the place name.
  • a terminal includes a processor and a memory coupled to the processor, wherein,
  • the memory stores program instructions for implementing the geocoding method
  • the processor is configured to execute the program instructions stored in the memory to control geocoding.
  • a storage medium storing program instructions executable by a processor, where the program instructions are used to execute the geocoding method.
  • the beneficial effects of the embodiments of the present application are: the geocoding method, system, terminal and storage medium of the embodiments of the present application perform word segmentation and standardization processing on place names and addresses based on the N-shortest path optimization algorithm. Results After the place name address is segmented, the segmented place name address is converted into a string that can be recognized by the computer according to the level elements in the place name address model, and finally the string is matched with the corresponding geographic coordinates in the geocoding library. Matching results assign standard geographic coordinates to place-name addresses.
  • the present application improves the disadvantages of word-by-word traversal, increases the practicability, and inherits the advantages of the full segmentation idea, which can not only reduce the number of segmented phrases as much as possible, but also include All results that need to be retained can effectively avoid wasting resources and increase search efficiency.
  • Fig. 1 is the flow chart of the geocoding method of the first embodiment of the present application
  • FIG. 2 is a schematic diagram of a representation of a place name and address according to an embodiment of the application
  • FIG. 3 is a schematic structural diagram of a directed acyclic graph according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a directed acyclic graph solution process according to an embodiment of the application.
  • FIG. 5 is a schematic diagram of a precursor record table in the process of solving a directed acyclic graph according to an embodiment of the present application
  • Fig. 6 is the flow chart of the geocoding method of the second embodiment of the present application.
  • FIG. 7 is a schematic diagram of an N-shortest path improved word segmentation algorithm according to an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a geocoding system according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the geocoding method of the embodiment of the present application first performs data cleaning on the initial place name address data to prevent problems such as excessive typos, spelling mistakes, and text repetition in the input text; and then establishes a place name address model to enable it to Reflect the different representations of geographic names in a country or region, and then build a geographic coding library including a place name data table, a building data table and a door (building) plate data table according to the place name address model, and use the N-shortest path optimization algorithm.
  • the place name address is subjected to word segmentation and standardization processing.
  • the segmented place name address is converted into a string that can be recognized by the computer according to the level elements in the place name address model.
  • the corresponding geographic coordinates in the encoding library are matched.
  • FIG. 1 is a flowchart of the geocoding method according to the first embodiment of the present application.
  • the geocoding method of the first embodiment of the present application includes the following steps:
  • the present invention uses Trillum technology, adopts syntax analysis and fuzzy matching algorithm to perform data cleaning on place name addresses.
  • FIG. 2 it is a schematic diagram of the representation of place names and addresses.
  • the place name of the administrative region includes the provincial, city, county, township, street name, community place name, community name, gate building address, landmark name or alias, and unit name or its abbreviation;
  • the provincial level has priority over the city level
  • the city level has priority over the county level
  • the county level has priority over the township level
  • street and lane names have priority over community place names
  • community place names have priority over community names
  • gate building addresses have priority over landmark names or their aliases, followed by is the unit name or its abbreviation.
  • the street name and community name in a city are unique, so using the street name or community name can roughly lock a certain range of addresses, and using "street name or community name + door (floor)" Brand” can be accurately located to a location, and “administrative region place name + marker name” can basically be used to accurately locate a location. That is to say, when the content to be expressed in the text has a house number, use "street name or community name + door (building) number” to lock a location; when the local name address data contains a landmark name, use " Administrative region place name + landmark name” for precise positioning.
  • An example of the structure of place name address data according to the above description rules of granularity range is as follows:
  • the place name address data is "Huizhou College, No. 46 Yanda Avenue, Huizhou City", which can be simplified as “No. 46 Yanda Avenue” in the application of Huizhou City without any ambiguity at all; and if the place name address data is "Guangzhou City" "ICBC", at this time, multiple markers may be located, and the results obtained are difficult to filter, so it needs to be extended to the street name or community name for description, and then a certain ICBC can be accurately located.
  • S12 According to the place name and address model, establish a geocoding library including the entity data table of administrative areas, the entity data table of streets and alleys, and the entity data table of community;
  • the table structure of the administrative area entity data table, the street entity data table and the community entity data table can be defined according to the application scenario, and the establishment of the geocoding library follows the following principles:
  • Standardization principle The coding rules are adapted to the national standard system for data sharing.
  • each table enter all provinces, districts, counties, streets, communities, landmarks, house numbers and geographic coordinates in turn to construct the geocoding database. Select the field value in each data table as the place name address entry, and record it in the address dictionary together with the corresponding address level.
  • an address alias is used as a place-name address entry, it is also necessary to record the standard name so that the address elements can be normalized during address segmentation.
  • the implementation process of the N-shortest path optimization algorithm is as follows: the address dictionary records all geographical names addresses (including aliases and abbreviations, etc.) in different countries and regions, first, according to the address dictionary, place names that may appear in the geographical name address data are recorded. The phrases are matched in order, and then a directed acyclic graph is constructed. Each phrase is a node in the directed acyclic graph, and corresponds to a given edge length (ie, weight, in the non-statistical rough scoring model). , assuming that all words are equal, for the convenience of calculation, the side lengths of the corresponding sides of all words are set as the sides of 1). In all paths from the starting point to the end point in the directed acyclic graph, the path value from each node to the source node is obtained, and the corresponding path set is used as the path result set of each node.
  • edge length ie, weight, in the non-statistical rough scoring model
  • Path(i,j) be the set of all paths from the node Vi to the node Vj;
  • Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path;
  • LS is the directed and non-directional The set of lengths of all paths from V0 to Vn in the ring graph G; then there are:
  • NLS is the set of N-shortest path lengths from V0 to Vn
  • NSP is the set of N-shortest paths from V0 to Vn
  • RS is the final result set of N-shortest path rough division.
  • the definition of NLS is:
  • min(
  • ,N); a ⁇ LS-NLS,b ⁇ NLS ⁇ a ⁇ b NSP ⁇ path
  • path ⁇ Path(0,n),Length(path) ⁇ NLS ⁇ RS ⁇ w1w2...wm
  • wi is the word corresponding to the i-th edge of the path, i 1,2,...,m, where path ⁇ NSP ⁇ , n is the number of shortest paths.
  • the solution process of the text data is shown in Figure 4.
  • a greedy algorithm is used to obtain the local optimal solution of each node. Record the shortest path value at each node and the predecessor of the node. If a node includes more than two paths of the same length, record the predecessor of the node on each path separately.
  • the predecessor record table of the text data is shown in Figure 5.
  • the present invention uses the N-shortest path word segmentation algorithm to segment the place name address, which can not only greatly reduce the number of word segmentation, but also try to include all possible word segmentation results without loss, and avoid the algorithm itself. At the same time, it can reduce the search space as much as possible and improve the efficiency of word segmentation.
  • S14 Convert the segmented at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library;
  • FIG. 6 is a flowchart of the geocoding method according to the second embodiment of the present application.
  • the geocoding method of the second embodiment of the present application includes the following steps:
  • the present invention uses Trillum technology, adopts syntax analysis and fuzzy matching algorithm to perform data cleaning on place name addresses.
  • the place name address can be regarded as a hierarchically scalable place name address model.
  • S22 Establish a geocoding library including a place name data table, a building data table and a door (building) sign data table according to the place name address model;
  • the table structure of the place name data table, the building data table and the door (building) plate data table can be defined according to the application scenario, and all provinces, districts, counties, streets, communities, landmarks, house plates are sequentially entered according to the table structure. number and geographic coordinates for the construction of the geocoding library.
  • the N-shortest path improved word segmentation algorithm combining the dynamic deletion algorithm and the N-shortest path word segmentation algorithm is used to segment and standardize the irregular place name address data, and the place name address data is divided into at least one phrase ;
  • the present invention proposes an N-shortest path improved word segmentation algorithm that combines the dynamic deletion algorithm with the N-shortest path word segmentation algorithm.
  • the basic idea of the dynamic deletion algorithm is to construct the shortest path update queue. , used to store the child nodes of the deleted node; delete the node that should be deleted and all child nodes in the original shortest path tree; select the node closest to the root node for updating in the queue, and no longer Insert updated nodes into the queue to reduce the number of node updates.
  • the N-shortest path improved word segmentation algorithm is shown in Figure 7, and its solution process is as follows:
  • Step 1 First, based on the N-shortest path word segmentation algorithm, construct a directed acyclic graph G with words (or characters) as nodes; wherein, the directed acyclic graph construction process is the same as the first embodiment, and this embodiment will not repeat;
  • Lj is used to store the shortest path, where j is a dynamic variable, and the initial value of j can be set according to the length of the entire string, such as "what he said is true", the initial value of j is 8,
  • the initial value of j is 8
  • the nodes between the combinations are deleted, and the j value is updated.
  • the j value will become smaller and smaller until the sentence cannot be divided.
  • Step 3 Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and determine whether the descendant node of Hm is in the set E, if it is in the set In E, calculate the shortest path from the start node V0 to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its all from the directed acyclic graph G Descendant node; wherein, the set E is the N-shortest path set (ie, NSP) from V0 to Vn, which is used here to determine whether the deleted node is in the shortest path.
  • Hm and H'm represent the end node in each cycle, and H'm will be the end marker of the next cycle.
  • n is the number of shortest paths after deleting nodes
  • the value of j should be moderate, neither too large nor too small, for the first j optimal paths to be reserved.
  • S24 Convert the segmented at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library;
  • the geocoding method of the embodiment of the present application uses the N-shortest path optimization algorithm to perform word segmentation and standardization processing on the place-name address, and after segmenting the place-name address according to the standardized processing result, according to the level element in the place-name address model.
  • the last place name address is converted into a character string that can be recognized by the computer, and finally the character string is matched with the corresponding geographic coordinates in the geocoding library, and the place name address is given standard geographic coordinates according to the matching result.
  • the present application improves the disadvantages of word-by-word traversal, increases the practicability, and inherits the advantages of the full segmentation idea, which can not only reduce the number of segmented phrases as much as possible, but also include All results that need to be retained can effectively avoid wasting resources and increase search efficiency.
  • FIG. 8 is a schematic structural diagram of a geocoding system according to an embodiment of the present application.
  • the geocoding system 41 of the embodiment of the present application includes:
  • Data cleaning module 41 used for data cleaning of the initial place name and address data; since the text data such as place name and address input by the user terminal may contain typos or repeated words, in order to avoid problems such as inconsistent character strings in the text data, spelling errors, etc. If the subsequent character string is incorrectly matched with the geographic coordinates, the embodiment of the present invention uses the Trillum technology, and uses the syntax analysis and fuzzy matching algorithm to clean the data of the place name address.
  • Place-name and address model building module 42 used to structure the cleaned place-name and address data to establish a place-name and address model; wherein, different countries or regions have different granularity and scope rules for the representation of place-names and addresses, and place-names and addresses can be regarded as a kind of Hierarchically scalable place-name address model.
  • Geocoding library building module 43 used to establish a geographic coding library including a place name data table, a building data table and a door (building) sign data table according to the place name address model; wherein, the place name data table, the building data table and the door (building) )
  • the table structure of the card data table can be defined according to the application scenario, and all provinces, districts, counties, streets, communities, landmarks, house numbers and geographic coordinates are entered in turn according to each table structure to construct a geocoding library.
  • the word segmentation and standardization processing module 44 is used to perform word segmentation and standardization processing on the irregular place name address data by using the N-shortest path optimization algorithm based on the address dictionary, and divide the place name address data into at least one phrase; wherein, the N-shortest path
  • the implementation process of the optimization algorithm is as follows: the address dictionary records all geographical names addresses (including aliases and abbreviations, etc.) in different countries and regions.
  • each phrase is a node in the directed acyclic graph, and corresponds to a given edge length (ie weight, in the non-statistical rough segmentation model, it is assumed that all words are correct. etc., for the convenience of calculation, the side lengths of the corresponding sides of all words are set as the sides of 1).
  • the path value from each node to the source node is obtained, and the corresponding path set is used as the path result set of each node.
  • Path(i,j) be the set of all paths from the node Vi to the node Vj;
  • Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path;
  • LS is the directed and non-directional The set of lengths of all paths from V0 to Vn in the ring graph G; then there are:
  • NLS is the set of N-shortest path lengths from V0 to Vn; NSP is the set of N-shortest paths from V0 to Vn; RS is the final N-shortest path rough division result set.
  • the definition of NLS is:
  • min(
  • ,N); a ⁇ LS-NLS,b ⁇ NLS ⁇ a ⁇ b NSP ⁇ path
  • path ⁇ Path(0,n),Length(path) ⁇ NLS ⁇ RS ⁇ w1w2...wm
  • wi is the word corresponding to the i-th edge of path, i 1,2,...,m, where path ⁇ NSP ⁇ .
  • the present invention uses the N-shortest path word segmentation algorithm to segment the place name address, which can not only greatly reduce the number of word segmentation, but also try to include all possible word segmentation results without loss, and avoid the algorithm itself. At the same time, it can reduce the search space as much as possible and improve the efficiency of word segmentation.
  • the word segmentation and standardization processing module 44 adopts the N-shortest path improved word segmentation algorithm that combines the dynamic deletion algorithm and the N-shortest path word segmentation algorithm to perform word segmentation, that is, standardized processing. for:
  • the first step First, based on the N-shortest path word segmentation algorithm, construct a directed acyclic graph G with words as nodes;
  • Step 3 Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and determine whether the descendant node of Hm is in the set E, if it is in the set In E, calculate the shortest path from the start node V0 to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its all from the directed acyclic graph G descendant node;
  • Coordinate matching module 45 used to convert at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library , take the geographic coordinates matched by the string as the standard geographic coordinates of the corresponding place name address.
  • FIG. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • the terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
  • the memory 52 stores program instructions for implementing the above-described geocoding method.
  • the processor 51 is adapted to execute program instructions stored in the memory 52 to control the geocoding.
  • the processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 51 may be an integrated circuit chip with signal processing capability.
  • the processor 51 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components .
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • FIG. 10 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the storage medium of this embodiment of the present application stores a program file 61 capable of implementing all the above methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to enable a computer device (which may It is a personal computer, a server, or a network device, etc.) or a processor that executes all or part of the steps of the methods of the various embodiments of the present invention.
  • a computer device which may It is a personal computer, a server, or a network device, etc.
  • a processor that executes all or part of the steps of the methods of the various embodiments of the present invention.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes , or terminal devices such as computers, servers, mobile phones, and tablets.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A geocoding method and system, a terminal and a storage medium. The method comprises: building a geographical name address model according to geographical name address data; building a geocoding library according to the geographical name address model, the geocoding library including an administrative area entity data table, a street entity data table and a community entity data table; performing word segmentation and standardization on the geographical name address data on the basis of an address dictionary by using an N-shortest path optimization algorithm, to segment the geographical name address data into at least one phrase; converting the at least one phrase into a character string of a predetermined format according to level elements in the geographical name address model, matching the character string with corresponding geographical coordinates in the geocoding library, and using the geographical coordinates matched with the character string as standard geographical coordinates of corresponding geographical name addresses. The number of segmented phrases can be reduced as much as possible, and all the results that need to be retained can be included, effectively avoiding resource waste and increasing search efficiency.

Description

一种地理编码方法、系统、终端以及存储介质A geocoding method, system, terminal and storage medium 技术领域technical field
本申请属于地理编码技术领域,特别涉及一种地理编码方法、系统、终端以及存储介质。The present application belongs to the technical field of geocoding, and in particular, relates to a geocoding method, system, terminal and storage medium.
背景技术Background technique
地理信息系统作为位置服务与信息化平台结合的产物,其应用范围越来越广泛。随着地理信息技术的普及与不断成熟,众多企业、单位及政府部门都纷纷建立了基于地理信息的业务,例如药业、传媒等,借助地理信息进行管理运作的需求也日益凸显。然而我们国家地名地址等地理信息的命名方式具有语义凌乱、语序混乱等特点,即没有一个统一的准则将它们规范化。另外,通常普通部门单位所能够采集到的地理信息只是各种杂乱无章的地名地址类文字性描述信息(非空间信息),而无法获取到可以直接使用的空间坐标信息。如果不能将这些非空间信息成功转化为空间坐标信息,相关企业就无法将相关专题数据和地理信息结合起来,间接影响到对GIS软件的可视化、功能分析等应用。因此,如何将与地理位置相关的非空间信息转换成能够被计算机识别的GIS系统地理坐标,实现非空间信息与实体地理坐标的匹配,才能够发挥地理信息系统的最大作用。As the product of the combination of location service and information platform, geographic information system has a wider and wider range of applications. With the popularization and continuous maturity of geographic information technology, many enterprises, units and government departments have established business based on geographic information, such as pharmaceuticals, media, etc., and the demand for management and operation with the help of geographic information has become increasingly prominent. However, the naming methods of geographic information such as national place names and addresses have the characteristics of messy semantics and disordered word order, that is, there is no unified criterion to standardize them. In addition, the geographic information that can be collected by ordinary departmental units is only the textual description information (non-spatial information) of various disorganized place names and addresses, and the spatial coordinate information that can be used directly cannot be obtained. If these non-spatial information cannot be successfully converted into spatial coordinate information, relevant enterprises will not be able to combine relevant thematic data with geographic information, which will indirectly affect the visualization and functional analysis of GIS software. Therefore, how to convert the non-spatial information related to the geographic location into the geographic coordinates of the GIS system that can be recognized by the computer and realize the matching between the non-spatial information and the physical geographic coordinates can play the greatest role of the geographic information system.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种地理编码方法、系统、终端以及存储介质,旨在至少在一定程度上解决现有技术中的上述技术问题之一。The present application provides a geocoding method, system, terminal and storage medium, aiming to solve one of the above technical problems in the prior art at least to a certain extent.
为了解决上述问题,本申请提供了如下技术方案:In order to solve the above problems, the application provides the following technical solutions:
一种地理编码方法,包括:A geocoding method that includes:
根据地名地址数据建立地名地址模型;Establish a place name address model according to the place name address data;
根据所述地名地址模型建立地理编码库,所述地理编码库包括行政区域实体数据表、街巷实体数据表以及小区实体数据表;A geocoding library is established according to the place name and address model, and the geocoding library includes an administrative area entity data table, a street and alley entity data table, and a community entity data table;
基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理,将所述地名地址数据切分为至少一个词组;Based on the address dictionary, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data, and divide the place name address data into at least one phrase;
按照所述地名地址模型中的级别要素将所述至少一个词组转换成预定格式的字符串,将所述字符串与所述地理编码库中对应的地理坐标进行匹配,将所述字符串匹配到的地理坐标作为对应地名地址的标准地理坐标。Convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and match the character string to The geographic coordinates are used as the standard geographic coordinates for the address of the corresponding place name.
本申请实施例采取的技术方案还包括:所述根据地名地址数据建立地名地址模型前还包括:The technical solution adopted in the embodiment of the present application further includes: before the establishment of the place-name-address model according to the place-name-address data, the following further includes:
对所述地名地址数据进行数据清洗。Data cleaning is performed on the place name and address data.
本申请实施例采取的技术方案还包括:所述根据所述地名地址模型建立地理编码库包括:The technical solution adopted in the embodiment of the present application further includes: the establishing of the geographic coding library according to the place name and address model includes:
定义所述行政区域实体数据表、街巷实体数据表以及小区实体数据表的表结构,按照所述表结构依次录入省、区县、街道、小区、标志物、门牌号和地理坐标进行所述地理编码库的构建。Define the table structure of the administrative area entity data table, street and lane entity data table, and community entity data table, and enter provinces, districts, counties, streets, communities, landmarks, house numbers, and geographic coordinates in turn according to the table structure. Construction of the geocoding library.
本申请实施例采取的技术方案还包括:所述基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理包括:The technical solutions adopted in the embodiments of the present application further include: the address dictionary-based, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data includes:
根据所述地址词典顺序匹配出所述地名地址数据中的地名词组,并构建一个有向无环图,每个词组分别是所述有向无环图中的一个节点,且分别对应着一条被赋予边长的边;According to the address dictionary order, the place name groups in the place name address data are matched, and a directed acyclic graph is constructed. Each phrase is a node in the directed acyclic graph, and corresponds to a side given length;
按照预设规则建立所述有向无环图所有可能的词边,使得所述地名地理数据中包含的所有词分别与所述有向无环图中的边一一对应,并求解所述有向无环图中从开始节点到结束节点的N-最短路径集合,根据所述N-最短路径集合对所述地名地址数据进行分词。All possible word edges of the directed acyclic graph are established according to preset rules, so that all words contained in the geographical name geographic data correspond to the edges of the directed acyclic graph one-to-one respectively, and solve the To the N-shortest path set from the start node to the end node in the acyclic graph, the place name address data is segmented according to the N-shortest path set.
本申请实施例采取的技术方案还包括:假设所述地名地理数据S=c1c2……cn,其中ci(i=1,2,…n)为单个的字,n为串的长度,n≥1,建立的有向无环图G的节点数为n+1,各节点编号依次为V0,V1,V2,…,Vn,所述建立有向无环图所有可能的词边的预设规则为:The technical solutions adopted in the embodiments of the present application further include: assuming that the geographical data of place names S=c1c2...cn, where ci(i=1,2,...n) is a single word, n is the length of the string, and n≥1 , the number of nodes in the established directed acyclic graph G is n+1, and the number of each node is V0, V1, V2, ..., Vn. The preset rules for establishing all possible word edges in the directed acyclic graph are: :
相邻节点Vk-1,Vk之间建立有向边<Vk-1,Vk>,边的长度值为Lk,边对应的词默认为ck(k=1,2,…n);A directed edge <Vk-1, Vk> is established between adjacent nodes Vk-1, Vk, the length of the edge is Lk, and the word corresponding to the edge defaults to ck (k=1,2,...n);
若w=ci ci+1……cj是一个词,则节点Vi-1,Vj之间建立有向边<Vi-1,Vj>,边的长度值为Lw,边对应的词为w(0<i<j≤n)。If w=ci ci+1...cj is a word, then a directed edge <Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0 <i<j≤n).
本申请实施例采取的技术方案还包括:所述求解所述有向无环图中从开始节点到结束节点的N-最短路径集合包括:The technical solutions adopted in the embodiments of the present application further include: the solving of the set of N-shortest paths from the start node to the end node in the directed acyclic graph includes:
假设Path(i,j)为所有从节点Vi到节点Vj的路径集合;Length(path)为路径path的长度,Length(path)值等于path中所有边的长度之和;LS为有向无环图G中所有从V0到Vn路径的长度集合,则有:Suppose Path(i,j) is the set of all paths from node Vi to node Vj; Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path; LS is directed acyclic The set of lengths of all paths from V0 to Vn in graph G, there are:
LS={len|len=Length(path),path∈Path(0,n)}LS={len|len=Length(path), path∈Path(0,n)}
设NLS为V0到Vn的N-最短路径长度集合,NSP为V0到Vn的N-最短路径集合,RS是最终求出的N-最短路径粗分结果集,|NLS|=min(|LS|,N);a∈LS-NLS,b∈NLS→a<b,NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2…wm|,wi是path的第i条边对应的词,i=1,2,…,m,其中path∈NSP},n为最短路径数。Let NLS be the set of N-shortest path lengths from V0 to Vn, NSP be the set of N-shortest path lengths from V0 to Vn, RS is the final N-shortest path rough division result set, |NLS|=min(|LS| ,N); a∈LS-NLS,b∈NLS→a<b, NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2...wm|, wi is path The word corresponding to the i-th edge of , i=1,2,...,m, where path∈NSP}, n is the number of shortest paths.
本申请实施例采取的技术方案还包括:所述基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理还包括:The technical solutions adopted in the embodiments of the present application further include: the said address dictionary-based, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data further includes:
计算从开始节点到结束节点的最短路径为Lj=1,如果j小于最短路径数并且存在其他候选路径,则更新当前路径L为Lj,反之结束;Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths and there are other candidate paths, update the current path L to Lj, otherwise end;
从当前路径中的第一个节点开始,将入度大于1的第一个节点删除,并将被删除节点记为Hm,判断Hm的子孙节点是否在集合E中,如果在集合E中,则计算从开始节点到Hm的最短路径,并将该最短路径的结束节点记为H’m;如果不在集合E中,则从有向无环图G中删除节点Hm及其所有子孙节点;其中,集合E为V0到Vn的N-最短路径集合,Hm与H’m在每一个循环当中均代表结束节点,H’m作为下一次循环的结束标记;Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and judge whether the descendant node of Hm is in the set E. If it is in the set E, then Calculate the shortest path from the start node to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its descendant nodes from the directed acyclic graph G; among them, Set E is the set of N-shortest paths from V0 to Vn, Hm and H'm represent end nodes in each cycle, and H'm is used as the end marker of the next cycle;
重复所述节点删除过程,直到m≮n,更新当前路径,求得开始节点V0到所有节点H’m的最短路径j=j+1;n为删除节点后的最短路径数,m为j循环构造后的最短路径,在每一次循环当中,m的值为m=j+1。Repeat the node deletion process until m≮n, update the current path, and obtain the shortest path j=j+1 from the starting node V0 to all nodes H'm; n is the number of shortest paths after deleting the node, m is the j cycle For the constructed shortest path, in each cycle, the value of m is m=j+1.
本申请实施例采取的另一技术方案为:一种地理编码系统,包括:Another technical solution adopted by the embodiment of the present application is: a geographic coding system, comprising:
地名地址模型构建模块:用于根据地名地址数据建立地名地址模型;Place name address model building module: used to build place name address model based on place name address data;
地理编码库构建模块:用于根据所述地名地址模型建立地理编码库,所述地理编码库包括行政区域实体数据表、街巷实体数据表以及小区实体数据表;Geographical coding library building module: used to establish a geographic coding library according to the place name and address model, and the geographic coding library includes an administrative area entity data table, a street and lane entity data table and a community entity data table;
分词及标准化处理模块:用于基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理,将所述地名地址数据切分为至少一个词组;Word segmentation and standardization processing module: used to perform word segmentation and standardization processing on the place name address data based on the address dictionary, using the N-shortest path optimization algorithm, and divide the place name address data into at least one phrase;
坐标匹配模块:用于按照所述地名地址模型中的级别要素将所述至少一个词组转换成预定格式的字符串,将所述字符串与所述地理编码库中对应的地理坐标进行匹配,将所述字符串匹配到的地理坐标作为对应地名地址的标准地理坐标。Coordinate matching module: used to convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and The geographic coordinates matched by the character string are used as the standard geographic coordinates of the address corresponding to the place name.
本申请实施例采取的又一技术方案为:一种终端,所述终端包括处理器、与所述处理器耦接的存储器,其中,Another technical solution adopted by the embodiments of the present application is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,
所述存储器存储有用于实现所述地理编码方法的程序指令;the memory stores program instructions for implementing the geocoding method;
所述处理器用于执行所述存储器存储的所述程序指令以控制地理编码。The processor is configured to execute the program instructions stored in the memory to control geocoding.
本申请实施例采取的又一技术方案为:一种存储介质,存储有处理器可运行的程序指令,所述程序指令用于执行所述地理编码方法。Another technical solution adopted by the embodiments of the present application is: a storage medium storing program instructions executable by a processor, where the program instructions are used to execute the geocoding method.
相对于现有技术,本申请实施例产生的有益效果在于:本申请实施例的地理编码方法、系统、终端及存储介质基于N-最短路径优化算法对地名地址进行分词及标准化处理,根据标准化处理结果对地名地址进行切分后,按照地名地址模型中的级别要素将切分后的地名地址转换成计算机能够识别的字符串,最后将字符串与地理编码库中相应的地理坐标进行匹配,根据匹配结果为地名地址赋予标准地理坐标。本申请通过在算法中加入辅助的语法与语义规则,改进逐词遍历的弊端,增加了实用性,秉承了全切分思想的优势,既能够尽可能减少切分词组的数量,同时又能包含所有需要被保留的结果,能有效避免资源浪费,加大搜索效率。Compared with the prior art, the beneficial effects of the embodiments of the present application are: the geocoding method, system, terminal and storage medium of the embodiments of the present application perform word segmentation and standardization processing on place names and addresses based on the N-shortest path optimization algorithm. Results After the place name address is segmented, the segmented place name address is converted into a string that can be recognized by the computer according to the level elements in the place name address model, and finally the string is matched with the corresponding geographic coordinates in the geocoding library. Matching results assign standard geographic coordinates to place-name addresses. By adding auxiliary grammatical and semantic rules to the algorithm, the present application improves the disadvantages of word-by-word traversal, increases the practicability, and inherits the advantages of the full segmentation idea, which can not only reduce the number of segmented phrases as much as possible, but also include All results that need to be retained can effectively avoid wasting resources and increase search efficiency.
附图说明Description of drawings
图1是本申请第一实施例的地理编码方法的流程图;Fig. 1 is the flow chart of the geocoding method of the first embodiment of the present application;
图2为本申请实施例的地名地址表述方式示意图;2 is a schematic diagram of a representation of a place name and address according to an embodiment of the application;
图3为本申请实施例的有向无环图结构示意图;3 is a schematic structural diagram of a directed acyclic graph according to an embodiment of the present application;
图4为本申请实施例的有向无环图求解过程示意图;4 is a schematic diagram of a directed acyclic graph solution process according to an embodiment of the application;
图5为本申请实施例对有向无环图求解过程中的前驱记录表示意图;5 is a schematic diagram of a precursor record table in the process of solving a directed acyclic graph according to an embodiment of the present application;
图6是本申请第二实施例的地理编码方法的流程图;Fig. 6 is the flow chart of the geocoding method of the second embodiment of the present application;
图7为本申请实施例的N-最短路径改进分词算法示意图;7 is a schematic diagram of an N-shortest path improved word segmentation algorithm according to an embodiment of the application;
图8为本申请实施例的地理编码系统结构示意图;8 is a schematic structural diagram of a geocoding system according to an embodiment of the present application;
图9为本申请实施例的终端结构示意图;FIG. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;
图10为本申请实施例的存储介质的结构示意图。FIG. 10 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
针对现有技术的不足,本申请实施例的地理编码方法首先对初始地名地址数据进行数据清洗,防止输入的文本错别字过多、拼写错误、文本重复等问题;然后建立地名地址模型,使其能够反映一个国家或地区对地理名称的不同表示,再根据该地名地址模型建立包含地名数据表、建筑物数据表以及门(楼)牌数据表的地理编码库,并运用N-最短路径优化算法对地名地址进行分词及标准化处理,根据标准化处理结果对地名地址进行切分后,按照地名地址模型中的级别要素将切分后的地名地址转换成计算机能够识别的字符串,最后将字符串与地理编码库中相应的地理坐标进行匹配。本申请实施例秉承了全切分思想的优势,既能够尽可能减少切分词组的数量,同时又能包含所有需要被保留的结果,能有效避免资源浪费,加大搜索效率。In view of the deficiencies of the prior art, the geocoding method of the embodiment of the present application first performs data cleaning on the initial place name address data to prevent problems such as excessive typos, spelling mistakes, and text repetition in the input text; and then establishes a place name address model to enable it to Reflect the different representations of geographic names in a country or region, and then build a geographic coding library including a place name data table, a building data table and a door (building) plate data table according to the place name address model, and use the N-shortest path optimization algorithm. The place name address is subjected to word segmentation and standardization processing. After the place name address is segmented according to the standardized processing result, the segmented place name address is converted into a string that can be recognized by the computer according to the level elements in the place name address model. The corresponding geographic coordinates in the encoding library are matched. The embodiment of the present application inherits the advantages of the idea of full segmentation, can reduce the number of segmented phrases as much as possible, and at the same time can include all the results that need to be retained, which can effectively avoid waste of resources and increase search efficiency.
具体的,请参阅图1,是本申请第一实施例的地理编码方法的流程图。本申请第一实施例的地理编码方法包括以下步骤:Specifically, please refer to FIG. 1 , which is a flowchart of the geocoding method according to the first embodiment of the present application. The geocoding method of the first embodiment of the present application includes the following steps:
S10:对初始地名地址数据进行数据清洗;S10: Perform data cleaning on the initial place name and address data;
本步骤中,由于用户端输入的地名地址等文本数据中可能会包含错别字或重复字,为了避免由于文本数据中字符串不一致、拼写错误等问题导致后续的字符串与地理坐标匹配错误,本发明实施例使用Trillum技术,采用语法分析和模糊匹配算法对地名地址进行数据清洗。In this step, since the text data such as place names and addresses input by the user terminal may contain typos or repeated words, in order to avoid the mismatch between the subsequent character strings and the geographic coordinates due to inconsistent character strings in the text data, spelling mistakes and other problems, the present invention The embodiment uses Trillum technology, adopts syntax analysis and fuzzy matching algorithm to perform data cleaning on place name addresses.
S11:将清洗后的地名地址数据结构化后,建立地名地址模型;S11: After structuring the cleaned place name address data, establish a place name address model;
本步骤中,不同国家或地区对地名地址的表述方式具有不同粒度范围的描述规则,本申请实施例根据不同粒度范围的描述规则建立可伸缩的地名地址模型。具体如图2所示,为地名地址表述方式示意图。在该地名地址表述方式中,行政区域地名所包含省级、市级、县级、乡级、街巷名、小区地名、社区名、门楼址、标志物名或别名以及单位名或其简称;其中,省级优先于市级,市级优先于县级,县级优先于乡级;街巷名优先于小区地名,小区地名优先于社区名,门楼址优先于标志物名或其别名,其次是单位名或其简称。通常情况下,在一个城市内的街巷名以及小区名是唯一的,因此采用街巷名或小区名就可以大致锁定某一个范围地址,而采用“街巷名或小区名+门(楼)牌号”则可以精确到一个地点,采用“行政区域地名+标志物名”基本上也可以准确定位到一个位置。也就是说,当文本中所要表述的内容有门牌号时,就用“街巷名或小区名+门(楼)牌号”来锁定一个位置;当地名地址数据中含有标志物名时,使用“行政区域地名+标志物名”进行精确定位。按照上述的粒度范围的描述规则进行的地名地址数据结构化示例如下:In this step, different countries or regions have description rules of different granularity ranges for the representation of place-name addresses, and the embodiment of the present application establishes a scalable place-name address model according to the description rules of different granularity ranges. Specifically, as shown in FIG. 2 , it is a schematic diagram of the representation of place names and addresses. In the representation of the address of the place name, the place name of the administrative region includes the provincial, city, county, township, street name, community place name, community name, gate building address, landmark name or alias, and unit name or its abbreviation; Among them, the provincial level has priority over the city level, the city level has priority over the county level, and the county level has priority over the township level; street and lane names have priority over community place names, community place names have priority over community names, and gate building addresses have priority over landmark names or their aliases, followed by is the unit name or its abbreviation. Under normal circumstances, the street name and community name in a city are unique, so using the street name or community name can roughly lock a certain range of addresses, and using "street name or community name + door (floor)" Brand” can be accurately located to a location, and “administrative region place name + marker name” can basically be used to accurately locate a location. That is to say, when the content to be expressed in the text has a house number, use "street name or community name + door (building) number" to lock a location; when the local name address data contains a landmark name, use " Administrative region place name + landmark name" for precise positioning. An example of the structure of place name address data according to the above description rules of granularity range is as follows:
(1)广东省广州市天河区体育西路111号,地名地址数据结构化为:行政区域名+街巷名+门(楼)牌号;(1) No. 111, Tiyu West Road, Tianhe District, Guangzhou City, Guangdong Province, the place name and address data is structured as: administrative area name + street name + door (building) number;
(2)广东省广州市天河区体育西路建和中心,地名地址数据结构化为:行政区域名+街巷名+标志物名。(2) Jianhe Center, Tiyu West Road, Tianhe District, Guangzhou City, Guangdong Province, the place name and address data are structured as: administrative area name + street name + landmark name.
当遇到包含多个标志物名时,可以根据当前行政区域的粒度进行延伸,直到能够确定唯一的地点。例如,地名地址数据是“惠州市演达大道46号惠州学院”,在惠州市的应用中可以简化为“演达大道46号”完全不会产生任何歧义;而如果地名地址数据是“广州市工商银行”,此时可能会定位到多个标志物,得出的结果很难筛选,则需要将其延伸至街巷名或小区名等进行描述,即可准确定位到某一家工商银行。When encountering multiple marker names, the granularity of the current administrative region can be extended until a unique location can be identified. For example, the place name address data is "Huizhou College, No. 46 Yanda Avenue, Huizhou City", which can be simplified as "No. 46 Yanda Avenue" in the application of Huizhou City without any ambiguity at all; and if the place name address data is "Guangzhou City" "ICBC", at this time, multiple markers may be located, and the results obtained are difficult to filter, so it needs to be extended to the street name or community name for description, and then a certain ICBC can be accurately located.
S12:根据地名地址模型建立包含行政区域实体数据表、街巷实体数据表以及小区实体数据表的地理编码库;S12: According to the place name and address model, establish a geocoding library including the entity data table of administrative areas, the entity data table of streets and alleys, and the entity data table of community;
本步骤中,行政区域实体数据表、街巷实体数据表以及小区实体数据表的表结构可根据应用场景进行定义,地理编码库的建立遵循以下原则:In this step, the table structure of the administrative area entity data table, the street entity data table and the community entity data table can be defined according to the application scenario, and the establishment of the geocoding library follows the following principles:
唯一性原则:任何地理实体都只能被唯一标识;Uniqueness principle: any geographic entity can only be uniquely identified;
透明性原则:从编码中可以识别结构间的从属关系;Transparency principle: The affiliation between structures can be identified from the coding;
灵活性原则:应当适应对象的发展变化;The principle of flexibility: it should adapt to the development and changes of the object;
标准性原则:编码规则适应国家标准体系,以便实现数据共享。Standardization principle: The coding rules are adapted to the national standard system for data sharing.
行政区域实体数据表、街巷实体数据表以及小区实体数据表结构分别如下表1、2、3所示:The structures of the administrative area entity data table, the street entity data table and the community entity data table are shown in Tables 1, 2 and 3 below:
表1 行政区域实体数据表Table 1 Administrative area entity data table
Figure PCTCN2020139759-appb-000001
Figure PCTCN2020139759-appb-000001
表2 街巷实体数据表Table 2 Street and Alley entity data table
Figure PCTCN2020139759-appb-000002
Figure PCTCN2020139759-appb-000002
表3 小区实体数据表Table 3 Cell entity data table
Figure PCTCN2020139759-appb-000003
Figure PCTCN2020139759-appb-000003
按照各表结构依次录入所有省、区县、街道、小区、标志物、门牌号和地理坐标进行地理编码库的构建。选取各数据表中的字段值作为地名地址词条,连同相应的地址级别记录在地址词典中。当是地址别名作为地名地址词条时,还需记录标准的名称,以便在地址分词时进行地址要素的规范化。According to the structure of each table, enter all provinces, districts, counties, streets, communities, landmarks, house numbers and geographic coordinates in turn to construct the geocoding database. Select the field value in each data table as the place name address entry, and record it in the address dictionary together with the corresponding address level. When an address alias is used as a place-name address entry, it is also necessary to record the standard name so that the address elements can be normalized during address segmentation.
S13:基于地址词典,运用N-最短路径优化算法对不规则的地名地址数据进行分词及标准化处理,将地名地址数据切分为至少一个词组;S13: Based on the address dictionary, the N-shortest path optimization algorithm is used to segment and standardize the irregular place name address data, and the place name address data is divided into at least one phrase;
本步骤中,N-最短路径优化算法实现过程具体为:地址词典中记录有不同国家不同地区的所有地名地址(包括别名以及简称等),首先根据地址词典将地名地址数据中有可能出现的地名词组顺序匹配出来,然后构建一个有向无环图,每个词组分别是有向无环图中的一个节点,且分别对应着一条被赋予边长(即权值,在非统计粗分模型中,假定所有的词都是对等的,为便于计算,将所有词对应边的边长均设为1)的边。有向无环图中起点到终点的所有路径中,求出每个节点到源节点的路径值,对应路径集合,作为每个节点的路径结果集。In this step, the implementation process of the N-shortest path optimization algorithm is as follows: the address dictionary records all geographical names addresses (including aliases and abbreviations, etc.) in different countries and regions, first, according to the address dictionary, place names that may appear in the geographical name address data are recorded. The phrases are matched in order, and then a directed acyclic graph is constructed. Each phrase is a node in the directed acyclic graph, and corresponds to a given edge length (ie, weight, in the non-statistical rough scoring model). , assuming that all words are equal, for the convenience of calculation, the side lengths of the corresponding sides of all words are set as the sides of 1). In all paths from the starting point to the end point in the directed acyclic graph, the path value from each node to the source node is obtained, and the corresponding path set is used as the path result set of each node.
例如,假设待分字串S=c1c2……cn,其中ci(i=1,2,…n)为单个的字,n为串的长度,n≥1。建立一个节点数为n+1的有向无环图G,各节点编号依次为V0,V1,V2,…,Vn。通过以下两种规则建立G所有可能的词边:For example, it is assumed that the character string to be divided is S=c1c2...cn, where ci(i=1, 2,...n) is a single word, n is the length of the string, and n≥1. Create a directed acyclic graph G with the number of nodes n+1, and the numbers of the nodes are V0, V1, V2, ..., Vn in sequence. All possible word edges of G are established by the following two rules:
(1)相邻节点Vk-1,Vk之间建立有向边<Vk-1,Vk>,边的长度值为Lk,边对应的词默认为ck(k=1,2,…n);(1) A directed edge <Vk-1, Vk> is established between adjacent nodes Vk-1, Vk, the length of the edge is Lk, and the word corresponding to the edge defaults to ck (k=1,2,...n);
(2)若w=ci ci+1……cj是一个词,则节点Vi-1,Vj之间建立有向边<Vi-1,Vj>,边的长度值为Lw,边对应的词为w(0<i<j≤n)。(2) If w=ci ci+1...cj is a word, then a directed edge <Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0<i<j≤n).
依照上述规则,使得待分字串S中包含的所有词与有向无环图G中的边一一对应,具体如图3所示,为本申请实施例的有向无环图结构示意图。N-最短路径优化算法的词语粗切问题即为求解有向无环图G的集合NSP。有向无环图结构的求解过程具体为:According to the above rules, all words contained in the string S to be divided are made to correspond one-to-one with the edges in the directed acyclic graph G, as shown in FIG. The word rough cutting problem of the N-shortest path optimization algorithm is to solve the set NSP of the directed acyclic graph G. The solution process of the directed acyclic graph structure is as follows:
设:Path(i,j)为所有从节点Vi到节点Vj的路径集合;Length(path)为路径path的长度,Length(path)值等于path中所有边的长度之和;LS为有向无环图G中所有从V0到Vn路径的长度集合;则有:Let: Path(i,j) be the set of all paths from the node Vi to the node Vj; Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path; LS is the directed and non-directional The set of lengths of all paths from V0 to Vn in the ring graph G; then there are:
LS={len|len=Length(path),path∈Path(0,n)}(1)LS={len|len=Length(path), path∈Path(0,n)}(1)
NLS为V0到Vn的N-最短路径长度集合,NSP为V0到Vn的N-最短路径集合;RS是最终求出的N-最短路径粗分结果集。NLS的定义为:|NLS|=min(|LS|,N);a∈LS-NLS,b∈ NLS→a<b NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2…wm|wi是path的第i条边对应的词,i=1,2,…,m,其中path∈NSP},n为最短路径数。NLS is the set of N-shortest path lengths from V0 to Vn, NSP is the set of N-shortest paths from V0 to Vn; RS is the final result set of N-shortest path rough division. The definition of NLS is: |NLS|=min(|LS|,N); a∈LS-NLS,b∈ NLS→a<b NSP={path|path∈Path(0,n),Length(path)∈ NLS}RS={w1w2...wm|wi is the word corresponding to the i-th edge of the path, i=1,2,...,m, where path∈NSP}, n is the number of shortest paths.
以文本数据“他说的确实在理”构建有向无环图的求解为例,该文本数据的求解过程如图4所示。首先,采用贪心算法求得每个节点的局部最优解。记录每个节点处的最短路径值以及该节点的前驱,如果某节点包括两条以上相同长度的路径,则分别记录每条路径上该节点的前驱,该文本数据的前驱记录表如图5所示,其中,在(a)中,前驱(2,1)他、(3,1)他,长度分别为3和4,对应节点分别为012,0123;在(b)中,前驱(4,1)他和(4,2)他说,长度分别为4和5,对应节点分别为0123,01234;在(c)中,前驱(4,1)他、(5,1)他((4,2)他)和(5,2)他说,长度分别为4、5和6,对应节点分别为0123,01234,012345;在(d)中,前驱(6,1)他((5,1)他)、(6,2)他说((5,2))和(6,3)他说的,长度分别为5、6和7,对应节点分别为01234,012345,0123456。然后通过回溯算法,往前搜索更加优选的结果,最终求解出文本数据“他说的确实在理”的最优分词结果为“他|说|的|确实|在理|”。Taking the solution of constructing a directed acyclic graph for text data "what he said really makes sense" as an example, the solution process of the text data is shown in Figure 4. First, a greedy algorithm is used to obtain the local optimal solution of each node. Record the shortest path value at each node and the predecessor of the node. If a node includes more than two paths of the same length, record the predecessor of the node on each path separately. The predecessor record table of the text data is shown in Figure 5. where, in (a), the precursors (2,1)he and (3,1)he have lengths of 3 and 4, respectively, and the corresponding nodes are 012, 0123; in (b), the precursors (4, 1) He and (4, 2) he said that the lengths are 4 and 5, respectively, and the corresponding nodes are 0123, 01234; in (c), the predecessors (4, 1) he, (5, 1) he ((4 , 2) he) and (5, 2) he said that the lengths are 4, 5 and 6, respectively, and the corresponding nodes are 0123, 01234, 012345; in (d), the predecessor (6, 1) he ((5, 1) He), (6,2) He said ((5,2)) and (6,3) He said, the lengths are 5, 6 and 7 respectively, and the corresponding nodes are 01234, 012345, 0123456 respectively. Then, through the backtracking algorithm, search for a more preferred result forward, and finally solve the optimal word segmentation result of the text data "what he said is true" is "he | said | | true | is true |".
基于上述,本发明通过采用N-最短路径分词算法对地名地址进行分词,不仅可以大大减少分词数量,还能尽量将所有可能的分词结果包含不流失,避免算法本身因素可能造成的舍弃正确结果的同时,又能尽可能的缩小搜索空间,提升分词效率。Based on the above, the present invention uses the N-shortest path word segmentation algorithm to segment the place name address, which can not only greatly reduce the number of word segmentation, but also try to include all possible word segmentation results without loss, and avoid the algorithm itself. At the same time, it can reduce the search space as much as possible and improve the efficiency of word segmentation.
S14:按照地名地址模型中的级别要素将切分后的至少一个词组转换为预定格式(计算机可识别)的字符串,然后将转换后的字符串与地理编码库中对应的地理坐标进行匹配;S14: Convert the segmented at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library;
S15:将字符串匹配到的地理坐标作为对应地名地址的标准地理坐标。S15: Use the geographic coordinates matched by the string as the standard geographic coordinates of the address corresponding to the place name.
请参阅图6,是本申请第二实施例的地理编码方法的流程图。本申请第二实施例的地理编码方法包括以下步骤:Please refer to FIG. 6 , which is a flowchart of the geocoding method according to the second embodiment of the present application. The geocoding method of the second embodiment of the present application includes the following steps:
S20:对初始地名地址数据进行数据清洗;S20: Perform data cleaning on the initial place name and address data;
本步骤中,由于用户端输入的地名地址等文本数据中可能会包含错别字或重复字,为了避免由于文本数据中字符串不一致、拼写错误等问题导致后续的字符串与地理坐标匹配错误,本发明实施例使用Trillum技术,采用语法分析和模糊匹配算法对地名地址进行数据清洗。In this step, since the text data such as place names and addresses input by the user terminal may contain typos or repeated words, in order to avoid the mismatch between the subsequent character strings and the geographic coordinates due to inconsistent character strings in the text data, spelling mistakes and other problems, the present invention The embodiment uses Trillum technology, adopts syntax analysis and fuzzy matching algorithm to perform data cleaning on place name addresses.
S21:将清洗后的地名地址数据结构化后,建立地名地址模型;S21: After structuring the cleaned place name address data, establish a place name address model;
本步骤中,不同国家或地区对地名地址的表述方式具有不同粒度范围规则,可以把地名地址看成一种在层次上可伸缩的地名地址模型。In this step, different countries or regions have different granularity and scope rules for the representation of the place name address, and the place name address can be regarded as a hierarchically scalable place name address model.
S22:根据地名地址模型建立包含地名数据表、建筑物数据表以及门(楼)牌数据表的地理编码库;S22: Establish a geocoding library including a place name data table, a building data table and a door (building) sign data table according to the place name address model;
本步骤中,地名数据表、建筑物数据表以及门(楼)牌数据表的表结构可根据应用场景进行定义,按照各表结构依次录入所有省、区县、街道、小区、标志物、门牌号和地理坐标进行地理编码库的构建。In this step, the table structure of the place name data table, the building data table and the door (building) plate data table can be defined according to the application scenario, and all provinces, districts, counties, streets, communities, landmarks, house plates are sequentially entered according to the table structure. number and geographic coordinates for the construction of the geocoding library.
S23:基于地址词典,采用动态删除算法与N-最短路径分词算法相结合的N-最短路径改进分词算法对不规则的地名地址数据进行分词及标准化处理,将地名地址数据切分为至少一个词组;S23: Based on the address dictionary, the N-shortest path improved word segmentation algorithm combining the dynamic deletion algorithm and the N-shortest path word segmentation algorithm is used to segment and standardize the irregular place name address data, and the place name address data is divided into at least one phrase ;
本步骤中,为了进一步提高分词效率,本发明提出将动态删除算法与N-最短路径分词算法相结合的N-最短路径改进分词算法,其中,动态删除算法的基本思想是:构建最短路径更新队列,用于存放被删除的节点的子节点们;而在原来的最短路径树中删除应当被删除的节点以及所有的子节点;选择最接近根节点的节点用于在队列中更新,并且不再将更新过的节点插入队列中以减少节点更新的次数。N-最短路径改进分词算法如图7所示,其求解过程具体为:In this step, in order to further improve the word segmentation efficiency, the present invention proposes an N-shortest path improved word segmentation algorithm that combines the dynamic deletion algorithm with the N-shortest path word segmentation algorithm. The basic idea of the dynamic deletion algorithm is to construct the shortest path update queue. , used to store the child nodes of the deleted node; delete the node that should be deleted and all child nodes in the original shortest path tree; select the node closest to the root node for updating in the queue, and no longer Insert updated nodes into the queue to reduce the number of node updates. The N-shortest path improved word segmentation algorithm is shown in Figure 7, and its solution process is as follows:
第一步:首先基于N-最短路径分词算法,以词(或字)为节点构造有向无环图G;其中,有向无环图构造过程与第一实施例相同,本实施例将不再赘述;Step 1: First, based on the N-shortest path word segmentation algorithm, construct a directed acyclic graph G with words (or characters) as nodes; wherein, the directed acyclic graph construction process is the same as the first embodiment, and this embodiment will not repeat;
第二步:计算从开始节点到结束节点的最短路径为Lj=1,如果j小于最短路径数(即j<n)并且存在其他候选路径,则更新当前路径L为Lj,反之结束;其中,候选路径是指不同的组词切割方式产生的不同路径。Step 2: Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths (ie, j<n) and there are other candidate paths, update the current path L to Lj, otherwise end; where, Candidate paths refer to different paths generated by different ways of cutting words.
其中,Lj用于存储最短路径,其中的j是一个动态变量,j的初始值可根据整个字符串的长度进行设定,如“他说的的确是在理”,j的初始值为8,当完成词语组合,删除组合间的节点,再更新j值,随着句子分割的持续进行,j值会越来越小,直到句子无法切割为止。Among them, Lj is used to store the shortest path, where j is a dynamic variable, and the initial value of j can be set according to the length of the entire string, such as "what he said is true", the initial value of j is 8, When the word combination is completed, the nodes between the combinations are deleted, and the j value is updated. As the sentence segmentation continues, the j value will become smaller and smaller until the sentence cannot be divided.
第三步:从当前路径中的第一个节点开始,将入度大于1的第一个节点删除,并将被删除节点记为Hm,判断Hm的子孙节点是否在集合E中,如果在集合E中,则计算从开始节点V0到Hm的最短路径,并将该最短路径的结束节点记为H’m;如果不在集合E中,则从有向无环图G中删除节点Hm及其所有子孙节点;其中,集合E为V0到Vn的N-最短路径集合(即NSP),此处用于判断被删除节点是否在最短路径中。Hm与H’m在每一个循环当中均代表结束节点,H’m会作为下一次循环的结束标记。Step 3: Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and determine whether the descendant node of Hm is in the set E, if it is in the set In E, calculate the shortest path from the start node V0 to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its all from the directed acyclic graph G Descendant node; wherein, the set E is the N-shortest path set (ie, NSP) from V0 to Vn, which is used here to determine whether the deleted node is in the shortest path. Hm and H'm represent the end node in each cycle, and H'm will be the end marker of the next cycle.
第四步:重复上述过程,直到m≮n,更新当前路径,求得开始节点V0到所有节点H’m的最短路径j=j+1。其中,n为删除节点后的最短路径数,m为j循环构造后的最短路径,在每一次循环当中,m的值为m=j+1,且在进入下一个循环后,Hm会随m值更新,而H’m不会改变。Step 4: Repeat the above process until m≮n, update the current path, and obtain the shortest path j=j+1 from the start node V0 to all nodes H'm. Among them, n is the number of shortest paths after deleting nodes, m is the shortest path after j loop construction, in each loop, the value of m is m=j+1, and after entering the next loop, Hm will follow m The value is updated while H'm does not change.
在上述求解过程中,为了避免影响搜索效率和准确性,对于保留的前j个最优路径,j的值应当适中,不能过大也不能过小。In the above solution process, in order to avoid affecting the search efficiency and accuracy, the value of j should be moderate, neither too large nor too small, for the first j optimal paths to be reserved.
S24:按照地名地址模型中的级别要素将切分后的至少一个词组转换为预定格式(计算机可识别)的字符串,然后将转换后的字符串与地理编码库中对应的地理坐标进行匹配;S24: Convert the segmented at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library;
S25:将字符串匹配到的地理坐标作为对应地名地址的标准地理坐标。S25: Use the geographic coordinates matched by the character string as the standard geographic coordinates of the address corresponding to the place name.
为了验证本申请实施例的可行性和有效性,通过借助ArcGIS api for Javascript平台对本方案进行了实验,并与传统算法的精确度进行了对比,对比结果如表1所示:In order to verify the feasibility and effectiveness of the embodiments of the present application, experiments were carried out on the scheme with the help of the ArcGIS api for Javascript platform, and the accuracy of the traditional algorithm was compared. The comparison results are shown in Table 1:
表1 算法精确度对比Table 1 Algorithm Accuracy Comparison
Figure PCTCN2020139759-appb-000004
Figure PCTCN2020139759-appb-000004
Figure PCTCN2020139759-appb-000005
Figure PCTCN2020139759-appb-000005
实验结果表明,采用本申请实施例的地理坐标匹配正确率超过了传统算法,且分词速度加快了两倍以上。The experimental results show that the correct rate of geographic coordinate matching using the embodiment of the present application exceeds that of the traditional algorithm, and the word segmentation speed is accelerated by more than two times.
基于上述,本申请实施例的地理编码方法运用N-最短路径优化算法对地名地址进行分词及标准化处理,根据标准化处理结果对地名地址进行切分后,按照地名地址模型中的级别要素将切分后的地名地址转换成计算机能够识别的字符串,最后将字符串与地理编码库中相应的地理坐标进行匹配,根据匹配结果为地名地址赋予标准地理坐标。本申请通过在算法中加入辅助的语法与语义规则,改进逐词遍历的弊端,增加了实用性,秉承了全切分思想的优势,既能够尽可能减少切分词组的数量,同时又能包含所有需要被保留的结果,能有效避免资源浪费,加大搜索效率。Based on the above, the geocoding method of the embodiment of the present application uses the N-shortest path optimization algorithm to perform word segmentation and standardization processing on the place-name address, and after segmenting the place-name address according to the standardized processing result, according to the level element in the place-name address model. The last place name address is converted into a character string that can be recognized by the computer, and finally the character string is matched with the corresponding geographic coordinates in the geocoding library, and the place name address is given standard geographic coordinates according to the matching result. By adding auxiliary grammatical and semantic rules to the algorithm, the present application improves the disadvantages of word-by-word traversal, increases the practicability, and inherits the advantages of the full segmentation idea, which can not only reduce the number of segmented phrases as much as possible, but also include All results that need to be retained can effectively avoid wasting resources and increase search efficiency.
请参阅图8,是本申请实施例的地理编码系统的结构示意图。本申请实施例的地理编码系统41包括:Please refer to FIG. 8 , which is a schematic structural diagram of a geocoding system according to an embodiment of the present application. The geocoding system 41 of the embodiment of the present application includes:
数据清洗模块41:用于对初始地名地址数据进行数据清洗;由于用户端输入的地名地址等文本数据中可能会包含错别字或重复字,为了避免由于文本数据中字符串不一致、拼写错误等问题导致后续的字符串与地理坐标匹配错误,本发明实施例使用Trillum技术,采用语法分析和模糊匹配算法对地名地址进行数据清洗。Data cleaning module 41: used for data cleaning of the initial place name and address data; since the text data such as place name and address input by the user terminal may contain typos or repeated words, in order to avoid problems such as inconsistent character strings in the text data, spelling errors, etc. If the subsequent character string is incorrectly matched with the geographic coordinates, the embodiment of the present invention uses the Trillum technology, and uses the syntax analysis and fuzzy matching algorithm to clean the data of the place name address.
地名地址模型构建模块42:用于将清洗后的地名地址数据结构化后,建立地名地址模型;其中,不同国家或地区对地名地址的表述方式具有不同粒度范围规则,可以把地名地址看成一种在层次上可伸缩的地名地址模型。Place-name and address model building module 42: used to structure the cleaned place-name and address data to establish a place-name and address model; wherein, different countries or regions have different granularity and scope rules for the representation of place-names and addresses, and place-names and addresses can be regarded as a kind of Hierarchically scalable place-name address model.
地理编码库构建模块43:用于根据地名地址模型建立包含地名数据表、建筑物数据表以及门(楼)牌数据表的地理编码库;其中,地名数据表、建筑物数据表以及门(楼)牌数据表的表结构可根据应用场景进行定义,按照各表结构依次录入所有省、区县、街道、小区、标志物、门牌号和地理坐标进行地理编码库的构建。Geocoding library building module 43: used to establish a geographic coding library including a place name data table, a building data table and a door (building) sign data table according to the place name address model; wherein, the place name data table, the building data table and the door (building) ) The table structure of the card data table can be defined according to the application scenario, and all provinces, districts, counties, streets, communities, landmarks, house numbers and geographic coordinates are entered in turn according to each table structure to construct a geocoding library.
分词及标准化处理模块44:用于基于地址词典,运用N-最短路径优化算法对不规则的地名地址数据进行分词及标准化处理,将地名地址数据切分为至少一个词组;其中,N-最短路径优化算法实现过程具体为:地址词典中记录有不同国家不同地区的所有地名地址(包括别名以及简称等),首先根据地址词典将地名地址数据中有可能出现的地名词组顺序匹配出来,然后构建一个有向无环图,每个词组分别是有向无环图中的一个节点,且分别对应着一条被赋予边长(即权值,在非统计粗分模型中,假定所有的词都是对等的,为便于计算,将所有词对应边的边长均设为1)的边。有向无环图中起点到终点的所有路径中,求出每个节点到源节点的路径值,对应路径集合,作为每个节点的路径结果集。The word segmentation and standardization processing module 44 is used to perform word segmentation and standardization processing on the irregular place name address data by using the N-shortest path optimization algorithm based on the address dictionary, and divide the place name address data into at least one phrase; wherein, the N-shortest path The implementation process of the optimization algorithm is as follows: the address dictionary records all geographical names addresses (including aliases and abbreviations, etc.) in different countries and regions. First, according to the address dictionary, the geographical names that may appear in the geographical name address data are matched in order, and then a Directed acyclic graph, each phrase is a node in the directed acyclic graph, and corresponds to a given edge length (ie weight, in the non-statistical rough segmentation model, it is assumed that all words are correct. etc., for the convenience of calculation, the side lengths of the corresponding sides of all words are set as the sides of 1). In all paths from the starting point to the end point in the directed acyclic graph, the path value from each node to the source node is obtained, and the corresponding path set is used as the path result set of each node.
例如,假设待分字串S=c1c2……cn,其中ci(i=1,2,…n)为单个的字,n为串的长度,n≥1。建立一个节点数为n+1的有向无环图G,各节点编号依次为V0,V1,V2,…,Vn。通过以下两种规则建立G所有可能的词边:For example, it is assumed that the character string to be divided is S=c1c2...cn, where ci(i=1, 2,...n) is a single word, n is the length of the string, and n≥1. Create a directed acyclic graph G with the number of nodes n+1, and the numbers of the nodes are V0, V1, V2, ..., Vn in sequence. All possible word edges of G are established by the following two rules:
(1)相邻节点Vk-1,Vk之间建立有向边<Vk-1,Vk>,边的长度值为Lk,边对应的词默认为ck(k=1,2,…n);(1) A directed edge <Vk-1, Vk> is established between adjacent nodes Vk-1, Vk, the length of the edge is Lk, and the word corresponding to the edge defaults to ck (k=1,2,...n);
(2)若w=ci ci+1……cj是一个词,则节点Vi-1,Vj之间建立有向边<Vi-1,Vj>,边的长度值为Lw,边对应的词为w(0<i<j≤n)。(2) If w=ci ci+1...cj is a word, then a directed edge <Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0<i<j≤n).
依照上述规则,使得待分字串S中包含的所有词与有向无环图G中的边一一对应,具体如图3所示,为本申请实施例的有向无环图结构示意图。N-最短路径优化算法的词语粗切问题即为求解有向无环图G的集合NSP。有向无环图结构的求解过程具体为:According to the above rules, all words contained in the string S to be divided are made to correspond one-to-one with the edges in the directed acyclic graph G, as shown in FIG. The word rough cutting problem of the N-shortest path optimization algorithm is to solve the set NSP of the directed acyclic graph G. The solution process of the directed acyclic graph structure is as follows:
设:Path(i,j)为所有从节点Vi到节点Vj的路径集合;Length(path)为路径path的长度,Length(path)值等于path中所有边的长度之和;LS为有向无环图G中所有从V0到Vn路径的长度集合;则有:Let: Path(i,j) be the set of all paths from the node Vi to the node Vj; Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path; LS is the directed and non-directional The set of lengths of all paths from V0 to Vn in the ring graph G; then there are:
LS={len|len=Length(path),path∈Path(0,n)}  (1)LS={len|len=Length(path),path∈Path(0,n)} (1)
NLS为V0到Vn的N-最短路径长度集合;NSP为V0到Vn的N-最短路径集合;RS是最终求出的N-最短路径粗分结果集。NLS的定义为:|NLS|=min(|LS|,N);a∈LS-NLS,b∈NLS→a<b NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2…wm|wi是path的第i条边对应的词,i=1,2,…,m,其中path∈NSP}。NLS is the set of N-shortest path lengths from V0 to Vn; NSP is the set of N-shortest paths from V0 to Vn; RS is the final N-shortest path rough division result set. The definition of NLS is: |NLS|=min(|LS|,N); a∈LS-NLS,b∈NLS→a<b NSP={path|path∈Path(0,n),Length(path)∈ NLS}RS={w1w2...wm|wi is the word corresponding to the i-th edge of path, i=1,2,...,m, where path∈NSP}.
以文本数据“他说的确实在理”构建有向无环图的求解为例,求解过程如图4所示。首先,采用贪心算法求得每个节点的局部最优解。记录每个节点处的最短路径值以及该节点的前驱,如果某节点包括两条以上相同长度的路径,则分别记录每条路径上该节点的前驱(前驱记录表如图4所示),然后通过回溯算法,往前搜索更加优选的结果,最终求解出文本数据“他说的确实在理”的最优分词结果为“他|说|的|确实|在理|”。Taking the solution of the directed acyclic graph constructed by the text data "what he said is true" as an example, the solution process is shown in Figure 4. First, a greedy algorithm is used to obtain the local optimal solution of each node. Record the shortest path value at each node and the predecessor of the node. If a node includes more than two paths of the same length, record the predecessor of the node on each path separately (the predecessor record table is shown in Figure 4), and then Through the backtracking algorithm, search for a more preferred result forward, and finally solve the optimal word segmentation result of the text data "what he said is true" is "he | said | | is true | is true |".
基于上述,本发明通过采用N-最短路径分词算法对地名地址进行分词,不仅可以大大减少分词数量,还能尽量将所有可能的分词结果包含不流失,避免算法本身因素可能造成的舍弃正确结果的同时,又能尽可能的缩小搜索空间,提升分词效率。Based on the above, the present invention uses the N-shortest path word segmentation algorithm to segment the place name address, which can not only greatly reduce the number of word segmentation, but also try to include all possible word segmentation results without loss, and avoid the algorithm itself. At the same time, it can reduce the search space as much as possible and improve the efficiency of word segmentation.
在本申请另一实施例中,为了进一步提高分词效率,分词及标准化处理模块44采用将动态删除算法与N-最短路径分词算法相结合的N-最短路径改进分词算法进行分词即标准化处理,具体为:In another embodiment of the present application, in order to further improve the efficiency of word segmentation, the word segmentation and standardization processing module 44 adopts the N-shortest path improved word segmentation algorithm that combines the dynamic deletion algorithm and the N-shortest path word segmentation algorithm to perform word segmentation, that is, standardized processing. for:
第一步:首先基于N-最短路径分词算法,以词为节点构造有向无环图G;The first step: First, based on the N-shortest path word segmentation algorithm, construct a directed acyclic graph G with words as nodes;
第二步:计算从开始节点到结束节点的最短路径为Lj=1,如果j小于最短路径数并且存在其他候选路径,则更新当前路径L为Lj,反之结束;Step 2: Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths and there are other candidate paths, update the current path L to Lj, otherwise end;
第三步:从当前路径中的第一个节点开始,将入度大于1的第一个节点删除,并将被删除节点记为Hm,判断Hm的子孙节点是否在集合E中,如果在集合E中,则计算从开始节点V0到Hm的最短路径,并将该最短路径的结束节点记为H’m;如果不在集合E中,则从有向无环图G中删除节点Hm及其所有子孙节点;Step 3: Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and determine whether the descendant node of Hm is in the set E, if it is in the set In E, calculate the shortest path from the start node V0 to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its all from the directed acyclic graph G descendant node;
第四步:重复上述过程,直到m≮n,更新当前路径,求得开始节点V0到所有节点H’m的最短路径j=j+1。Step 4: Repeat the above process until m≮n, update the current path, and obtain the shortest path j=j+1 from the start node V0 to all nodes H'm.
坐标匹配模块45:用于按照地名地址模型中的级别要素将至少一个词组转换为预定格式(计算机可识别)的字符串,然后将转换后的字符串与地理编码库中对应的地理坐标进行匹配,将字符串匹配到的地理坐标作为对应地名地址的标准地理坐标。Coordinate matching module 45: used to convert at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library , take the geographic coordinates matched by the string as the standard geographic coordinates of the corresponding place name address.
请参阅图9,为本申请实施例的终端结构示意图。该终端50包括处理器51、与处理器51耦接的存储器52。Please refer to FIG. 9 , which is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
存储器52存储有用于实现上述地理编码方法的程序指令。The memory 52 stores program instructions for implementing the above-described geocoding method.
处理器51用于执行存储器52存储的程序指令以控制地理编码。The processor 51 is adapted to execute program instructions stored in the memory 52 to control the geocoding.
其中,处理器51还可以称为CPU(Central Processing Unit,中央处理单元)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 51 may be an integrated circuit chip with signal processing capability. The processor 51 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components . A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
请参阅图10,为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件61,其中,该程序文件61可以以软件产品的形式存储在上述存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Please refer to FIG. 10 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of this embodiment of the present application stores a program file 61 capable of implementing all the above methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to enable a computer device (which may It is a personal computer, a server, or a network device, etc.) or a processor that executes all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes , or terminal devices such as computers, servers, mobile phones, and tablets.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本申请中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本申请所示的这些实施例,而是要符合与本申请所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this application may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

  1. 一种地理编码方法,其特征在于,包括:A method for geocoding, comprising:
    根据地名地址数据建立地名地址模型;Establish a place name address model according to the place name address data;
    根据所述地名地址模型建立地理编码库,所述地理编码库包括行政区域实体数据表、街巷实体数据表以及小区实体数据表;A geocoding library is established according to the place name and address model, and the geocoding library includes an administrative area entity data table, a street and alley entity data table, and a community entity data table;
    基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理,将所述地名地址数据切分为至少一个词组;Based on the address dictionary, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data, and divide the place name address data into at least one phrase;
    按照所述地名地址模型中的级别要素将所述至少一个词组转换成预定格式的字符串,将所述字符串与所述地理编码库中对应的地理坐标进行匹配,将所述字符串匹配到的地理坐标作为对应地名地址的标准地理坐标。Convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and match the character string to The geographic coordinates are used as the standard geographic coordinates for the address of the corresponding place name.
  2. 根据权利要求1所述的地理编码方法,其特征在于,所述根据地名地址数据建立地名地址模型前还包括:The geocoding method according to claim 1, wherein before establishing the place-name-address model according to the place-name and address data, the method further comprises:
    对所述地名地址数据进行数据清洗。Data cleaning is performed on the place name and address data.
  3. 根据权利要求1所述的地理编码方法,其特征在于,所述根据所述地名地址模型建立地理编码库包括:The geocoding method according to claim 1, wherein the establishing a geocoding library according to the place name and address model comprises:
    定义所述行政区域实体数据表、街巷实体数据表以及小区实体数据表的表结构,按照所述表结构依次录入省、区县、街道、小区、标志物、门牌号和地理坐标进行所述地理编码库的构建。Define the table structure of the administrative area entity data table, the street entity data table and the community entity data table, and enter the provinces, districts, counties, streets, communities, markers, house numbers and geographic coordinates in turn according to the table structure. Construction of the geocoding library.
  4. 根据权利要求3所述的地理编码方法,其特征在于,所述基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理包括:The method for geocoding according to claim 3, wherein, based on an address dictionary, using an N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data comprises:
    根据所述地址词典顺序匹配出所述地名地址数据中的地名词组,并构建一个有向无环图,每个词组分别是所述有向无环图中的一个节点,且分别对应着一条被赋予边长的边;According to the address dictionary order, the place name groups in the place name address data are matched, and a directed acyclic graph is constructed. Each phrase is a node in the directed acyclic graph, and corresponds to a side given length;
    按照预设规则建立所述有向无环图所有可能的词边,使得所述地名地理数据中包含的所有词分别与所述有向无环图中的边一一对应,并求解所述有向无环图中从开始节点到结束节点的N-最短路径集合,根据所述N-最短路径集合对所述地名地址数据进行分词。All possible word edges of the directed acyclic graph are established according to preset rules, so that all words contained in the geographical name geographic data correspond to the edges of the directed acyclic graph one-to-one respectively, and solve the To the N-shortest path set from the start node to the end node in the acyclic graph, the place name address data is segmented according to the N-shortest path set.
  5. 根据权利要求4所述的地理编码方法,其特征在于,假设所述地名地理数据S=c1 c2……cn,其中ci(i=1,2,…n)为单个的字,n为串的长度,n≥1,建立的有向无环图G的节点数为n+1,各节点编号依次为V0,V1,V2,…,Vn,所述建立有向无环图所有可能的词边的预设规则为:The geocoding method according to claim 4, characterized in that, it is assumed that the geographic data of place names S=c1 c2...cn, wherein ci(i=1,2,...n) is a single word, and n is a string Length, n≥1, the number of nodes in the established directed acyclic graph G is n+1, and the number of each node is V0, V1, V2, ..., Vn, all possible word edges of the established directed acyclic graph The default rules are:
    相邻节点Vk-1,Vk之间建立有向边<Vk-1,Vk>,边的长度值为Lk,边对应的词默认为ck(k=1,2,…n);A directed edge <Vk-1, Vk> is established between adjacent nodes Vk-1, Vk, the length of the edge is Lk, and the word corresponding to the edge defaults to ck (k=1,2,...n);
    若w=ci ci+1……cj是一个词,则节点Vi-1,Vj之间建立有向边<Vi-1,Vj>,边的长度值为Lw,边对应的词为w(0<i<j≤n)。If w=ci ci+1...cj is a word, then a directed edge <Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0 <i<j≤n).
  6. 根据权利要求5所述的地理编码方法,其特征在于,所述求解所述有向无环图中从开始节点到结束节点的N-最短路径集合包括:The geocoding method according to claim 5, wherein the solving the N-shortest path set from the start node to the end node in the directed acyclic graph comprises:
    假设Path(i,j)为所有从节点Vi到节点Vj的路径集合;Length(path)为路径path的长度,Length(path)值等于path中所有边的长度之和;LS为有向无环图G中所有从V0到Vn路径的长度集合,则有:Suppose Path(i,j) is the set of all paths from node Vi to node Vj; Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path; LS is directed acyclic The set of lengths of all paths from V0 to Vn in graph G, there are:
    LS={len|len=Length(path),path∈Path(0,n)}LS={len|len=Length(path), path∈Path(0,n)}
    设NLS为V0到Vn的N-最短路径长度集合,NSP为V0到Vn的N-最短路径集合,RS是最终求出的N-最短路径粗分结果集,|NLS|=min(|LS|,N);a∈LS-NLS,b∈NLS→a<b,NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2…wm|,wi是path的第i条边对应的词,i=1,2,…,m,其中path∈NSP},n为最短路径数。Let NLS be the set of N-shortest path lengths from V0 to Vn, NSP be the set of N-shortest path lengths from V0 to Vn, RS is the final N-shortest path rough division result set, |NLS|=min(|LS| ,N); a∈LS-NLS,b∈NLS→a<b, NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2...wm|, wi is path The word corresponding to the i-th edge of , i=1,2,...,m, where path∈NSP}, n is the number of shortest paths.
  7. 根据权利要求6所述的地理编码方法,其特征在于,所述基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理还包括:The method for geocoding according to claim 6, wherein, based on an address dictionary, using an N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data, further comprising:
    计算从开始节点到结束节点的最短路径为Lj=1,如果j小于最短路径数并且存在其他候选路径,则更新当前路径L为Lj,反之结束;Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths and there are other candidate paths, update the current path L to Lj, otherwise end;
    从当前路径中的第一个节点开始,将入度大于1的第一个节点删除,并将被删除节点记为Hm,判断Hm的子孙节点是否在集合E中,如果在集合E中,则计算从开始节点到Hm的最短路径,并将该最短路径的结束节点记为H’m;如果不在集合E中,则从有向无环图G中删除节点Hm及其所有子孙节点;其中,集合E为V0到Vn的N-最短路径集合,Hm与H’m在每一个循环当中均代表结束节点,H’m作为下一次循环的结束标记;Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and judge whether the descendant node of Hm is in the set E. If it is in the set E, then Calculate the shortest path from the start node to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its descendant nodes from the directed acyclic graph G; among them, Set E is the set of N-shortest paths from V0 to Vn, Hm and H'm represent end nodes in each cycle, and H'm is used as the end marker of the next cycle;
    重复所述节点删除过程,直到m≮n,更新当前路径,求得开始节点V0到所有节点H’m的最短路径j=j+1;n为删除节点后的最短路径数,m为j循环构造后的最短路径,在每一次循环当中,m的值为m=j+1。Repeat the node deletion process until m≮n, update the current path, and obtain the shortest path j=j+1 from the starting node V0 to all nodes H'm; n is the number of shortest paths after deleting the node, m is the j cycle For the constructed shortest path, in each cycle, the value of m is m=j+1.
  8. 一种地理编码系统,其特征在于,包括:A geographic coding system, comprising:
    地名地址模型构建模块:用于根据地名地址数据建立地名地址模型;Place name address model building module: used to build place name address model based on place name address data;
    地理编码库构建模块:用于根据所述地名地址模型建立地理编码库,所述地理编码库包括行政区域实体数据表、街巷实体数据表以及小区实体数据表;Geographical coding library building module: used to establish a geographic coding library according to the place name and address model, and the geographic coding library includes an administrative area entity data table, a street and lane entity data table and a community entity data table;
    分词及标准化处理模块:用于基于地址词典,运用N-最短路径优化算法对所述地名地址数据进行分词及标准化处理,将所述地名地址数据切分为至少一个词组;Word segmentation and standardization processing module: used to perform word segmentation and standardization processing on the place name address data based on the address dictionary, using the N-shortest path optimization algorithm, and divide the place name address data into at least one phrase;
    坐标匹配模块:用于按照所述地名地址模型中的级别要素将所述至少一个词组转换成预定格式的字符串,将所述字符串与所述地理编码库中对应的地理坐标进行匹配,将所述字符串匹配到的地理坐标作为对应地名地址的标准地理坐标。Coordinate matching module: used to convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and The geographic coordinates matched by the character string are used as the standard geographic coordinates of the address corresponding to the place name.
  9. 一种终端,其特征在于,所述终端包括处理器、与所述处理器耦接的存储器,其中,A terminal, characterized in that the terminal includes a processor and a memory coupled to the processor, wherein,
    所述存储器存储有用于实现权利要求1-7任一项所述的地理编码方法的程序指令;The memory stores program instructions for implementing the geocoding method according to any one of claims 1-7;
    所述处理器用于执行所述存储器存储的所述程序指令以控制地理编码。The processor is configured to execute the program instructions stored in the memory to control geocoding.
  10. 一种存储介质,其特征在于,存储有处理器可运行的程序指令,所述程序指令用于执行权利要求1至7任一项所述地理编码方法。A storage medium, characterized in that it stores program instructions executable by a processor, and the program instructions are used to execute the geocoding method according to any one of claims 1 to 7.
PCT/CN2020/139759 2020-11-05 2020-12-26 Geocoding method and system, terminal and storage medium WO2022095256A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011222303.X 2020-11-05
CN202011222303.XA CN112256817A (en) 2020-11-05 2020-11-05 Geocoding method, system, terminal and storage medium

Publications (1)

Publication Number Publication Date
WO2022095256A1 true WO2022095256A1 (en) 2022-05-12

Family

ID=74268299

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139759 WO2022095256A1 (en) 2020-11-05 2020-12-26 Geocoding method and system, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN112256817A (en)
WO (1) WO2022095256A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115809315A (en) * 2022-11-24 2023-03-17 中科星图智慧科技安徽有限公司 Geographical name and address standardized matching algorithm
CN116910386A (en) * 2023-09-14 2023-10-20 深圳市智慧城市科技发展集团有限公司 Address completion method, terminal device and computer-readable storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949260A (en) * 2021-03-05 2021-06-11 浪潮云信息技术股份公司 Method for accelerating conversion of unstructured enterprise address into longitude and latitude
CN112699640B (en) * 2021-03-23 2021-06-11 城云科技(中国)有限公司 Geocoding method and system based on PostgreSQL
CN113723681A (en) * 2021-08-30 2021-11-30 平安国际智慧城市科技股份有限公司 Path selection method and device, electronic equipment and readable storage medium
CN114970518B (en) * 2022-02-15 2022-12-16 北京青萌数海科技有限公司 Method and device for correcting address data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150127324A1 (en) * 2007-09-28 2015-05-07 Telogis, Inc. Natural language parsers to normalize addresses for geocoding
CN108763215A (en) * 2018-05-30 2018-11-06 中智诚征信有限公司 A kind of address storage method, device and computer equipment based on address participle
CN109145169A (en) * 2018-07-26 2019-01-04 浙江省测绘科学技术研究院 A kind of address matching method based on statistics participle

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150127324A1 (en) * 2007-09-28 2015-05-07 Telogis, Inc. Natural language parsers to normalize addresses for geocoding
CN108763215A (en) * 2018-05-30 2018-11-06 中智诚征信有限公司 A kind of address storage method, device and computer equipment based on address participle
CN109145169A (en) * 2018-07-26 2019-01-04 浙江省测绘科学技术研究院 A kind of address matching method based on statistics participle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WU XIAO-QIAN, HU XUE-GANG: "Research on Chinese Word Segmentation Based on N-shortest Path", JOURNAL OF ANHUI UNIVERSITY OF SCIENCE AND TECHNOLOGY(NATURAL SCIENCE), vol. 34, no. 1, 1 March 2014 (2014-03-01), pages 72 - 75, XP055931102, ISSN: 1672-1098 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115809315A (en) * 2022-11-24 2023-03-17 中科星图智慧科技安徽有限公司 Geographical name and address standardized matching algorithm
CN116910386A (en) * 2023-09-14 2023-10-20 深圳市智慧城市科技发展集团有限公司 Address completion method, terminal device and computer-readable storage medium
CN116910386B (en) * 2023-09-14 2024-02-02 深圳市智慧城市科技发展集团有限公司 Address completion method, terminal device and computer-readable storage medium

Also Published As

Publication number Publication date
CN112256817A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2022095256A1 (en) Geocoding method and system, terminal and storage medium
US11550826B2 (en) Method and system for generating a geocode trie and facilitating reverse geocode lookups
WO2020228706A1 (en) Fence address-based coordinate data processing method and apparatus, and computer device
CN108959244B (en) Address word segmentation method and device
CN107145577A (en) Address standardization method, device, storage medium and computer
US10387438B2 (en) Method and apparatus for integration of community-provided place data
CN109657074B (en) News knowledge graph construction method based on address tree
WO2015027836A1 (en) Method and system for place name entity recognition
WO2016050088A1 (en) Address search method and device
US8949196B2 (en) Systems and methods for matching similar geographic objects
CN110990520B (en) Address coding method and device, electronic equipment and storage medium
EP3117344B1 (en) Density-based dynamic geohash
WO2022247165A1 (en) Coding method and apparatus for geographic location area, and method and apparatus for establishing coding model
CN112528174A (en) Address finishing and complementing method based on knowledge graph and multiple matching and application
CN111291099B (en) Address fuzzy matching method and system and computer equipment
Moura et al. Reference data enhancement for geographic information retrieval using linked data
CN113609100B (en) Data storage method, data query device and electronic equipment
CN116414824A (en) Administrative division information identification and standardization processing method, device and storage medium
WO2023087702A1 (en) Text recognition method for form certificate image file, and computing device
CN113535803B (en) Block chain efficient retrieval and reliability verification method based on keyword index
CN116414808A (en) Method, device, computer equipment and storage medium for normalizing detailed address
US20220107967A1 (en) Systems and Methods for Generating Multi-Part Place Identifiers
CN115185986A (en) Method and device for matching provincial and urban area address information, computer equipment and storage medium
CN111444299A (en) Chinese address extraction method based on address tree model
CN112949260A (en) Method for accelerating conversion of unstructured enterprise address into longitude and latitude

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20960713

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20960713

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20960713

Country of ref document: EP

Kind code of ref document: A1