CN112256817A - Geocoding method, system, terminal and storage medium - Google Patents

Geocoding method, system, terminal and storage medium Download PDF

Info

Publication number
CN112256817A
CN112256817A CN202011222303.XA CN202011222303A CN112256817A CN 112256817 A CN112256817 A CN 112256817A CN 202011222303 A CN202011222303 A CN 202011222303A CN 112256817 A CN112256817 A CN 112256817A
Authority
CN
China
Prior art keywords
place name
path
node
name address
shortest path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011222303.XA
Other languages
Chinese (zh)
Inventor
钱静
彭树宏
陈朝亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202011222303.XA priority Critical patent/CN112256817A/en
Priority to PCT/CN2020/139759 priority patent/WO2022095256A1/en
Publication of CN112256817A publication Critical patent/CN112256817A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Abstract

The application relates to a geocoding method, a system, a terminal and a storage medium. The method comprises the following steps: establishing a place name address model according to the place name address data; establishing a geographic coding library according to the place name address model, wherein the geographic coding library comprises an administrative region entity data table, a street lane entity data table and a cell entity data table; based on an address dictionary, performing word segmentation and standardization processing on the place name address data by using an N-shortest path optimization algorithm, and segmenting the place name address data into at least one word group; converting at least one phrase into a character string with a preset format according to the level elements in the place name address model, matching the character string with the corresponding geographic coordinates in the geographic coding library, and taking the geographic coordinates matched with the character string as the standard geographic coordinates of the corresponding place name address. The method and the device can reduce the number of word groups to be segmented as much as possible, and can contain all results which need to be reserved, thereby effectively avoiding resource waste and increasing the search efficiency.

Description

Geocoding method, system, terminal and storage medium
Technical Field
The present application relates to the field of geocoding technologies, and in particular, to a geocoding method, a system, a terminal, and a storage medium.
Background
Geographic information systems are increasingly used as a product of combining location services with information platforms. With the popularization and continuous maturity of geographic information technology, many enterprises, units and government departments establish geographic information-based services, such as pharmaceutical industry, media and the like, and the demand for management operation by means of geographic information is increasingly prominent. However, the naming mode of geographic information such as the national place name and address has the characteristics of messy semantics, disordered language sequence and the like, namely, a unified criterion is not available to standardize the geographic information. In addition, the geographic information that can be collected by a general department unit is only various disordered place name address type literal description information (non-spatial information), and cannot acquire space coordinate information that can be used directly. If the non-spatial information cannot be successfully converted into the spatial coordinate information, related enterprises cannot combine related thematic data with geographic information, and application of visualization, functional analysis and the like of GIS software is indirectly influenced. Therefore, how to convert the non-spatial information related to the geographic position into the geographic coordinates of the GIS system which can be identified by the computer and realize the matching of the non-spatial information and the geographic coordinates of the entity can play the maximum role of the geographic information system.
Disclosure of Invention
The present application provides a geocoding method, system, terminal and storage medium, which aim to solve at least one of the above technical problems in the prior art to some extent.
In order to solve the above problems, the present application provides the following technical solutions:
a geocoding method comprising:
establishing a place name address model according to the place name address data;
establishing a geographic coding library according to the place name address model, wherein the geographic coding library comprises an administrative region entity data table, a street lane entity data table and a cell entity data table;
based on an address dictionary, performing word segmentation and standardization processing on the place name address data by using an N-shortest path optimization algorithm, and dividing the place name address data into at least one word group;
and converting the at least one phrase into a character string in a preset format according to the level elements in the place name address model, matching the character string with the corresponding geographic coordinate in the geographic coding library, and taking the geographic coordinate matched with the character string as the standard geographic coordinate of the corresponding place name address.
The technical scheme adopted by the embodiment of the application further comprises the following steps: before establishing a place name address model according to the place name address data, the method further comprises the following steps:
and performing data cleaning on the place name address data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the establishing of the geocode library according to the place name address model comprises the following steps:
and defining table structures of the administrative region entity data table, the street lane entity data table and the cell entity data table, and sequentially inputting provinces, counties, streets, cells, markers, house numbers and geographic coordinates according to the table structures to construct the geocode library.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the method for performing word segmentation and standardization processing on the address data of the place name by using an N-shortest path optimization algorithm based on the address dictionary comprises the following steps:
matching place name phrases in the place name address data according to the address dictionary sequence, and constructing a directed acyclic graph, wherein each phrase is a node in the directed acyclic graph and corresponds to an edge with a side length;
and establishing all possible word edges of the directed acyclic graph according to a preset rule, enabling all words contained in the geographical data of the place name to correspond to the edges of the directed acyclic graph one by one, solving an N-shortest path set from a starting node to an ending node in the directed acyclic graph, and segmenting the address data of the place name according to the N-shortest path set.
The technical scheme adopted by the embodiment of the application further comprises the following steps: assuming that the geographical data S ═ c1c2 … … cn, where ci (i ═ 1,2, … n) is a single word, n is the length of a string, n is greater than or equal to 1, the number of nodes of the established directed acyclic graph G is n +1, the node numbers are V0, V1, V2, …, and Vn in sequence, and the preset rule for establishing all possible word edges of the directed acyclic graph is as follows:
directional edges < Vk-1, Vk > are established between adjacent nodes Vk-1, Vk >, the length value of each edge is Lk, and words corresponding to the edges are defaulted to ck (k is 1,2, … n);
if w is ci +1 … … cj, a directed edge < Vi-1, Vj > is established between nodes Vi-1, Vj, the length value of the edge is Lw, and the word corresponding to the edge is w (0< i < j ≦ n).
The technical scheme adopted by the embodiment of the application further comprises the following steps: the solving of the set of N-shortest paths from the start node to the end node in the directed acyclic graph comprises:
let Path (i, j) be the set of all paths from node Vi to node Vj; length (path) is the length of path, and length (path) has a value equal to the sum of the lengths of all edges in path; LS is the length set of all paths from V0 to Vn in the directed acyclic graph G, and then:
LS={len|len=Length(path),path∈Path(0,n)}
setting NLS as an N-shortest path length set from V0 to Vn, NSP as an N-shortest path set from V0 to Vn, RS as a rough division result set of the finally obtained N-shortest paths, | NLS | ═ min (| LS |, N); a belongs to LS-NLS, b belongs to NLS → a < b, NSP belongs to { Path | Path ∈ Path (0, n), length (Path) ∈ NLS } RS { (w 1w2 … wm |, wi is the word corresponding to the ith side of Path, i belongs to 1,2, …, m, wherein Path ∈ NSP }, and n is the shortest Path number.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the performing word segmentation and standardization processing on the address data of the place name by using an N-shortest path optimization algorithm based on the address dictionary further comprises the following steps:
calculating the shortest path from the starting node to the ending node as Lj ═ 1, if j is less than the shortest path number and other candidate paths exist, updating the current path L as Lj, otherwise, ending;
deleting the first node with the in-degree greater than 1 from the first node in the current path, recording the deleted node as Hm, judging whether the descendant node of Hm is in the set E, if so, calculating the shortest path from the starting node to Hm, and recording the ending node of the shortest path as H'm; if the node Hm is not in the set E, deleting the node Hm and all descendant nodes thereof from the directed acyclic graph G; the set E is an N-shortest path set from V0 to Vn, Hm and H'm represent end nodes in each cycle, and H'm is used as an end mark of the next cycle;
repeating the node deletion process until m is not less than n, updating the current path, and obtaining the shortest path j from the starting node V0 to all nodes H'm as j + 1; n is the shortest path number after deleting the node, m is the shortest path after constructing the j loop, and the value of m is j +1 in each loop.
Another technical scheme adopted by the embodiment of the application is as follows: a geocoding system comprising:
the place name address model building module: the system is used for establishing a place name address model according to the place name address data;
the geocode library construction module comprises: the geographic coding library is established according to the place name address model and comprises an administrative region entity data table, a street lane entity data table and a cell entity data table;
word segmentation and standardization processing module: the system is used for performing word segmentation and standardization processing on the place name address data by using an N-shortest path optimization algorithm based on an address dictionary, and segmenting the place name address data into at least one word group;
a coordinate matching module: and the geographic coordinate matched with the character string is used as a standard geographic coordinate of the corresponding place name address.
The embodiment of the application adopts another technical scheme that: a terminal comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions for implementing the geocoding method;
the processor is configured to execute the program instructions stored by the memory to control geocoding.
The embodiment of the application adopts another technical scheme that: a storage medium storing program instructions executable by a processor for performing the geocoding method.
Compared with the prior art, the embodiment of the application has the advantages that: the geographical coding method, the system, the terminal and the storage medium of the embodiment of the application carry out word segmentation and standardization processing on the place name address based on an N-shortest path optimization algorithm, after the place name address is segmented according to a standardization processing result, the segmented place name address is converted into a character string which can be identified by a computer according to level elements in a place name address model, finally, the character string is matched with corresponding geographical coordinates in a geographical coding library, and standard geographical coordinates are given to the place name address according to a matching result. The method and the device have the advantages that the auxiliary grammar and semantic rules are added into the algorithm, the drawback of word-by-word traversal is improved, the practicability is increased, the advantages of full segmentation idea are inherited, the number of segmented word groups can be reduced as far as possible, meanwhile, all results needing to be reserved can be contained, the resource waste can be effectively avoided, and the search efficiency is increased.
Drawings
FIG. 1 is a flow chart of a geocoding method of a first embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a location name address representation according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a directed acyclic graph structure according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a solving process of a directed acyclic graph according to an embodiment of the present application;
FIG. 5 is a diagram illustrating predecessor records in a solution process of a directed acyclic graph according to an embodiment of the present application;
FIG. 6 is a flow chart of a geocoding method of a second embodiment of the present application;
FIG. 7 is a diagram illustrating an N-shortest path improved word segmentation algorithm according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a geocoding system according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Aiming at the defects of the prior art, the geocoding method of the embodiment of the application firstly carries out data cleaning on the initial place name address data, and prevents the problems of too many wrongly input texts, wrong spelling, repeated texts and the like; then, a place name address model is established to reflect different expressions of a country or a region on the geographical name, a geographical coding library comprising a place name data sheet, a building data sheet and a gate (building) plate data sheet is established according to the place name address model, the N-shortest path optimization algorithm is used for carrying out word segmentation and standardization processing on the place name address, after the place name address is segmented according to the standardization processing result, the segmented place name address is converted into a character string which can be identified by a computer according to level elements in the place name address model, and finally the character string is matched with corresponding geographical coordinates in the geographical coding library. The embodiment of the application inherits the advantages of the full segmentation idea, can reduce the number of segmented phrases as much as possible, and simultaneously can contain all the results which need to be reserved, thereby effectively avoiding resource waste and increasing the search efficiency.
Specifically, please refer to fig. 1, which is a flowchart illustrating a geocoding method according to a first embodiment of the present application. The geocoding method of the first embodiment of the application comprises the following steps:
s10: carrying out data cleaning on the initial place name address data;
in this step, since text data such as the place name address input by the user side may include wrongly written or repeated characters, in order to avoid the problem that the subsequent character strings and the geographic coordinates are wrongly matched due to the problems of inconsistency of the character strings, misspelling and the like in the text data, the embodiment of the present invention uses the trillium technology, and performs data cleaning on the place name address by using a syntax analysis and fuzzy matching algorithm.
S11: structuring the cleaned place name address data, and establishing a place name address model;
in this step, the expression modes of the place name address by different countries or regions have description rules with different granularity ranges, and a telescopic place name address model is established according to the description rules with different granularity ranges in the embodiment of the application. Specifically, as shown in fig. 2, a schematic diagram of a location name address expression manner is shown. In the address expression mode of the place name, the place name of the administrative region comprises province level, city level, county level, street name, district name, community name, gate address, marker name or alias and unit name or short name thereof; wherein, the province level is prior to the city level, the city level is prior to the county level, and the county level is prior to the county level; the street name is superior to the district name, the district name is superior to the community name, the gate address is superior to the marker name or the alias thereof, and the unit name or the abbreviation thereof is used next. Generally, a street name and a cell name in a city are unique, so that a certain range address can be roughly locked by using the street name or the cell name, a place can be accurately located by using the street name or the cell name plus a house (building) brand number, and a position can be basically and accurately located by using the administrative district place name plus a marker name. That is, when the content to be expressed in the text has a house number, a position is locked by using a street name or a cell name + a house (building) number; when the place name address data contains the marker name, the accurate positioning is carried out by using the administrative region place name and the marker name. The address data structuring according to the description rule of the above-mentioned granularity range is exemplified as follows:
(1) the address data of the city of Guangzhou, Guangdong province No. 111 sports West road, the place name is structured as follows: administrative region name + street lane name + house number;
(2) the city, Guangzhou, Guangdong province, the Tianhe district sports West road construction and center, the place name address data structuralization: administrative area name + street lane name + marker name.
When multiple marker names are encountered, the method can be extended according to the granularity of the current administrative region until a unique place can be determined. For example, the place name address data is "Huizhou institute of Huizhou city Yanda 46", which can be simplified to "Yanda 46" in application of Huizhou city without any ambiguity at all; if the address data of the place name is 'Guangzhou city business bank', a plurality of markers can be positioned at the time, and the obtained result is difficult to screen, the address data needs to be extended to a street name or a cell name for description, and then a certain business bank can be accurately positioned.
S12: establishing a geocode base containing an administrative region entity data table, a street lane entity data table and a cell entity data table according to a place name address model;
in this step, the table structures of the administrative area entity data table, the street entity data table and the cell entity data table can be defined according to the application scenario, and the establishment of the geocode base follows the following principle:
the principle of uniqueness: any geographic entity can only be uniquely identified;
principle of transparency: the dependency relationship between the structures can be identified from the code;
the principle of flexibility: should accommodate the evolving changes in the subject;
standard principle: the coding rules are adapted to the national standards body in order to realize data sharing.
The structure of the administrative region entity data table, the street lane entity data table and the cell entity data table are respectively shown in the following tables 1,2 and 3:
TABLE 1 administrative region entity data sheet
Figure BDA0002762486870000091
TABLE 2 street lane entity data sheet
Figure BDA0002762486870000092
Figure BDA0002762486870000101
Table 3 cell entity data table
Figure BDA0002762486870000102
And (4) sequentially inputting all provinces, counties, streets, districts, markers, house numbers and geographic coordinates according to the table structures to construct a geocode library. And selecting the field value in each data table as a place name address entry, and recording the field value and the corresponding address level in an address dictionary. When the address alias is used as a place name address entry, a standard name is also recorded so as to normalize the address elements during address word segmentation.
S13: based on an address dictionary, performing word segmentation and standardization processing on irregular place name address data by using an N-shortest path optimization algorithm, and segmenting the place name address data into at least one word group;
in this step, the N-shortest path optimization algorithm is implemented as follows: the address dictionary records all place name addresses (including alias, abbreviation and the like) of different countries and different regions, firstly, place name phrases possibly appearing in place name address data are sequentially matched according to the address dictionary, then a directed acyclic graph is constructed, each phrase is a node in the directed acyclic graph and corresponds to an edge with a given side length (namely a weight value, in a non-statistical rough classification model, all words are assumed to be equal, and for convenience of calculation, the side lengths of the corresponding edges of all words are set to be 1). And calculating the path value from each node to the source node in all paths from the starting point to the end point in the directed acyclic graph, and corresponding to a path set to be used as a path result set of each node.
For example, assume a string to be divided, S ═ c1c2 … … cn, where ci (i ═ 1,2, … n) is a single word, n is the length of the string, and n ≧ 1. And establishing a directed acyclic graph G with n +1 nodes, wherein the serial numbers of the nodes are V0, V1, V2, … and Vn in sequence. All possible word edges of G are established by two rules:
(1) directional edges < Vk-1, Vk > are established between adjacent nodes Vk-1, Vk >, the length value of each edge is Lk, and words corresponding to the edges are defaulted to ck (k is 1,2, … n);
(2) if w is ci +1 … … cj, a directed edge < Vi-1, Vj > is established between nodes Vi-1, Vj, the length value of the edge is Lw, and the word corresponding to the edge is w (0< i < j ≦ n).
According to the above rules, all words included in the string S to be divided are in one-to-one correspondence with edges in the directed acyclic graph G, specifically as shown in fig. 3, which is a schematic view of a directed acyclic graph structure according to an embodiment of the present application. The term rough cutting problem of the N-shortest path optimization algorithm is the set NSP for solving the directed acyclic graph G. The solving process of the directed acyclic graph structure specifically comprises the following steps:
setting: path (i, j) is the set of all paths from node Vi to node Vj; length (path) is the length of path, and length (path) has a value equal to the sum of the lengths of all edges in path; LS is the length set of all paths from V0 to Vn in the directed acyclic graph G; then there are:
LS={len|len=Length(path),path∈Path(0,n)} (1)
NLS is an N-shortest path length set from V0 to Vn, and NSP is an N-shortest path set from V0 to Vn; RS is the final N-shortest path rough separation result set. NLS is defined as: | NLS | ═ min (| LS |, N); a belongs to LS-NLS, b belongs to NLS → a < b NSP [ { Path | Path ∈ Path (0, n) ], length (Path) ∈ NLS } RS [ { w1w2 … wm | wi ] is the word corresponding to the ith side of Path, i ═ 1,2, …, m, wherein Path ∈ NSP }, and n is the shortest Path number.
Taking the text data "what he says is really ideal" as an example to construct a solution of a directed acyclic graph, the process of solving the text data is shown in fig. 4. Firstly, a greedy algorithm is adopted to obtain a local optimal solution of each node. Recording the shortest path value at each node and the predecessor of the node, and if a certain node includes more than two paths with the same length, respectively recording the predecessor of the node on each path, where the predecessor recording table of the text data is shown in fig. 5, where in (a), the predecessor (2,1) and (3,1) are 3 and 4 in length, respectively, and the corresponding node is 012,0123, respectively; in (b), the predecessors (4,1) he and (4,2) he say that the lengths are 4 and 5, respectively, and the corresponding nodes are 0123,01234, respectively; in (c), the predecessor (4,1) he, (5,1) he ((4,2) he) and (5,2) he say that the lengths are 4, 5 and 6, respectively, and the corresponding nodes are 0123,01234,012345, respectively; in (d), the predecessors (6,1) he ((5,1) he), (6,2) he said ((5,2)) and (6,3) he said, the lengths are 5, 6 and 7, respectively, and the corresponding nodes are 01234,012345,0123456, respectively. Then, through a backtracking algorithm, a more preferable result is searched forward, and finally the optimal word segmentation result of the text data 'he really says ideal' is solved as 'he really says | really | ideal |'.
Based on the above, the invention performs word segmentation on the place name address by adopting the N-shortest path word segmentation algorithm, so that the word segmentation quantity can be greatly reduced, all possible word segmentation results can be contained and not lost as much as possible, correct results are avoided being abandoned possibly caused by the factors of the algorithm, the search space can be reduced as much as possible, and the word segmentation efficiency is improved.
S14: converting at least one segmented phrase into a character string with a preset format (which can be identified by a computer) according to level elements in a place name address model, and then matching the converted character string with a corresponding geographic coordinate in a geographic coding library;
s15: and taking the geographic coordinates matched with the character strings as standard geographic coordinates of the corresponding place name addresses.
Please refer to fig. 6, which is a flowchart illustrating a geocoding method according to a second embodiment of the present application. The geocoding method of the second embodiment of the present application includes the steps of:
s20: carrying out data cleaning on the initial place name address data;
in this step, since text data such as the place name address input by the user side may include wrongly written or repeated characters, in order to avoid the problem that the subsequent character strings and the geographic coordinates are wrongly matched due to the problems of inconsistency of the character strings, misspelling and the like in the text data, the embodiment of the present invention uses the trillium technology, and performs data cleaning on the place name address by using a syntax analysis and fuzzy matching algorithm.
S21: structuring the cleaned place name address data, and establishing a place name address model;
in this step, the expression modes of the place name addresses by different countries or regions have different granularity range rules, and the place name addresses can be regarded as a place name address model which is scalable in hierarchy.
S22: establishing a geocode base containing a place name data table, a building data table and a gate (building) board data table according to a place name address model;
in this step, the table structures of the place name data table, the building data table and the gate (building) plate data table can be defined according to the application scene, and all provinces, counties, streets, districts, signs, house numbers and geographic coordinates are sequentially recorded according to the table structures to construct the geocode library.
S23: based on an address dictionary, performing word segmentation and standardization processing on irregular place name address data by adopting an N-shortest path improved word segmentation algorithm combining a dynamic deletion algorithm and an N-shortest path word segmentation algorithm, and segmenting the place name address data into at least one word group;
in the step, in order to further improve the word segmentation efficiency, the invention provides an N-shortest path improved word segmentation algorithm which combines a dynamic deletion algorithm and an N-shortest path word segmentation algorithm, wherein the basic idea of the dynamic deletion algorithm is as follows: constructing a shortest path updating queue for storing the child nodes of the deleted node; deleting the nodes which should be deleted and all the child nodes in the original shortest path tree; the node closest to the root node is selected for updating in the queue and the updated node is no longer inserted into the queue to reduce the number of node updates. The N-shortest path improved word segmentation algorithm is shown in fig. 7, and the solving process specifically includes:
the first step is as follows: firstly, constructing a directed acyclic graph G by taking words (or characters) as nodes based on an N-shortest path word segmentation algorithm; the directed acyclic graph construction process is the same as that in the first embodiment, and will not be described again in this embodiment;
the second step is that: calculating the shortest path from the starting node to the ending node as Lj ═ 1, if j is less than the shortest path number (namely j < n) and other candidate paths exist, updating the current path L as Lj, otherwise, ending; the candidate paths refer to different paths generated by different word group cutting modes.
And Lj is used for storing the shortest path, wherein j is a dynamic variable, the initial value of j can be set according to the length of the whole character string, for example, "what he says is exactly what", the initial value of j is 8, when the word combination is completed, the nodes between the combination are deleted, and the value of j is updated, and as the sentence division continues, the value of j becomes smaller and smaller until the sentence cannot be cut.
The third step: deleting the first node with the in-degree greater than 1 from the first node in the current path, recording the deleted node as Hm, judging whether the descendant node of Hm is in the set E, if so, calculating the shortest path from the starting node V0 to Hm, and recording the ending node of the shortest path as H'm; if the node Hm is not in the set E, deleting the node Hm and all descendant nodes thereof from the directed acyclic graph G; where the set E is an N-shortest path set (i.e., NSP) from V0 to Vn, and is used here to determine whether the deleted node is in the shortest path. Hm and H'm represent the end nodes in each cycle, and H'm will be used as the end mark of the next cycle.
The fourth step: the above process is repeated until m is not less than n, the current path is updated, and the shortest path j from the start node V0 to all the nodes H'm is determined to be j + 1. Wherein n is the shortest path number after deleting the node, m is the shortest path after constructing the j loop, in each loop, the value of m is j +1, and after entering the next loop, Hm will be updated with the value of m, and H'm will not change.
In the above solving process, in order to avoid affecting the searching efficiency and accuracy, for the first j optimal paths to be reserved, the value of j should be moderate, and cannot be too large or too small.
S24: converting at least one segmented phrase into a character string with a preset format (which can be identified by a computer) according to level elements in a place name address model, and then matching the converted character string with a corresponding geographic coordinate in a geographic coding library;
s25: and taking the geographic coordinates matched with the character strings as standard geographic coordinates of the corresponding place name addresses.
In order to verify the feasibility and the effectiveness of the embodiment of the application, the scheme is tested by means of the ArcGIS api for Javascript platform, the accuracy of the scheme is compared with that of the traditional algorithm, and the comparison result is shown in Table 1:
TABLE 1 Algorithm accuracy comparison
Figure BDA0002762486870000151
Experimental results show that the accuracy of geographic coordinate matching by adopting the embodiment of the application exceeds that of a traditional algorithm, and the word segmentation speed is increased by more than two times.
Based on the above, the geocoding method of the embodiment of the application performs word segmentation and standardization on the place name address by using an N-shortest path optimization algorithm, segments the place name address according to a standardization processing result, converts the segmented place name address into a character string which can be identified by a computer according to a level element in a place name address model, matches the character string with a corresponding geographic coordinate in a geocode library, and assigns a standard geographic coordinate to the place name address according to a matching result. The method and the device have the advantages that the auxiliary grammar and semantic rules are added into the algorithm, the drawback of word-by-word traversal is improved, the practicability is increased, the advantages of full segmentation idea are inherited, the number of segmented word groups can be reduced as far as possible, meanwhile, all results needing to be reserved can be contained, the resource waste can be effectively avoided, and the search efficiency is increased.
Please refer to fig. 8, which is a schematic structural diagram of a geocoding system according to an embodiment of the present application. The geocoding system 41 of the embodiment of the present application includes:
the data cleansing module 41: the system is used for carrying out data cleaning on the initial place name address data; because text data such as a place name address input by a user end may contain wrongly written or repeated characters, in order to avoid the problem that subsequent character strings and geographic coordinates are wrongly matched due to the problems of inconsistent character strings, wrong spelling and the like in the text data, the embodiment of the invention uses a Trillum technology and adopts a grammar analysis and fuzzy matching algorithm to carry out data cleaning on the place name address.
The place name address model building module 42: the system is used for structuring the cleaned place name address data and then establishing a place name address model; the expression modes of different countries or regions for the place name address have different granularity range rules, and the place name address can be regarded as a place name address model which is scalable in hierarchy.
Geocode library construction Module 43: the geocode library is used for establishing a geographical code library containing a place name data table, a building data table and a gate (building) board data table according to the place name address model; the table structures of the place name data table, the building data table and the gate (building) plate data table can be defined according to application scenes, and all provinces, counties, streets, districts, markers, house numbers and geographic coordinates are sequentially recorded according to the table structures to construct a geocode library.
Word segmentation and normalization processing module 44: the system is used for performing word segmentation and standardization processing on irregular place name address data by using an N-shortest path optimization algorithm based on an address dictionary, and segmenting the place name address data into at least one word group; the N-shortest path optimization algorithm implementation process specifically comprises the following steps: the address dictionary records all place name addresses (including alias, abbreviation and the like) of different countries and different regions, firstly, place name phrases possibly appearing in place name address data are sequentially matched according to the address dictionary, then a directed acyclic graph is constructed, each phrase is a node in the directed acyclic graph and corresponds to an edge with a given side length (namely a weight value, in a non-statistical rough classification model, all words are assumed to be equal, and for convenience of calculation, the side lengths of the corresponding edges of all words are set to be 1). And calculating the path value from each node to the source node in all paths from the starting point to the end point in the directed acyclic graph, and corresponding to a path set to be used as a path result set of each node.
For example, assume a string to be divided, S ═ c1c2 … … cn, where ci (i ═ 1,2, … n) is a single word, n is the length of the string, and n ≧ 1. And establishing a directed acyclic graph G with n +1 nodes, wherein the serial numbers of the nodes are V0, V1, V2, … and Vn in sequence. All possible word edges of G are established by two rules:
(1) directional edges < Vk-1, Vk > are established between adjacent nodes Vk-1, Vk >, the length value of each edge is Lk, and words corresponding to the edges are defaulted to ck (k is 1,2, … n);
(2) if w is ci +1 … … cj, a directed edge < Vi-1, Vj > is established between nodes Vi-1, Vj, the length value of the edge is Lw, and the word corresponding to the edge is w (0< i < j ≦ n).
According to the above rules, all words included in the string S to be divided are in one-to-one correspondence with edges in the directed acyclic graph G, specifically as shown in fig. 3, which is a schematic view of a directed acyclic graph structure according to an embodiment of the present application. The term rough cutting problem of the N-shortest path optimization algorithm is the set NSP for solving the directed acyclic graph G. The solving process of the directed acyclic graph structure specifically comprises the following steps:
setting: path (i, j) is the set of all paths from node Vi to node Vj; length (path) is the length of path, and length (path) has a value equal to the sum of the lengths of all edges in path; LS is the length set of all paths from V0 to Vn in the directed acyclic graph G; then there are:
LS={len|len=Length(path),path∈Path(0,n)} (1)
NLS is the N-shortest path length set from V0 to Vn; NSP is a set of N-shortest paths from V0 to Vn; RS is the final N-shortest path rough separation result set. NLS is defined as: | NLS | ═ min (| LS |, N); a belongs to LS-NLS, b belongs to NLS → a < b NSP [ { Path | Path ∈ Path (0, n) ], length (Path) ∈ NLS } RS [ { w1w2 … wm | wi ] is the word corresponding to the ith side of Path, i ═ 1,2, …, m, where Path belongs to NSP }.
Taking the text data "what he says is really ideal" as an example of solving by constructing a directed acyclic graph, the solving process is shown in fig. 4. Firstly, a greedy algorithm is adopted to obtain a local optimal solution of each node. Recording the shortest path value at each node and the predecessor of the node, if a certain node comprises more than two paths with the same length, respectively recording the predecessor of the node on each path (the predecessor recording table is shown in fig. 4), then searching for more preferable results forward through a backtracking algorithm, and finally solving that the optimal participle result of the text data 'really ideal in what he said's best is 'really ideal in what he said'.
Based on the above, the invention performs word segmentation on the place name address by adopting the N-shortest path word segmentation algorithm, so that the word segmentation quantity can be greatly reduced, all possible word segmentation results can be contained and not lost as much as possible, correct results are avoided being abandoned possibly caused by the factors of the algorithm, the search space can be reduced as much as possible, and the word segmentation efficiency is improved.
In another embodiment of the present application, in order to further improve the word segmentation efficiency, the word segmentation and normalization processing module 44 performs word segmentation, i.e. normalization processing, by using an N-shortest path improved word segmentation algorithm that combines a dynamic deletion algorithm and an N-shortest path word segmentation algorithm, specifically:
the first step is as follows: firstly, constructing a directed acyclic graph G by taking words as nodes based on an N-shortest path word segmentation algorithm;
the second step is that: calculating the shortest path from the starting node to the ending node as Lj ═ 1, if j is less than the shortest path number and other candidate paths exist, updating the current path L as Lj, otherwise, ending;
the third step: deleting the first node with the in-degree greater than 1 from the first node in the current path, recording the deleted node as Hm, judging whether the descendant node of Hm is in the set E, if so, calculating the shortest path from the starting node V0 to Hm, and recording the ending node of the shortest path as H'm; if the node Hm is not in the set E, deleting the node Hm and all descendant nodes thereof from the directed acyclic graph G;
the fourth step: the above process is repeated until m is not less than n, the current path is updated, and the shortest path j from the start node V0 to all the nodes H'm is determined to be j + 1.
The coordinate matching module 45: the system is used for converting at least one phrase into a character string in a preset format (which can be identified by a computer) according to a level element in a place name address model, matching the converted character string with a corresponding geographic coordinate in a geographic coding library, and taking the geographic coordinate matched with the character string as a standard geographic coordinate of a corresponding place name address.
Please refer to fig. 9, which is a schematic diagram of a terminal structure according to an embodiment of the present application. The terminal 50 comprises a processor 51, a memory 52 coupled to the processor 51.
The memory 52 stores program instructions for implementing the geocoding methods described above.
The processor 51 is operative to execute program instructions stored in the memory 52 to control geocoding.
The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Please refer to fig. 10, which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of the embodiment of the present application stores a program file 61 capable of implementing all the methods described above, where the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A geocoding method, comprising:
establishing a place name address model according to the place name address data;
establishing a geographic coding library according to the place name address model, wherein the geographic coding library comprises an administrative region entity data table, a street lane entity data table and a cell entity data table;
based on an address dictionary, performing word segmentation and standardization processing on the place name address data by using an N-shortest path optimization algorithm, and dividing the place name address data into at least one word group;
and converting the at least one phrase into a character string in a preset format according to the level elements in the place name address model, matching the character string with the corresponding geographic coordinate in the geographic coding library, and taking the geographic coordinate matched with the character string as the standard geographic coordinate of the corresponding place name address.
2. The geocoding method of claim 1, wherein prior to establishing a place name address model from place name address data, further comprising:
and performing data cleaning on the place name address data.
3. The geocoding method of claim 1, wherein building a geocode library according to the location name address model comprises:
and defining table structures of the administrative region entity data table, the street lane entity data table and the cell entity data table, and sequentially inputting provinces, counties, streets, cells, markers, house numbers and geographic coordinates according to the table structures to construct the geocode library.
4. The geocoding method of claim 3, wherein the parsing and normalizing the place name address data using an N-shortest path optimization algorithm based on an address dictionary comprises:
matching place name phrases in the place name address data according to the address dictionary sequence, and constructing a directed acyclic graph, wherein each phrase is a node in the directed acyclic graph and corresponds to an edge with a side length;
and establishing all possible word edges of the directed acyclic graph according to a preset rule, enabling all words contained in the geographical data of the place name to correspond to the edges of the directed acyclic graph one by one, solving an N-shortest path set from a starting node to an ending node in the directed acyclic graph, and segmenting the address data of the place name according to the N-shortest path set.
5. The geocoding method of claim 4, wherein the geographical location name data S ═ c1c2 … … cn, where ci (i ═ 1,2, … n) is a single word, n is a length of a string, n ≧ 1, the number of nodes of the created directed acyclic graph G is n +1, the node numbers are sequentially V0, V1, V2, …, Vn, and the preset rule for creating all possible word edges of the directed acyclic graph is:
directional edges < Vk-1, Vk > are established between adjacent nodes Vk-1, Vk >, the length value of each edge is Lk, and words corresponding to the edges are defaulted to ck (k is 1,2, … n);
if w is ci +1 … … cj, a directed edge < Vi-1, Vj > is established between nodes Vi-1, Vj, the length value of the edge is Lw, and the word corresponding to the edge is w (0< i < j ≦ n).
6. The geocoding method of claim 5, wherein solving for a set of N-shortest paths from a starting node to an ending node in the directed acyclic graph comprises:
let Path (i, j) be the set of all paths from node Vi to node Vj; length (path) is the length of path, and length (path) has a value equal to the sum of the lengths of all edges in path; LS is the length set of all paths from V0 to Vn in the directed acyclic graph G, and then:
LS={len|len=Length(path),path∈Path(0,n)}
setting NLS as an N-shortest path length set from V0 to Vn, NSP as an N-shortest path set from V0 to Vn, RS as a rough division result set of the finally obtained N-shortest paths, | NLS | ═ min (| LS |, N); a belongs to LS-NLS, b belongs to NLS → a < b, NSP belongs to { Path | Path ∈ Path (0, n), length (Path) ∈ NLS } RS { (w 1w2 … wm |, wi is the word corresponding to the ith side of Path, i belongs to 1,2, …, m, wherein Path ∈ NSP }, and n is the shortest Path number.
7. The geocoding method of claim 6, wherein the parsing and normalizing the place name address data using an N-shortest path optimization algorithm based on an address dictionary further comprises:
calculating the shortest path from the starting node to the ending node as Lj ═ 1, if j is less than the shortest path number and other candidate paths exist, updating the current path L as Lj, otherwise, ending;
deleting the first node with the in-degree greater than 1 from the first node in the current path, recording the deleted node as Hm, judging whether the descendant node of Hm is in the set E, if so, calculating the shortest path from the starting node to Hm, and recording the ending node of the shortest path as H'm; if the node Hm is not in the set E, deleting the node Hm and all descendant nodes thereof from the directed acyclic graph G; the set E is an N-shortest path set from V0 to Vn, Hm and H'm represent end nodes in each cycle, and H'm is used as an end mark of the next cycle;
repeating the node deletion process until m is not less than n, updating the current path, and obtaining the shortest path j from the starting node V0 to all nodes H'm as j + 1; n is the shortest path number after deleting the node, m is the shortest path after constructing the j loop, and the value of m is j +1 in each loop.
8. A geocoding system, comprising:
the place name address model building module: the system is used for establishing a place name address model according to the place name address data;
the geocode library construction module comprises: the geographic coding library is established according to the place name address model and comprises an administrative region entity data table, a street lane entity data table and a cell entity data table;
word segmentation and standardization processing module: the system is used for performing word segmentation and standardization processing on the place name address data by using an N-shortest path optimization algorithm based on an address dictionary, and segmenting the place name address data into at least one word group;
a coordinate matching module: and the geographic coordinate matched with the character string is used as a standard geographic coordinate of the corresponding place name address.
9. A terminal, comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions for implementing the geocoding method of any one of claims 1-7;
the processor is configured to execute the program instructions stored by the memory to control geocoding.
10. A storage medium having stored thereon program instructions executable by a processor to perform the geocoding method of any one of claims 1 to 7.
CN202011222303.XA 2020-11-05 2020-11-05 Geocoding method, system, terminal and storage medium Pending CN112256817A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011222303.XA CN112256817A (en) 2020-11-05 2020-11-05 Geocoding method, system, terminal and storage medium
PCT/CN2020/139759 WO2022095256A1 (en) 2020-11-05 2020-12-26 Geocoding method and system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011222303.XA CN112256817A (en) 2020-11-05 2020-11-05 Geocoding method, system, terminal and storage medium

Publications (1)

Publication Number Publication Date
CN112256817A true CN112256817A (en) 2021-01-22

Family

ID=74268299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011222303.XA Pending CN112256817A (en) 2020-11-05 2020-11-05 Geocoding method, system, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN112256817A (en)
WO (1) WO2022095256A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699640A (en) * 2021-03-23 2021-04-23 城云科技(中国)有限公司 Geocoding method and system based on PostgreSQL
CN112949260A (en) * 2021-03-05 2021-06-11 浪潮云信息技术股份公司 Method for accelerating conversion of unstructured enterprise address into longitude and latitude
CN113723681A (en) * 2021-08-30 2021-11-30 平安国际智慧城市科技股份有限公司 Path selection method and device, electronic equipment and readable storage medium
CN114970518A (en) * 2022-02-15 2022-08-30 北京青萌数海科技有限公司 Method and device for correcting address data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910386B (en) * 2023-09-14 2024-02-02 深圳市智慧城市科技发展集团有限公司 Address completion method, terminal device and computer-readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145169A (en) * 2018-07-26 2019-01-04 浙江省测绘科学技术研究院 A kind of address matching method based on statistics participle

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8868479B2 (en) * 2007-09-28 2014-10-21 Telogis, Inc. Natural language parsers to normalize addresses for geocoding
CN108763215B (en) * 2018-05-30 2022-04-29 中智诚征信有限公司 Address storage method and device based on address word segmentation and computer equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145169A (en) * 2018-07-26 2019-01-04 浙江省测绘科学技术研究院 A kind of address matching method based on statistics participle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴晓倩 等: "基于N-最短路径的中文分词技术研究", 《安徽理工大学学报(自然科学版)》, vol. 34, no. 1, pages 72 - 75 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949260A (en) * 2021-03-05 2021-06-11 浪潮云信息技术股份公司 Method for accelerating conversion of unstructured enterprise address into longitude and latitude
CN112699640A (en) * 2021-03-23 2021-04-23 城云科技(中国)有限公司 Geocoding method and system based on PostgreSQL
CN113723681A (en) * 2021-08-30 2021-11-30 平安国际智慧城市科技股份有限公司 Path selection method and device, electronic equipment and readable storage medium
CN114970518A (en) * 2022-02-15 2022-08-30 北京青萌数海科技有限公司 Method and device for correcting address data
CN114970518B (en) * 2022-02-15 2022-12-16 北京青萌数海科技有限公司 Method and device for correcting address data

Also Published As

Publication number Publication date
WO2022095256A1 (en) 2022-05-12

Similar Documents

Publication Publication Date Title
CN112256817A (en) Geocoding method, system, terminal and storage medium
WO2020228706A1 (en) Fence address-based coordinate data processing method and apparatus, and computer device
CN107145577A (en) Address standardization method, device, storage medium and computer
Davis et al. Assessing the certainty of locations produced by an address geocoding system
US7046827B2 (en) Adapting point geometry for storing address density
US20130231862A1 (en) Customizable route planning
CN110990520B (en) Address coding method and device, electronic equipment and storage medium
CN106909611B (en) Hotel automatic matching method based on text information extraction
US20120310523A1 (en) Customizable route planning
CN108733810B (en) Address data matching method and device
CN112612863B (en) Address matching method and system based on Chinese word segmentation device
WO2019227581A1 (en) Interest point recognition method, apparatus, terminal device, and storage medium
Christen et al. A probabilistic geocoding system based on a national address file
CN116414823A (en) Address positioning method and device based on word segmentation model
CN111291099B (en) Address fuzzy matching method and system and computer equipment
CN115630648A (en) Address element analysis method and system for man-machine conversation and computer readable medium
CN109271625B (en) Pinyin spelling standardization method for Chinese place names
CN112069824B (en) Region identification method, device and medium based on context probability and citation
CN116414824A (en) Administrative division information identification and standardization processing method, device and storage medium
CN116431625A (en) Positioning analysis method and device for geographic entity and computer equipment
CN111680122B (en) Space data active recommendation method and device, storage medium and computer equipment
CN116414808A (en) Method, device, computer equipment and storage medium for normalizing detailed address
Tabarcea et al. Ad-hoc georeferencing of web-pages using street-name prefix trees
CN112417812A (en) Address standardization method and system and electronic equipment
CN113568951A (en) Data mining and processing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination