WO2022095256A1

WO2022095256A1 - Geocoding method and system, terminal and storage medium

Info

Publication number: WO2022095256A1
Application number: PCT/CN2020/139759
Authority: WO
Inventors: 钱静; 彭树宏; 陈朝亮
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-11-05
Filing date: 2020-12-26
Publication date: 2022-05-12
Also published as: CN112256817A

Abstract

A geocoding method and system, a terminal and a storage medium. The method comprises: building a geographical name address model according to geographical name address data; building a geocoding library according to the geographical name address model, the geocoding library including an administrative area entity data table, a street entity data table and a community entity data table; performing word segmentation and standardization on the geographical name address data on the basis of an address dictionary by using an N-shortest path optimization algorithm, to segment the geographical name address data into at least one phrase; converting the at least one phrase into a character string of a predetermined format according to level elements in the geographical name address model, matching the character string with corresponding geographical coordinates in the geocoding library, and using the geographical coordinates matched with the character string as standard geographical coordinates of corresponding geographical name addresses. The number of segmented phrases can be reduced as much as possible, and all the results that need to be retained can be included, effectively avoiding resource waste and increasing search efficiency.

Description

A geocoding method, system, terminal and storage medium

technical field

The present application belongs to the technical field of geocoding, and in particular, relates to a geocoding method, system, terminal and storage medium.

Background technique

As the product of the combination of location service and information platform, geographic information system has a wider and wider range of applications. With the popularization and continuous maturity of geographic information technology, many enterprises, units and government departments have established business based on geographic information, such as pharmaceuticals, media, etc., and the demand for management and operation with the help of geographic information has become increasingly prominent. However, the naming methods of geographic information such as national place names and addresses have the characteristics of messy semantics and disordered word order, that is, there is no unified criterion to standardize them. In addition, the geographic information that can be collected by ordinary departmental units is only the textual description information (non-spatial information) of various disorganized place names and addresses, and the spatial coordinate information that can be used directly cannot be obtained. If these non-spatial information cannot be successfully converted into spatial coordinate information, relevant enterprises will not be able to combine relevant thematic data with geographic information, which will indirectly affect the visualization and functional analysis of GIS software. Therefore, how to convert the non-spatial information related to the geographic location into the geographic coordinates of the GIS system that can be recognized by the computer and realize the matching between the non-spatial information and the physical geographic coordinates can play the greatest role of the geographic information system.

SUMMARY OF THE INVENTION

The present application provides a geocoding method, system, terminal and storage medium, aiming to solve one of the above technical problems in the prior art at least to a certain extent.

In order to solve the above problems, the application provides the following technical solutions:

A geocoding method that includes:

Establish a place name address model according to the place name address data;

A geocoding library is established according to the place name and address model, and the geocoding library includes an administrative area entity data table, a street and alley entity data table, and a community entity data table;

Based on the address dictionary, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data, and divide the place name address data into at least one phrase;

Convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and match the character string to The geographic coordinates are used as the standard geographic coordinates for the address of the corresponding place name.

The technical solution adopted in the embodiment of the present application further includes: before the establishment of the place-name-address model according to the place-name-address data, the following further includes:

Data cleaning is performed on the place name and address data.

The technical solution adopted in the embodiment of the present application further includes: the establishing of the geographic coding library according to the place name and address model includes:

Define the table structure of the administrative area entity data table, street and lane entity data table, and community entity data table, and enter provinces, districts, counties, streets, communities, landmarks, house numbers, and geographic coordinates in turn according to the table structure. Construction of the geocoding library.

The technical solutions adopted in the embodiments of the present application further include: the address dictionary-based, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data includes:

According to the address dictionary order, the place name groups in the place name address data are matched, and a directed acyclic graph is constructed. Each phrase is a node in the directed acyclic graph, and corresponds to a side given length;

All possible word edges of the directed acyclic graph are established according to preset rules, so that all words contained in the geographical name geographic data correspond to the edges of the directed acyclic graph one-to-one respectively, and solve the To the N-shortest path set from the start node to the end node in the acyclic graph, the place name address data is segmented according to the N-shortest path set.

The technical solutions adopted in the embodiments of the present application further include: assuming that the geographical data of place names S=c1c2...cn, where ci(i=1,2,...n) is a single word, n is the length of the string, and n≥1 , the number of nodes in the established directed acyclic graph G is n+1, and the number of each node is V0, V1, V2, ..., Vn. The preset rules for establishing all possible word edges in the directed acyclic graph are: :

A directed edge <Vk-1, Vk> is established between adjacent nodes Vk-1, Vk, the length of the edge is Lk, and the word corresponding to the edge defaults to ck (k=1,2,...n);

If w=ci ci+1...cj is a word, then a directed edge <Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0 <i<j≤n).

The technical solutions adopted in the embodiments of the present application further include: the solving of the set of N-shortest paths from the start node to the end node in the directed acyclic graph includes:

Suppose Path(i,j) is the set of all paths from node Vi to node Vj; Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path; LS is directed acyclic The set of lengths of all paths from V0 to Vn in graph G, there are:

LS={len|len=Length(path), path∈Path(0,n)}

Let NLS be the set of N-shortest path lengths from V0 to Vn, NSP be the set of N-shortest path lengths from V0 to Vn, RS is the final N-shortest path rough division result set, |NLS|=min(|LS| ,N); a∈LS-NLS,b∈NLS→a<b, NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2...wm|, wi is path The word corresponding to the i-th edge of , i=1,2,...,m, where path∈NSP}, n is the number of shortest paths.

The technical solutions adopted in the embodiments of the present application further include: the said address dictionary-based, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data further includes:

Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths and there are other candidate paths, update the current path L to Lj, otherwise end;

Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and judge whether the descendant node of Hm is in the set E. If it is in the set E, then Calculate the shortest path from the start node to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its descendant nodes from the directed acyclic graph G; among them, Set E is the set of N-shortest paths from V0 to Vn, Hm and H'm represent end nodes in each cycle, and H'm is used as the end marker of the next cycle;

Repeat the node deletion process until m≮n, update the current path, and obtain the shortest path j=j+1 from the starting node V0 to all nodes H'm; n is the number of shortest paths after deleting the node, m is the j cycle For the constructed shortest path, in each cycle, the value of m is m=j+1.

Another technical solution adopted by the embodiment of the present application is: a geographic coding system, comprising:

Place name address model building module: used to build place name address model based on place name address data;

Geographical coding library building module: used to establish a geographic coding library according to the place name and address model, and the geographic coding library includes an administrative area entity data table, a street and lane entity data table and a community entity data table;

Word segmentation and standardization processing module: used to perform word segmentation and standardization processing on the place name address data based on the address dictionary, using the N-shortest path optimization algorithm, and divide the place name address data into at least one phrase;

Coordinate matching module: used to convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and The geographic coordinates matched by the character string are used as the standard geographic coordinates of the address corresponding to the place name.

Another technical solution adopted by the embodiments of the present application is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,

the memory stores program instructions for implementing the geocoding method;

The processor is configured to execute the program instructions stored in the memory to control geocoding.

Another technical solution adopted by the embodiments of the present application is: a storage medium storing program instructions executable by a processor, where the program instructions are used to execute the geocoding method.

Compared with the prior art, the beneficial effects of the embodiments of the present application are: the geocoding method, system, terminal and storage medium of the embodiments of the present application perform word segmentation and standardization processing on place names and addresses based on the N-shortest path optimization algorithm. Results After the place name address is segmented, the segmented place name address is converted into a string that can be recognized by the computer according to the level elements in the place name address model, and finally the string is matched with the corresponding geographic coordinates in the geocoding library. Matching results assign standard geographic coordinates to place-name addresses. By adding auxiliary grammatical and semantic rules to the algorithm, the present application improves the disadvantages of word-by-word traversal, increases the practicability, and inherits the advantages of the full segmentation idea, which can not only reduce the number of segmented phrases as much as possible, but also include All results that need to be retained can effectively avoid wasting resources and increase search efficiency.

Description of drawings

Fig. 1 is the flow chart of the geocoding method of the first embodiment of the present application;

2 is a schematic diagram of a representation of a place name and address according to an embodiment of the application;

3 is a schematic structural diagram of a directed acyclic graph according to an embodiment of the present application;

4 is a schematic diagram of a directed acyclic graph solution process according to an embodiment of the application;

5 is a schematic diagram of a precursor record table in the process of solving a directed acyclic graph according to an embodiment of the present application;

Fig. 6 is the flow chart of the geocoding method of the second embodiment of the present application;

7 is a schematic diagram of an N-shortest path improved word segmentation algorithm according to an embodiment of the application;

8 is a schematic structural diagram of a geocoding system according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

In view of the deficiencies of the prior art, the geocoding method of the embodiment of the present application first performs data cleaning on the initial place name address data to prevent problems such as excessive typos, spelling mistakes, and text repetition in the input text; and then establishes a place name address model to enable it to Reflect the different representations of geographic names in a country or region, and then build a geographic coding library including a place name data table, a building data table and a door (building) plate data table according to the place name address model, and use the N-shortest path optimization algorithm. The place name address is subjected to word segmentation and standardization processing. After the place name address is segmented according to the standardized processing result, the segmented place name address is converted into a string that can be recognized by the computer according to the level elements in the place name address model. The corresponding geographic coordinates in the encoding library are matched. The embodiment of the present application inherits the advantages of the idea of full segmentation, can reduce the number of segmented phrases as much as possible, and at the same time can include all the results that need to be retained, which can effectively avoid waste of resources and increase search efficiency.

Specifically, please refer to FIG. 1 , which is a flowchart of the geocoding method according to the first embodiment of the present application. The geocoding method of the first embodiment of the present application includes the following steps:

S10: Perform data cleaning on the initial place name and address data;

In this step, since the text data such as place names and addresses input by the user terminal may contain typos or repeated words, in order to avoid the mismatch between the subsequent character strings and the geographic coordinates due to inconsistent character strings in the text data, spelling mistakes and other problems, the present invention The embodiment uses Trillum technology, adopts syntax analysis and fuzzy matching algorithm to perform data cleaning on place name addresses.

S11: After structuring the cleaned place name address data, establish a place name address model;

In this step, different countries or regions have description rules of different granularity ranges for the representation of place-name addresses, and the embodiment of the present application establishes a scalable place-name address model according to the description rules of different granularity ranges. Specifically, as shown in FIG. 2 , it is a schematic diagram of the representation of place names and addresses. In the representation of the address of the place name, the place name of the administrative region includes the provincial, city, county, township, street name, community place name, community name, gate building address, landmark name or alias, and unit name or its abbreviation; Among them, the provincial level has priority over the city level, the city level has priority over the county level, and the county level has priority over the township level; street and lane names have priority over community place names, community place names have priority over community names, and gate building addresses have priority over landmark names or their aliases, followed by is the unit name or its abbreviation. Under normal circumstances, the street name and community name in a city are unique, so using the street name or community name can roughly lock a certain range of addresses, and using "street name or community name + door (floor)" Brand” can be accurately located to a location, and “administrative region place name + marker name” can basically be used to accurately locate a location. That is to say, when the content to be expressed in the text has a house number, use "street name or community name + door (building) number" to lock a location; when the local name address data contains a landmark name, use " Administrative region place name + landmark name" for precise positioning. An example of the structure of place name address data according to the above description rules of granularity range is as follows:

(1) No. 111, Tiyu West Road, Tianhe District, Guangzhou City, Guangdong Province, the place name and address data is structured as: administrative area name + street name + door (building) number;

(2) Jianhe Center, Tiyu West Road, Tianhe District, Guangzhou City, Guangdong Province, the place name and address data are structured as: administrative area name + street name + landmark name.

When encountering multiple marker names, the granularity of the current administrative region can be extended until a unique location can be identified. For example, the place name address data is "Huizhou College, No. 46 Yanda Avenue, Huizhou City", which can be simplified as "No. 46 Yanda Avenue" in the application of Huizhou City without any ambiguity at all; and if the place name address data is "Guangzhou City" "ICBC", at this time, multiple markers may be located, and the results obtained are difficult to filter, so it needs to be extended to the street name or community name for description, and then a certain ICBC can be accurately located.

S12: According to the place name and address model, establish a geocoding library including the entity data table of administrative areas, the entity data table of streets and alleys, and the entity data table of community;

In this step, the table structure of the administrative area entity data table, the street entity data table and the community entity data table can be defined according to the application scenario, and the establishment of the geocoding library follows the following principles:

Uniqueness principle: any geographic entity can only be uniquely identified;

Transparency principle: The affiliation between structures can be identified from the coding;

The principle of flexibility: it should adapt to the development and changes of the object;

Standardization principle: The coding rules are adapted to the national standard system for data sharing.

The structures of the administrative area entity data table, the street entity data table and the community entity data table are shown in Tables 1, 2 and 3 below:

Table 1 Administrative area entity data table

Table 2 Street and Alley entity data table

Table 3 Cell entity data table

According to the structure of each table, enter all provinces, districts, counties, streets, communities, landmarks, house numbers and geographic coordinates in turn to construct the geocoding database. Select the field value in each data table as the place name address entry, and record it in the address dictionary together with the corresponding address level. When an address alias is used as a place-name address entry, it is also necessary to record the standard name so that the address elements can be normalized during address segmentation.

S13: Based on the address dictionary, the N-shortest path optimization algorithm is used to segment and standardize the irregular place name address data, and the place name address data is divided into at least one phrase;

In this step, the implementation process of the N-shortest path optimization algorithm is as follows: the address dictionary records all geographical names addresses (including aliases and abbreviations, etc.) in different countries and regions, first, according to the address dictionary, place names that may appear in the geographical name address data are recorded. The phrases are matched in order, and then a directed acyclic graph is constructed. Each phrase is a node in the directed acyclic graph, and corresponds to a given edge length (ie, weight, in the non-statistical rough scoring model). , assuming that all words are equal, for the convenience of calculation, the side lengths of the corresponding sides of all words are set as the sides of 1). In all paths from the starting point to the end point in the directed acyclic graph, the path value from each node to the source node is obtained, and the corresponding path set is used as the path result set of each node.

For example, it is assumed that the character string to be divided is S=c1c2...cn, where ci(i=1, 2,...n) is a single word, n is the length of the string, and n≥1. Create a directed acyclic graph G with the number of nodes n+1, and the numbers of the nodes are V0, V1, V2, ..., Vn in sequence. All possible word edges of G are established by the following two rules:

(1) A directed edge <Vk-1, Vk> is established between adjacent nodes Vk-1, Vk, the length of the edge is Lk, and the word corresponding to the edge defaults to ck (k=1,2,...n);

(2) If w=ci ci+1...cj is a word, then a directed edge <Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0<i<j≤n).

According to the above rules, all words contained in the string S to be divided are made to correspond one-to-one with the edges in the directed acyclic graph G, as shown in FIG. The word rough cutting problem of the N-shortest path optimization algorithm is to solve the set NSP of the directed acyclic graph G. The solution process of the directed acyclic graph structure is as follows:

Let: Path(i,j) be the set of all paths from the node Vi to the node Vj; Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path; LS is the directed and non-directional The set of lengths of all paths from V0 to Vn in the ring graph G; then there are:

LS={len|len=Length(path), path∈Path(0,n)}(1)

NLS is the set of N-shortest path lengths from V0 to Vn, NSP is the set of N-shortest paths from V0 to Vn; RS is the final result set of N-shortest path rough division. The definition of NLS is: |NLS|=min(|LS|,N); a∈LS-NLS,b∈ NLS→a<b NSP={path|path∈Path(0,n),Length(path)∈ NLS}RS={w1w2...wm|wi is the word corresponding to the i-th edge of the path, i=1,2,...,m, where path∈NSP}, n is the number of shortest paths.

Taking the solution of constructing a directed acyclic graph for text data "what he said really makes sense" as an example, the solution process of the text data is shown in Figure 4. First, a greedy algorithm is used to obtain the local optimal solution of each node. Record the shortest path value at each node and the predecessor of the node. If a node includes more than two paths of the same length, record the predecessor of the node on each path separately. The predecessor record table of the text data is shown in Figure 5. where, in (a), the precursors (2,1)he and (3,1)he have lengths of 3 and 4, respectively, and the corresponding nodes are 012, 0123; in (b), the precursors (4, 1) He and (4, 2) he said that the lengths are 4 and 5, respectively, and the corresponding nodes are 0123, 01234; in (c), the predecessors (4, 1) he, (5, 1) he ((4 , 2) he) and (5, 2) he said that the lengths are 4, 5 and 6, respectively, and the corresponding nodes are 0123, 01234, 012345; in (d), the predecessor (6, 1) he ((5, 1) He), (6,2) He said ((5,2)) and (6,3) He said, the lengths are 5, 6 and 7 respectively, and the corresponding nodes are 01234, 012345, 0123456 respectively. Then, through the backtracking algorithm, search for a more preferred result forward, and finally solve the optimal word segmentation result of the text data "what he said is true" is "he | said | | true | is true |".

Based on the above, the present invention uses the N-shortest path word segmentation algorithm to segment the place name address, which can not only greatly reduce the number of word segmentation, but also try to include all possible word segmentation results without loss, and avoid the algorithm itself. At the same time, it can reduce the search space as much as possible and improve the efficiency of word segmentation.

S14: Convert the segmented at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library;

S15: Use the geographic coordinates matched by the string as the standard geographic coordinates of the address corresponding to the place name.

Please refer to FIG. 6 , which is a flowchart of the geocoding method according to the second embodiment of the present application. The geocoding method of the second embodiment of the present application includes the following steps:

S20: Perform data cleaning on the initial place name and address data;

S21: After structuring the cleaned place name address data, establish a place name address model;

In this step, different countries or regions have different granularity and scope rules for the representation of the place name address, and the place name address can be regarded as a hierarchically scalable place name address model.

S22: Establish a geocoding library including a place name data table, a building data table and a door (building) sign data table according to the place name address model;

In this step, the table structure of the place name data table, the building data table and the door (building) plate data table can be defined according to the application scenario, and all provinces, districts, counties, streets, communities, landmarks, house plates are sequentially entered according to the table structure. number and geographic coordinates for the construction of the geocoding library.

S23: Based on the address dictionary, the N-shortest path improved word segmentation algorithm combining the dynamic deletion algorithm and the N-shortest path word segmentation algorithm is used to segment and standardize the irregular place name address data, and the place name address data is divided into at least one phrase ;

In this step, in order to further improve the word segmentation efficiency, the present invention proposes an N-shortest path improved word segmentation algorithm that combines the dynamic deletion algorithm with the N-shortest path word segmentation algorithm. The basic idea of the dynamic deletion algorithm is to construct the shortest path update queue. , used to store the child nodes of the deleted node; delete the node that should be deleted and all child nodes in the original shortest path tree; select the node closest to the root node for updating in the queue, and no longer Insert updated nodes into the queue to reduce the number of node updates. The N-shortest path improved word segmentation algorithm is shown in Figure 7, and its solution process is as follows:

Step 1: First, based on the N-shortest path word segmentation algorithm, construct a directed acyclic graph G with words (or characters) as nodes; wherein, the directed acyclic graph construction process is the same as the first embodiment, and this embodiment will not repeat;

Step 2: Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths (ie, j<n) and there are other candidate paths, update the current path L to Lj, otherwise end; where, Candidate paths refer to different paths generated by different ways of cutting words.

Among them, Lj is used to store the shortest path, where j is a dynamic variable, and the initial value of j can be set according to the length of the entire string, such as "what he said is true", the initial value of j is 8, When the word combination is completed, the nodes between the combinations are deleted, and the j value is updated. As the sentence segmentation continues, the j value will become smaller and smaller until the sentence cannot be divided.

Step 3: Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and determine whether the descendant node of Hm is in the set E, if it is in the set In E, calculate the shortest path from the start node V0 to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its all from the directed acyclic graph G Descendant node; wherein, the set E is the N-shortest path set (ie, NSP) from V0 to Vn, which is used here to determine whether the deleted node is in the shortest path. Hm and H'm represent the end node in each cycle, and H'm will be the end marker of the next cycle.

Step 4: Repeat the above process until m≮n, update the current path, and obtain the shortest path j=j+1 from the start node V0 to all nodes H'm. Among them, n is the number of shortest paths after deleting nodes, m is the shortest path after j loop construction, in each loop, the value of m is m=j+1, and after entering the next loop, Hm will follow m The value is updated while H'm does not change.

In the above solution process, in order to avoid affecting the search efficiency and accuracy, the value of j should be moderate, neither too large nor too small, for the first j optimal paths to be reserved.

S24: Convert the segmented at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library;

S25: Use the geographic coordinates matched by the character string as the standard geographic coordinates of the address corresponding to the place name.

In order to verify the feasibility and effectiveness of the embodiments of the present application, experiments were carried out on the scheme with the help of the ArcGIS api for Javascript platform, and the accuracy of the traditional algorithm was compared. The comparison results are shown in Table 1:

Table 1 Algorithm Accuracy Comparison

The experimental results show that the correct rate of geographic coordinate matching using the embodiment of the present application exceeds that of the traditional algorithm, and the word segmentation speed is accelerated by more than two times.

Based on the above, the geocoding method of the embodiment of the present application uses the N-shortest path optimization algorithm to perform word segmentation and standardization processing on the place-name address, and after segmenting the place-name address according to the standardized processing result, according to the level element in the place-name address model. The last place name address is converted into a character string that can be recognized by the computer, and finally the character string is matched with the corresponding geographic coordinates in the geocoding library, and the place name address is given standard geographic coordinates according to the matching result. By adding auxiliary grammatical and semantic rules to the algorithm, the present application improves the disadvantages of word-by-word traversal, increases the practicability, and inherits the advantages of the full segmentation idea, which can not only reduce the number of segmented phrases as much as possible, but also include All results that need to be retained can effectively avoid wasting resources and increase search efficiency.

Please refer to FIG. 8 , which is a schematic structural diagram of a geocoding system according to an embodiment of the present application. The geocoding system 41 of the embodiment of the present application includes:

Data cleaning module 41: used for data cleaning of the initial place name and address data; since the text data such as place name and address input by the user terminal may contain typos or repeated words, in order to avoid problems such as inconsistent character strings in the text data, spelling errors, etc. If the subsequent character string is incorrectly matched with the geographic coordinates, the embodiment of the present invention uses the Trillum technology, and uses the syntax analysis and fuzzy matching algorithm to clean the data of the place name address.

Place-name and address model building module 42: used to structure the cleaned place-name and address data to establish a place-name and address model; wherein, different countries or regions have different granularity and scope rules for the representation of place-names and addresses, and place-names and addresses can be regarded as a kind of Hierarchically scalable place-name address model.

Geocoding library building module 43: used to establish a geographic coding library including a place name data table, a building data table and a door (building) sign data table according to the place name address model; wherein, the place name data table, the building data table and the door (building) ) The table structure of the card data table can be defined according to the application scenario, and all provinces, districts, counties, streets, communities, landmarks, house numbers and geographic coordinates are entered in turn according to each table structure to construct a geocoding library.

The word segmentation and standardization processing module 44 is used to perform word segmentation and standardization processing on the irregular place name address data by using the N-shortest path optimization algorithm based on the address dictionary, and divide the place name address data into at least one phrase; wherein, the N-shortest path The implementation process of the optimization algorithm is as follows: the address dictionary records all geographical names addresses (including aliases and abbreviations, etc.) in different countries and regions. First, according to the address dictionary, the geographical names that may appear in the geographical name address data are matched in order, and then a Directed acyclic graph, each phrase is a node in the directed acyclic graph, and corresponds to a given edge length (ie weight, in the non-statistical rough segmentation model, it is assumed that all words are correct. etc., for the convenience of calculation, the side lengths of the corresponding sides of all words are set as the sides of 1). In all paths from the starting point to the end point in the directed acyclic graph, the path value from each node to the source node is obtained, and the corresponding path set is used as the path result set of each node.

LS={len|len=Length(path),path∈Path(0,n)} (1)

NLS is the set of N-shortest path lengths from V0 to Vn; NSP is the set of N-shortest paths from V0 to Vn; RS is the final N-shortest path rough division result set. The definition of NLS is: |NLS|=min(|LS|,N); a∈LS-NLS,b∈NLS→a<b NSP={path|path∈Path(0,n),Length(path)∈ NLS}RS={w1w2...wm|wi is the word corresponding to the i-th edge of path, i=1,2,...,m, where path∈NSP}.

Taking the solution of the directed acyclic graph constructed by the text data "what he said is true" as an example, the solution process is shown in Figure 4. First, a greedy algorithm is used to obtain the local optimal solution of each node. Record the shortest path value at each node and the predecessor of the node. If a node includes more than two paths of the same length, record the predecessor of the node on each path separately (the predecessor record table is shown in Figure 4), and then Through the backtracking algorithm, search for a more preferred result forward, and finally solve the optimal word segmentation result of the text data "what he said is true" is "he | said | | is true | is true |".

In another embodiment of the present application, in order to further improve the efficiency of word segmentation, the word segmentation and standardization processing module 44 adopts the N-shortest path improved word segmentation algorithm that combines the dynamic deletion algorithm and the N-shortest path word segmentation algorithm to perform word segmentation, that is, standardized processing. for:

The first step: First, based on the N-shortest path word segmentation algorithm, construct a directed acyclic graph G with words as nodes;

Step 2: Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths and there are other candidate paths, update the current path L to Lj, otherwise end;

Step 3: Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and determine whether the descendant node of Hm is in the set E, if it is in the set In E, calculate the shortest path from the start node V0 to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its all from the directed acyclic graph G descendant node;

Step 4: Repeat the above process until m≮n, update the current path, and obtain the shortest path j=j+1 from the start node V0 to all nodes H'm.

Coordinate matching module 45: used to convert at least one phrase into a character string in a predetermined format (recognizable by the computer) according to the level element in the place name address model, and then match the converted character string with the corresponding geographic coordinates in the geocoding library , take the geographic coordinates matched by the string as the standard geographic coordinates of the corresponding place name address.

Please refer to FIG. 9 , which is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .

The memory 52 stores program instructions for implementing the above-described geocoding method.

The processor 51 is adapted to execute program instructions stored in the memory 52 to control the geocoding.

The processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 51 may be an integrated circuit chip with signal processing capability. The processor 51 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components . A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Please refer to FIG. 10 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of this embodiment of the present application stores a program file 61 capable of implementing all the above methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to enable a computer device (which may It is a personal computer, a server, or a network device, etc.) or a processor that executes all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes , or terminal devices such as computers, servers, mobile phones, and tablets.

The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this application may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

A method for geocoding, comprising:

Establish a place name address model according to the place name address data;

A geocoding library is established according to the place name and address model, and the geocoding library includes an administrative area entity data table, a street and alley entity data table, and a community entity data table;

Based on the address dictionary, using the N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data, and divide the place name address data into at least one phrase;

Convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and match the character string to The geographic coordinates are used as the standard geographic coordinates for the address of the corresponding place name.
The geocoding method according to claim 1, wherein before establishing the place-name-address model according to the place-name and address data, the method further comprises:

Data cleaning is performed on the place name and address data.
The geocoding method according to claim 1, wherein the establishing a geocoding library according to the place name and address model comprises:

Define the table structure of the administrative area entity data table, the street entity data table and the community entity data table, and enter the provinces, districts, counties, streets, communities, markers, house numbers and geographic coordinates in turn according to the table structure. Construction of the geocoding library.
The method for geocoding according to claim 3, wherein, based on an address dictionary, using an N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data comprises:

According to the address dictionary order, the place name groups in the place name address data are matched, and a directed acyclic graph is constructed. Each phrase is a node in the directed acyclic graph, and corresponds to a side given length;

All possible word edges of the directed acyclic graph are established according to preset rules, so that all words contained in the geographical name geographic data correspond to the edges of the directed acyclic graph one-to-one respectively, and solve the To the N-shortest path set from the start node to the end node in the acyclic graph, the place name address data is segmented according to the N-shortest path set.
The geocoding method according to claim 4, characterized in that, it is assumed that the geographic data of place names S=c1 c2...cn, wherein ci(i=1,2,...n) is a single word, and n is a string Length, n≥1, the number of nodes in the established directed acyclic graph G is n+1, and the number of each node is V0, V1, V2, ..., Vn, all possible word edges of the established directed acyclic graph The default rules are:

A directed edge <Vk-1, Vk> is established between adjacent nodes Vk-1, Vk, the length of the edge is Lk, and the word corresponding to the edge defaults to ck (k=1,2,...n);

If w=ci ci+1...cj is a word, then a directed edge <Vi-1, Vj> is established between nodes Vi-1 and Vj, the length of the edge is Lw, and the word corresponding to the edge is w(0 <i<j≤n).
The geocoding method according to claim 5, wherein the solving the N-shortest path set from the start node to the end node in the directed acyclic graph comprises:

Suppose Path(i,j) is the set of all paths from node Vi to node Vj; Length(path) is the length of the path path, and the value of Length(path) is equal to the sum of the lengths of all edges in the path; LS is directed acyclic The set of lengths of all paths from V0 to Vn in graph G, there are:

LS={len|len=Length(path), path∈Path(0,n)}

Let NLS be the set of N-shortest path lengths from V0 to Vn, NSP be the set of N-shortest path lengths from V0 to Vn, RS is the final N-shortest path rough division result set, |NLS|=min(|LS| ,N); a∈LS-NLS,b∈NLS→a<b, NSP={path|path∈Path(0,n),Length(path)∈NLS}RS={w1w2...wm|, wi is path The word corresponding to the i-th edge of , i=1,2,...,m, where path∈NSP}, n is the number of shortest paths.
The method for geocoding according to claim 6, wherein, based on an address dictionary, using an N-shortest path optimization algorithm to perform word segmentation and standardization on the place name address data, further comprising:

Calculate the shortest path from the start node to the end node as Lj=1, if j is less than the number of shortest paths and there are other candidate paths, update the current path L to Lj, otherwise end;

Starting from the first node in the current path, delete the first node with an in-degree greater than 1, and record the deleted node as Hm, and judge whether the descendant node of Hm is in the set E. If it is in the set E, then Calculate the shortest path from the start node to Hm, and record the end node of the shortest path as H'm; if it is not in the set E, delete the node Hm and all its descendant nodes from the directed acyclic graph G; among them, Set E is the set of N-shortest paths from V0 to Vn, Hm and H'm represent end nodes in each cycle, and H'm is used as the end marker of the next cycle;

Repeat the node deletion process until m≮n, update the current path, and obtain the shortest path j=j+1 from the starting node V0 to all nodes H'm; n is the number of shortest paths after deleting the node, m is the j cycle For the constructed shortest path, in each cycle, the value of m is m=j+1.
A geographic coding system, comprising:

Place name address model building module: used to build place name address model based on place name address data;

Geographical coding library building module: used to establish a geographic coding library according to the place name and address model, and the geographic coding library includes an administrative area entity data table, a street and lane entity data table and a community entity data table;

Word segmentation and standardization processing module: used to perform word segmentation and standardization processing on the place name address data based on the address dictionary, using the N-shortest path optimization algorithm, and divide the place name address data into at least one phrase;

Coordinate matching module: used to convert the at least one phrase into a character string in a predetermined format according to the level element in the place name address model, match the character string with the corresponding geographic coordinates in the geocoding library, and The geographic coordinates matched by the character string are used as the standard geographic coordinates of the address corresponding to the place name.
A terminal, characterized in that the terminal includes a processor and a memory coupled to the processor, wherein,

The memory stores program instructions for implementing the geocoding method according to any one of claims 1-7;

The processor is configured to execute the program instructions stored in the memory to control geocoding.
A storage medium, characterized in that it stores program instructions executable by a processor, and the program instructions are used to execute the geocoding method according to any one of claims 1 to 7.