WO2021189977A1 - Address coding method and apparatus, and computer device and computer-readable storage medium - Google Patents

Address coding method and apparatus, and computer device and computer-readable storage medium Download PDF

Info

Publication number
WO2021189977A1
WO2021189977A1 PCT/CN2020/136330 CN2020136330W WO2021189977A1 WO 2021189977 A1 WO2021189977 A1 WO 2021189977A1 CN 2020136330 W CN2020136330 W CN 2020136330W WO 2021189977 A1 WO2021189977 A1 WO 2021189977A1
Authority
WO
WIPO (PCT)
Prior art keywords
regional
address
encoded
lowest
stored
Prior art date
Application number
PCT/CN2020/136330
Other languages
French (fr)
Chinese (zh)
Inventor
李硕
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021189977A1 publication Critical patent/WO2021189977A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application also provides a computer device including a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program is executed by the processor When, implement the following steps:
  • FIG. 2 is a schematic flowchart of another address encoding method provided by an embodiment of this application.
  • FIG. 3 is a schematic flowchart of yet another address encoding method provided by an embodiment of this application.
  • FIG. 4 is a schematic block diagram of an address encoding device provided by an embodiment of this application.
  • FIG. 5 is a schematic block diagram of the structure of a computer device related to an embodiment of the application.
  • FIG. 1 is a schematic flowchart of an address encoding method according to an embodiment of the application.
  • the address encoding method includes steps S101 to S105.
  • the matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region codes corresponding to the regional words of the lowest-level administrative region is specifically:
  • the regional words of the administrative region are compared with the pre-stored regional coding dictionary to determine the pre-stored regional names in the pre-stored regional coding dictionary that match the regional words of the lowest-level administrative region; based on the pre-stored regional coding dictionary, the pre-stored regional names and the pre-stored regions
  • the mapping relationship of codes determines the pre-stored area code corresponding to the pre-stored area name matching the area word of the lowest-level administrative area; the determined pre-stored area code is used as the area code corresponding to the area word of the lowest-level administrative area.
  • Step S104 Detect the credibility type of the coding result through the trained credibility detection model.
  • Step S106 Obtain national administrative division data, and construct a regional coding dictionary based on the national administrative division data.
  • a component task is created for the standard POI library to construct a trie tree based on the standard POI library.
  • a component task for the standard POI library configure corresponding task parameters on the Spark component page, and the task parameters include execution time, Spark code, and so on.
  • the Spark code defines the processing procedure when constructing the trie tree according to the standard POI library.
  • the processing procedure includes:
  • Trie tree Based on the street (town) code in the regional coding dictionary, split the standard POI library. Specifically, take the street (town) code as the split benchmark for each street (town) in the standard POI library and its subordinates The information of all roads constructs two trie trees. One trie tree includes all road information under the corresponding street (town) and is defined as the first trie tree, and the other Trie tree contains all the POI information under the corresponding street (town). (POI name and latitude and longitude), defined as the second trie tree.
  • the word segmentation operation is performed on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence; and then the geographical word of the lowest administrative region is extracted from the geographical phrase sequence , Match the region word of the lowest level administrative region with the pre-stored region code dictionary to determine the region code corresponding to the region word of the lowest level administrative region; then determine the target trie tree corresponding to the address text to be encoded according to the determined region code; then , Determine the POI information corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded; finally, detect the credibility type of the encoding result through the trained credibility detection model.
  • the encoding speed can be significantly improved when the address text is encoded, so that the encoding of a large amount of address text can be completed in a short time to meet address encoding
  • the trained credibility detection model is finally used to evaluate the credibility of the coding result, which can ensure the reliability of the subsequent use of the coding result.
  • the address encoding device 400 includes: a word segmentation module 401, a matching module 402, a first determination module 403, a second determination module 404, and a detection module 405.
  • the word segmentation module 401 is configured to, when an address encoding request is received, perform word segmentation operations on the address text to be encoded carried in the address encoding request to obtain a geographic phrase sequence;
  • the first determining module 403 is configured to determine the target trie tree corresponding to the address text to be encoded according to the determined geographic code
  • the internal memory provides an environment for the running of the computer program in the non-volatile storage medium.
  • the processor can execute any address encoding method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 5 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor is used to run a computer program stored in a memory to implement the following steps:
  • the SVM model is trained according to the coding result and the feature of the training data, and a trained SVM model is obtained as a trained credibility detection model.
  • the embodiments of the present application also provide a computer-readable storage medium, and the computer-readable storage medium may be volatile or non-volatile.
  • a computer program is stored on the computer-readable storage medium, and the computer program includes program instructions. The method implemented when the program instructions are executed can refer to the various embodiments of the address encoding method of the present application.

Abstract

An address coding method and apparatus, and a computer device and a computer-readable storage medium, which belong to the technical field of intelligent decision-making. The method comprises: when an address coding request is received, performing a word segmentation operation on address text to be coded that is carried by the address coding request, so as to obtain a territory word group sequence (S101); extracting a territory word of a lowest-level administrative region from the territory word group sequence, and matching the extracted territory word with a pre-stored territory code dictionary so as to determine a territory code corresponding to the extracted territory word (S102); determining, according to the determined territory code, a target trie tree corresponding to the address text to be coded (S103); determining, from the target trie tree, POI information corresponding to the address text to be coded, and taking same as a coding result of the address text to be coded (S104); and detecting a credibility type of the coding result by means of a trained credibility detection model (S105). By means of the method, a requirement for coding massive pieces of address text can be met, and the reliability of a coding result can be ensured.

Description

地址编码方法、装置、计算机设备及计算机可读存储介质Address coding method, device, computer equipment and computer readable storage medium
本申请要求于2020年08月31日提交中国专利局、申请号为CN 202010899558.3、名称为“地址编码方法、装置、计算机设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with the application number CN 202010899558.3 and the name "Address coding method, device, computer equipment and computer-readable storage medium" on August 31, 2020, and its entire contents Incorporated in this application by reference.
技术领域Technical field
本申请涉及智能决策技术领域,尤其涉及一种地址编码方法、装置、计算机设备及计算机可读存储介质。This application relates to the technical field of intelligent decision-making, and in particular to an address encoding method, device, computer equipment, and computer-readable storage medium.
背景技术Background technique
地址编码是指根据地址文本找到其在地球上对应的位置(经纬度)。地址编码技术应用于很多领域,比如物流、地图搜索等领域,以地图搜索领域为例,地图应用需对用户的搜索地址进行编码,以在地图上显示搜索结果。Address coding refers to finding the corresponding location (latitude and longitude) on the earth according to the address text. Address coding technology is used in many fields, such as logistics, map search and other fields. Taking the map search field as an example, the map application needs to encode the user's search address to display the search result on the map.
技术问题technical problem
发明人意识到,随着用户数量的增长,地址编码的需求量也越来越多,达到千万甚至过亿级,现有采用接口调用服务器进行编码的方式,编码速度过慢,难以满足需求,且难以保证可靠性。The inventor realizes that as the number of users grows, the demand for address coding is increasing, reaching tens of millions or even hundreds of millions. The existing method of using the interface to call the server for coding, the coding speed is too slow, and it is difficult to meet the demand. , And it is difficult to guarantee reliability.
技术解决方案Technical solutions
本申请提供了一种地址编码方法,所述方法包括:This application provides an address encoding method, the method includes:
当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;When an address encoding request is received, perform a word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence;
从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;Extracting the regional words of the lowest-level administrative region from the regional phrase sequence, and matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region code corresponding to the regional words of the lowest-level administrative region;
根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;Determine the target trie tree corresponding to the address text to be encoded according to the determined region code;
从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;Determining the POI information corresponding to the address text to be encoded from the target trie tree as the encoding result of the address text to be encoded;
通过训练好的可信度检测模型,检测所述编码结果的可信度类型。Through the trained credibility detection model, the credibility type of the coding result is detected.
本申请还提供了一种地址编码装置,所述装置包括:This application also provides an address encoding device, which includes:
分词模块,用于当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;The word segmentation module is used to perform word segmentation operations on the address text to be encoded carried in the address encoding request when an address encoding request is received, to obtain a regional phrase sequence;
匹配模块,用于从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;The matching module is used to extract the regional words of the lowest-level administrative region from the regional phrase sequence, and match the regional words of the lowest-level administrative region with a pre-stored regional coding dictionary to determine the regional words of the lowest-level administrative region Corresponding geographic code;
第一确定模块,用于根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;The first determining module is configured to determine the target trie tree corresponding to the address text to be encoded according to the determined geographic code;
第二确定模块,用于从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;A second determining module, configured to determine the POI information corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded;
检测模块,用于通过训练好的可信度检测模型,检测所述编码结果的可信度类型。The detection module is used to detect the credibility type of the coding result through the trained credibility detection model.
本申请还提供了一种计算机设备,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如下步骤:The present application also provides a computer device including a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program is executed by the processor When, implement the following steps:
当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;When an address encoding request is received, perform a word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence;
从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;Extracting the regional words of the lowest-level administrative region from the regional phrase sequence, and matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region code corresponding to the regional words of the lowest-level administrative region;
根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;Determine the target trie tree corresponding to the address text to be encoded according to the determined region code;
从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;Determining the POI information corresponding to the address text to be encoded from the target trie tree as the encoding result of the address text to be encoded;
通过训练好的可信度检测模型,检测所述编码结果的可信度类型。Through the trained credibility detection model, the credibility type of the coding result is detected.
本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如下步骤:The present application also provides a computer-readable storage medium with a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:
当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;When an address encoding request is received, perform a word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence;
从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;Extracting the regional words of the lowest-level administrative region from the regional phrase sequence, and matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region code corresponding to the regional words of the lowest-level administrative region;
根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;Determine the target trie tree corresponding to the address text to be encoded according to the determined region code;
从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;Determining the POI information corresponding to the address text to be encoded from the target trie tree as the encoding result of the address text to be encoded;
通过训练好的可信度检测模型,检测所述编码结果的可信度类型。Through the trained credibility detection model, the credibility type of the coding result is detected.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的一种地址编码方法的流程示意图;FIG. 1 is a schematic flowchart of an address encoding method provided by an embodiment of this application;
图2为本申请实施例提供的另一种地址编码方法的流程示意图;2 is a schematic flowchart of another address encoding method provided by an embodiment of this application;
图3为本申请实施例提供的又一种地址编码方法的流程示意图;FIG. 3 is a schematic flowchart of yet another address encoding method provided by an embodiment of this application;
图4为本申请实施例提供的一种地址编码装置的示意性框图;FIG. 4 is a schematic block diagram of an address encoding device provided by an embodiment of this application;
图5为本申请一实施例涉及的计算机设备的结构示意框图。FIG. 5 is a schematic block diagram of the structure of a computer device related to an embodiment of the application.
如下具体实施方式将结合上述附图进一步说明本申请。The following specific embodiments will further illustrate this application in conjunction with the above-mentioned drawings.
本发明的实施方式Embodiments of the present invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowchart shown in the drawings is only an example, and does not necessarily include all contents and operations/steps, nor does it have to be executed in the described order. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to actual conditions.
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进理解,在本申请说明书和所附权利要求书中使用的术语“和/ 或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the items listed in the associated and all possible combinations, and includes these combinations .
本申请的实施例提供了一种地址编码方法、装置、设备及计算机可读存储介质。其中,该地址编码方法主要应用于地址编码设备,该地址编码设备是由多台服务器组成的分布式服务器集群。其中,地址编码设备配置Spark框架。The embodiments of the present application provide an address encoding method, device, equipment, and computer-readable storage medium. Among them, the address encoding method is mainly applied to an address encoding device, which is a distributed server cluster composed of multiple servers. Among them, the address encoding device is configured with the Spark framework.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.
请参照图1,图1为本申请的实施例提供的一种地址编码方法的流程示意图。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an address encoding method according to an embodiment of the application.
如图1所示,该地址编码方法包括步骤S101至步骤S105。As shown in FIG. 1, the address encoding method includes steps S101 to S105.
步骤S101,当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列。Step S101: When an address encoding request is received, a word segmentation operation is performed on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence.
当地址编码设备接收到地址编码请求时,从该地址编码请求中提取待编码地址文本,然后采用NLP(自然语言处理)技术对待编码地址文本进行分词操作,即,将待编码地址文本中表示行政区域、道路和其他的字符进行切分,切分行政区域时以待编码地址文本中含有的最低层级行政区域的字符为分界,得到包含行政区域、道路和/或其他词组序列,例如对“xx市xx区xx街道xx路xx花园旁边70m”进行分词操作得到的分词结果为“xx市xx区xx街道/xx路/xx花园/旁边70m”。When the address encoding device receives an address encoding request, it extracts the address text to be encoded from the address encoding request, and then uses NLP (Natural Language Processing) technology to segment the address text to be encoded, that is, the address text to be encoded represents the administrative Regions, roads, and other characters are segmented. When segmenting administrative regions, the characters of the lowest administrative region contained in the address text to be encoded are used as the boundary to obtain a sequence of administrative regions, roads and/or other phrases. For example, "xx" "70m next to xx garden, xx street, xx street, xx district, city", the word segmentation result is "xx street, xx district, xx city, xx road/xx garden/70m next to xx garden".
步骤S102,从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码。Step S102, extracting the regional words of the lowest-level administrative region from the regional phrase sequence, and matching the regional words of the lowest-level administrative region with a pre-stored regional coding dictionary to determine the corresponding regional words of the lowest-level administrative region Geographical code.
其中,地址编码设备中存储有预先构建的地域编码字典,该地域编码字典中收录有国家行政区划的地域名称、国家行政区划的地域编码,以及二者之间的映射关系。Wherein, the address coding device stores a pre-built regional coding dictionary, and the regional coding dictionary contains the regional names of the national administrative divisions, the regional codes of the national administrative divisions, and the mapping relationship between the two.
在得到地域词组序列之后,从词组序列中提取最低层级行政区域的地域词,然后将最低层级行政区域的地域词与预先构建的地域编码字典进行匹配,以确定最低层级行政区域的地域词对应的地域编码。After obtaining the regional phrase sequence, extract the regional words of the lowest administrative region from the phrase sequence, and then match the regional words of the lowest administrative region with the pre-built regional coding dictionary to determine the corresponding regional words of the lowest administrative region Geographical code.
在一实施例中,所述将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码,具体为:将所述最低层级行政区域的地域词与预存地域编码字典进行比对,以确定预存地域编码字典中与所述最低层级行政区域的地域词匹配的预存地域名称;基于预存地域编码字典中,预存地域名称与预存地域编码的映射关系,确定与所述最低层级行政区域的地域词匹配的预存地域名称对应的预存地域编码;将确定的预存地域编码作为所述最低层级行政区域的地域词对应的地域编码。In an embodiment, the matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region codes corresponding to the regional words of the lowest-level administrative region is specifically: The regional words of the administrative region are compared with the pre-stored regional coding dictionary to determine the pre-stored regional names in the pre-stored regional coding dictionary that match the regional words of the lowest-level administrative region; based on the pre-stored regional coding dictionary, the pre-stored regional names and the pre-stored regions The mapping relationship of codes determines the pre-stored area code corresponding to the pre-stored area name matching the area word of the lowest-level administrative area; the determined pre-stored area code is used as the area code corresponding to the area word of the lowest-level administrative area.
也即,在预先构建的地域编码字典中,查找到与最低层级行政区域的地域词匹配的预存地域名称,然后基于预先构建的地域编码字典中,预存地域名称与预存地域编码之间的映射关系,确定与最低层级行政区域的地域词匹配的预存地域名称所对应的预存地域编码,该确定的预存地域编码即为最低层级行政区域的地域词对应的地域编码。That is, in the pre-built regional coding dictionary, find the pre-stored regional names matching the regional words of the lowest level administrative region, and then based on the pre-built regional coding dictionary, the mapping relationship between the pre-stored regional names and the pre-stored regional codes , Determine the pre-stored area code corresponding to the pre-stored area name matching the area word of the lowest-level administrative area, and the determined pre-stored area code is the area code corresponding to the area word of the lowest-level administrative area.
可以理解的是,若待编码地址文本中的最低层级行政区域为街道(镇),则最低层级行政区域的地域词对应的地域编码为街道(镇)编码;若待编码地址文本中的最低层级行政区域为区,则最低层级行政区域的地域词对应的地域编码为区编码;若待编码地址文本中的最低层级行政区域为市,则最低层级行政区域的地域词对应的地域编码为市编码。It is understandable that if the lowest-level administrative area in the address text to be coded is a street (town), then the geographic code corresponding to the regional word of the lowest-level administrative area is the street (town) code; if the lowest level in the address text to be coded If the administrative region is a district, the region code corresponding to the region word of the lowest administrative region is the district code; if the lowest administrative region in the address text to be coded is a city, the region code corresponding to the region word of the lowest administrative region is the city code .
步骤S103,根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;Step S103: Determine the target trie tree corresponding to the address text to be encoded according to the determined geographic code;
其中,地址编码设备中存储有预先构建的全国每个街道(镇)对应的两颗Trie树,这两颗Trie树以对应街道(镇)的地域编码为索引信息,其中一颗Trie树包含对应街道(镇)下的所有道路信息(定义为第一trie树),道路信息包括路名,另一颗Trie树包含对应街道(镇)下的所有POI(Point of Interest,信息点)信息(定义为第二trie树),POI信息包括POI名、地址和经纬度。Among them, the address coding device stores two pre-built Trie trees corresponding to each street (town) in the country. The two Trie trees use the geographic code of the corresponding street (town) as index information, and one of the Trie trees contains the corresponding All road information under the street (town) (defined as the first trie tree), the road information includes the road name, and the other Trie tree contains all POI (Point of Interest) information under the corresponding street (town) (definition It is the second trie tree), POI information includes POI name, address and latitude and longitude.
若待编码地址文本中的最低层级行政区域为街道(镇),将该街道(镇)编码与预存Trie树的索引信息进行比对,即可查找到待编码地址文本中的街道(镇)对应的第一trie树和第二trie树,将查找到的对应第一trie树和第二trie树定义为目标trie树。If the lowest level administrative area in the address text to be coded is a street (town), compare the street (town) code with the index information of the pre-stored Trie tree to find the corresponding street (town) in the address text to be coded Define the first trie tree and the second trie tree found corresponding to the first trie tree and the second trie tree as the target trie tree.
若待编码地址文本中的最低层级行政区域为区,则可以根据该区编码,从预存地域编码字典中查找到该区下所有街道(镇)的编码,进一步将该区下每个街道(镇)的编码与预存Trie树的索引信息进行比对,即可查找到待编码地址文本中的区下每个镇对应的目标trie树。If the lowest-level administrative area in the address text to be coded is a district, then according to the district code, the codes of all streets (towns) under the district can be found from the pre-stored geographic code dictionary, and then each street (town) under the district can be further coded. ) The code is compared with the index information of the pre-stored Trie tree, and the target trie tree corresponding to each town in the district in the address text to be coded can be found.
若待编码地址文本中的最低层级行政区域为市,则可以根据该市编码,从预存地域编码字典中查找到该市下所有区的编码,并进一步查找到该市下每个区下所有街道(镇)的编码,再将每个区下每个街道(镇)的编码与预存Trie树的索引信息进行比对,即可查找到待编码地址文本中的市下每个区的每个镇对应的目标trie树。If the lowest administrative area in the address text to be coded is a city, you can find the codes of all the districts under the city from the pre-stored area code dictionary according to the city code, and further find all the streets under each district under the city (Town) code, and then compare the code of each street (town) under each district with the index information of the pre-stored Trie tree, you can find each town in each district under the city in the address text to be coded The corresponding target trie tree.
步骤S104,从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果。Step S104: Determine the POI information corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded.
若待编码地址文本中的最低层级行政区域为街道(镇),按路进行匹配,即,将待编码地址文本中的路名,与该街道(镇)对应的目标trie树中的第一trie树进行匹配,采用最大正向匹配算法计算待编码地址文本中的路名在该第一trie树中能够匹配的文本长度,当待编码地址文本中的路名在该第一trie树中能够匹配的文本长度达到预设第一阈值,则认为二者一致,确认在该第一trie树中匹配到路名。If the lowest-level administrative area in the address text to be encoded is a street (town), match by road, that is, the road name in the address text to be encoded is the first trie in the target trie tree corresponding to the street (town) The tree is matched, and the maximum forward matching algorithm is used to calculate the text length that the road name in the address text to be encoded can match in the first trie tree. When the road name in the address text to be encoded can be matched in the first trie tree If the length of the text reaches the preset first threshold, it is considered that the two are consistent, and it is confirmed that the road name is matched in the first trie tree.
进一步地,依据在该第一trie树中匹配到的路名,在目标trie树中的第二trie树中,按号进行匹配,即,将待编码地址文本中的号与该第二trie树进行匹配,从而从该第二trie树中找到匹配到的路名下,与待编码地址文本中的号相匹配的号,如此,便可从该第二trie树中确定所述待编码地址文本对应的POI信息。Further, according to the road name matched in the first trie tree, the second trie tree in the target trie tree is matched by number, that is, the number in the address text to be encoded is matched with the second trie tree Perform matching, so as to find from the second trie tree the number that matches the number in the address text to be encoded under the matched road name, so that the address text to be encoded can be determined from the second trie tree Corresponding POI information.
此外,按号进行匹配时,若该第二trie树不存在与待编码地址文本中的号完全一致的号,则选取与待编码地址文本中的号的匹配程度达到预设第二阈值的号,作为与待编码地址文本中的号相匹配的号。其中,预设第一阈值和预设第二阈值均可以根据实际需要进行灵活设置,此处不作限定。In addition, when matching by number, if the second trie tree does not have a number that is exactly the same as the number in the address text to be encoded, the number whose matching degree with the number in the address text to be encoded reaches the preset second threshold is selected. , As the number that matches the number in the address text to be encoded. Wherein, both the preset first threshold and the preset second threshold can be flexibly set according to actual needs, and are not limited here.
若待编码地址文本中的最低层级行政区域为区或市,则将待编码地址文本中的路名与该区下每个镇对应的第一trie树进行匹配,或将待编码地址文本中的路名与该市下每个镇对应的第一trie树进行匹配。If the lowest-level administrative region in the address text to be encoded is a district or city, then the road name in the address text to be encoded is matched with the first trie tree corresponding to each town in the district, or the address text in the address to be encoded The road name is matched with the first trie tree corresponding to each town in the city.
在一实施例中,所述从所述目标trie树中确定所述待编码地址文本对应的POI信息之后,还包括;判断所述待编码地址文本中是否存在模糊词和/或数字;若所述待编码地址文本中存在模糊词,则在确定的所述POI信息后添加模糊词,作为编码结果;若所述待编码地址文本中存在数字,则对所述数字进行归一化,并在确定的所述POI信息后添加归一化后的数字,作为编码结果;若所述待编码地址文本中存在模糊词和数字,则对所述数字进行归一化,并在确定的所述POI信息后依次添加模糊词和归一化后的数字,作为编码结果。In one embodiment, after determining the POI information corresponding to the address text to be encoded from the target trie tree, the method further includes: determining whether there are fuzzy words and/or numbers in the address text to be encoded; If there is a fuzzy word in the address text to be encoded, the fuzzy word is added after the determined POI information as the encoding result; if there is a number in the address text to be encoded, the number is normalized, and the After the determined POI information, a normalized number is added as the encoding result; if there are fuzzy words and numbers in the address text to be encoded, the numbers are normalized, and the determined POI After the information, fuzzy words and normalized numbers are sequentially added as the encoding result.
即,在匹配到待编码地址文本对应的POI信息之后,还判断待编码地址文本中是否存在模糊词和/或数字,模糊词如旁边、对面、东南方向,数字如200m。如果待编码地址文本中存在模糊词,则在匹配到的POI信息后添加模糊词,作为编码结果;如果待编码地址文本中存在数字,则对数字进行归一化处理,按照1-100归一化,例如200m取值为100m,如果是70m则取值为70m,在匹配到的POI信息后添加归一化后的数字,作为编码结果。如果模糊词后还存在数字,则对数字进行归一化处理,将归一化后的数字添加至模糊词后,作为编码结果。That is, after matching the POI information corresponding to the address text to be encoded, it is also determined whether there are fuzzy words and/or numbers in the address text to be encoded, such as side, opposite, southeast direction, and numbers such as 200m. If there are fuzzy words in the address text to be encoded, add fuzzy words after the matched POI information as the encoding result; if there are numbers in the address text to be encoded, normalize the numbers to 1-100 For example, the value of 200m is 100m, if it is 70m, the value is 70m, and the normalized number is added after the matched POI information as the encoding result. If there is a number after the fuzzy word, the number is normalized, and the normalized number is added to the fuzzy word as the encoding result.
步骤S104,通过训练好的可信度检测模型,检测所述编码结果的可信度类型。Step S104: Detect the credibility type of the coding result through the trained credibility detection model.
得到编码结果之后,还对编码结果进行可信度评估,具体地,将编码结果输入至预先训练好的可信度检测模型,得到可信度检测模型输出的编码结果的可信度类型,可信度类型包括完全准确、比较准确、基本准备、不准确。After the encoding result is obtained, the credibility evaluation of the encoding result is also performed. Specifically, the encoding result is input to the pre-trained credibility detection model to obtain the credibility type of the encoding result output by the credibility detection model. Reliability types include complete accuracy, relatively accurate, basic preparation, and inaccuracy.
在一实施例中,如图2所示,步骤S101之前,包括步骤S106至步骤S107。In an embodiment, as shown in FIG. 2, before step S101, step S106 to step S107 are included.
步骤S106,获取国家行政区划数据,根据所述国家行政区划数据构建地域编码字典。Step S106: Obtain national administrative division data, and construct a regional coding dictionary based on the national administrative division data.
即,在步骤S101之前,需预先构建地域编码字典。具体地,采集国家行政区划数据,国家行政区划数据中包含全国省-市-区-街道(镇)的8位编码,这8位编码,从左至右,前两位数字表示省编码,前四位数字表示市编码,前六位数字表示区编码,最后两位数字表示街道(镇)编码,例如:That is, before step S101, a regional coding dictionary needs to be constructed in advance. Specifically, the national administrative division data is collected. The national administrative division data contains the 8-bit code of province-city-district-street (town) across the country. The 8-bit code, from left to right, the first two digits represent the province code, The four digits represent the city code, the first six digits represent the district code, and the last two digits represent the street (town) code, for example:
Figure dest_path_image001
Figure dest_path_image001
从每一省-市-区-街道(镇)的8位编码中提取前两位数字加上6个0,得到省编码,将省编码与省名称关联;再提取前四位数字加上4个0,得到市编码,将市编码与市名称关联;再提取前六位数字加上2个0,得到区编码,将区编码与区名称关联;8位编码为街道(镇)编码,与街道(镇)名称关联;根据关联的省编码与省名称、市编码与市名称、区编码与区名称、街道(镇)编码与街道(镇)名称,即可得到地域编码字典。Extract the first two digits plus 6 0s from the 8-digit code of each province-city-district-street (town) to get the province code, associate the province code with the province name; then extract the first four digits plus 4 One 0, get the city code, associate the city code with the city name; then extract the first six digits and add 2 zeros to get the district code, and associate the district code with the district name; the 8-bit code is the street (town) code, and Street (town) name association; according to the associated province code and province name, city code and city name, district code and district name, street (town) code and street (town) name, a regional code dictionary can be obtained.
步骤S107,获取全国POI数据,根据所述全国POI数据构建各个街道或镇对应的Trie树,并分布式存储构建的所述Trie树。Step S107: Obtain national POI data, construct a Trie tree corresponding to each street or town according to the national POI data, and store the constructed Trie tree in a distributed manner.
即,在步骤S101之前,还需预先构建各个街道(镇)对应的Trie树。具体地,从地理信息供应商或者城市开放数据平台采集全国POI数据,每个POI包含类别、名称、地址、经纬度等信息,然后根据全国POI数据构建各个街道或镇对应的Trie树,然后分布式存储构建的Trie树,可以避免单台服务器存储中文Trie树内存溢出的问题,大大加快文字匹配时的计算速度。That is, before step S101, the Trie tree corresponding to each street (town) needs to be constructed in advance. Specifically, collect national POI data from geographic information providers or city open data platforms. Each POI contains information such as category, name, address, latitude and longitude, and then construct the Trie tree corresponding to each street or town based on the national POI data, and then distribute it. Storing the constructed Trie tree can avoid the memory overflow problem of the Chinese Trie tree stored on a single server, and greatly speed up the calculation speed of text matching.
在一实施例中,所述根据所述全国POI数据构建各个街道或镇对应的Trie树,具体为:对所述全国POI数据进行清洗;基于预先配置的Hadoop框架,采用所述Hadoop框架的Hive组件,将清洗后的全国POI数据按照预设格式存储至Hive表中,得到标准POI库;基于预先配置的Spark框架,针对所述标准POI库创建组件任务;执行所述组件任务,得到各个街道或镇对应的Trie树。In an embodiment, the construction of the Trie tree corresponding to each street or town according to the national POI data is specifically: cleaning the national POI data; based on the pre-configured Hadoop framework, using the Hive of the Hadoop framework Component, store the cleaned national POI data in the Hive table in a preset format to obtain a standard POI library; based on the pre-configured Spark framework, create a component task for the standard POI library; execute the component task to obtain each street Or the Trie tree corresponding to the town.
考虑到全国POI数据不仅仅是为地址编码服务的,因此采集的全国POI信息点数据可能会包括了冗余数据,对此,先对全国POI信息点数据进行清洗,过滤掉不需要的冗余数据。之后,基于地址编码设备的Hadoop框架(地址编码设备配置有Hadoop框架),采用Hadoop的Hive组件,按照“省-市-区-街道(镇)-路-号- POI名”的格式,将清洗后的全国POI数据存储至Hive表中,得到标准POI库。Considering that the national POI data is not only for address coding, the collected national POI information point data may include redundant data. For this, first clean the national POI information point data to filter out unnecessary redundancy data. After that, the Hadoop framework based on the address encoding device (the address encoding device is equipped with the Hadoop framework), using the Hive component of Hadoop, in accordance with "province-city-district-street (town)-road-number- In the format of "POI name", the cleaned national POI data is stored in the Hive table to obtain the standard POI library.
进一步地,基于地址编码设备的Spark框架,针对标准POI库创建组件任务,以根据标准POI库构建trie树。具体地,针对标准POI库创建组件任务时,在Spark组件页面配置相应的任务参数,该任务参数包括执行时间、Spark代码等。其中,Spark代码定义了根据标准POI库构建trie树时的处理过程,该处理过程包括:Further, based on the Spark framework of the address encoding device, a component task is created for the standard POI library to construct a trie tree based on the standard POI library. Specifically, when creating a component task for the standard POI library, configure corresponding task parameters on the Spark component page, and the task parameters include execution time, Spark code, and so on. Among them, the Spark code defines the processing procedure when constructing the trie tree according to the standard POI library. The processing procedure includes:
a、初始化标准POI库中的地址和 POI;a. Initialize the address and POI in the standard POI library;
b、基于地域编码字典中的街道(镇)编码,对标准POI库进行拆分,具体地,以街道(镇)编码为拆分基准,针对标准POI库中每个街道(镇)及其下属所有道路的信息构建两颗trie树,其中一颗trie树包括对应街道(镇)下的所有道路信息,定义为第一trie树,另一颗Trie树包含对应街道(镇)下的所有POI信息(POI名和经纬度),定义为第二trie树。b. Based on the street (town) code in the regional coding dictionary, split the standard POI library. Specifically, take the street (town) code as the split benchmark for each street (town) in the standard POI library and its subordinates The information of all roads constructs two trie trees. One trie tree includes all road information under the corresponding street (town) and is defined as the first trie tree, and the other Trie tree contains all the POI information under the corresponding street (town). (POI name and latitude and longitude), defined as the second trie tree.
执行该组件任务,即可对Hive表中的标准POI库进行上述处理过程,得到各个街道(镇)对应的两颗Trie树。对各个街道(镇)对应的两颗Trie树进行分布式存储,并将街道(镇)编码作为对应Trie树的索引信息。By executing this component task, you can perform the above processing on the standard POI library in the Hive table to obtain two Trie trees corresponding to each street (town). The two Trie trees corresponding to each street (town) are stored in a distributed manner, and the street (town) code is used as the index information of the corresponding Trie tree.
由此,通过采用分布式服务器集群,结合分布式存储的Trie树,后续对地址文本进行编码时,能够显著提升编码速度。As a result, by using a distributed server cluster, combined with a distributed stored Trie tree, when the address text is subsequently encoded, the encoding speed can be significantly improved.
在一实施例中,如图3所示,步骤S101之前,包括步骤S108。In one embodiment, as shown in FIG. 3, before step S101, step S108 is included.
步骤S108,训练可信度检测模型,得到训练好的可信度检测模型。Step S108, training a credibility detection model to obtain a trained credibility detection model.
即,在步骤S101之前,还需预先训练用于检测编码结果可信度的可信度检测模型。That is, before step S101, a credibility detection model for detecting the credibility of the encoding result needs to be pre-trained.
在一实施例中,所述步骤S108,具体为:采集带有准确经纬度的地址文本作为训练数据;对所述训练数据进行编码得到编码结果,并在编码过程中提取所述训练数据的特征;根据所述训练数据的所述编码结果和所述特征训练SVM模型,得到训练好的SVM模型,作为训练好的可信度检测模型。In one embodiment, the step S108 specifically includes: collecting address text with accurate latitude and longitude as training data; encoding the training data to obtain an encoding result, and extracting features of the training data during the encoding process; The SVM model is trained according to the coding result and the feature of the training data, and a trained SVM model is obtained as a trained credibility detection model.
可信度检测模型可以是支持向量机SVM模型。具体地,首先采集带有准确经纬度的地址文本作为训练数据,先对训练数据进行编码,在编码的过程中提取训练数据的特征,比如是否有行政区、行政区对应级别、道路匹配比例、门牌号相似率、POI相似率、是否有模糊值、模糊值距离比例,然后根据训练数据的编码结果和训练数据的特征训练SVM模型。The credibility detection model may be a support vector machine SVM model. Specifically, first collect the address text with accurate latitude and longitude as training data, first encode the training data, and extract the characteristics of the training data during the encoding process, such as whether there are administrative districts, corresponding levels of administrative districts, road matching ratios, and similar house numbers Rate, POI similarity rate, whether there is fuzzy value, fuzzy value distance ratio, and then train the SVM model according to the coding result of the training data and the characteristics of the training data.
SVM模型需要区分的可信度类型分为四种情况:完全准确、比较准确、基本准备、不准确,SVM模型的任务即为总结训练数据编码结果的规律,将这四种情况划分为四类,逐步形成自主的判断逻辑曲线,将与训练数据实际经纬度之间的球面距离小于20m的训练数据编码结果划分为完全准确,将与训练数据实际经纬度之间的球面距离位于20-100m范围内的训练数据编码结果划分为比较准确,将与训练数据实际经纬度之间的球面距离位于100-1000m范围内的训练数据编码结果划分为基本准备,将与训练数据实际经纬度之间的球面距离大于1000m范围内的训练数据编码结果划分为不准确,由此得到其判断可信度类型的标准,得到训练好的SVM模型,作为训练好的可信度检测模型,为后续对编码结果进行可信度评估奠定基础。The type of credibility that the SVM model needs to distinguish is divided into four situations: completely accurate, relatively accurate, basic preparation, and inaccurate. The task of the SVM model is to summarize the rules of the coding results of the training data. These four situations are divided into four categories , Gradually form an autonomous judgment logic curve, and divide the coding result of training data whose spherical distance between the actual latitude and longitude of the training data is less than 20m into completely accurate, and divide the spherical distance from the actual longitude and latitude of the training data within the range of 20-100m. The training data coding result is divided into more accurate ones. The training data coding result with the spherical distance between the actual latitude and longitude of the training data in the range of 100-1000m is divided into the basic preparation, and the spherical distance between the actual longitude and latitude of the training data is greater than 1000m. The coding result of the training data is divided into inaccurate, and the standard for judging the credibility type is obtained, and the trained SVM model is obtained, which is used as the trained credibility detection model to evaluate the credibility of the subsequent coding results. Lay the foundation.
上述实施例提供的地址编码方法,当接收到地址编码请求时,对地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;然后从地域词组序列中提取最低层级行政区域的地域词,将最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定最低层级行政区域的地域词对应的地域编码;再根据确定的地域编码,确定待编码地址文本对应的目标trie树;之后,从目标trie树中确定待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;最后通过训练好的可信度检测模型,检测该编码结果的可信度类型。由于上述方式的实现以分布式服务器集群为基础,且结合了trie树,因此在对地址文本进行编码时,能够显著提升编码速度,从而能够在短时间内完成海量地址文本的编码,满足地址编码需求,最后采用训练好的可信度检测模型对编码结果进行可信度评估,能够保证后续使用编码结果的可靠性。In the address encoding method provided by the foregoing embodiment, when an address encoding request is received, the word segmentation operation is performed on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence; and then the geographical word of the lowest administrative region is extracted from the geographical phrase sequence , Match the region word of the lowest level administrative region with the pre-stored region code dictionary to determine the region code corresponding to the region word of the lowest level administrative region; then determine the target trie tree corresponding to the address text to be encoded according to the determined region code; then , Determine the POI information corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded; finally, detect the credibility type of the encoding result through the trained credibility detection model. Since the implementation of the above method is based on a distributed server cluster and combined with a trie tree, the encoding speed can be significantly improved when the address text is encoded, so that the encoding of a large amount of address text can be completed in a short time to meet address encoding As required, the trained credibility detection model is finally used to evaluate the credibility of the coding result, which can ensure the reliability of the subsequent use of the coding result.
请参照图4,图4为本申请实施例提供的一种地址编码装置的示意性框图。Please refer to FIG. 4, which is a schematic block diagram of an address encoding device according to an embodiment of the application.
如图4所示,该地址编码装置400,包括:分词模块401、匹配模块402、第一确定模块403、第二确定模块404和检测模块405。As shown in FIG. 4, the address encoding device 400 includes: a word segmentation module 401, a matching module 402, a first determination module 403, a second determination module 404, and a detection module 405.
分词模块401,用于当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;The word segmentation module 401 is configured to, when an address encoding request is received, perform word segmentation operations on the address text to be encoded carried in the address encoding request to obtain a geographic phrase sequence;
匹配模块402,用于从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;The matching module 402 is configured to extract the regional words of the lowest-level administrative region from the regional phrase sequence, and match the regional words of the lowest-level administrative region with a pre-stored region coding dictionary to determine the region of the lowest-level administrative region The geographical code corresponding to the word;
第一确定模块403,用于根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;The first determining module 403 is configured to determine the target trie tree corresponding to the address text to be encoded according to the determined geographic code;
第二确定模块404,用于从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;The second determining module 404 is configured to determine the POI information corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded;
检测模块405,用于通过训练好的可信度检测模型,检测所述编码结果的可信度类型。The detection module 405 is configured to detect the reliability type of the encoding result through the trained reliability detection model.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块及单元的具体工作过程,可以参考前述地址编码方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that for the convenience and conciseness of description, the specific working process of the above described device and each module and unit can refer to the corresponding process in the foregoing address encoding method embodiment. I won't repeat them here.
上述实施例提供的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图5所示的计算机设备上运行。The apparatus provided in the foregoing embodiment may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 5.
请参阅图5,图5为本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是个人计算机(personal computer,PC)、服务器等具有数据处理功能的设备。Please refer to FIG. 5, which is a schematic block diagram of the structure of a computer device according to an embodiment of the application. The computer device may be a personal computer (personal computer, PC), a server, and other devices with data processing functions.
如图5所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以是易失性的,也可以是非易失性的。As shown in FIG. 5, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may be volatile or non-volatile.
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种地址编码方法。The non-volatile storage medium can store an operating system and a computer program. The computer program includes program instructions, and when the program instructions are executed, the processor can execute any address encoding method.
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种地址编码方法。The internal memory provides an environment for the running of the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any address encoding method.
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
应当理解的是,处理器可以是中央处理单元 (Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现场可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:Wherein, in an embodiment, the processor is used to run a computer program stored in a memory to implement the following steps:
当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;通过训练好的可信度检测模型,检测所述编码结果的可信度类型。When an address encoding request is received, the word segmentation operation is performed on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence; the geographical word of the lowest administrative region is extracted from the geographical phrase sequence, and the lowest level Matching the regional words of the administrative area with the pre-stored regional code dictionary to determine the regional code corresponding to the regional words of the lowest-level administrative area; determine the target trie tree corresponding to the address text to be coded according to the determined regional code; The POI information corresponding to the address text to be encoded is determined from the target trie tree as the encoding result of the address text to be encoded; the credibility type of the encoding result is detected through the trained credibility detection model .
在一些实施例中,所述处理器实现所述将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码,包括:In some embodiments, the processor implementing the matching of the regional words of the lowest-level administrative region with a pre-stored region code dictionary to determine the region codes corresponding to the regional words of the lowest-level administrative region includes:
将所述最低层级行政区域的地域词与预存地域编码字典进行比对,以确定预存地域编码字典中与所述最低层级行政区域的地域词匹配的预存地域名称;Comparing the regional words of the lowest-level administrative region with a pre-stored regional coding dictionary to determine the pre-stored regional names in the pre-stored regional coding dictionary that match the regional words of the lowest-level administrative region;
基于预存地域编码字典中,预存地域名称与预存地域编码的映射关系,确定与所述最低层级行政区域的地域词匹配的预存地域名称对应的预存地域编码;Based on the mapping relationship between the pre-stored area name and the pre-stored area code in the pre-stored area code dictionary, determine the pre-stored area code corresponding to the pre-stored area name that matches the area word of the lowest-level administrative area;
将确定的预存地域编码作为所述最低层级行政区域的地域词对应的地域编码。The determined pre-stored area code is used as the area code corresponding to the area word of the lowest-level administrative region.
在一些实施例中,所述处理器实现所述从所述目标trie树中确定所述待编码地址文本对应的POI信息之后,还包括:In some embodiments, after the processor realizes the determination of the POI information corresponding to the address text to be encoded from the target trie tree, the method further includes:
判断所述待编码地址文本中是否存在模糊词和/或数字;Determine whether there are fuzzy words and/or numbers in the address text to be encoded;
若所述待编码地址文本中存在模糊词,则在确定的所述POI信息后添加模糊词,作为编码结果;If there are fuzzy words in the address text to be encoded, add fuzzy words after the determined POI information as the encoding result;
若所述待编码地址文本中存在数字,则对所述数字进行归一化,并在确定的所述POI信息后添加归一化后的数字,作为编码结果;If there are numbers in the address text to be encoded, normalize the numbers, and add the normalized numbers after the determined POI information as the encoding result;
若所述待编码地址文本中存在模糊词和数字,则对所述数字进行归一化,并在确定的所述POI信息后依次添加模糊词和归一化后的数字,作为编码结果。If there are fuzzy words and numbers in the address text to be encoded, the numbers are normalized, and fuzzy words and normalized numbers are sequentially added after the determined POI information as the encoding result.
在一些实施例中,所述处理器实现所述当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列之前,包括:In some embodiments, before the processor realizes that when an address encoding request is received, the word segmentation operation is performed on the address text to be encoded carried in the address encoding request to obtain a sequence of regional phrases, including:
获取国家行政区划数据,根据所述国家行政区划数据构建地域编码字典;Acquiring national administrative division data, and constructing a regional coding dictionary based on the national administrative division data;
获取全国POI数据,根据所述全国POI数据构建各个街道或镇对应的Trie树,并分布式存储构建的所述Trie树。Obtain the national POI data, construct the Trie tree corresponding to each street or town according to the national POI data, and store the constructed Trie tree in a distributed manner.
在一些实施例中,所述处理器实现所述根据所述全国POI数据构建各个街道或镇对应的Trie树,包括:In some embodiments, the processor implementing the construction of the Trie tree corresponding to each street or town according to the national POI data includes:
对所述全国POI数据进行清洗;Clean the national POI data;
基于预先配置的Hadoop框架,采用所述Hadoop框架的Hive组件,将清洗后的全国POI数据按照预设格式存储至Hive表中,得到标准POI库;Based on the pre-configured Hadoop framework, using the Hive component of the Hadoop framework, the cleaned national POI data is stored in the Hive table in a preset format to obtain the standard POI library;
基于预先配置的Spark框架,针对所述标准POI库创建组件任务;Create component tasks for the standard POI library based on the pre-configured Spark framework;
执行所述组件任务,得到各个街道或镇对应的Trie树。Perform the component task to obtain the Trie tree corresponding to each street or town.
在一些实施例中,所述处理器实现所述当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列之前,包括:In some embodiments, before the processor realizes that when an address encoding request is received, the word segmentation operation is performed on the address text to be encoded carried in the address encoding request to obtain a sequence of regional phrases, including:
训练可信度检测模型,得到训练好的可信度检测模型。Train the credibility detection model to obtain the trained credibility detection model.
在一些实施例中,所述处理器实现所述训练可信度检测模型,得到训练好的可信度检测模型,包括:In some embodiments, the processor implements the trained credibility detection model to obtain a trained credibility detection model, including:
采集带有准确经纬度的地址文本作为训练数据;Collect address text with accurate latitude and longitude as training data;
对所述训练数据进行编码得到编码结果,并在编码过程中提取所述训练数据的特征;Encoding the training data to obtain an encoding result, and extracting features of the training data during the encoding process;
根据所述训练数据的所述编码结果和所述特征训练SVM模型,得到训练好的SVM模型,作为训练好的可信度检测模型。The SVM model is trained according to the coding result and the feature of the training data, and a trained SVM model is obtained as a trained credibility detection model.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是易失性的,也可以是非易失性的。所述计算机可读存储介质上存储有计算机程序,所述计算机程序中包括程序指令,所述程序指令被执行时所实现的方法可参照本申请地址编码方法的各个实施例。The embodiments of the present application also provide a computer-readable storage medium, and the computer-readable storage medium may be volatile or non-volatile. A computer program is stored on the computer-readable storage medium, and the computer program includes program instructions. The method implemented when the program instructions are executed can refer to the various embodiments of the address encoding method of the present application.
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store Data created by the use of nodes, etc.
本发明所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in the present invention is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种地址编码方法,其中,所述地址编码方法包括以下步骤:An address encoding method, wherein the address encoding method includes the following steps:
    当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;When an address encoding request is received, perform a word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence;
    从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;Extracting the regional words of the lowest-level administrative region from the regional phrase sequence, and matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region code corresponding to the regional words of the lowest-level administrative region;
    根据确定的所述地域编码,确定所述待编码地址文本对应的目标字典树trie树;Determine the target dictionary tree trie tree corresponding to the address text to be encoded according to the determined geographic code;
    从所述目标trie树中确定所述待编码地址文本对应的信息点POI信息,作为所述待编码地址文本的编码结果;Determine the POI information of the information point corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded;
    通过训练好的可信度检测模型,检测所述编码结果的可信度类型。Through the trained credibility detection model, the credibility type of the coding result is detected.
  2. 根据权利要求1所述的地址编码方法,其中,所述当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列之前,包括:The address encoding method according to claim 1, wherein when an address encoding request is received, performing word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a sequence of geographic phrases includes:
    获取国家行政区划数据,根据所述国家行政区划数据构建地域编码字典;Acquiring national administrative division data, and constructing a regional coding dictionary based on the national administrative division data;
    获取全国POI数据,根据所述全国POI数据构建各个街道或镇对应的Trie树,并分布式存储构建的所述Trie树。Obtain the national POI data, construct the Trie tree corresponding to each street or town according to the national POI data, and store the constructed Trie tree in a distributed manner.
  3. 根据权利要求2所述的地址编码方法,其中,所述根据所述全国POI数据构建各个街道或镇对应的Trie树,包括:The address encoding method according to claim 2, wherein the construction of the Trie tree corresponding to each street or town according to the national POI data comprises:
    对所述全国POI数据进行清洗;Clean the national POI data;
    基于预先配置的分布式系统Hadoop框架,采用所述Hadoop框架的数据仓库工具Hive组件,将清洗后的全国POI数据按照预设格式存储至Hive表中,得到标准POI库;Based on the pre-configured distributed system Hadoop framework, using the Hive component of the data warehouse tool of the Hadoop framework, the cleaned national POI data is stored in the Hive table in a preset format to obtain the standard POI library;
    基于预先配置的计算引擎Spark框架,针对所述标准POI库创建组件任务;Create component tasks for the standard POI library based on the pre-configured computing engine Spark framework;
    执行所述组件任务,得到各个街道或镇对应的Trie树。Perform the component task to obtain the Trie tree corresponding to each street or town.
  4. 根据权利要求1所述的地址编码方法,其中,所述将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码,包括:The address coding method according to claim 1, wherein the matching the regional words of the lowest-level administrative region with a pre-stored region code dictionary to determine the region codes corresponding to the regional words of the lowest-level administrative region comprises :
    将所述最低层级行政区域的地域词与预存地域编码字典进行比对,以确定预存地域编码字典中与所述最低层级行政区域的地域词匹配的预存地域名称;Comparing the regional words of the lowest-level administrative region with a pre-stored regional coding dictionary to determine the pre-stored regional names in the pre-stored regional coding dictionary that match the regional words of the lowest-level administrative region;
    基于预存地域编码字典中,预存地域名称与预存地域编码的映射关系,确定与所述最低层级行政区域的地域词匹配的预存地域名称对应的预存地域编码;Based on the mapping relationship between the pre-stored area name and the pre-stored area code in the pre-stored area code dictionary, determine the pre-stored area code corresponding to the pre-stored area name that matches the area word of the lowest-level administrative area;
    将确定的预存地域编码作为所述最低层级行政区域的地域词对应的地域编码。The determined pre-stored area code is used as the area code corresponding to the area word of the lowest-level administrative region.
  5. 根据权利要求1所述的地址编码方法,其中,所述从所述目标trie树中确定所述待编码地址文本对应的POI信息之后,还包括:The address encoding method according to claim 1, wherein after determining the POI information corresponding to the address text to be encoded from the target trie tree, the method further comprises:
    判断所述待编码地址文本中是否存在模糊词和/或数字;Determine whether there are fuzzy words and/or numbers in the address text to be encoded;
    若所述待编码地址文本中存在模糊词,则在确定的所述POI信息后添加模糊词,作为编码结果;If there are fuzzy words in the address text to be encoded, add fuzzy words after the determined POI information as the encoding result;
    若所述待编码地址文本中存在数字,则对所述数字进行归一化,并在确定的所述POI信息后添加归一化后的数字,作为编码结果;If there are numbers in the address text to be encoded, normalize the numbers, and add the normalized numbers after the determined POI information as the encoding result;
    若所述待编码地址文本中存在模糊词和数字,则对所述数字进行归一化,并在确定的所述POI信息后依次添加模糊词和归一化后的数字,作为编码结果。If there are fuzzy words and numbers in the address text to be encoded, the numbers are normalized, and fuzzy words and normalized numbers are sequentially added after the determined POI information as the encoding result.
  6. 根据权利要求1所述的地址编码方法,其中,所述当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列之前,包括:The address encoding method according to claim 1, wherein when an address encoding request is received, performing word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a sequence of geographic phrases includes:
    训练可信度检测模型,得到训练好的可信度检测模型。Train the credibility detection model to obtain the trained credibility detection model.
  7. 根据权利要求6所述的地址编码方法,其中,所述训练可信度检测模型,得到训练好的可信度检测模型,包括:The address encoding method according to claim 6, wherein said training a credibility detection model to obtain a trained credibility detection model comprises:
    采集带有准确经纬度的地址文本作为训练数据;Collect address text with accurate latitude and longitude as training data;
    对所述训练数据进行编码得到编码结果,并在编码过程中提取所述训练数据的特征;Encoding the training data to obtain an encoding result, and extracting features of the training data during the encoding process;
    根据所述训练数据的所述编码结果和所述特征训练支持向量机SVM模型,得到训练好的SVM模型,作为训练好的可信度检测模型。A support vector machine SVM model is trained according to the encoding result of the training data and the feature, and a trained SVM model is obtained as a trained credibility detection model.
  8. 一种地址编码装置,其中,所述地址编码装置包括:An address encoding device, wherein the address encoding device includes:
    分词模块,用于当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;The word segmentation module is used to perform word segmentation operations on the address text to be encoded carried in the address encoding request when an address encoding request is received, to obtain a regional phrase sequence;
    匹配模块,用于从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;The matching module is used to extract the regional words of the lowest-level administrative region from the regional phrase sequence, and match the regional words of the lowest-level administrative region with a pre-stored regional coding dictionary to determine the regional words of the lowest-level administrative region Corresponding geographic code;
    第一确定模块,用于根据确定的所述地域编码,确定所述待编码地址文本对应的目标trie树;The first determining module is configured to determine the target trie tree corresponding to the address text to be encoded according to the determined geographic code;
    第二确定模块,用于从所述目标trie树中确定所述待编码地址文本对应的POI信息,作为所述待编码地址文本的编码结果;A second determining module, configured to determine the POI information corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded;
    检测模块,用于通过训练好的可信度检测模型,检测所述编码结果的可信度类型。The detection module is used to detect the credibility type of the coding result through the trained credibility detection model.
  9. 一种计算机设备,其中,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如下步骤:A computer device, wherein the computer device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein when the computer program is executed by the processor, the The following steps:
    当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;When an address encoding request is received, perform a word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence;
    从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;Extracting the regional words of the lowest-level administrative region from the regional phrase sequence, and matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region code corresponding to the regional words of the lowest-level administrative region;
    根据确定的所述地域编码,确定所述待编码地址文本对应的目标字典树trie树;Determine the target dictionary tree trie tree corresponding to the address text to be encoded according to the determined geographic code;
    从所述目标trie树中确定所述待编码地址文本对应的信息点POI信息,作为所述待编码地址文本的编码结果;Determine the POI information of the information point corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded;
    通过训练好的可信度检测模型,检测所述编码结果的可信度类型。Through the trained credibility detection model, the credibility type of the coding result is detected.
  10. 根据权利要求9所述的计算机设备,所述计算机程序被所述处理器执行时,还实现如下步骤:The computer device according to claim 9, when the computer program is executed by the processor, the following steps are further implemented:
    获取国家行政区划数据,根据所述国家行政区划数据构建地域编码字典;Acquiring national administrative division data, and constructing a regional coding dictionary based on the national administrative division data;
    获取全国POI数据,根据所述全国POI数据构建各个街道或镇对应的Trie树,并分布式存储构建的所述Trie树。Obtain the national POI data, construct the Trie tree corresponding to each street or town according to the national POI data, and store the constructed Trie tree in a distributed manner.
  11. 根据权利要求10所述的计算机设备,其中,所述根据所述全国POI数据构建各个街道或镇对应的Trie树,包括:The computer device according to claim 10, wherein said constructing the Trie tree corresponding to each street or town according to the national POI data comprises:
    对所述全国POI数据进行清洗;Clean the national POI data;
    基于预先配置的分布式系统Hadoop框架,采用所述Hadoop框架的数据仓库工具Hive组件,将清洗后的全国POI数据按照预设格式存储至Hive表中,得到标准POI库;Based on the pre-configured distributed system Hadoop framework, using the Hive component of the data warehouse tool of the Hadoop framework, the cleaned national POI data is stored in the Hive table in a preset format to obtain the standard POI library;
    基于预先配置的计算引擎Spark框架,针对所述标准POI库创建组件任务;Create component tasks for the standard POI library based on the pre-configured computing engine Spark framework;
    执行所述组件任务,得到各个街道或镇对应的Trie树。Perform the component task to obtain the Trie tree corresponding to each street or town.
  12. 根据权利要求9所述的计算机设备,其中,所述将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码,包括:The computer device according to claim 9, wherein the matching the regional words of the lowest-level administrative region with a pre-stored region code dictionary to determine the region codes corresponding to the regional words of the lowest-level administrative region comprises:
    将所述最低层级行政区域的地域词与预存地域编码字典进行比对,以确定预存地域编码字典中与所述最低层级行政区域的地域词匹配的预存地域名称;Comparing the regional words of the lowest-level administrative region with a pre-stored regional coding dictionary to determine the pre-stored regional names in the pre-stored regional coding dictionary that match the regional words of the lowest-level administrative region;
    基于预存地域编码字典中,预存地域名称与预存地域编码的映射关系,确定与所述最低层级行政区域的地域词匹配的预存地域名称对应的预存地域编码;Based on the mapping relationship between the pre-stored area name and the pre-stored area code in the pre-stored area code dictionary, determine the pre-stored area code corresponding to the pre-stored area name that matches the area word of the lowest-level administrative area;
    将确定的预存地域编码作为所述最低层级行政区域的地域词对应的地域编码。The determined pre-stored area code is used as the area code corresponding to the area word of the lowest-level administrative region.
  13. 根据权利要求9所述的计算机设备,所述计算机程序被所述处理器执行时,实现如下步骤:The computer device according to claim 9, when the computer program is executed by the processor, the following steps are implemented:
    判断所述待编码地址文本中是否存在模糊词和/或数字;Determine whether there are fuzzy words and/or numbers in the address text to be encoded;
    若所述待编码地址文本中存在模糊词,则在确定的所述POI信息后添加模糊词,作为编码结果;If there are fuzzy words in the address text to be encoded, add fuzzy words after the determined POI information as the encoding result;
    若所述待编码地址文本中存在数字,则对所述数字进行归一化,并在确定的所述POI信息后添加归一化后的数字,作为编码结果;If there are numbers in the address text to be encoded, normalize the numbers, and add the normalized numbers after the determined POI information as the encoding result;
    若所述待编码地址文本中存在模糊词和数字,则对所述数字进行归一化,并在确定的所述POI信息后依次添加模糊词和归一化后的数字,作为编码结果。If there are fuzzy words and numbers in the address text to be encoded, the numbers are normalized, and fuzzy words and normalized numbers are sequentially added after the determined POI information as the encoding result.
  14. 根据权利要求9所述的计算机设备,其中,所述计算机程序被所述处理器执行时,实现如下步骤:The computer device according to claim 9, wherein when the computer program is executed by the processor, the following steps are implemented:
    训练可信度检测模型,得到训练好的可信度检测模型。Train the credibility detection model to obtain the trained credibility detection model.
  15. 根据权利要求14所述的计算机设备,其中,所述训练可信度检测模型,得到训练好的可信度检测模型,包括:The computer device according to claim 14, wherein the training a credibility detection model to obtain a trained credibility detection model comprises:
    采集带有准确经纬度的地址文本作为训练数据;Collect address text with accurate latitude and longitude as training data;
    对所述训练数据进行编码得到编码结果,并在编码过程中提取所述训练数据的特征;Encoding the training data to obtain an encoding result, and extracting features of the training data during the encoding process;
    根据所述训练数据的所述编码结果和所述特征训练支持向量机SVM模型,得到训练好的SVM模型,作为训练好的可信度检测模型。A support vector machine SVM model is trained according to the encoding result of the training data and the feature, and a trained SVM model is obtained as a trained credibility detection model.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如下步骤:A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:
    当接收到地址编码请求时,对所述地址编码请求携带的待编码地址文本进行分词操作,得到地域词组序列;When an address encoding request is received, perform a word segmentation operation on the address text to be encoded carried in the address encoding request to obtain a geographical phrase sequence;
    从所述地域词组序列中提取最低层级行政区域的地域词,将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码;Extracting the regional words of the lowest-level administrative region from the regional phrase sequence, and matching the regional words of the lowest-level administrative region with a pre-stored regional code dictionary to determine the region code corresponding to the regional words of the lowest-level administrative region;
    根据确定的所述地域编码,确定所述待编码地址文本对应的目标字典树trie树;Determine the target dictionary tree trie tree corresponding to the address text to be encoded according to the determined geographic code;
    从所述目标trie树中确定所述待编码地址文本对应的信息点POI信息,作为所述待编码地址文本的编码结果;Determine the POI information of the information point corresponding to the address text to be encoded from the target trie tree, as the encoding result of the address text to be encoded;
    通过训练好的可信度检测模型,检测所述编码结果的可信度类型。Through the trained credibility detection model, the credibility type of the coding result is detected.
  17. 根据权利要求16所述的计算机设备,所述计算机程序被所述处理器执行时,还实现如下步骤:The computer device according to claim 16, when the computer program is executed by the processor, the following steps are further implemented:
    获取国家行政区划数据,根据所述国家行政区划数据构建地域编码字典;Acquiring national administrative division data, and constructing a regional coding dictionary based on the national administrative division data;
    获取全国POI数据,根据所述全国POI数据构建各个街道或镇对应的Trie树,并分布式存储构建的所述Trie树。Obtain the national POI data, construct the Trie tree corresponding to each street or town according to the national POI data, and store the constructed Trie tree in a distributed manner.
  18. 根据权利要求17所述的计算机设备,其中,所述根据所述全国POI数据构建各个街道或镇对应的Trie树,包括:The computer device according to claim 17, wherein said constructing the Trie tree corresponding to each street or town according to the national POI data comprises:
    对所述全国POI数据进行清洗;Clean the national POI data;
    基于预先配置的分布式系统Hadoop框架,采用所述Hadoop框架的数据仓库工具Hive组件,将清洗后的全国POI数据按照预设格式存储至Hive表中,得到标准POI库;Based on the pre-configured distributed system Hadoop framework, using the Hive component of the data warehouse tool of the Hadoop framework, the cleaned national POI data is stored in the Hive table in a preset format to obtain the standard POI library;
    基于预先配置的计算引擎Spark框架,针对所述标准POI库创建组件任务;Create component tasks for the standard POI library based on the pre-configured computing engine Spark framework;
    执行所述组件任务,得到各个街道或镇对应的Trie树。Perform the component task to obtain the Trie tree corresponding to each street or town.
  19. 根据权利要求16所述的计算机设备,其中,所述将所述最低层级行政区域的地域词与预存地域编码字典进行匹配,以确定所述最低层级行政区域的地域词对应的地域编码,包括:The computer device according to claim 16, wherein the matching the regional words of the lowest-level administrative region with a pre-stored region code dictionary to determine the region codes corresponding to the regional words of the lowest-level administrative region comprises:
    将所述最低层级行政区域的地域词与预存地域编码字典进行比对,以确定预存地域编码字典中与所述最低层级行政区域的地域词匹配的预存地域名称;Comparing the regional words of the lowest-level administrative region with a pre-stored regional coding dictionary to determine the pre-stored regional names in the pre-stored regional coding dictionary that match the regional words of the lowest-level administrative region;
    基于预存地域编码字典中,预存地域名称与预存地域编码的映射关系,确定与所述最低层级行政区域的地域词匹配的预存地域名称对应的预存地域编码;Based on the mapping relationship between the pre-stored area name and the pre-stored area code in the pre-stored area code dictionary, determine the pre-stored area code corresponding to the pre-stored area name that matches the area word of the lowest-level administrative area;
    将确定的预存地域编码作为所述最低层级行政区域的地域词对应的地域编码。The determined pre-stored area code is used as the area code corresponding to the area word of the lowest-level administrative region.
  20. 根据权利要求16所述的计算机设备,所述计算机程序被所述处理器执行时,实现如下步骤:The computer device according to claim 16, when the computer program is executed by the processor, the following steps are implemented:
    判断所述待编码地址文本中是否存在模糊词和/或数字;Determine whether there are fuzzy words and/or numbers in the address text to be encoded;
    若所述待编码地址文本中存在模糊词,则在确定的所述POI信息后添加模糊词,作为编码结果;If there are fuzzy words in the address text to be encoded, add fuzzy words after the determined POI information as the encoding result;
    若所述待编码地址文本中存在数字,则对所述数字进行归一化,并在确定的所述POI信息后添加归一化后的数字,作为编码结果;If there are numbers in the address text to be encoded, normalize the numbers, and add the normalized numbers after the determined POI information as the encoding result;
    若所述待编码地址文本中存在模糊词和数字,则对所述数字进行归一化,并在确定的所述POI信息后依次添加模糊词和归一化后的数字,作为编码结果。If there are fuzzy words and numbers in the address text to be encoded, the numbers are normalized, and fuzzy words and normalized numbers are sequentially added after the determined POI information as the encoding result.
PCT/CN2020/136330 2020-08-31 2020-12-15 Address coding method and apparatus, and computer device and computer-readable storage medium WO2021189977A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010899558.3 2020-08-31
CN202010899558.3A CN112069276B (en) 2020-08-31 2020-08-31 Address coding method, address coding device, computer equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2021189977A1 true WO2021189977A1 (en) 2021-09-30

Family

ID=73666253

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136330 WO2021189977A1 (en) 2020-08-31 2020-12-15 Address coding method and apparatus, and computer device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN112069276B (en)
WO (1) WO2021189977A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114153851A (en) * 2021-12-06 2022-03-08 智慧足迹数据科技有限公司 GEOHASH indexing method, GEOHASH indexing device, computer equipment and storage medium
CN116246288A (en) * 2023-05-10 2023-06-09 浪潮电子信息产业股份有限公司 Text coding method, model training method, model matching method and device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435360B (en) * 2019-01-15 2023-08-29 菜鸟智能物流控股有限公司 Address type identification method and device and electronic equipment
CN112069276B (en) * 2020-08-31 2024-03-08 平安科技(深圳)有限公司 Address coding method, address coding device, computer equipment and computer readable storage medium
CN112835897B (en) * 2021-01-29 2024-03-15 上海寻梦信息技术有限公司 Geographic area division management method, data conversion method and related equipment
CN114491089B (en) * 2022-01-28 2023-08-29 北京百度网讯科技有限公司 Address acquisition method, address acquisition device, electronic equipment and medium
CN115526147A (en) * 2022-08-30 2022-12-27 江苏新流数字科技有限公司 Code capable of reading physical space and compiling method and application thereof

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699623A (en) * 2013-12-19 2014-04-02 百度在线网络技术(北京)有限公司 Geo-coding realizing method and device
CN105468632A (en) * 2014-09-05 2016-04-06 高德软件有限公司 Geocoding method and apparatus
CN106874287A (en) * 2015-12-11 2017-06-20 北京四维图新科技股份有限公司 A kind of processing method and processing device of point of interest POI geocodings
US20180080794A1 (en) * 2016-04-12 2018-03-22 Beijing Didi Infinity Technology And Development C O., Ltd. Systems and methods for determining a point of interest
CN109344213A (en) * 2018-08-28 2019-02-15 浙江工业大学 A kind of Chinese Geocoding based on dictionary tree
CN109933797A (en) * 2019-03-21 2019-06-25 东南大学 Geocoding and system based on Jieba participle and address dictionary
CN110990520A (en) * 2019-11-28 2020-04-10 中国建设银行股份有限公司 Address coding method and device, electronic equipment and storage medium
CN112069276A (en) * 2020-08-31 2020-12-11 平安科技(深圳)有限公司 Address coding method and device, computer equipment and computer readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7046827B2 (en) * 2002-02-15 2006-05-16 International Business Machines Corporation Adapting point geometry for storing address density
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN103914544A (en) * 2014-04-03 2014-07-09 浙江大学 Method for quickly matching Chinese addresses in multi-level manner on basis of address feature words
CN107798065B (en) * 2017-09-21 2020-07-07 平安科技(深圳)有限公司 Client number coding method, application server, system and storage medium
CN109145169B (en) * 2018-07-26 2021-03-26 浙江省测绘科学技术研究院 Address matching method based on statistical word segmentation
CN109145073A (en) * 2018-08-28 2019-01-04 成都市映潮科技股份有限公司 A kind of address resolution method and device based on segmentation methods
CN109408781A (en) * 2018-10-09 2019-03-01 北京邮电大学 A kind of consignment address coding method based on administrative division

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699623A (en) * 2013-12-19 2014-04-02 百度在线网络技术(北京)有限公司 Geo-coding realizing method and device
CN105468632A (en) * 2014-09-05 2016-04-06 高德软件有限公司 Geocoding method and apparatus
CN106874287A (en) * 2015-12-11 2017-06-20 北京四维图新科技股份有限公司 A kind of processing method and processing device of point of interest POI geocodings
US20180080794A1 (en) * 2016-04-12 2018-03-22 Beijing Didi Infinity Technology And Development C O., Ltd. Systems and methods for determining a point of interest
CN109344213A (en) * 2018-08-28 2019-02-15 浙江工业大学 A kind of Chinese Geocoding based on dictionary tree
CN109933797A (en) * 2019-03-21 2019-06-25 东南大学 Geocoding and system based on Jieba participle and address dictionary
CN110990520A (en) * 2019-11-28 2020-04-10 中国建设银行股份有限公司 Address coding method and device, electronic equipment and storage medium
CN112069276A (en) * 2020-08-31 2020-12-11 平安科技(深圳)有限公司 Address coding method and device, computer equipment and computer readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114153851A (en) * 2021-12-06 2022-03-08 智慧足迹数据科技有限公司 GEOHASH indexing method, GEOHASH indexing device, computer equipment and storage medium
CN116246288A (en) * 2023-05-10 2023-06-09 浪潮电子信息产业股份有限公司 Text coding method, model training method, model matching method and device
CN116246288B (en) * 2023-05-10 2023-08-04 浪潮电子信息产业股份有限公司 Text coding method, model training method, model matching method and device

Also Published As

Publication number Publication date
CN112069276A (en) 2020-12-11
CN112069276B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
WO2021189977A1 (en) Address coding method and apparatus, and computer device and computer-readable storage medium
WO2016165538A1 (en) Address data management method and device
CN108038183B (en) Structured entity recording method, device, server and storage medium
CN105528372B (en) A kind of address search method and equipment
CN108304423A (en) A kind of information identifying method and device
CN108363686A (en) A kind of character string segmenting method, device, terminal device and storage medium
CN108628811A (en) The matching process and device of address text
WO2021189897A1 (en) Road matching method and apparatus, and electronic device and readable storage medium
CN113656547B (en) Text matching method, device, equipment and storage medium
WO2022100154A1 (en) Artificial intelligence-based address standardization method and apparatus, device and storage medium
CN111291099B (en) Address fuzzy matching method and system and computer equipment
CN116414823A (en) Address positioning method and device based on word segmentation model
CN113591459B (en) Address standardization processing method and device, electronic equipment and readable storage medium
CN112069824A (en) Region identification method, device and medium based on context probability and citation
Kilic et al. Investigating the quality of reverse geocoding services using text similarity techniques and logistic regression analysis
CN112395401A (en) Adaptive negative sample pair sampling method and device, electronic equipment and storage medium
CN116501834A (en) Address information processing method and device, mobile terminal and storage medium
CN115658837A (en) Address data processing method and device, electronic equipment and storage medium
US11821748B2 (en) Processing apparatus and method for determining road names
CN116414808A (en) Method, device, computer equipment and storage medium for normalizing detailed address
CN114003812A (en) Address matching method, system, device and storage medium
CN113656466A (en) Policy data query method, device, equipment and storage medium
CN116910386B (en) Address completion method, terminal device and computer-readable storage medium
CN115577065B (en) Address resolution method and device
CN113434672B (en) Text type intelligent recognition method, device, equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20927732

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20927732

Country of ref document: EP

Kind code of ref document: A1