WO2019165644A1 - Address error correction method and terminal - Google Patents

Address error correction method and terminal Download PDF

Info

Publication number
WO2019165644A1
WO2019165644A1 PCT/CN2018/077926 CN2018077926W WO2019165644A1 WO 2019165644 A1 WO2019165644 A1 WO 2019165644A1 CN 2018077926 W CN2018077926 W CN 2018077926W WO 2019165644 A1 WO2019165644 A1 WO 2019165644A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
name
dictionary tree
corrected
error correction
Prior art date
Application number
PCT/CN2018/077926
Other languages
French (fr)
Chinese (zh)
Inventor
李林贵
吴卫东
周涛
Original Assignee
福建联迪商用设备有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福建联迪商用设备有限公司 filed Critical 福建联迪商用设备有限公司
Priority to CN201880000142.4A priority Critical patent/CN108369582B/en
Priority to PCT/CN2018/077926 priority patent/WO2019165644A1/en
Publication of WO2019165644A1 publication Critical patent/WO2019165644A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to the field of data processing, and in particular, to an address error correction method and a terminal.
  • the methods of post-processing to identify address information by OCR technology mainly include constructing vocabulary method, statistical language model, syntax tree, similar words, distance information and the like. More commonly used is the construction of vocabulary and statistical language models.
  • the statistical language model uses probability and statistics to obtain similar words and words or the relationship between words and words. According to the probability of occurrence of this relationship, the most likely result is obtained.
  • the Markov model is commonly used. For example, given an address of "Lake x Changsha City", according to the statistical probability of the address, the conditional probability of "South” after the word “L” is N1, the conditional probability of "North” is M1; after the word “South” The conditional probability of “province” is N2, the conditional probability of “province” after “North” is M2, the probability of “Hunan province” is N1*N2, and the probability of “Hubei province” is M1*M2.
  • an address data can usually be divided into multiple words, and the relationship between words is greater than the relationship between words and words. Therefore, the statistical language model based on words is more suitable for address error correction.
  • word-based statistical language model for address error correction, generally by collecting address data, an address database training language model is constructed to obtain the conditional probability between different address names, which is saved as a parameter; then according to a certain word segmentation rule The address is divided into multiple words; finally, the search algorithm is used to find the optimal solution of the language model, that is, the address with the highest probability of occurrence.
  • the shortcoming of word-based statistical language models is the need to calculate the probability of occurrence of words, using the search algorithm to derive the final address.
  • the parameter space is huge, and a large corpus is needed. If the corpus data is insufficient, it is easy to have a conditional probability of 0, resulting in poor model effect. There are approximate place names in the address, which may not be distinguished according to statistical probability. If the order of the Markov model is increased, the parameter space will increase sharply.
  • the construction of the vocabulary method uses a certain data structure to save the classified words, query according to the vocabulary, and obtain possible words to correct the current wrong words.
  • Data structures can be linear or tree-like. In general, linear structures are less efficient in time and space.
  • tree structures such as dictionary trees applied in search engines.
  • the dictionary tree is constructed with the word common root nodes with the same prefix, such as add, and, andy stored as a tree structure as shown in Figure 1. Saving data as a dictionary tree can share nodes and reduce redundancy.
  • each node stores a Chinese character and a pointer. The resulting dictionary tree is very large and takes up a lot of space.
  • the disadvantage of the dictionary tree is that the dictionary tree that constructs the address data is too large and takes up too much space.
  • the technical problem to be solved by the present invention is how to reduce the space occupied by the address error correction process.
  • the technical solution adopted by the present invention is:
  • the invention provides an address error correction method, comprising:
  • S2 Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
  • S3 Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
  • S5 Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
  • the present invention also provides an address correction terminal comprising one or more processors and a memory, the memory storing a program, and being configured to perform the following steps by the one or more processors:
  • S2 Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, and obtain a first-level name;
  • the first dictionary tree is configured to store a province name and a city name;
  • S3 Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
  • S5 Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
  • the invention has the beneficial effects that: different from the prior art, when the error correction address is needed, a complete dictionary tree corresponding to the national address needs to be called, and the occupied space is large.
  • the present invention classifies the national address according to the streets of provinces, cities, counties, and towns and villages. Save, in turn check the province information, city and county information and township and village information in the address to be corrected, and dynamically retrieve the dictionary tree corresponding to the next-level address according to each verification result, which greatly reduces the The memory space is occupied during address error correction and has high accuracy.
  • Figure 1 is a schematic diagram of a dictionary tree
  • FIG. 2 is a flow chart of a specific implementation manner of an address error correction method according to the present invention.
  • FIG. 3 is a structural block diagram of a specific implementation manner of an address error correction terminal according to the present invention.
  • FIG. 4 is a schematic diagram of a first dictionary tree
  • Figure 5 is a schematic diagram of a second dictionary tree
  • FIG. 6 is a schematic diagram of a third dictionary tree
  • FIG. 7 is a schematic diagram of a dictionary tree corresponding to an address to be corrected
  • the most important technical idea of the invention is that the national address is stored hierarchically according to the province, city, county, township and village streets, and the province information, city and county information, and township and village information in the address to be corrected are sequentially checked, and according to each The verification result of the second time dynamically retrieves the dictionary tree corresponding to the next-level address, which reduces the occupation of the memory space in the address error correction process.
  • the present invention provides an address error correction method, including:
  • S2 Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
  • S3 Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
  • S5 Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
  • S2 is specifically:
  • a node in the first dictionary tree represents a province name or a city name
  • a node in the second dictionary tree represents a city name, a county name or a zone name
  • a node in the third dictionary tree represents one of a township name, a village name, or a street name.
  • S5 is specifically:
  • a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
  • the character corresponding to the preset order is a first character after the second-level name and a fourth character after the second-level name.
  • the first character after the second-level name is generally the first character of the town name
  • the fourth character after the second-level name is generally the first character of the village name.
  • the town and village after the county name can be screened. Can effectively reduce the dictionary tree nodes that need to be generated.
  • the method further includes:
  • S75 is specifically:
  • the correct rate of selecting the address with the highest similarity to the error-correction address from the one or more candidate addresses as the correct address is improved.
  • S1 is specifically:
  • the address information in the identity card is identified by an optical character recognition technology to obtain the address to be corrected.
  • the present invention also provides an address error correction terminal including one or more processors 1 and a memory 2, the memory 2 storing a program and configured to be configured by the one or more processors 1 Perform the following steps:
  • S2 Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
  • S3 Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
  • S5 Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
  • S2 is specifically:
  • a node in the first dictionary tree represents a province name or a city name
  • a node in the second dictionary tree represents a city name, a county name or a zone name
  • a node in the third dictionary tree represents one of a township name, a village name, or a street name.
  • S5 is specifically:
  • a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
  • the character corresponding to the preset order is a first character after the second-level name and a fourth character after the second-level name.
  • the method further includes:
  • S75 is specifically:
  • S1 is specifically:
  • the address information in the identity card is identified by an optical character recognition technology to obtain the address to be corrected.
  • Embodiment 1 of the present invention is:
  • This embodiment provides an address error correction method, including:
  • the address information in the identity card is identified by an optical character recognition technology to obtain the to-be-corrected address.
  • the address to be corrected is "Hongshan, Hongshan, Gulou District, Fuchuan City, Fujian province”.
  • S2 Identify, by the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name. Specifically:
  • a node in the first dictionary tree represents a province name or a city name; the province name is located in the first layer, and the city name corresponding to the province name is located in the second layer.
  • the province to which the error correction address belongs is Fujian province, and the first-level name is Fujian province.
  • the second dictionary tree is configured to store a city name, a county name, and a zone name corresponding to the current province name.
  • the node in the second dictionary tree represents a city name, a county name, or a zone name.
  • the root node of the second dictionary tree is the first-level name.
  • FIG. 5 is a second dictionary tree corresponding to Fujian province.
  • the area to which the error correction address belongs is the Gulou area
  • the second level name is the Gulou area
  • S5 Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name; specifically:
  • a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
  • the node in the third dictionary tree represents one character in a township name, a village name, or a street name.
  • the character corresponding to the preset order is the first character after the second-level name and the fourth character after the second-level name.
  • FIG. 6 is a third dictionary tree corresponding to the Gulou area.
  • the third dictionary tree is saved according to the word formation node.
  • the input address when querying is, for example, “Hongshan Bridge, Hongshan Town, Gulou District, Fuzhou City, Fujian province”.
  • the address of the district is “Hongshan Bridge Hongshan Town”, according to the first word “Hong”.
  • Other branches such as "Wufeng Street” do not need to be restored to reduce memory usage.
  • the first character and the fourth character are generally the first words of the town and the village. Considering the general situation to reduce the branch of the dictionary tree that needs to be restored, the case of non-conformity cannot be reduced, and only the original third dictionary tree can be restored.
  • a tailored third dictionary tree is constructed, and each word after the county name of the address to be corrected is queried. If the query is not available, the child nodes of all nodes of the current branch in the third dictionary tree are used as candidate nodes, and the next word is queried in the candidate nodes. For example, the address to be corrected is “Five North Road, Gulou District, Fuzhou City, Fujian province”.
  • the dictionary tree corresponding to each level is dynamically acquired, and a complete dictionary tree corresponding to the address to be corrected is constructed, as shown in FIG. 7 .
  • the address identified by OCR is called an error-correction address, and the error-correction address may have an error.
  • the candidate address can be obtained by querying the error correction address in the dictionary tree. The address most similar to the address to be error corrected among the candidate addresses is selected as the best address, and the degree of similarity is evaluated according to the same number of Chinese characters in the same position. Then compare the best address with the address to be corrected. If the number of consecutive different Chinese characters is within two words, the best address is used as the correct address; if the number of consecutive different words is two or more, then this part will be followed. The address to be corrected is the correct address. According to the above error correction principle, the best address and the error correction address are combined as the last correct address.
  • the address to be corrected is “Hongshan, Hongshan, Gulou District, Fuchuan City, Fujian province”.
  • the candidate address is “Hongshan Bridge, Hongshan Town, Gulou District, Fuzhou City, Fujian province”.
  • "Fuzhou City” is taken as the correct address instead of "Fuchuan City” in the address to be corrected.
  • the final address after correction is "Gulou District, Fuzhou City, Fujian province”. Hongshan Bridge in Hongshan Town.”
  • the error correction also compares the recognition result and the query result according to the classification, and selects whether to perform error correction according to the above error correction principle.
  • the national address is stored hierarchically according to the provinces, cities, counties, towns and villages, and the province name is saved as the first dictionary tree for querying the province to which the error correction address belongs.
  • the provincial, city, and county addresses of the province are saved as a second dictionary tree for querying the district and county names.
  • the street-level address of the township village is built according to the word structure to reduce the redundancy, and the dictionary tree is restored.
  • the dictionary tree needs to be restored according to the address to be corrected, and the number of nodes is reduced.
  • the correct address name can be obtained according to the similarity at the provincial, city, district and county levels.
  • a candidate address that is closest to the correct address can be obtained.
  • the error correction address and the candidate address are compared to obtain an error corrected address.
  • the address to be corrected is “Hongshan and Hongshan Overseas Chinese in Gulou District, Fuchuan City, Fujian province”.
  • the address to be corrected is “Hongshan and Hongshan Overseas Chinese in Gulou District, Fuchuan City, Fujian province”.
  • the address to be corrected is “Hongshan and Hongshan Overseas Chinese in Gulou District, Fuchuan City, Fujian province”.
  • the present invention does not need to train the parameter model, and does not need to calculate the probability of occurrence of the word multiple times.
  • the search algorithm is used to find the optimal path, and only after constructing the dictionary tree, the query can be performed. faster. Different cities may have counties or towns or villages with the same name. According to the statistical model, the first-order Markov may not be able to judge, and when the order is increased to judge, the calculation amount also increases.
  • the present invention enters different branch queries according to the constructed dictionary tree, and the address names below the county are stored as nodes, and the candidate addresses are obtained by backtracking from the lowest node.
  • the present invention constructs the dictionary tree of the county for the provinces, cities and counties that need to be queried according to the information of the address to be corrected, and then performs the county dictionary tree according to the address to be corrected. Cropping greatly reduces the space required and query time. If the national address data is saved as text about 60M, the average address data of a province is about 2M. When building the entire province address dictionary tree during query, it takes at least a dozen M memory, and it takes 5s to query the address once. According to the county name, when querying the village-level address, only the address under the county needs to be restored.
  • the general data amount of the dictionary tree restored after the cropping is only a few K, and it takes about 0.05s to query the address once.
  • the tree node is saved as a text by layer to about 10M, indicating that the dictionary tree structure effectively removes the redundancy of the village address.
  • the nodes in the county dictionary tree use bidirectional pointers. The last node that is queried can be traced back to the first node, the connection gets the address prefix, and the candidate error correction address can be obtained according to the address prefix.
  • the general dictionary tree structure is used for searching, the pointer is one-way, and the node can only be queried from top to bottom, and the dictionary tree of the present invention is a bidirectional pointer, and the candidate address can be obtained by backtracking to the first node according to the lower layer node.
  • the query time is derived from the Debug mode of the Visual Studio software of the same laptop.
  • Embodiment 2 of the present invention is:
  • the embodiment provides an address error correction terminal comprising one or more processors 1 and a memory 2, the memory 2 storing a program, and being configured to perform the following steps by the one or more processors 1:
  • the address information in the identity card is identified by an optical character recognition technology to obtain the to-be-corrected address.
  • S2 Identify, by the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name. Specifically:
  • the node in the first dictionary tree represents a province name or a city name; the province name is located on the first layer, and the city name corresponding to the province name is located on the second layer.
  • the second dictionary tree is configured to store a city name, a county name, and a zone name corresponding to the current province name.
  • the node in the second dictionary tree represents a city name, a county name, or a zone name.
  • the root node of the second dictionary tree is the first-level name.
  • S5 Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name; specifically:
  • a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
  • the node in the third dictionary tree represents one character in a township name, a village name, or a street name.
  • the character corresponding to the preset order is the first character after the second-level name and the fourth character after the second-level name.
  • the address correction method and terminal store the national address in stages according to the province, city, county, township, and village streets, and sequentially check the province information, the city and county information, and the city and county information in the address to be corrected.
  • Township village street information and dynamically retrieve the dictionary tree corresponding to the next-level address according to each verification result, which greatly reduces the memory space occupation in the address error correction process, and has high accuracy.
  • the province name of the address to be error-corrected is more serious, the name of the province corresponding to the address to be corrected can be confirmed by the city name, which is advantageous for improving the accuracy of error correction.
  • the provinces, cities, and counties are less likely to be duplicated, and the entire word can be saved as a node.
  • the county level may be a township, a village, or a street.
  • the possibility of repeated occurrence is relatively large, sharing the same prefix. It can effectively reduce redundancy and reduce the space required.
  • the capacity of the third dictionary tree is reduced, that is, the space required for checking the street address of the township village is reduced.
  • the first character after the second-level name is generally the first character of the town name
  • the fourth character after the second-level name is generally the first character of the village name.
  • the town and the village after the county name can be screened, which can be effective. Reduce the dictionary tree nodes that need to be generated. Further, the correct rate of selecting the address with the highest similarity to the error-correction address from the one or more candidate addresses as the correct address is improved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Document Processing Apparatus (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the field of data processing and in particular relates to an address error correction method and a terminal. acquiring an address to be error corrected; identifying, based on a first trie tree, the province names corresponding to the address to be error corrected to obtain a primary name; the first trie tree is used to store the province names and city names; acquiring a second trie tree corresponding to the primary name; the second trie tree is used to store the city names, county names and district names corresponding to the current province names; identifying, based on the second trie tree, the county names or district names corresponding to the address to be error corrected to obtain a secondary name; acquiring a third trie tree corresponding to the secondary name; the third trie tree is used to store the village town names, village names and street names corresponding to the secondary name; and acquiring, based on the third trie tree, more than one candidate addresses corresponding to the address to be error corrected to obtain a candidate address set. The space occupied during address error correction is reduced.

Description

一种地址纠错方法及终端Address correction method and terminal 技术领域Technical field
本发明涉及数据处理领域,尤其涉及一种地址纠错方法及终端。The present invention relates to the field of data processing, and in particular, to an address error correction method and a terminal.
背景技术Background technique
通过OCR技术识别到地址信息后处理的方法主要有构建词表法、统计语言模型、语法树、相似字、距离信息等。比较常用的是构建词表法和统计语言模型。The methods of post-processing to identify address information by OCR technology mainly include constructing vocabulary method, statistical language model, syntax tree, similar words, distance information and the like. More commonly used is the construction of vocabulary and statistical language models.
统计语言模型是用概率统计得到相近的字与字或者词与词之间的关系,根据出现这种关系的概率得出最有可能的结果,常用的有马尔可夫模型。比如给定一个地址为“湖x省长沙市”,根据地址的统计概率,“湖”字后为“南”的条件概率为N1,为“北”的条件概率为M1;“南”字后为“省”的条件概率为N2,“北”字后为“省”的条件概率为M2,则为“湖南省”的概率是N1*N2,为“湖北省”的概率是M1*M2,再根据“省”字后面的“长”字可以得出为“湖南省”的概率大于“湖北省”,得出地址为“湖南省长沙市”。根据地址的特性,一条地址数据通常可以分为多个词语,词语之间的联系要大于字与字之间的联系,因此基于词的统计语言模型更适用于地址纠错。利用基于词的统计语言模型进行地址纠错的方案,一般先通过收集地址数据,构建一个地址数据库训练语言模型,得到不同地址名之间出现的条件概率,作为参数保存;然后根据某种分词规则将地址分为多个词语;最后采用搜索算法求出语言模型的最优解,也就是出现概率最大的地址。The statistical language model uses probability and statistics to obtain similar words and words or the relationship between words and words. According to the probability of occurrence of this relationship, the most likely result is obtained. The Markov model is commonly used. For example, given an address of "Lake x Changsha City", according to the statistical probability of the address, the conditional probability of "South" after the word "L" is N1, the conditional probability of "North" is M1; after the word "South" The conditional probability of “province” is N2, the conditional probability of “province” after “North” is M2, the probability of “Hunan Province” is N1*N2, and the probability of “Hubei Province” is M1*M2. According to the word "长" after the word "province", the probability of being "Hunan Province" is greater than that of "Hubei Province", and the address is "Changsha City, Hunan Province". According to the characteristics of the address, an address data can usually be divided into multiple words, and the relationship between words is greater than the relationship between words and words. Therefore, the statistical language model based on words is more suitable for address error correction. Using the word-based statistical language model for address error correction, generally by collecting address data, an address database training language model is constructed to obtain the conditional probability between different address names, which is saved as a parameter; then according to a certain word segmentation rule The address is divided into multiple words; finally, the search algorithm is used to find the optimal solution of the language model, that is, the address with the highest probability of occurrence.
但是,基于词的统计语言模型的缺点是需要计算词语出现的概率,利用搜索算法得出最后的地址。训练统计语言模型时,参数空间庞大,需要规模巨大的语料库,如果语料库数据不足,容易出现条件概率为0的情况,导致模型效果变差。地址中存在近似的地名,根据统计概率可能无法区分,如果增加马尔可夫模型的阶数,参数空间会急剧增大。However, the shortcoming of word-based statistical language models is the need to calculate the probability of occurrence of words, using the search algorithm to derive the final address. When training the statistical language model, the parameter space is huge, and a large corpus is needed. If the corpus data is insufficient, it is easy to have a conditional probability of 0, resulting in poor model effect. There are approximate place names in the address, which may not be distinguished according to statistical probability. If the order of the Markov model is increased, the parameter space will increase sharply.
构建词表法是用某种数据结构来保存分类后的词语,根据词表进行查询, 得到可能的词语来纠正当前错误的词语。数据结构可以线性结构或者是树状结构,一般来说,线性结构的时间和空间效率比较低,常用的是树状结构,比如应用在搜索引擎中的字典树。字典树是以具有相同前缀的词语共用根节点构建的,比如add、and、andy存为树状结构如图1所示。将数据保存为字典树能够共用节点、减少冗余。但是由于汉字种类过多,每个节点存放一个汉字以及指针,形成的字典树非常大,会占用很多空间。查询时,从根节点向下,进入不同的分支,最后将进入过的所有节点连接,得出地址。The construction of the vocabulary method uses a certain data structure to save the classified words, query according to the vocabulary, and obtain possible words to correct the current wrong words. Data structures can be linear or tree-like. In general, linear structures are less efficient in time and space. Commonly used are tree structures, such as dictionary trees applied in search engines. The dictionary tree is constructed with the word common root nodes with the same prefix, such as add, and, andy stored as a tree structure as shown in Figure 1. Saving data as a dictionary tree can share nodes and reduce redundancy. However, due to the variety of Chinese characters, each node stores a Chinese character and a pointer. The resulting dictionary tree is very large and takes up a lot of space. When querying, go down from the root node, enter different branches, and finally connect all the nodes that have entered, and get the address.
但是,字典树的缺点是构建地址数据的字典树太庞大,占用空间过多。However, the disadvantage of the dictionary tree is that the dictionary tree that constructs the address data is too large and takes up too much space.
发明内容Summary of the invention
本发明所要解决的技术问题是:如何减少地址纠错过程中所占用的空间。The technical problem to be solved by the present invention is how to reduce the space occupied by the address error correction process.
为了解决上述技术问题,本发明采用的技术方案为:In order to solve the above technical problems, the technical solution adopted by the present invention is:
本发明提供一种地址纠错方法,包括:The invention provides an address error correction method, comprising:
S1、获取待纠错地址;S1, obtaining an address to be corrected;
S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名称;所述第一字典树用于存储省份名称和市名称;S2: Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称;S3: Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称,得到二级名称;S4. Identify, according to the second dictionary tree, a county name or a zone name corresponding to the to-be-corrected address, and obtain a secondary name;
S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
本发明还提供一种地址纠错终端,包括一个或多个处理器及存储器,所述存储器存储有程序,并且被配置成由所述一个或多个处理器执行以下步骤:The present invention also provides an address correction terminal comprising one or more processors and a memory, the memory storing a program, and being configured to perform the following steps by the one or more processors:
S1、获取待纠错地址;S1, obtaining an address to be corrected;
S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名 称;所述第一字典树用于存储省份名称和市名称;S2: Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, and obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称;S3: Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称,得到二级名称;S4. Identify, according to the second dictionary tree, a county name or a zone name corresponding to the to-be-corrected address, and obtain a secondary name;
S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
本发明的有益效果在于:区别于现有技术,在纠错地址时需调用与全国地址对应的完整的字典树,占用空间大,本发明将全国地址按照省、市县区、乡镇村街道分级保存,依次核查待纠错地址中的省份信息、市县区信息和乡镇村街道信息,并根据每次的核查结果动态调取与下一级地址对应的字典树,极大程度上减少了在地址纠错过程中内存空间的占用情况,且具有较高的准确度。The invention has the beneficial effects that: different from the prior art, when the error correction address is needed, a complete dictionary tree corresponding to the national address needs to be called, and the occupied space is large. The present invention classifies the national address according to the streets of provinces, cities, counties, and towns and villages. Save, in turn check the province information, city and county information and township and village information in the address to be corrected, and dynamically retrieve the dictionary tree corresponding to the next-level address according to each verification result, which greatly reduces the The memory space is occupied during address error correction and has high accuracy.
附图说明DRAWINGS
图1为字典树示意图;Figure 1 is a schematic diagram of a dictionary tree;
图2为本发明提供的一种地址纠错方法的具体实施方式的流程框图;2 is a flow chart of a specific implementation manner of an address error correction method according to the present invention;
图3为本发明提供的一种地址纠错终端的具体实施方式的结构框图;3 is a structural block diagram of a specific implementation manner of an address error correction terminal according to the present invention;
图4为第一字典树示意图;4 is a schematic diagram of a first dictionary tree;
图5为第二字典树示意图;Figure 5 is a schematic diagram of a second dictionary tree;
图6为第三字典树示意图;6 is a schematic diagram of a third dictionary tree;
图7为与待纠错地址对应的字典树示意图;7 is a schematic diagram of a dictionary tree corresponding to an address to be corrected;
标号说明:Label description:
1、处理器;    2、存储器。1, the processor; 2, memory.
具体实施方式Detailed ways
发明最关键的技术构思在于:本申请将全国地址按照省、市县区、乡镇村街道分级保存,依次核查待纠错地址中的省份信息、市县区信息和乡镇村街道信息,并根据每次的核查结果动态调取与下一级地址对应的字典树,减少了在地址纠错过程中内存空间的占用情况。The most important technical idea of the invention is that the national address is stored hierarchically according to the province, city, county, township and village streets, and the province information, city and county information, and township and village information in the address to be corrected are sequentially checked, and according to each The verification result of the second time dynamically retrieves the dictionary tree corresponding to the next-level address, which reduces the occupation of the memory space in the address error correction process.
请参照图2至图7,Please refer to Figure 2 to Figure 7,
如图2所示,本发明提供一种地址纠错方法,包括:As shown in FIG. 2, the present invention provides an address error correction method, including:
S1、获取待纠错地址;S1, obtaining an address to be corrected;
S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名称;所述第一字典树用于存储省份名称和市名称;S2: Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称;S3: Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称,得到二级名称;S4. Identify, according to the second dictionary tree, a county name or a zone name corresponding to the to-be-corrected address, and obtain a secondary name;
S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
进一步地,所述S2具体为:Further, the S2 is specifically:
当所述第一字典树中不存在与所述待纠错地址适配的省份名称时,获取与所述待纠错地址适配的市名称,得到当前市名称;获取与所述当前市名称对应的省份名称,得到所述一级名称。When there is no province name adapted to the error correction address in the first dictionary tree, acquiring a city name adapted to the error correction address, obtaining a current city name; acquiring the current city name The corresponding province name gives the first-level name.
由上述描述可知,在待纠错地址的省份名称错误程度较严重的情况下,可通过市名称确认与待纠错地址对应的省份名称,有利于提高纠错的准确度。It can be seen from the above description that in the case that the province name of the address to be error-corrected is more serious, the name of the province corresponding to the address to be corrected can be confirmed by the city name, which is advantageous for improving the accuracy of error correction.
进一步地,还包括:Further, it also includes:
所述第一字典树中的一节点表示一省份名称或一市名称;A node in the first dictionary tree represents a province name or a city name;
所述第二字典树中的一节点表示一市名称、一县名称或一区名称;A node in the second dictionary tree represents a city name, a county name or a zone name;
所述第三字典树中的一节点表示乡镇名称、村名称或街道名称中的一个字符。A node in the third dictionary tree represents one of a township name, a village name, or a street name.
由上述描述可知,一般省、市、县名称重复的可能性比较小,可以将整个词存为一个节点,而县级以后可能为乡镇、村级或者街道,重复出现的可能性比较大,共用相同前缀可以有效减少冗余,减少所需占用的空间。It can be seen from the above description that the general province, city, and county names are less likely to be duplicated, and the whole word can be saved as one node, and the county level may be township, village, or street in the future, and the possibility of repeated occurrence is relatively large, sharing. The same prefix can effectively reduce redundancy and reduce the space required.
进一步地,所述S5具体为:Further, the S5 is specifically:
获取与所述二级名称对应的字典树,得到第三字典树;Obtaining a dictionary tree corresponding to the second-level name to obtain a third dictionary tree;
从所述待纠错地址中获取位于所述二级名称后,且与预设次序对应的字符,得到当前字符;Obtaining a character corresponding to the preset order after the second-level name from the to-be-corrected address, to obtain a current character;
根据所述第三字典树中与所述当前字符适配的分支裁剪需构造的第三字典树;所述第三字典树的根节点为所述二级名称。And a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
由上述描述可知,通过指定特定位置的字符,并选取与特定位置的字符适配的分支信息作为候选的地址,减少了第三字典树的容量,即减少了核查乡镇村街道地址时所需占用的空间。It can be seen from the above description that by specifying a character at a specific location and selecting branch information adapted to a character at a specific location as a candidate address, the capacity of the third dictionary tree is reduced, that is, the occupation required to check the street address of the township village is reduced. Space.
进一步地,还包括:Further, it also includes:
所述与预设次序对应的字符为所述二级名称后的第一个字符和所述二级名称后的第四个字符。The character corresponding to the preset order is a first character after the second-level name and a fourth character after the second-level name.
由上述描述可知,二级名称后的第一个字符一般为镇名称的首字符,二级名称后的第四个字符一般为村名称的首字符,一般可以筛选县名后的镇和村,能有效减少需生成的字典树节点。It can be known from the above description that the first character after the second-level name is generally the first character of the town name, and the fourth character after the second-level name is generally the first character of the village name. Generally, the town and village after the county name can be screened. Can effectively reduce the dictionary tree nodes that need to be generated.
进一步地,所述S6之后,还包括:Further, after the S6, the method further includes:
S71、从所述候选地址集合中获取一候选地址,得到当前候选地址;S71. Obtain a candidate address from the candidate address set to obtain a current candidate address.
S72、统计所述当前候选地址与所述待纠错地址的相同位置上字符相同的个数,得到匹配个数;S72. Count the number of characters in the same position of the current candidate address and the same location of the to-be-corrected address, and obtain a matching number.
S73、重复执行所述S71至所述S72,直至所述候选地址集合被遍历;S73. Repeat performing the S71 to the S72 until the candidate address set is traversed;
S74、获取所述候选地址集合中具有最大匹配个数的候选地址,得到最佳地址;S74. Obtain a candidate address having the largest matching number in the candidate address set, to obtain an optimal address.
S75、根据所述最佳地址更新所述待纠错地址,得到正确地址。S75. Update the to-be-corrected address according to the optimal address to obtain a correct address.
进一步地,所述S75具体为:Further, the S75 is specifically:
若所述最佳地址存在两个以上连续的,且与所述待纠错地址不适配的字符,则:If there are more than two consecutive characters in the optimal address that are not adapted to the address to be corrected, then:
从所述最佳地址中获取位于两个以上连续的,且与所述待纠错地址不适配的字符之前的字符串;Obtaining, from the optimal address, a character string located before two or more consecutive characters that are not adapted to the error correction address;
根据所述字符串更新所述待纠错地址,得到正确地址;Updating the to-be-corrected address according to the string to obtain a correct address;
否则,设置所述最佳地址为正确地址。Otherwise, set the best address to the correct address.
由上述描述可知,提高了从一个以上候选地址中选取与待纠错地址相似度最大的地址作为正确地址的正确率。As can be seen from the above description, the correct rate of selecting the address with the highest similarity to the error-correction address from the one or more candidate addresses as the correct address is improved.
进一步地,所述S1具体为:Further, the S1 is specifically:
通过光学字符识别技术识别身份证中的地址信息,得到所述待纠错地址。The address information in the identity card is identified by an optical character recognition technology to obtain the address to be corrected.
如图3所示,本发明还提供一种地址纠错终端,包括一个或多个处理器1及存储器2,所述存储器2存储有程序,并且被配置成由所述一个或多个处理器1执行以下步骤:As shown in FIG. 3, the present invention also provides an address error correction terminal including one or more processors 1 and a memory 2, the memory 2 storing a program and configured to be configured by the one or more processors 1 Perform the following steps:
S1、获取待纠错地址;S1, obtaining an address to be corrected;
S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名称;所述第一字典树用于存储省份名称和市名称;S2: Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称;S3: Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称,得到二级名称;S4. Identify, according to the second dictionary tree, a county name or a zone name corresponding to the to-be-corrected address, and obtain a secondary name;
S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
进一步地,所述S2具体为:Further, the S2 is specifically:
当所述第一字典树中不存在与所述待纠错地址适配的省份名称时,获取与所述待纠错地址适配的市名称,得到当前市名称;获取与所述当前市名称对应的省份名称,得到所述一级名称。When there is no province name adapted to the error correction address in the first dictionary tree, acquiring a city name adapted to the error correction address, obtaining a current city name; acquiring the current city name The corresponding province name gives the first-level name.
进一步地,还包括:Further, it also includes:
所述第一字典树中的一节点表示一省份名称或一市名称;A node in the first dictionary tree represents a province name or a city name;
所述第二字典树中的一节点表示一市名称、一县名称或一区名称;A node in the second dictionary tree represents a city name, a county name or a zone name;
所述第三字典树中的一节点表示乡镇名称、村名称或街道名称中的一个字符。A node in the third dictionary tree represents one of a township name, a village name, or a street name.
进一步地,所述S5具体为:Further, the S5 is specifically:
获取与所述二级名称对应的字典树,得到第三字典树;Obtaining a dictionary tree corresponding to the second-level name to obtain a third dictionary tree;
从所述待纠错地址中获取位于所述二级名称后,且与预设次序对应的字符,得到当前字符;Obtaining a character corresponding to the preset order after the second-level name from the to-be-corrected address, to obtain a current character;
根据所述第三字典树中与所述当前字符适配的分支裁剪需构造的第三字典树;所述第三字典树的根节点为所述二级名称。And a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
进一步地,还包括:Further, it also includes:
所述与预设次序对应的字符为所述二级名称后的第一个字符和所述二级名称后的第四个字符。The character corresponding to the preset order is a first character after the second-level name and a fourth character after the second-level name.
进一步地,所述S6之后,还包括:Further, after the S6, the method further includes:
S71、从所述候选地址集合中获取一候选地址,得到当前候选地址;S71. Obtain a candidate address from the candidate address set to obtain a current candidate address.
S72、统计所述当前候选地址与所述待纠错地址的相同位置上字符相同的个数,得到匹配个数;S72. Count the number of characters in the same position of the current candidate address and the same location of the to-be-corrected address, and obtain a matching number.
S73、重复执行所述S71至所述S72,直至所述候选地址集合被遍历;S73. Repeat performing the S71 to the S72 until the candidate address set is traversed;
S74、获取所述候选地址集合中具有最大匹配个数的候选地址,得到最佳地址;S74. Obtain a candidate address having the largest matching number in the candidate address set, to obtain an optimal address.
S75、根据所述最佳地址更新所述待纠错地址,得到正确地址。S75. Update the to-be-corrected address according to the optimal address to obtain a correct address.
进一步地,所述S75具体为:Further, the S75 is specifically:
若所述最佳地址存在两个以上连续的,且与所述待纠错地址不适配的字符,则:If there are more than two consecutive characters in the optimal address that are not adapted to the address to be corrected, then:
从所述最佳地址中获取位于两个以上连续的,且与所述待纠错地址不适配的字符之前的字符串;Obtaining, from the optimal address, a character string located before two or more consecutive characters that are not adapted to the error correction address;
根据所述字符串更新所述待纠错地址,得到正确地址;Updating the to-be-corrected address according to the string to obtain a correct address;
否则,设置所述最佳地址为正确地址。Otherwise, set the best address to the correct address.
进一步地,所述S1具体为:Further, the S1 is specifically:
通过光学字符识别技术识别身份证中的地址信息,得到所述待纠错地址。The address information in the identity card is identified by an optical character recognition technology to obtain the address to be corrected.
本发明的实施例一为: Embodiment 1 of the present invention is:
本实施例提供一种地址纠错方法,包括:This embodiment provides an address error correction method, including:
S1、获取待纠错地址。S1: Obtain an address to be corrected.
可选地,通过光学字符识别技术识别身份证中的地址信息,得到所述待纠错地址。Optionally, the address information in the identity card is identified by an optical character recognition technology to obtain the to-be-corrected address.
例如,所述待纠错地址为“福建省福川市鼓楼区洪山滇洪山侨”。For example, the address to be corrected is "Hongshan, Hongshan, Gulou District, Fuchuan City, Fujian Province".
S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名称;所述第一字典树用于存储省份名称和市名称。具体为:S2: Identify, by the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name. Specifically:
当所述第一字典树中不存在与所述待纠错地址适配的省份名称时,获取与所述待纠错地址适配的市名称,得到当前市名称;获取与所述当前市名称对应的省份名称,得到所述一级名称。When there is no province name adapted to the error correction address in the first dictionary tree, acquiring a city name adapted to the error correction address, obtaining a current city name; acquiring the current city name The corresponding province name gives the first-level name.
其中,如图4所示,所述第一字典树中的一节点表示一省份名称或一市名称;省份名称位于第一层,与省份名称对应的市名称位于第二层。As shown in FIG. 4, a node in the first dictionary tree represents a province name or a city name; the province name is located in the first layer, and the city name corresponding to the province name is located in the second layer.
例如,所述待纠错地址所属的省份为福建省,则一级名称为福建省。For example, the province to which the error correction address belongs is Fujian Province, and the first-level name is Fujian Province.
S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称。S3. Obtain a second dictionary tree corresponding to the first-level name. The second dictionary tree is configured to store a city name, a county name, and a zone name corresponding to the current province name.
其中,所述第二字典树中的一节点表示一市名称、一县名称或一区名称。所述第二字典树的根节点为所述一级名称。The node in the second dictionary tree represents a city name, a county name, or a zone name. The root node of the second dictionary tree is the first-level name.
例如,图5为与福建省对应的第二字典树。For example, FIG. 5 is a second dictionary tree corresponding to Fujian Province.
S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称,得到二级名称。S4. Identify a county name or a zone name corresponding to the to-be-corrected address according to the second dictionary tree, and obtain a secondary name.
例如,所述待纠错地址所属的区为鼓楼区,则二级名称为鼓楼区。For example, the area to which the error correction address belongs is the Gulou area, and the second level name is the Gulou area.
S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;具体为:S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name; specifically:
获取与所述二级名称对应的字典树,得到第三字典树;Obtaining a dictionary tree corresponding to the second-level name to obtain a third dictionary tree;
从所述待纠错地址中获取位于所述二级名称后,且与预设次序对应的字符,得到当前字符;Obtaining a character corresponding to the preset order after the second-level name from the to-be-corrected address, to obtain a current character;
根据所述第三字典树中与所述当前字符适配的分支裁剪需构造的第三字典树;所述第三字典树的根节点为所述二级名称。And a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
其中,所述第三字典树中的一节点表示乡镇名称、村名称或街道名称中的一个字符。The node in the third dictionary tree represents one character in a township name, a village name, or a street name.
其中,所述与预设次序对应的字符为所述二级名称后的第一个字符和所述二级名称后的第四个字符。The character corresponding to the preset order is the first character after the second-level name and the fourth character after the second-level name.
例如,图6为与鼓楼区对应的第三字典树。第三字典树是按照单字形成节点保存,查询时输入地址比如为“福建省福州市鼓楼区洪山镇洪山桥”,区以后地址为“洪山镇洪山桥”,根据第一个字“洪”,可以筛选出需要恢复的字典树的节点第一个为“洪”的分支,其他分支比如“五凤街”则不需要恢复,减少内存占用。For example, FIG. 6 is a third dictionary tree corresponding to the Gulou area. The third dictionary tree is saved according to the word formation node. The input address when querying is, for example, “Hongshan Bridge, Hongshan Town, Gulou District, Fuzhou City, Fujian Province”. The address of the district is “Hongshan Bridge Hongshan Town”, according to the first word “Hong”. You can filter out the first node of the dictionary tree that needs to be restored as a "Hong" branch. Other branches such as "Wufeng Street" do not need to be restored to reduce memory usage.
第一个字符和第四个字符一般是镇和村的首字,考虑一般情况来减少需要恢复的字典树分支,不符合的情况则无法减少,只能恢复原始的第三字典树。构建出经过裁剪的第三字典树,根据待纠错地址的县区名以后的每个字进行查询。如果查询不到,则将第三字典树中当前分支所有节点的子节点作为候选节点,在候选节点中查询下一个字。比如待纠错地址为“福建省福州市鼓楼区五x北路”,在图6中,“鼓楼区”以后可以查到“五”,“五”后查询不到“x”字,则将“四”、“一”、“凤”这些节点的子节点作为候选节点,查询下一个字,则可以查询到“北”字。The first character and the fourth character are generally the first words of the town and the village. Considering the general situation to reduce the branch of the dictionary tree that needs to be restored, the case of non-conformity cannot be reduced, and only the original third dictionary tree can be restored. A tailored third dictionary tree is constructed, and each word after the county name of the address to be corrected is queried. If the query is not available, the child nodes of all nodes of the current branch in the third dictionary tree are used as candidate nodes, and the next word is queried in the candidate nodes. For example, the address to be corrected is “Five North Road, Gulou District, Fuzhou City, Fujian Province”. In Figure 6, “Gulou District” can be found after “five”, after “five”, if “x” is not found, then The child nodes of the nodes "four", "one", and "phoen" are used as candidate nodes. If the next word is queried, the word "north" can be queried.
S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
其中,随着逐级匹配省、市、县区、乡镇村街道,动态获取与各层级对应 的字典树,并构造出与所述待纠错地址对应的完整字典树,如图7所示。Among them, as the province, city, county, township and village streets are matched step by step, the dictionary tree corresponding to each level is dynamically acquired, and a complete dictionary tree corresponding to the address to be corrected is constructed, as shown in FIG. 7 .
查询结束可以得到一个最低层的节点,根据此节点的指针可以找到上一层的唯一父节点,而此父节点又可以找到它的唯一父节点,这个过程称为回溯。最低层的节点经过回溯可以得到第一个节点,从第一个节点连接到最低层节点可以得到一个字符串。根据这个字符串返回以这个字符串为前缀的所有地址,作为候选地址。比如图7中,得到最低层节点为“桥”字,根据节点指针可以得到上一层唯一父节点为“山”,重复这个过程回溯得到最后的父节点即区名后的第一个节点为“洪”。将第一个节点连接到最后一个节点可以得到“洪山镇洪山桥”。从根节点至最低层节点对应的字符串为“福建省福州市鼓楼区洪山镇洪山桥”。At the end of the query, you can get a node with the lowest level. According to the pointer of this node, you can find the only parent node of the upper layer, and this parent node can find its only parent node. This process is called backtracking. The lowest level node is backtracked to get the first node, and the first node is connected to the lowest node to get a string. Returns all addresses prefixed by this string as a candidate address based on this string. For example, in Figure 7, the lowest node is the "bridge" word. According to the node pointer, the last parent node can be obtained as the "mountain". Repeat this process to get the last parent node, that is, the first node after the zone name. "flood". Connect the first node to the last node to get the Hongshan Bridge. The string corresponding from the root node to the lowest node is "Hongshan Bridge, Hongshan Town, Gulou District, Fuzhou City, Fujian Province".
S7、根据所述候选地址集合选取一最佳地址。具体为:S7. Select an optimal address according to the candidate address set. Specifically:
S71、从所述候选地址集合中获取一候选地址,得到当前候选地址。S71. Obtain a candidate address from the candidate address set to obtain a current candidate address.
S72、统计所述当前候选地址与所述待纠错地址的相同位置上字符相同的个数,得到匹配个数。S72. Count the number of characters in the same position of the current candidate address and the same location of the to-be-corrected address, and obtain a matching number.
S73、重复执行所述S71至所述S72,直至所述候选地址集合被遍历。S73. Perform S71 to S72 repeatedly until the candidate address set is traversed.
S74、获取所述候选地址集合中具有最大匹配个数的候选地址,得到最佳地址。S74. Acquire a candidate address having the largest matching number in the candidate address set to obtain an optimal address.
S75、根据所述最佳地址更新所述待纠错地址,得到正确地址。具体为:S75. Update the to-be-corrected address according to the optimal address to obtain a correct address. Specifically:
若所述最佳地址存在两个以上连续的,且与所述待纠错地址不适配的字符,则:If there are more than two consecutive characters in the optimal address that are not adapted to the address to be corrected, then:
从所述最佳地址中获取位于两个以上连续的,且与所述待纠错地址不适配的字符之前的字符串;根据所述字符串更新所述待纠错地址,得到正确地址;Obtaining, from the optimal address, a character string located before two or more consecutive characters that are not adapted to the error correction address; updating the error correction address according to the character string to obtain a correct address;
否则,设置所述最佳地址为正确地址。Otherwise, set the best address to the correct address.
其中,根据OCR(光学字符识别技术)识别得到的地址称为待纠错地址,待纠错地址可能存在错误。将待纠错地址在字典树进行查询可以得到候选地址。选择候选地址中与待纠错地址最相似的地址作为最佳地址,相似程度根据相同位置的汉字相同的数量来评价。再将最佳地址与待纠错地址比较,如果连续不同的汉字数量在两个字以内,将这部分最佳地址作为正确地址;连续不同字数 为两个及两个以上,则将这部分之后的待纠错地址作为正确地址。根据上述纠错原则将最佳地址和待纠错地址组合作为最后的正确地址。The address identified by OCR (Optical Character Recognition Technology) is called an error-correction address, and the error-correction address may have an error. The candidate address can be obtained by querying the error correction address in the dictionary tree. The address most similar to the address to be error corrected among the candidate addresses is selected as the best address, and the degree of similarity is evaluated according to the same number of Chinese characters in the same position. Then compare the best address with the address to be corrected. If the number of consecutive different Chinese characters is within two words, the best address is used as the correct address; if the number of consecutive different words is two or more, then this part will be followed. The address to be corrected is the correct address. According to the above error correction principle, the best address and the error correction address are combined as the last correct address.
比如待纠错地址为“福建省福川市鼓楼区洪山滇洪山侨”,候选地址为“福建省福州市鼓楼区洪山镇洪山桥”。在市这一级中,由于连续不同的汉字为一个,则将“福州市”作为正确地址,代替待纠错地址中的“福川市”。同理,在区县级以后,“洪山镇洪山桥”中连续不同的汉字也只有一个(不连续的不同字则为两个),因此最后的纠错后地址为“福建省福州市鼓楼区洪山镇洪山桥”。因为查询时按省、市、县和县级以后分级进行,因此纠错时也按照分级对比识别结果和查询结果,根据上述纠错原则选择是否进行纠错。For example, the address to be corrected is “Hongshan, Hongshan, Gulou District, Fuchuan City, Fujian Province”. The candidate address is “Hongshan Bridge, Hongshan Town, Gulou District, Fuzhou City, Fujian Province”. In the city level, since there are consecutive different Chinese characters, "Fuzhou City" is taken as the correct address instead of "Fuchuan City" in the address to be corrected. In the same way, after the district and county level, there is only one consecutive Chinese character in the "Hongshan Bridge of Hongshan Town" (two different words are discontinuous), so the final address after correction is "Gulou District, Fuzhou City, Fujian Province". Hongshan Bridge in Hongshan Town." Because the query is carried out according to the province, city, county and county level, the error correction also compares the recognition result and the query result according to the classification, and selects whether to perform error correction according to the above error correction principle.
由上述描述可知,本发明将全国地址按照省市、区县、乡镇村街道分级保存,将省份名称保存为第一字典树,用于查询待纠错地址所属省份。再将省份的省、市、县地址保存为第二字典树,用于查询区县名称。最后将乡镇村街道级地址按单字构建字典树减少冗余后进行存储,需要查询时恢复字典树,根据待纠错地址对需要恢复的字典树进行裁剪,减少节点数量。按省市、区县、乡镇村街道级查询时,只要待纠错地址中出现错字数量不多,在省、市、区县级根据相似度能够得到正确的地址名称,在村级根据节点回溯能够得到最接近正确地址的候选地址。最后根据纠错原则将待纠错地址和候选地址进行比较,得出纠错后的地址。比如待纠错地址为“福建省福川市鼓楼区洪山滇洪山侨”,在省、市、区县级根据相似度能够得出“福建省”下一级地址为“福x市”的只有“福州市”,在村级根据能够查询到的最低层节点“山”进行回溯可以得到“洪山镇洪山”,因此候选地址为“福建省福州市洪山镇洪山”为前缀的地址,比如“福建省福州市洪山镇洪山桥”。根据纠错原则,连续不同的汉字不超过一个,因此待纠错地址经过纠错为“福建省福州市洪山镇洪山桥”。It can be seen from the above description that the national address is stored hierarchically according to the provinces, cities, counties, towns and villages, and the province name is saved as the first dictionary tree for querying the province to which the error correction address belongs. The provincial, city, and county addresses of the province are saved as a second dictionary tree for querying the district and county names. Finally, the street-level address of the township village is built according to the word structure to reduce the redundancy, and the dictionary tree is restored. The dictionary tree needs to be restored according to the address to be corrected, and the number of nodes is reduced. According to the provincial, city, district, township and village street level query, as long as there are not many typos in the error correction address, the correct address name can be obtained according to the similarity at the provincial, city, district and county levels. A candidate address that is closest to the correct address can be obtained. Finally, according to the error correction principle, the error correction address and the candidate address are compared to obtain an error corrected address. For example, the address to be corrected is “Hongshan and Hongshan Overseas Chinese in Gulou District, Fuchuan City, Fujian Province”. At the provincial, city, district and county levels, according to the similarity, it can be concluded that “Fujian Province” has the next-level address as “Fux City”. "Fuzhou City", at the village level, according to the lowest level node "mountain" that can be queried, you can get "Hongshan Town Hongshan", so the candidate address is the address prefixed with "Hongshan Town, Fuzhou City, Fushan City, Fujian Province", such as "Fujian Province" Hongshan Bridge, Hongshan Town, Fuzhou City." According to the principle of error correction, there are no more than one consecutive Chinese characters, so the address to be corrected is corrected as "Hongshan Bridge, Hongshan Town, Fuzhou City, Fujian Province".
与基于词的统计语言模型相比,本发明不需要训练参数模型,也不需要多次计算词出现的概率,使用搜索算法寻找最优路径,只需要构建出字典树后,进行查询即可,速度更快。不同的市可能有相同名称的县或者乡镇或者村,根据统计模型的一阶马尔可夫可能无法判断,而增大阶数来判断时,计算量也随之增大。而本发明在分级查询时,根据构建的字典树进入不同的分支查询,对 于县以下的地址名用单字存为节点,再从最底层节点回溯就可以得到候选地址。Compared with the word-based statistical language model, the present invention does not need to train the parameter model, and does not need to calculate the probability of occurrence of the word multiple times. The search algorithm is used to find the optimal path, and only after constructing the dictionary tree, the query can be performed. faster. Different cities may have counties or towns or villages with the same name. According to the statistical model, the first-order Markov may not be able to judge, and when the order is increased to judge, the calculation amount also increases. In the hierarchical query, the present invention enters different branch queries according to the constructed dictionary tree, and the address names below the county are stored as nodes, and the candidate addresses are obtained by backtracking from the lowest node.
与构建全国地址的字典树相比,本发明根据待纠错地址的信息,针对需要查询的省、市、县,构建该县的字典树即可,再根据待纠错地址对县字典树进行裁剪,大大减少了需要的空间和查询时间。假如全国地址数据保存为文本约60M,一个省份的地址数据平均约为2M,查询时构建整个省份地址字典树则至少占用十几M内存,查询一次地址需要接近5s。而根据县名进行划分,查询村级地址时只需要恢复该县下的地址,经过裁剪后恢复的字典树一般数据量只有几K,查询一次地址需要0.05s左右。将全国地址构建为字典树后再将树节点按层保存为文本大约为10M,说明字典树结构有效去除了村级地址的冗余。县字典树中的节点使用双向指针,查询到的最后一个节点可以回溯到第一个节点,连接得到地址前缀,再根据地址前缀可以得到候选纠错地址。一般的字典树结构是用于搜索,指针为单向,只能从上到下查询节点,而本发明的字典树为双向指针,可以根据低层节点回溯到第一个节点,得出候选地址。Compared with the dictionary tree for constructing the national address, the present invention constructs the dictionary tree of the county for the provinces, cities and counties that need to be queried according to the information of the address to be corrected, and then performs the county dictionary tree according to the address to be corrected. Cropping greatly reduces the space required and query time. If the national address data is saved as text about 60M, the average address data of a province is about 2M. When building the entire province address dictionary tree during query, it takes at least a dozen M memory, and it takes 5s to query the address once. According to the county name, when querying the village-level address, only the address under the county needs to be restored. The general data amount of the dictionary tree restored after the cropping is only a few K, and it takes about 0.05s to query the address once. After the national address is constructed as a dictionary tree, the tree node is saved as a text by layer to about 10M, indicating that the dictionary tree structure effectively removes the redundancy of the village address. The nodes in the county dictionary tree use bidirectional pointers. The last node that is queried can be traced back to the first node, the connection gets the address prefix, and the candidate error correction address can be obtained according to the address prefix. The general dictionary tree structure is used for searching, the pointer is one-way, and the node can only be queried from top to bottom, and the dictionary tree of the present invention is a bidirectional pointer, and the candidate address can be obtained by backtracking to the first node according to the lower layer node.
查询时间是在同一台笔记本电脑的Visual Studio软件的Debug模式下得出的。The query time is derived from the Debug mode of the Visual Studio software of the same laptop.
方案Program 原始文本数据Raw text data Access数据库Access database SQLite数据库SQLite database 字典树结构Dictionary tree structure
数据存储空间Data storage space 60M60M 100M100M 50M50M 10M10M
查询一次用时Query once -- 0.5s-2s0.5s-2s 0.05s0.05s 0.05s0.05s
本发明的实施例二为: Embodiment 2 of the present invention is:
本实施例提供一种地址纠错终端,包括一个或多个处理器1及存储器2,所述存储器2存储有程序,并且被配置成由所述一个或多个处理器1执行以下步骤:The embodiment provides an address error correction terminal comprising one or more processors 1 and a memory 2, the memory 2 storing a program, and being configured to perform the following steps by the one or more processors 1:
S1、获取待纠错地址。S1: Obtain an address to be corrected.
可选地,通过光学字符识别技术识别身份证中的地址信息,得到所述待纠错地址。Optionally, the address information in the identity card is identified by an optical character recognition technology to obtain the to-be-corrected address.
S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名称;所述第一字典树用于存储省份名称和市名称。具体为:S2: Identify, by the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name. Specifically:
当所述第一字典树中不存在与所述待纠错地址适配的省份名称时,获取与所述待纠错地址适配的市名称,得到当前市名称;获取与所述当前市名称对应的省份名称,得到所述一级名称。When there is no province name adapted to the error correction address in the first dictionary tree, acquiring a city name adapted to the error correction address, obtaining a current city name; acquiring the current city name The corresponding province name gives the first-level name.
其中,所述第一字典树中的一节点表示一省份名称或一市名称;省份名称位于第一层,与省份名称对应的市名称位于第二层。The node in the first dictionary tree represents a province name or a city name; the province name is located on the first layer, and the city name corresponding to the province name is located on the second layer.
S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称。S3. Obtain a second dictionary tree corresponding to the first-level name. The second dictionary tree is configured to store a city name, a county name, and a zone name corresponding to the current province name.
其中,所述第二字典树中的一节点表示一市名称、一县名称或一区名称。所述第二字典树的根节点为所述一级名称。The node in the second dictionary tree represents a city name, a county name, or a zone name. The root node of the second dictionary tree is the first-level name.
S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称,得到二级名称。S4. Identify a county name or a zone name corresponding to the to-be-corrected address according to the second dictionary tree, and obtain a secondary name.
S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;具体为:S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name; specifically:
获取与所述二级名称对应的字典树,得到第三字典树;Obtaining a dictionary tree corresponding to the second-level name to obtain a third dictionary tree;
从所述待纠错地址中获取位于所述二级名称后,且与预设次序对应的字符,得到当前字符;Obtaining a character corresponding to the preset order after the second-level name from the to-be-corrected address, to obtain a current character;
根据所述第三字典树中与所述当前字符适配的分支裁剪需构造的第三字典树;所述第三字典树的根节点为所述二级名称。And a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
其中,所述第三字典树中的一节点表示乡镇名称、村名称或街道名称中的一个字符。The node in the third dictionary tree represents one character in a township name, a village name, or a street name.
其中,所述与预设次序对应的字符为所述二级名称后的第一个字符和所述二级名称后的第四个字符。The character corresponding to the preset order is the first character after the second-level name and the fourth character after the second-level name.
S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
S7、根据所述候选地址集合选取一最佳地址。具体为:S7. Select an optimal address according to the candidate address set. Specifically:
S71、从所述候选地址集合中获取一候选地址,得到当前候选地址。S71. Obtain a candidate address from the candidate address set to obtain a current candidate address.
S72、统计所述当前候选地址与所述待纠错地址的相同位置上字符相同的个数,得到匹配个数。S72. Count the number of characters in the same position of the current candidate address and the same location of the to-be-corrected address, and obtain a matching number.
S73、重复执行所述S71至所述S72,直至所述候选地址集合被遍历。S73. Perform S71 to S72 repeatedly until the candidate address set is traversed.
S74、获取所述候选地址集合中具有最大匹配个数的候选地址,得到最佳地址。S74. Acquire a candidate address having the largest matching number in the candidate address set to obtain an optimal address.
S75、根据所述最佳地址更新所述待纠错地址,得到正确地址。具体为:S75. Update the to-be-corrected address according to the optimal address to obtain a correct address. Specifically:
若所述最佳地址存在两个以上连续的,且与所述待纠错地址不适配的字符,则:If there are more than two consecutive characters in the optimal address that are not adapted to the address to be corrected, then:
从所述最佳地址中获取位于两个以上连续的,且与所述待纠错地址不适配的字符之前的字符串;根据所述字符串更新所述待纠错地址,得到正确地址;Obtaining, from the optimal address, a character string located before two or more consecutive characters that are not adapted to the error correction address; updating the error correction address according to the character string to obtain a correct address;
否则,设置所述最佳地址为正确地址。Otherwise, set the best address to the correct address.
综上所述,本发明提供的一种地址纠错方法及终端,将全国地址按照省、市县区、乡镇村街道分级保存,依次核查待纠错地址中的省份信息、市县区信息和乡镇村街道信息,并根据每次的核查结果动态调取与下一级地址对应的字典树,极大程度上减少了在地址纠错过程中内存空间的占用情况,且具有较高的准确度。进一步地,在待纠错地址的省份名称错误程度较严重的情况下,可通过市名称确认与待纠错地址对应的省份名称,有利于提高纠错的准确度。进一步地,一般省、市、县名称重复的可能性比较小,可以将整个词存为一个节点,而县级以后可能为乡镇、村级或者街道,重复出现的可能性比较大,共用相同前缀可以有效减少冗余,减少所需占用的空间。进一步地,通过指定特定位置的字符,并选取与特定位置的字符适配的分支信息作为候选的地址,减少了第三字典树的容量,即减少了核查乡镇村街道地址时所需占用的空间。进一步地,二级名称后的第一个字符一般为镇名称的首字符,二级名称后的第四个字符一般为村名称的首字符,一般可以筛选县名后的镇和村,能有效减少需生成的字典树节点。进一步地,提高了从一个以上候选地址中选取与待纠错地址相似度最大的地址作为正确地址的正确率。In summary, the address correction method and terminal provided by the present invention store the national address in stages according to the province, city, county, township, and village streets, and sequentially check the province information, the city and county information, and the city and county information in the address to be corrected. Township village street information, and dynamically retrieve the dictionary tree corresponding to the next-level address according to each verification result, which greatly reduces the memory space occupation in the address error correction process, and has high accuracy. . Further, in the case that the province name of the address to be error-corrected is more serious, the name of the province corresponding to the address to be corrected can be confirmed by the city name, which is advantageous for improving the accuracy of error correction. Further, in general, the provinces, cities, and counties are less likely to be duplicated, and the entire word can be saved as a node. The county level may be a township, a village, or a street. The possibility of repeated occurrence is relatively large, sharing the same prefix. It can effectively reduce redundancy and reduce the space required. Further, by specifying the character of the specific location and selecting the branch information adapted to the character of the specific location as the candidate address, the capacity of the third dictionary tree is reduced, that is, the space required for checking the street address of the township village is reduced. . Further, the first character after the second-level name is generally the first character of the town name, and the fourth character after the second-level name is generally the first character of the village name. Generally, the town and the village after the county name can be screened, which can be effective. Reduce the dictionary tree nodes that need to be generated. Further, the correct rate of selecting the address with the highest similarity to the error-correction address from the one or more candidate addresses as the correct address is improved.

Claims (16)

  1. 一种地址纠错方法,其特征在于,包括:An address error correction method, comprising:
    S1、获取待纠错地址;S1, obtaining an address to be corrected;
    S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名称;所述第一字典树用于存储省份名称和市名称;S2: Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
    S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称;S3: Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
    S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称,得到二级名称;S4. Identify, according to the second dictionary tree, a county name or a zone name corresponding to the to-be-corrected address, and obtain a secondary name;
    S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
    S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
  2. 根据权利要求1所述的地址纠错方法,其特征在于,所述S2具体为:The address error correction method according to claim 1, wherein the S2 is specifically:
    当所述第一字典树中不存在与所述待纠错地址适配的省份名称时,获取与所述待纠错地址适配的市名称,得到当前市名称;获取与所述当前市名称对应的省份名称,得到所述一级名称。When there is no province name adapted to the error correction address in the first dictionary tree, acquiring a city name adapted to the error correction address, obtaining a current city name; acquiring the current city name The corresponding province name gives the first-level name.
  3. 根据权利要求1所述的地址纠错方法,其特征在于,还包括:The address error correction method according to claim 1, further comprising:
    所述第一字典树中的一节点表示一省份名称或一市名称;A node in the first dictionary tree represents a province name or a city name;
    所述第二字典树中的一节点表示一市名称、一县名称或一区名称;A node in the second dictionary tree represents a city name, a county name or a zone name;
    所述第三字典树中的一节点表示乡镇名称、村名称或街道名称中的一个字符。A node in the third dictionary tree represents one of a township name, a village name, or a street name.
  4. 根据权利要求1所述的地址纠错方法,其特征在于,所述S5具体为:The address error correction method according to claim 1, wherein the S5 is specifically:
    获取与所述二级名称对应的字典树,得到第三字典树;Obtaining a dictionary tree corresponding to the second-level name to obtain a third dictionary tree;
    从所述待纠错地址中获取位于所述二级名称后,且与预设次序对应的字符,得到当前字符;Obtaining a character corresponding to the preset order after the second-level name from the to-be-corrected address, to obtain a current character;
    根据所述第三字典树中与所述当前字符适配的分支裁剪需构造的第三字典树;所述第三字典树的根节点为所述二级名称。And a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
  5. 根据权利要求4所述的地址纠错方法,其特征在于,还包括:The address error correction method according to claim 4, further comprising:
    所述与预设次序对应的字符为所述二级名称后的第一个字符和所述二级名称后的第四个字符。The character corresponding to the preset order is a first character after the second-level name and a fourth character after the second-level name.
  6. 根据权利要求1所述的地址纠错方法,其特征在于,所述S6之后,还包括:The address error correction method according to claim 1, wherein after the S6, the method further comprises:
    S71、从所述候选地址集合中获取一候选地址,得到当前候选地址;S71. Obtain a candidate address from the candidate address set to obtain a current candidate address.
    S72、统计所述当前候选地址与所述待纠错地址的相同位置上字符相同的个数,得到匹配个数;S72. Count the number of characters in the same position of the current candidate address and the same location of the to-be-corrected address, and obtain a matching number.
    S73、重复执行所述S71至所述S72,直至所述候选地址集合被遍历;S73. Repeat performing the S71 to the S72 until the candidate address set is traversed;
    S74、获取所述候选地址集合中具有最大匹配个数的候选地址,得到最佳地址;S74. Obtain a candidate address having the largest matching number in the candidate address set, to obtain an optimal address.
    S75、根据所述最佳地址更新所述待纠错地址,得到正确地址。S75. Update the to-be-corrected address according to the optimal address to obtain a correct address.
  7. 根据权利要求6所述的地址纠错方法,其特征在于,所述S75具体为:The address error correction method according to claim 6, wherein the S75 is specifically:
    若所述最佳地址存在两个以上连续的,且与所述待纠错地址不适配的字符,则:If there are more than two consecutive characters in the optimal address that are not adapted to the address to be corrected, then:
    从所述最佳地址中获取位于两个以上连续的,且与所述待纠错地址不适配的字符之前的字符串;Obtaining, from the optimal address, a character string located before two or more consecutive characters that are not adapted to the error correction address;
    根据所述字符串更新所述待纠错地址,得到正确地址;Updating the to-be-corrected address according to the string to obtain a correct address;
    否则,设置所述最佳地址为正确地址。Otherwise, set the best address to the correct address.
  8. 根据权利要求1所述的地址纠错方法,其特征在于,所述S1具体为:The address error correction method according to claim 1, wherein the S1 is specifically:
    通过光学字符识别技术识别身份证中的地址信息,得到所述待纠错地址。The address information in the identity card is identified by an optical character recognition technology to obtain the address to be corrected.
  9. 一种地址纠错终端,其特征在于,包括一个或多个处理器及存储器,所述存储器存储有程序,并且被配置成由所述一个或多个处理器执行以下步骤:An address correction terminal characterized by comprising one or more processors and a memory, the memory storing a program, and being configured to perform the following steps by the one or more processors:
    S1、获取待纠错地址;S1, obtaining an address to be corrected;
    S2、根据第一字典树识别与所述待纠错地址对应的省份名称,得到一级名称;所述第一字典树用于存储省份名称和市名称;S2: Identify, according to the first dictionary tree, a province name corresponding to the to-be-corrected address, to obtain a first-level name; the first dictionary tree is configured to store a province name and a city name;
    S3、获取与所述一级名称对应的第二字典树;所述第二字典树用于存储与所述当前省份名称对应的市名称、县名称和区名称;S3: Obtain a second dictionary tree corresponding to the first-level name; the second dictionary tree is configured to store a city name, a county name, and a district name corresponding to the current province name;
    S4、根据所述第二字典树识别与所述待纠错地址对应的县名称或区名称, 得到二级名称;S4. Identify, according to the second dictionary tree, a county name or a zone name corresponding to the to-be-corrected address, and obtain a secondary name;
    S5、获取与所述二级名称对应的第三字典树;所述第三字典树用于存储与所述二级名称对应的乡镇名称、村名称和街道名称;S5: Obtain a third dictionary tree corresponding to the second-level name; the third dictionary tree is configured to store a township name, a village name, and a street name corresponding to the second-level name;
    S6、根据所述第三字典树获取与所述待纠错地址对应的一个以上候选地址,得到候选地址集合。S6. Acquire one or more candidate addresses corresponding to the to-be-corrected address according to the third dictionary tree, to obtain a candidate address set.
  10. 根据权利要求9所述的地址纠错终端,其特征在于,所述S2具体为:The address error correction terminal according to claim 9, wherein the S2 is specifically:
    当所述第一字典树中不存在与所述待纠错地址适配的省份名称时,获取与所述待纠错地址适配的市名称,得到当前市名称;获取与所述当前市名称对应的省份名称,得到所述一级名称。When there is no province name adapted to the error correction address in the first dictionary tree, acquiring a city name adapted to the error correction address, obtaining a current city name; acquiring the current city name The corresponding province name gives the first-level name.
  11. 根据权利要求9所述的地址纠错终端,其特征在于,还包括:The address correction terminal according to claim 9, further comprising:
    所述第一字典树中的一节点表示一省份名称或一市名称;A node in the first dictionary tree represents a province name or a city name;
    所述第二字典树中的一节点表示一市名称、一县名称或一区名称;A node in the second dictionary tree represents a city name, a county name or a zone name;
    所述第三字典树中的一节点表示乡镇名称、村名称或街道名称中的一个字符。A node in the third dictionary tree represents one of a township name, a village name, or a street name.
  12. 根据权利要求9所述的地址纠错终端,其特征在于,所述S5具体为:The address correction terminal according to claim 9, wherein the S5 is specifically:
    获取与所述二级名称对应的字典树,得到第三字典树;Obtaining a dictionary tree corresponding to the second-level name to obtain a third dictionary tree;
    从所述待纠错地址中获取位于所述二级名称后,且与预设次序对应的字符,得到当前字符;Obtaining a character corresponding to the preset order after the second-level name from the to-be-corrected address, to obtain a current character;
    根据所述第三字典树中与所述当前字符适配的分支裁剪需构造的第三字典树;所述第三字典树的根节点为所述二级名称。And a third dictionary tree to be constructed according to the branch of the third dictionary tree adapted to the current character; the root node of the third dictionary tree is the second-level name.
  13. 根据权利要求12所述的地址纠错终端,其特征在于,还包括:The address correction terminal according to claim 12, further comprising:
    所述与预设次序对应的字符为所述二级名称后的第一个字符和所述二级名称后的第四个字符。The character corresponding to the preset order is a first character after the second-level name and a fourth character after the second-level name.
  14. 根据权利要求9所述的地址纠错终端,其特征在于,所述S6之后,还包括:The address correction terminal according to claim 9, wherein after the S6, the method further comprises:
    S71、从所述候选地址集合中获取一候选地址,得到当前候选地址;S71. Obtain a candidate address from the candidate address set to obtain a current candidate address.
    S72、统计所述当前候选地址与所述待纠错地址的相同位置上字符相同的个数,得到匹配个数;S72. Count the number of characters in the same position of the current candidate address and the same location of the to-be-corrected address, and obtain a matching number.
    S73、重复执行所述S71至所述S72,直至所述候选地址集合被遍历;S73. Repeat performing the S71 to the S72 until the candidate address set is traversed;
    S74、获取所述候选地址集合中具有最大匹配个数的候选地址,得到最佳地址;S74. Obtain a candidate address having the largest matching number in the candidate address set, to obtain an optimal address.
    S75、根据所述最佳地址更新所述待纠错地址,得到正确地址。S75. Update the to-be-corrected address according to the optimal address to obtain a correct address.
  15. 根据权利要求14所述的地址纠错终端,其特征在于,所述S75具体为:The address correction terminal according to claim 14, wherein the S75 is specifically:
    若所述最佳地址存在两个以上连续的,且与所述待纠错地址不适配的字符,则:If there are more than two consecutive characters in the optimal address that are not adapted to the address to be corrected, then:
    从所述最佳地址中获取位于两个以上连续的,且与所述待纠错地址不适配的字符之前的字符串;Obtaining, from the optimal address, a character string located before two or more consecutive characters that are not adapted to the error correction address;
    根据所述字符串更新所述待纠错地址,得到正确地址;Updating the to-be-corrected address according to the string to obtain a correct address;
    否则,设置所述最佳地址为正确地址。Otherwise, set the best address to the correct address.
  16. 根据权利要求9所述的地址纠错终端,其特征在于,所述S1具体为:The address error correction terminal according to claim 9, wherein the S1 is specifically:
    通过光学字符识别技术识别身份证中的地址信息,得到所述待纠错地址。The address information in the identity card is identified by an optical character recognition technology to obtain the address to be corrected.
PCT/CN2018/077926 2018-03-02 2018-03-02 Address error correction method and terminal WO2019165644A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880000142.4A CN108369582B (en) 2018-03-02 2018-03-02 Address error correction method and terminal
PCT/CN2018/077926 WO2019165644A1 (en) 2018-03-02 2018-03-02 Address error correction method and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/077926 WO2019165644A1 (en) 2018-03-02 2018-03-02 Address error correction method and terminal

Publications (1)

Publication Number Publication Date
WO2019165644A1 true WO2019165644A1 (en) 2019-09-06

Family

ID=63012592

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077926 WO2019165644A1 (en) 2018-03-02 2018-03-02 Address error correction method and terminal

Country Status (2)

Country Link
CN (1) CN108369582B (en)
WO (1) WO2019165644A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108369582B (en) * 2018-03-02 2021-06-25 福建联迪商用设备有限公司 Address error correction method and terminal
CN109254964A (en) * 2018-08-20 2019-01-22 中国平安人寿保险股份有限公司 Address Standardization method, apparatus, computer equipment and storage medium
CN109784308B (en) * 2019-02-01 2020-09-29 腾讯科技(深圳)有限公司 Address error correction method, device and storage medium
CN110020640B (en) * 2019-04-19 2021-08-24 厦门商集网络科技有限责任公司 Method and terminal for correcting identity card information
CN110737592B (en) * 2019-09-16 2024-01-30 平安科技(深圳)有限公司 Link abnormality identification method, server and computer readable storage medium
CN110851559B (en) * 2019-10-14 2020-10-09 中科曙光南京研究院有限公司 Automatic data element identification method and identification system
CN111008625B (en) * 2019-12-06 2023-07-18 建信金融科技有限责任公司 Address correction method, device, equipment and storage medium
CN112256821B (en) * 2020-09-23 2024-05-17 北京捷通华声科技股份有限公司 Chinese address completion method, device, equipment and storage medium
CN112364113A (en) * 2020-11-13 2021-02-12 北京明略软件系统有限公司 Address error correction method and system
CN114661688B (en) * 2022-03-25 2023-09-19 马上消费金融股份有限公司 Address error correction method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270435A1 (en) * 2004-03-16 2008-10-30 Turbo Data Laboratories Inc. Method for Handling Tree-Type Data Structure, Information Processing Device, and Program
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN104598887A (en) * 2015-01-29 2015-05-06 华东师范大学 Recognition method for written Chinese address of non-specification format
CN107679187A (en) * 2017-09-30 2018-02-09 浪潮软件股份有限公司 A kind of construction method and device of Chinese address tree
CN108369582A (en) * 2018-03-02 2018-08-03 福建联迪商用设备有限公司 A kind of address error correction method and terminal

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984432A (en) * 2010-11-10 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for constructing address database
CN105740257A (en) * 2014-12-09 2016-07-06 朗新科技股份有限公司 Method and system for establishing standard geographic name address base

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270435A1 (en) * 2004-03-16 2008-10-30 Turbo Data Laboratories Inc. Method for Handling Tree-Type Data Structure, Information Processing Device, and Program
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN104598887A (en) * 2015-01-29 2015-05-06 华东师范大学 Recognition method for written Chinese address of non-specification format
CN107679187A (en) * 2017-09-30 2018-02-09 浪潮软件股份有限公司 A kind of construction method and device of Chinese address tree
CN108369582A (en) * 2018-03-02 2018-08-03 福建联迪商用设备有限公司 A kind of address error correction method and terminal

Also Published As

Publication number Publication date
CN108369582B (en) 2021-06-25
CN108369582A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
WO2019165644A1 (en) Address error correction method and terminal
Agichtein et al. Mining reference tables for automatic text segmentation
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN109033086A (en) A kind of address resolution, matched method and device
CN112612863B (en) Address matching method and system based on Chinese word segmentation device
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN101093478A (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN113326267B (en) Address matching method based on inverted index and neural network algorithm
WO2021072874A1 (en) Dual array-based location query method and apparatus, computer device, and storage medium
WO2019227581A1 (en) Interest point recognition method, apparatus, terminal device, and storage medium
CN102955832A (en) Correspondence address identifying and standardizing system
CN112528174A (en) Address finishing and complementing method based on knowledge graph and multiple matching and application
CN109885641B (en) Method and system for searching Chinese full text in database
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN103996021A (en) Fusion method of multiple character identification results
CN115688779B (en) Address recognition method based on self-supervision deep learning
CN112256817A (en) Geocoding method, system, terminal and storage medium
CN114780680A (en) Retrieval and completion method and system based on place name and address database
CN113505190B (en) Address information correction method, device, computer equipment and storage medium
CN115470307A (en) Address matching method and device
CN112948717B (en) Massive space POI searching method and system based on multi-factor constraint
CN114201480A (en) Multi-source POI fusion method and device based on NLP technology and readable storage medium
CN111767476B (en) Method for constructing space-time big data spatialization engine of smart city based on HMM model
WO2021142968A1 (en) Multilingual-oriented semantic similarity calculation method for general place names, and application thereof
Schraagen Aspects of record linkage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18907952

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18907952

Country of ref document: EP

Kind code of ref document: A1