WO2015027835A1 - System and terminal for querying mailing address postal codes - Google Patents

System and terminal for querying mailing address postal codes Download PDF

Info

Publication number
WO2015027835A1
WO2015027835A1 PCT/CN2014/084607 CN2014084607W WO2015027835A1 WO 2015027835 A1 WO2015027835 A1 WO 2015027835A1 CN 2014084607 W CN2014084607 W CN 2014084607W WO 2015027835 A1 WO2015027835 A1 WO 2015027835A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
level
user
query
communication address
Prior art date
Application number
PCT/CN2014/084607
Other languages
French (fr)
Chinese (zh)
Inventor
王国印
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2015027835A1 publication Critical patent/WO2015027835A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Definitions

  • the present invention relates to the field of zip code inquiry, and in particular to a system and terminal for querying a zip code by a communication address.
  • the e-commerce and logistics industries are inseparable from the communication address (also known as the mailing address, referred to as the address) and the postal code. These data need to be provided by the user.
  • the current main practices of some e-commerce websites and the logistics industry are as follows: The complete address and address corresponding zip code; through the drop-down list to provide provinces, provinces below the prefecture-level city and prefecture-level cities below the districts and counties, these relatively fixed addresses for users to choose, the rest of the address and zip code manually input by the user; The entered address and zip code are convenient for the next time to use again.
  • Some other websites currently solve the first problem, which is to help users get the zip code corresponding to the address.
  • they often use database technology to implement the system.
  • string fuzzy queries like %XXX% to participate in the search.
  • this method is large.
  • the query of data volume is very inefficient.
  • the database-based query makes the user's input format and content greatly restricted.
  • the user first selects the name of the provincial administrative region (including the province, special administrative region, autonomous region, and municipality), and then selects the prefecture-level administrative region (Including the name of the prefecture-level city, autonomous prefecture, region, and alliance level, and then the name of the county-level administrative region (including the municipal district, county, flag, SAR, forest area, autonomous county, and autonomous flag, etc.), and the last user enters the township level and Village roads, etc.
  • the input process of the query is very mechanical.
  • the address format is required to satisfy four levels, that is, provincial, prefecture-level, district-level, and then other specific addresses.
  • levels that is, provincial, prefecture-level, district-level, and then other specific addresses.
  • not all addresses satisfy this situation.
  • prefecture-level city level between the municipality and the provinces directly under the central government or the county-level cities under the jurisdiction of the municipality.
  • Some special prefecture-level cities do not have district-level and county-level, such as Zhongshan, Guangdong province.
  • the present invention has been made to solve one of the above drawbacks. Therefore, the present invention provides a system and terminal for querying a postal code by a communication address, which helps the user input a prompt to make the query format more free; based on the named entity identification technology, the level of the user input address metadata can be identified, thereby realizing the address.
  • the step-by-step query and the completion of the communication address make the query result more accurate.
  • the user can also obtain the query result by means of a two-dimensional code, or link the map for positioning.
  • an embodiment of the present invention provides a system for querying a postal code by a communication address, the system comprising a communication address input subsystem and a postal code query subsystem; and the address input subsystem promptly prompts the user to input text, The user determines the communication address to be queried according to the prompt list address; the zip code query subsystem normalizes the communication address to be queried and retrieves the closest standardized communication address, Returns the zip code corresponding to the standardized communication address.
  • the determining the communication address to be queried may further include: the user may not select the address in the prompt list, and determine the communication address to be queried only according to the user input text.
  • the real-time prompt includes: automatically changing the prompt content as the user inputs each increment of the text; the step of implementing the prompt content is specifically: acquiring an address text input by the current user and performing pre-processing to delete extra spaces; The segmentation obtains the address metadata and labels all the address levels; obtains the final place name entity annotation sequence by the place name entity identification, and generates a Query query statement; retrieves the index address file to obtain the prompt list address content.
  • the preprocessing further comprises: converting a full-width character of a number or a letter into a half-width character; wherein the dictionary is stored in a pre-processing process using a dual array-based Trie tree data structure.
  • the prompt list address includes: The obtained prompt list addresses are arranged in descending order according to the closest standard address.
  • the standardizing the communication address to be queried includes the following specific steps: obtaining a communication address to be queried determined by the user and performing pre-processing; performing address segmentation to obtain address metadata, and labeling all address levels; Obtain the final list of place name entities and generate a Query query statement; parse the Query query statement and retrieve the index file to compare with it to obtain the closest communication address; perform address completion to generate a standardized communication address, and return the standardized communication address The postal code corresponding to the address.
  • the corresponding zip code is determined according to a lowest address level value of the marked address.
  • the returning the zip code corresponding to the standardized communication address may further include: selecting the determined zip code query result, the user may obtain the map location; or sending the zip code query result to the mobile terminal device by using the two-dimensional code.
  • the address segmentation adopts a binary model segmentation method; the named entity recognition technology identifies the most likely address level of each place name metadata in the place name entity annotation result.
  • Another embodiment of the present invention provides a terminal for querying a zip code by using a communication address
  • the terminal includes: a user input prompting unit and a zip code query unit, wherein the user inputs a prompting unit for real-time Prompting the user to input and receive the communication address to be queried finally determined by the user;
  • the zip code query unit is configured to retrieve a standardized communication address that is closest to the communication address to be queried, and receive a post corresponding to the standardized communication address coding.
  • the invention makes the query format more free by helping the user input prompts; the named entity identification technology can identify the level of the user input address metadata, thereby implementing the level-by-level query of the address, and simultaneously completing the communication address, so that the search is performed. The result is more accurate.
  • the user can also obtain the query result in two-dimensional code, or link the map for positioning.
  • FIG. 1 is a schematic flowchart of a system for querying a postal code by using a communication address according to an embodiment of the present invention.
  • 2 is a detailed flow chart of an address input subsystem implemented by an embodiment of the present invention.
  • FIG. 3 is a detailed flow chart of an address input subsystem implemented by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an example of address completion in a postal code query subsystem implemented by an embodiment of the present invention. detailed description
  • the system and terminal for querying the zip code of the communication address provided by the invention provide the user with the prompt to make the query format more free; the named entity identification technology can identify the level of the user input address metadata, thereby realizing the address
  • the query is step by step, and the communication address is complemented at the same time, so that the query result is more accurate.
  • the user can also obtain the query result by means of a two-dimensional code, or link the map for positioning.
  • Step S110 Address
  • the input subsystem prompts the user to input the text in real time, and the user determines the communication address to be queried according to the prompt list address.
  • step S110 The detailed process of step S110 is as shown in FIG. 2, specifically: Step S111: Obtain the address text input by the user, and perform pre-processing on the obtained address text, and the pre-processing mainly includes turning the full angle of the number or letter. Change to half-width characters and remove extra spaces.
  • This input prompt automatically changes the prompt content as the user inputs each increase in the text, and can also save the real-time prompt.
  • the user can directly input the communication address text to be queried in the address input prompting system, and if the real-time prompt is selected, Then the obtained hint list addresses are arranged in descending order according to the closest standard address.
  • Step S112 The address text is divided into addresses.
  • the word segmentation method used in full-text indexing is a binary model, that is, the longest Chinese word length in the index is 2, and the length of Chinese place names is generally more than 2, and each identified address metadata is identified to generate a PhraseQuery check.
  • the syntax is used to filter out the words consisting of the last word of the previous address metadata and the first word of the next address metadata in the adjacent two address metadata.
  • the constructed PhraseQuery query syntax is: "Guangdong City" "Shenzhen City", that is, each place name metadata is enclosed in double quotes. In this way, it is possible to filter out the results of the query caused by the words "deep-deep” and greatly improve the accuracy.
  • Dictionary-based word segmentation usually has a positive (left to right) match and a reverse (from right to left) match.
  • the inverse matching is half the error rate of the positive matching segmentation.
  • the cross ambiguity is defined as: ABC three consecutive Chinese characters, AB and BC can be words; in general, Chinese BC constituent words The probability is greater.
  • Address segmentation is based on the address metadata dictionary using the inverse maximum matching algorithm to scan the user input address text from right to left to achieve the address segmentation.
  • the dictionary uses a double array (Double Array) based Trie tree. The data structure is stored.
  • Step S113 Perform address labeling.
  • address metadata is required, which can be obtained from Wikipedia and the National Bureau of Statistics regarding the address metadata of the Chinese administrative division, and from the complete communication address by address segmentation and identification technology.
  • the address metadata contains the following data: provincial administrative district names (including provinces, autonomous regions, municipalities and special administrative regions), prefecture-level administrative district names (prefecture-level cities, autonomous prefectures, regions, and alliances), and county-level administrative districts (including municipal districts, County-level cities, counties, autonomous counties, flags, autonomous flags, special zones and forest areas), township-level administrative district names (including townships, Town, street, Sumu, district office), other address data (including road name, village name, community name, building name and square name).
  • the address metadata dictionary should contain various aliases for place names, and its format is defined as: Address metadata dictionary consists of multiple lines, each line becomes a term (Term), each Term should contain the address level corresponding to the place name and place name (level ) , where the name is key, the address level is the attribute or value of the key.
  • the address metadata dictionary contains 2 items for each Term, that is, the address level corresponding to the place name and the place name. They are separated by a semicolon semicolon ";”, and some place names contain multiple address levels (such as some standard versions).
  • the alias of the address is also an alias for other standard version addresses. The different level levels are separated by a comma ",".
  • the usual formats are as follows: Provincial administrative districts, one prefecture-level administrative district, one county-level administrative district, one township-level administrative district, one other (this format is often used in the Internet), for example: Fuyang City, Anhui province Huxiaozhai Village, Chenqiao Village Committee, Guanji Town, County; County-level administrative district, a township-level administrative district, and other (when the county-level administrative district is a county-level city, county, autonomous county, flag, autonomous flag, special zone and forest area) You can omit the prefecture-level administrative district. This format is often used on ID cards.
  • Table 1 Five-level hierarchical model of address level. For convenience of processing, the value of level is set to 1, 2, 3, 4, 0 in order according to the address level. That is, “1" represents the address level as one level, “2" represents the address level as level 2, "3" represents the address level as level 3, "4" represents the address level as level 4, and "0" represents the address level as level 5, "0" represents the address level as level 5 .
  • the address level can be obtained from the attribute of each place name in the address metadata dictionary. If the segmented address does not exist in the dictionary, the address is an unrecognized address, and the address level is marked as level 0. Step S114: Perform geographical name entity identification.
  • the geographical name entity identification is to identify the most likely address level of each place name in the result of the place name entity labeling, for example, an address sequence: "Guangdong Shenzhen Baoan Xixiang” is the full name of "Xixiang Street, Bao'an District, Shenzhen City, Guangdong province”; The results after the points and labels are: “Guangdong (1) Shenzhen (2, 4) Baoan (3) Xixiang (2, 4)"; The correct labeling sequence is: “Guangdong (1) Shenzhen (2) Baoan (3) Xixiang (4)".
  • the system uses dynamic programming algorithm plus backtracking (Viterbi algorithm) to find the most accurate labeling sequence.
  • the observation value and state in Viterbi algorithm are address levels. At this time, the algorithm becomes a first-order Markov process.
  • the toponymic entity identification includes two parts, one part is a processing flow for obtaining an optimal address level labeling sequence by the Viterbi algorithm, and the other part is to correct an optimal labeling level sequence that does not satisfy the rule according to the knowledge of the context, so that the recognition result is obtained. More precise.
  • the value in Pi is set according to experience or prior knowledge.
  • the value of each value in the following follows the following principles: The higher the administrative level of the address, the higher the initial probability, such as the initial probability of the provincial level is greater than the prefecture level.
  • A ⁇ ⁇ 0.05, 0.45, 0.25, 0.15, 0.10 ⁇ ;
  • the most probable sequence of labels is the first type of labeling. Therefore, the result of the dynamic programming algorithm output is also the first type of labeling status "Guangdong (1) Shenzhen (2) Baoan (3) Xixiang (4)".
  • the address entered is: "Hebei Shijiazhuang Pingshan Ancient Moon", the address sequence marked is: “Hebei (1, 2, 4) Shijiazhuang (2, 4) Pingshan (2, 3, 4) Ancient Moon (4) ", this
  • the labeling level of each address in the labeling sequence is interpreted as: “Hebei” may be an alias of "Hebei City”, or an alias of "Hebei District” in Tianjin, or an alias of "Hebei Township”; “Shijiazhuang” It can be an alias for "Shijiazhuang City” and “Shijiazhuang Town”; "Pingshan” can be an alias for "Pingshan County” or “Pingshan District” or “Pingshan Town”.
  • the optimal labeling sequence is: "Hebei (1) Shijiazhuang (2) Pingshan (3) Ancient Moon (4)”.
  • the prefecture-level city marked as the three-level address is its direct predecessor address, if not corrected.
  • the rules are stored in the opposite manner as described above, that is, the alias of the prefecture-level city to which the county or county-level city belongs is the context, for example, (Taihe Yiyang). Therefore, when this context is satisfied, the level of the label is modified, and no modification is made when it is not satisfied.
  • the second-level address and the fourth-level address have the same name, mainly in the county-level city or county alias and the same name of the township alias. Since the four-level address can appear multiple times in a complete address, Sometimes the secondary address is marked on level 4. At this time, it is also necessary to discriminate according to the context.
  • the sequence of the note. Examples are as follows:
  • the input address is: "Heihe River Heilongjiang Wudalianchi Xinfa Township and Mincun”
  • the optimal labeling sequence is: "Heilongjiang (1) Heihe (2) Wudalianchi (4) Xinfa Township (4) and Mincun (0)”
  • This The "Five Dalian Pool” was marked at the fourth level address level, in fact it is a county-level city.
  • the sequence of labels after correction according to the context is: "Heilongjiang (1) Heihe (2) Wudalianchi (2) Xinfa Township (4) and Mincun (0) ", similar to the solution with the same alias in the district and county, for townships and counties
  • the rule reserved by the system is that the alias is the context of the prefecture-level city of the county or county-level city, for example (Wudao Pool Ihehe). Therefore, when this context is satisfied, the level of the label is modified, and no modification is made when it is not satisfied.
  • a mechanism is also provided to correct the optimal label sequence according to the context.
  • the method of processing is to eliminate the ambiguity caused by the alias according to the address context (an alias corresponds to multiple address levels). The result is more accurate.
  • Step S120 The postal code query subsystem normalizes the communication address to be queried and retrieves the closest standardized communication address, and returns the zip code corresponding to the standardized communication address.
  • the index file is composed of a plurality of documents, each of which contains fields: an address field, a complete standard address. ; ZIP code domain, the zip code associated with the full standard address; the lowest level of the address, the administrative level of the lowest level address in the address.
  • the data field value of the lowest level field is as follows:
  • An address text whose value corresponding to the lowest address level field is calculated as follows: First, the address text is preprocessed. The preprocessing includes deleting extra spaces, and the full-width characters are converted into half-width characters; the second is address segmentation and address labeling;
  • the address level in the labeling sequence is defined as follows:
  • the lowest address is mapped to the lowest level value of the address field level: l ⁇ province; 2 ⁇ city ; 3 ⁇ district; 4 ⁇ town; 0 ⁇ all.
  • step S120 The detailed process diagram of step S120 is shown in FIG. 3, which is specifically as follows:
  • Step S121 Acquire a communication address to be queried determined by the user and perform pre-processing.
  • the address input subsystem Since in the address input subsystem, there may be an address text selected by the user to input by itself, and the input prompt function provided by the system is not used, it is necessary to pre-process the pre-queried communication address confirmed by the user, the pre-processing process and the content and The same is true in the address input subsystem.
  • Step S122 Perform address segmentation to obtain address metadata, and mark all address levels.
  • Step S123 obtaining the final place name entity labeling sequence by the place name entity identification, and generating a Query query statement.
  • Step S124 Parse the Query query statement and retrieve the index file to compare with the index file to obtain the closest communication address.
  • Step S125 Perform address restoration to generate a standardized communication address, and return a zip code corresponding to the standardized communication address.
  • step S 121 of the zip code query subsystem is The specific implementation process of the S 124 step implementation process refers to the specific implementation process in the address input subsystem, and the address completion process is mainly described here, as follows:
  • the system When the user submits a query request, the system returns the result of the query and ranks the address most similar to the address text entered by the user. Because the reference data is not collected too much, plus every year there are new buildings, roads, communities, etc., there are some administrative divisions, etc., so that the address in the first location is the address after the district and county location. There is a discrepancy with the address entered by the user. The system uses address completion technology to modify the most similar return results to make it closer to the user's requirements.
  • Address completion is a technique to improve the results of queries based on user input, making the results closer to the needs of users.
  • Address replenishment is mainly used at a certain level of address, and it is difficult to collect all of them.
  • the new addition is relatively large, mainly concentrated on the four-level and five-level addresses.
  • the order of the address level entered by the user is normal, that is, there is no one or two level address appearing after the level four or five level address.
  • the four-level address and the subsequent part of the address level input by the user are identified, and are stitched to the third-level address in the address with the most similar search result.
  • An example of address completion is shown in Figure 4.
  • step S125 the corresponding zip code is determined according to the lowest address level value of the marked address, and finally the zip code corresponding to the standardized communication address is returned, and the determined zip code query result can be selected, and the user can obtain the map location or pass the two-dimensional The code sends the postal code query result to the mobile terminal device.
  • Another embodiment of the present invention provides a terminal for a communication address query zip code
  • the terminal includes: a user input prompting unit and a zip code query unit, wherein the user input prompting unit is configured to prompt the user to input and Receiving a communication address to be queried finally determined by the user; The normalized communication address closest to the communication address to be queried is retrieved, and the zip code corresponding to the standardized communication address is received.
  • the invention makes the query format more free by helping the user input prompts; the named entity identification technology can identify the level of the user input address metadata, thereby implementing the level-by-level query of the address, and simultaneously completing the communication address, so that the search is performed. The result is more accurate.
  • the user can also obtain the query result in two-dimensional code, or link the map for positioning.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Computational Linguistics (AREA)
  • Document Processing Apparatus (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a system for querying mailing address postal codes, said system comprising a mailing address input subsystem and a postal code query subsystem; by means of said address input subsystem prompting a user in real time to input text, the user determines, according to the prompted list of addresses, the mailing address to be queried; said postal code query subsystem standardizes the mailing address to be queried and retrieves the closest standardized mailing address, while also returning the postal code corresponding to said standardized mailing address. By means of assisting a user to input a prompt, the present invention makes the query format freer; on the basis of named-entity recognition, the invention can identify the level of user input address metadata, thereby achieving progressive address querying and simultaneous completion of mailing addresses, making query results more precise; in addition, the user can also obtain query results in two-dimensional code or find the location by linking to a map. The present invention also provides a terminal for querying mailing address postal codes.

Description

一种通信地址査询邮政编码的系统及终端  System and terminal for querying postal code by communication address
技术领域 Technical field
本发明涉及邮政编码査询领域, 尤其涉及一种通信地址査询邮政编码的系统及 终端。  The present invention relates to the field of zip code inquiry, and in particular to a system and terminal for querying a zip code by a communication address.
背景技术 Background technique
随着电子商务的突飞猛进和物流行业的信息化, 使得人们在足不出户的情况下 完成购物和邮寄物品, 大大节约了时间和金钱成本。 电子商务和物流行业都离不开 通信地址 (又称为通讯地址, 简称为地址) 和邮编, 这些数据都需要用户提供, 当 前一些电子商务网站和物流行业的主要的做法如下: 让用户手工输入完整的地址和 地址对应的邮编; 通过下拉列表提供省, 省下面地级市和地级市下面的区县, 这些 比较固定的地址让用户选择, 余下的地址和邮编由用户手工输入; 保留用户输入的 地址和邮编,方便下次再次使用, 即如果本次输入的地址和邮编之前已经有了一份, 直接选中, 就避免了让用户重复输入。 上述做法主要存在的问题如下: 很多情况下用户未必知道自己输入的地址对应 的邮编; 由于基于拼音的输入法和汉语本身存在的缺陷 (汉字存在多音字, 多个汉 字拥有相同的读音, 多数基于拼音的输入法都是基于统计的语言模型), 再加上地 址中存在的一些生僻字的原因会导致输入的地址存在错别字; 由于地名存在别名现 象, 即同一个地名有多种叫法, 例如 "广东省" 的别名有 "广东" 和 "粵", 因此 他们识别不了对同一个地名的不同描述; 有些情况下用户无法输入完整的地址, 当 输入的时候一脸茫然和无助; 由于地址存在变更和搜集不完全的问题, 这些网站的 数据往往得不到更新。 当前一些其他的网站能够解决第一个问题, 即帮助用户得到地址对应的邮编。 但是他们往往采用数据库技术来实现的系统, 对于低于区县级别的地址, 往往采用 字串模糊査询 (like %XXX% ) 的方式参与检索, 由于性能的原因此种方式对于大 数据量的査询效率很差。 另外基于数据库实现的査询使得用户的输入格式和内容受 到了很大的限制, 比如: 用户首先选择省级行政区 (包括省、 特别行政区、 自治区和直辖市) 的名字, 其次是选择地级行政区 (包括地级市、 自治州、 地区和盟) 级别的名字, 然后再县 级行政区 (包括市辖区、 县、 旗、 特区、 林区、 自治县和自治旗等) 级别的名字, 最后用户输入乡镇级别及村庄道路等。 査询的输入过程非常机械。 With the rapid development of e-commerce and the informationization of the logistics industry, people can save time and money by completing shopping and mailing items without leaving the house. The e-commerce and logistics industries are inseparable from the communication address (also known as the mailing address, referred to as the address) and the postal code. These data need to be provided by the user. The current main practices of some e-commerce websites and the logistics industry are as follows: The complete address and address corresponding zip code; through the drop-down list to provide provinces, provinces below the prefecture-level city and prefecture-level cities below the districts and counties, these relatively fixed addresses for users to choose, the rest of the address and zip code manually input by the user; The entered address and zip code are convenient for the next time to use again. That is, if the address and the postal code that was entered this time have already been received, directly selected, the user is prevented from repeatedly inputting. The main problems of the above methods are as follows: In many cases, the user may not know the zip code corresponding to the address entered by himself; due to the pinyin-based input method and the flaws in Chinese itself (Chinese characters have multiple syllables, and multiple Chinese characters have the same pronunciation, most of them are based on Pinyin input methods are based on statistical language models), plus some uncommon words in the address will cause the input address to have typos; because the place name has an alias phenomenon, that is, the same place name has multiple names, for example "Guangdong Province" has the aliases "Guangdong" and "Yue", so they can't identify different descriptions of the same place name; in some cases users can't enter the full address, when they type, they look blank and helpless; because of the address There are incomplete changes and incomplete collections, and the data on these sites are often not updated. Some other websites currently solve the first problem, which is to help users get the zip code corresponding to the address. However, they often use database technology to implement the system. For addresses below the district level, they often use string fuzzy queries (like %XXX%) to participate in the search. For performance reasons, this method is large. The query of data volume is very inefficient. In addition, the database-based query makes the user's input format and content greatly restricted. For example, the user first selects the name of the provincial administrative region (including the province, special administrative region, autonomous region, and municipality), and then selects the prefecture-level administrative region ( Including the name of the prefecture-level city, autonomous prefecture, region, and alliance level, and then the name of the county-level administrative region (including the municipal district, county, flag, SAR, forest area, autonomous county, and autonomous flag, etc.), and the last user enters the township level and Village roads, etc. The input process of the query is very mechanical.
[0005] 另外基于数据库的査询模式, 要求地址格式全部满足四级, 即省级, 地级市 级, 区县级, 然后是其他具体地址。 但是并不是所有的地址都满足此种情况, 例如 直辖市下和省与直辖县或省直辖县级市之间就没有地级市级, 一些特殊的地级市没 有区县级, 如广东省中山市、 广东省东莞市、 海南省三亚市、 海南省三沙市、 甘肃 省嘉峪关市; 他们的解决办法, 起个其他的名字代替, 例如 "直辖区县", "市辖 区", "省直辖县" 等, 但是査询的结果中一般也包含这些非真正地址的数据。 所以需要一种实现帮助用户输入提示, 给出完整的参考地址, 并将待査询地址 进行标准化的精确査询邮政编码的系统。  [0005] In addition, based on the database query mode, the address format is required to satisfy four levels, that is, provincial, prefecture-level, district-level, and then other specific addresses. However, not all addresses satisfy this situation. For example, there is no prefecture-level city level between the municipality and the provinces directly under the central government or the county-level cities under the jurisdiction of the municipality. Some special prefecture-level cities do not have district-level and county-level, such as Zhongshan, Guangdong Province. City, Dongguan City, Guangdong Province, Sanya City, Hainan Province, Sansha City, Hainan Province, Jiayuguan City, Gansu Province; their solutions, replaced by other names, such as "direct jurisdiction", "municipal jurisdiction", "provincial jurisdiction "etc, but the results of the query also generally contain data for these non-real addresses. Therefore, there is a need for a system that implements an accurate query zip code that assists the user in entering a prompt, giving a complete reference address, and normalizing the address to be queried.
发明内容 Summary of the invention
为此, 本发明为了解决上述缺陷之一。 因而, 本发明提供一种通信地址査询邮政编码的系统及终端, 通过帮助用户输 入提示, 使得査询格式更加自由; 基于命名实体识别技术能够标识出用户输入地址 元数据的级别, 从而实现地址的逐级査询, 同时对通信地址进行补全, 使得査询结 果更加精确, 另外用户还可以将査询结果以二维码的方式获取, 或者链接地图进行 定位。  To this end, the present invention has been made to solve one of the above drawbacks. Therefore, the present invention provides a system and terminal for querying a postal code by a communication address, which helps the user input a prompt to make the query format more free; based on the named entity identification technology, the level of the user input address metadata can be identified, thereby realizing the address. The step-by-step query and the completion of the communication address make the query result more accurate. In addition, the user can also obtain the query result by means of a two-dimensional code, or link the map for positioning.
所以, 本发明一个实施例提供一种通信地址査询邮政编码的系统, 该系统包括 通信地址输入子系统和邮政编码査询子系统; 所述地址输入子系统通过对用户输入 文本进行实时提示, 用户根据提示列表地址确定待査询的通信地址; 所述邮政编码 査询子系统将待査询的通信地址进行标准化并检索出最接近的标准化通信地址, 同 时返回该标准化通信地址对应的邮政编码。 Therefore, an embodiment of the present invention provides a system for querying a postal code by a communication address, the system comprising a communication address input subsystem and a postal code query subsystem; and the address input subsystem promptly prompts the user to input text, The user determines the communication address to be queried according to the prompt list address; the zip code query subsystem normalizes the communication address to be queried and retrieves the closest standardized communication address, Returns the zip code corresponding to the standardized communication address.
优选地, 所述确定待査询的通信地址还可以包括: 用户可以不选择提示列表中 的地址, 仅根据用户输入文本确定待査询的通信地址。  Preferably, the determining the communication address to be queried may further include: the user may not select the address in the prompt list, and determine the communication address to be queried only according to the user input text.
所述实时提示包括: 随着用户输入本文的每一次增加来自动改变提示内容; 所述提示内容的实现步骤具体为: 获取当前用户输入的地址文本并进行预处理, 删 除多余的空格; 进行地址切分获得地址元数据, 并标注所有的地址等级; 通过地名 实体识别获得最终的地名实体标注序列, 并生成 Query査询语句; 检索索引地址文 件, 获得提示列表地址内容。  The real-time prompt includes: automatically changing the prompt content as the user inputs each increment of the text; the step of implementing the prompt content is specifically: acquiring an address text input by the current user and performing pre-processing to delete extra spaces; The segmentation obtains the address metadata and labels all the address levels; obtains the final place name entity annotation sequence by the place name entity identification, and generates a Query query statement; retrieves the index address file to obtain the prompt list address content.
优选地, 所述预处理还包括: 将数字或字母的全角字符转换为半角字符; 所述 预处理过程中字典采用基于双数组的 Trie树数据结构进行存储。  Preferably, the preprocessing further comprises: converting a full-width character of a number or a letter into a half-width character; wherein the dictionary is stored in a pre-processing process using a dual array-based Trie tree data structure.
所述提示列表地址包括: 获得的提示列表地址根据最接近的标准地址按降序排 列。  The prompt list address includes: The obtained prompt list addresses are arranged in descending order according to the closest standard address.
所述将待査询的通信地址进行标准化包括以下具体步骤: 获取用户确定的待査 询通信地址并进行预处理;进行地址切分获得地址元数据,并标注所有的地址等级; 通过地名实体识别获得最终的地名实体标注序列, 并生成 Query 査询语句; 解析 Query査询语句并检索索引文件与之比对, 获得最接近的通信地址; 进行地址补全 生成标准化通信地址, 并返回该标准化通信地址对应的邮政编码。  The standardizing the communication address to be queried includes the following specific steps: obtaining a communication address to be queried determined by the user and performing pre-processing; performing address segmentation to obtain address metadata, and labeling all address levels; Obtain the final list of place name entities and generate a Query query statement; parse the Query query statement and retrieve the index file to compare with it to obtain the closest communication address; perform address completion to generate a standardized communication address, and return the standardized communication address The postal code corresponding to the address.
优选地, 所述对应的邮政编码根据标注地址的最低地址等级值来确定。  Preferably, the corresponding zip code is determined according to a lowest address level value of the marked address.
所述返回该标准化通信地址对应的邮政编码还可以包括: 选择确定的邮政编码 査询结果, 用户可以获取地图定位; 或通过二维码将邮政编码査询结果发送到移动 终端设备上。  The returning the zip code corresponding to the standardized communication address may further include: selecting the determined zip code query result, the user may obtain the map location; or sending the zip code query result to the mobile terminal device by using the two-dimensional code.
优选地, 所述地址切分采用二元模型的分词方法; 所述命名实体识别技术识别 出地名实体标注结果中每一个地名元数据最可能的地址等级。  Preferably, the address segmentation adopts a binary model segmentation method; the named entity recognition technology identifies the most likely address level of each place name metadata in the place name entity annotation result.
本发明另一个实施例提供一种通信地址査询邮政编码的终端, 所述终端包括: 用户输入提示单元和邮政编码査询单元, 其中, 所述用户输入提示单元, 用以实时 提示用户输入并接收用户最终确定的待査询通信地址; 所述邮政编码査询单元, 用 以检索出与待査询通信地址最接近的标准化通信地址, 并接收与该标准化通信地址 对应的邮政编码。 本发明通过帮助用户输入提示, 使得査询格式更加自由; 基于命 名实体识别技术能够标识出用户输入地址元数据的级别, 从而实现地址的逐级査 询, 同时对通信地址进行补全, 使得査询结果更加精确, 另外用户还可以将査询结 果以二维码的方式获取, 或者链接地图进行定位。 Another embodiment of the present invention provides a terminal for querying a zip code by using a communication address, where the terminal includes: a user input prompting unit and a zip code query unit, wherein the user inputs a prompting unit for real-time Prompting the user to input and receive the communication address to be queried finally determined by the user; the zip code query unit is configured to retrieve a standardized communication address that is closest to the communication address to be queried, and receive a post corresponding to the standardized communication address coding. The invention makes the query format more free by helping the user input prompts; the named entity identification technology can identify the level of the user input address metadata, thereby implementing the level-by-level query of the address, and simultaneously completing the communication address, so that the search is performed. The result is more accurate. In addition, the user can also obtain the query result in two-dimensional code, or link the map for positioning.
附图说明 DRAWINGS
图 1是本发明实施例实现的一种通信地址査询邮政编码的系统的流程示意图。 图 2是本发明实施例实现的地址输入子系统的详细流程示意图。 图 3是本发明实施例实现的地址输入子系统的详细流程示意图。  FIG. 1 is a schematic flowchart of a system for querying a postal code by using a communication address according to an embodiment of the present invention. 2 is a detailed flow chart of an address input subsystem implemented by an embodiment of the present invention. FIG. 3 is a detailed flow chart of an address input subsystem implemented by an embodiment of the present invention.
图 4是本发明实施例实现的邮政编码査询子系统中地址补全的实例示意图。 具体实施方式  FIG. 4 is a schematic diagram of an example of address completion in a postal code query subsystem implemented by an embodiment of the present invention. detailed description
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附图及实施例, 对本发明进行进一步的详细说明。 应当理解, 此处所描述的具体实施例仅仅用于解 释本发明, 并不用于限定本发明。 本发明提供的一种通信地址査询邮政编码的系统及终端, 通过帮助用户输入提 示, 使得査询格式更加自由; 基于命名实体识别技术能够标识出用户输入地址元数 据的级别, 从而实现地址的逐级査询, 同时对通信地址进行补全, 使得査询结果更 加精确,另外用户还可以将査询结果以二维码的方式获取,或者链接地图进行定位。 如图 1是本发明实施例实现的一种通信地址査询邮政编码的系统的流程示意图, 该系统包括通信地址输入子系统和邮政编码査询子系统, 具体包括以下步骤: 步骤 S 110 : 地址输入子系统通过对用户输入文本进行实时提示, 用户根据提示列表地址 确定待査询的通信地址。  The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The system and terminal for querying the zip code of the communication address provided by the invention provide the user with the prompt to make the query format more free; the named entity identification technology can identify the level of the user input address metadata, thereby realizing the address The query is step by step, and the communication address is complemented at the same time, so that the query result is more accurate. In addition, the user can also obtain the query result by means of a two-dimensional code, or link the map for positioning. FIG. 1 is a schematic flowchart of a system for querying a postal code of a communication address according to an embodiment of the present invention. The system includes a communication address input subsystem and a postal code query subsystem, and specifically includes the following steps: Step S110: Address The input subsystem prompts the user to input the text in real time, and the user determines the communication address to be queried according to the prompt list address.
步骤 S 110的详细流程如图 2所示, 具体为: 步骤 S 111 : 获取用户输入的地址文 本, 并对获取的地址文本进行预处理操作, 预处理主要包括将数字或字母的全角转 换成半角字符以及删除多余空格等。 The detailed process of step S110 is as shown in FIG. 2, specifically: Step S111: Obtain the address text input by the user, and perform pre-processing on the obtained address text, and the pre-processing mainly includes turning the full angle of the number or letter. Change to half-width characters and remove extra spaces.
本输入提示随着用户输入本文的每一次增加来自动改变提示内容, 同时亦可以 省去实时提示, 用户可以在地址输入提示系统中直接输入其要査询的通信地址文 本, 如果选择实时提示, 那么获得的提示列表地址是根据最接近的标准地址按降序 排列的。  This input prompt automatically changes the prompt content as the user inputs each increase in the text, and can also save the real-time prompt. The user can directly input the communication address text to be queried in the address input prompting system, and if the real-time prompt is selected, Then the obtained hint list addresses are arranged in descending order according to the closest standard address.
步骤 S 112 : 将地址文本进行地址切分。  Step S112: The address text is divided into addresses.
由于全文索引采用的分词方式是二元模型, 也就是说索引里最长的汉语词长为 2 , 中文地名的长度一般多数超过 2, 把识别出来的每一个确定的地址元数据, 生成 PhraseQuery 査询语法来过滤掉相邻两个地址元数据中前一个地址元数据最后一个 字和后一个地址元数据第一个字组成的词。 例如用户输入的文本: 广东省深圳市, 经过地名识别后, 构造的 PhraseQuery査询语法为: "广东省" "深圳市", 即将每 一个地名元数据用半角的双引号括起来。 这样就可以过滤掉 "省深" 两字构成的词 所带来的査询结果, 大大提高精确率。  Since the word segmentation method used in full-text indexing is a binary model, that is, the longest Chinese word length in the index is 2, and the length of Chinese place names is generally more than 2, and each identified address metadata is identified to generate a PhraseQuery check. The syntax is used to filter out the words consisting of the last word of the previous address metadata and the first word of the next address metadata in the adjacent two address metadata. For example, the text input by the user: Shenzhen City, Guangdong Province, after the identification of the place name, the constructed PhraseQuery query syntax is: "Guangdong Province" "Shenzhen City", that is, each place name metadata is enclosed in double quotes. In this way, it is possible to filter out the results of the query caused by the words "deep-deep" and greatly improve the accuracy.
基于字典的分词通常有正向 (从左向右) 匹配和逆向 (从右向左) 匹配。 通常 情况下逆向匹配比正向匹配切分错误率低一半, 对于解决交叉歧义有优势, 交叉歧 义定义为: ABC三个连续汉字, AB和 BC均可以成为词; 一般情况下汉语中 BC 组成词的概率更大些。 地址切分是基于地址元数据字典采用逆向最大匹配算法从右 到左扫描用户输入的地址文本, 来实现地址的切分, 为了提高搜索的速度, 字典采 用基于双数组 (Double Array) 的 Trie树数据结构来存储。  Dictionary-based word segmentation usually has a positive (left to right) match and a reverse (from right to left) match. In general, the inverse matching is half the error rate of the positive matching segmentation. For the solution of cross ambiguity, the cross ambiguity is defined as: ABC three consecutive Chinese characters, AB and BC can be words; in general, Chinese BC constituent words The probability is greater. Address segmentation is based on the address metadata dictionary using the inverse maximum matching algorithm to scan the user input address text from right to left to achieve the address segmentation. In order to improve the search speed, the dictionary uses a double array (Double Array) based Trie tree. The data structure is stored.
步骤 S 113 : 进行地址标注。  Step S113: Perform address labeling.
在本步骤中需要使用地址元数据, 这些数据可以从维基百科和国家统计局有关 中国行政区划的地址元数据, 以及通过地址切分和识别技术从完整的通信地址中获 得。 地址元数据包含的数据主要有: 省级行政区名 (包括省、 自治区、 直辖市和特 别行政区)、 地级行政区名 (地级市、 自治州、 地区、 盟)、 县级行政区名 (包含市 辖区、 县级市、 县、 自治县、 旗、 自治旗、 特区和林区)、 乡级行政区名 (包括乡、 镇、 街道、 苏木、 区公所), 其他地址数据 (包括道路名、 村庄名、 小区名、 建筑 物名和广场名) 等。 地址元数据字典应包含地名的各种别名, 其格式定义为: 地址元数据字典由多 行构成, 每一行成为一个词条(Term) , 每一个 Term应该包含地名和地名对应的地 址等级 (level) , 其中地名为 key, 地址等级为 key的属性或 value。 地址元数据字 典每一个 Term包含 2项, 即地名和地名对应的地址等级 (level) , 他们之间用半角 的分号 ";" 隔开, 有的地名包含多个地址等级 (比如一些标准版地址的别名也是 其他标准版地址的别名), 不同的地级等级之间用半角的逗号 "," 隔开。 人们在书 写地址的时候通常的格式有如下几种: 省级行政区一地级行政区一县级行政区一乡级行政区一其他 (此格式经常用在互联 网中的地址), 例如: 安徽省阜阳市太和县关集镇陈桥村委会胡小寨村; 省级行政区一县级行政区一乡级行政区一其他 (当县级行政区为县级市、 县、 自治 县、 旗、 自治旗、 特区和林区, 可以省略地级行政区, 此格式经常用在身份证上), 例如: 安徽省太和县关集镇陈桥村委会胡小寨村; 省级行政区一地级行政区一乡级行政区一其他 (此格式主要是用在地级行政区下没 有县级行政区的情况, 如广东省中山市、 广东省东莞市、 海南省三亚市、 海南省三 沙市、 甘肃省嘉峪关市), 例如: 广东省东莞市樟木头镇九明村; 省级行政区一地级行政区一县级行政区一其他, 例如: 广东省深圳市南山区高新南 环路 29号留学生创业大厦; 省级行政区一县级行政区一其他 (此格式主要用在直辖市下的地址, 或者没有地级 市的地址, 如海南省除了三亚市、 三沙市和海口市之外全是省直辖县级市或者省直 辖县), 例如: 上海市浦东新区南京西路 1500号。据以上 5点, 为了处理上的方便, 一般将地址划分为 5个等级, 如下表 1所示: In this step, address metadata is required, which can be obtained from Wikipedia and the National Bureau of Statistics regarding the address metadata of the Chinese administrative division, and from the complete communication address by address segmentation and identification technology. The address metadata contains the following data: provincial administrative district names (including provinces, autonomous regions, municipalities and special administrative regions), prefecture-level administrative district names (prefecture-level cities, autonomous prefectures, regions, and alliances), and county-level administrative districts (including municipal districts, County-level cities, counties, autonomous counties, flags, autonomous flags, special zones and forest areas), township-level administrative district names (including townships, Town, street, Sumu, district office), other address data (including road name, village name, community name, building name and square name). The address metadata dictionary should contain various aliases for place names, and its format is defined as: Address metadata dictionary consists of multiple lines, each line becomes a term (Term), each Term should contain the address level corresponding to the place name and place name (level ) , where the name is key, the address level is the attribute or value of the key. The address metadata dictionary contains 2 items for each Term, that is, the address level corresponding to the place name and the place name. They are separated by a semicolon semicolon ";", and some place names contain multiple address levels (such as some standard versions). The alias of the address is also an alias for other standard version addresses. The different level levels are separated by a comma ",". When people write addresses, the usual formats are as follows: Provincial administrative districts, one prefecture-level administrative district, one county-level administrative district, one township-level administrative district, one other (this format is often used in the Internet), for example: Fuyang City, Anhui Province Huxiaozhai Village, Chenqiao Village Committee, Guanji Town, County; County-level administrative district, a township-level administrative district, and other (when the county-level administrative district is a county-level city, county, autonomous county, flag, autonomous flag, special zone and forest area) You can omit the prefecture-level administrative district. This format is often used on ID cards. For example: Huxiaozhai Village, Chenqiao Village Committee, Guanji Town, Taihe County, Anhui Province; Provincial-level administrative district, one-level administrative district, one township-level administrative district, one other ( This format is mainly used in the case of no county-level administrative districts under the prefecture-level administrative districts, such as Zhongshan City, Guangdong Province, Dongguan City, Guangdong Province, Sanya City, Hainan Province, Sansha City, Hainan Province, Jiayuguan City, Gansu Province, for example: Dongguan City, Guangdong Province Jiuming Village, Zhangmutou Town; Provincial-level administrative district, one county-level administrative district, one other, for example: No. 29, Gaoxin South Ring Road, Nanshan District, Shenzhen, Guangdong, China Student entrepreneurial building; provincial administrative district, one county-level administrative district, one other (this format is mainly used in the address under the municipality, or there is no address of the prefecture-level city, such as Hainan Province except the Sanya City, Sansha City and Haikou City County-level or provincial-administered counties, for example: No. 1500, Nanjing West Road, Pudong New Area, Shanghai. According to the above 5 points, in order to facilitate the processing, the address is generally divided into five levels, as shown in Table 1 below:
地址等 Address, etc.
行政区域 举例  Administrative area
级 省、 自治区、 直辖市、 广东省, 内蒙古自治区、 上海 Level Province, autonomous region, municipality, Guangdong Province, Inner Mongolia Autonomous Region, Shanghai
一级 First level
特别行政区 市、 香港特别行政区  Special Administrative Region City, Hong Kong Special Administrative Region
深圳市、 浦东新区、 大兴安岭  Shenzhen City, Pudong New Area, Daxinganling
地级市、 直辖市辖区、 地区、 恩施土家族苗族自治  Prefecture-level cities, municipalities directly under the central government, regions, and Enshi Tujia Miao autonomy
地区、 自治州、 盟、 县 州、 锡林郭勒盟、 桐城市、 太  Region, autonomous prefecture, alliance, county, Xilin Gol League, Tongcheng, too
二级 Secondary
级市、 县、 自治县、 旗、 和县、 长白朝鲜族自治县、 科  City, county, autonomous county, flag, and county, Changbai Korean Autonomous County, Branch
自治旗、 特区、 林区 尔沁左翼后旗、 鄂伦春自治  Autonomous flag, special zone, forest area, Erqi left wing flag, Elunchun autonomy
旗、 六枝特区、 神农架林区  Flag, Liuzhi Special Zone, Shennongjia Forest Area
三级 地级市辖区 南山区 Tertiary prefecture-level city jurisdiction Nanshan District
赵集乡,徐塘羌族乡、关集镇、  Zhaoji Township, Xutang Yizu Township, Guanji Town,
乡、 民族乡、 镇、 街道、  Township, ethnic township, town, street,
四级 粵海街道、 达日罕乌拉苏木、 Level 4 Yuehai Street, Darihan Ulasumu,
苏木、 道路  Sumu, road
深南大道  Shennan Avenue
村庄、 小区、 建筑物、 流塘村、 海怡东方花园、 留学  Villages, communities, buildings, Liutang Village, Haiyi Oriental Garden, study abroad
五级 广场、 编号、 未识别的 生创业大厦、万达广场、天干、 Five-level square, number, unidentified living and entrepreneurial building, Wanda Plaza, Tiangan,
地名 序号等  Place name
表 1 : 地址等级五级分级模型。 为了处理方便, level的值按照地址等级依次设置为 1, 2, 3, 4, 0。 即 " 1 "代 表地址等级为一级, " 2 "代表地址等级为二级, " 3 "代表地址等级为三级, " 4 "代 表地址等级为四级, " 0 " 代表地址等级为五级。 地址等级可以从地址元数据字典中每一个地名的属性中获得, 若被切分的出来 的地址不存在于字典中,则说明此地址是未被识别的地址,其地址等级标注为 0级。 步骤 S 114 : 进行地名实体识别。 由于地名存在别名, 加上人们在表达信息时, 通常使用最简化原则, 即使用地 名的简称 (别名) 来描述地名, 以及表达的随意性 (省略地址中高级别的地名, 常 见的有缺省省级的地名等) 和输入任意一级别地址或者很短的地址片段希望能得到 一个近似的结果或提示等, 这就要求有强大的地址识别的能力, 这就是本步骤要实 现的。 地名实体识别是识别出地名实体标注结果中每一个地名最可能的地址等级, 例如一条地址序列: "广东深圳宝安西乡" 的全称是 "广东省深圳市宝安区西乡街 道"; 其经过切分和标注之后的结果为: "广东(1) 深圳 (2, 4) 宝安 (3) 西乡 (2,4) "; 其正确的标注序列为: "广东(1) 深圳 (2) 宝安 (3) 西乡 (4)"。本系统采用动态规划算 法加上回溯 (Viterbi算法) 求出概率最大的标注序列, Viterbi算法中的观察值和状 态均为地址等级, 此时算法成为一阶马尔科夫过程。 Table 1: Five-level hierarchical model of address level. For convenience of processing, the value of level is set to 1, 2, 3, 4, 0 in order according to the address level. That is, "1" represents the address level as one level, "2" represents the address level as level 2, "3" represents the address level as level 3, "4" represents the address level as level 4, and "0" represents the address level as level 5, "0" represents the address level as level 5 . The address level can be obtained from the attribute of each place name in the address metadata dictionary. If the segmented address does not exist in the dictionary, the address is an unrecognized address, and the address level is marked as level 0. Step S114: Perform geographical name entity identification. Because there are aliases for place names, plus people use the most simplified principle when expressing information, that is, using the short name (alias) of the place name to describe the place name, and the randomness of the expression (omit the high-level place name in the address, the common default is the province). Levels of place names, etc.) and entering any level of address or a short address fragment hope to get an approximate result or prompt, etc., which requires a strong address recognition capability, which is what this step is to achieve. The geographical name entity identification is to identify the most likely address level of each place name in the result of the place name entity labeling, for example, an address sequence: "Guangdong Shenzhen Baoan Xixiang" is the full name of "Xixiang Street, Bao'an District, Shenzhen City, Guangdong Province"; The results after the points and labels are: "Guangdong (1) Shenzhen (2, 4) Baoan (3) Xixiang (2, 4)"; The correct labeling sequence is: "Guangdong (1) Shenzhen (2) Baoan (3) Xixiang (4)". The system uses dynamic programming algorithm plus backtracking (Viterbi algorithm) to find the most accurate labeling sequence. The observation value and state in Viterbi algorithm are address levels. At this time, the algorithm becomes a first-order Markov process.
[0038] 地名实体识别包括 2部分, 一部分是通过 Viterbi算法的获得最优的地址等 级标注序列的处理流程, 另一部分是根据上下文的知识, 校正不满足规则的最优标 注等级序列, 使得识别结果更加精准。 Viterbi算法的描述如下: 包含一个初始状态值: = ;¾, ¾ ¾' ^ ¾^, 其中 是地址级别为 i的 初始概率。 Pi内的值依据经验或先验知识设定, 其内的每个值大小遵循如下原则: 地址行政级别越高对应的初始概率越高, 如省级的初始概率大于地市级的。  [0038] The toponymic entity identification includes two parts, one part is a processing flow for obtaining an optimal address level labeling sequence by the Viterbi algorithm, and the other part is to correct an optimal labeling level sequence that does not satisfy the rule according to the knowledge of the context, so that the recognition result is obtained. More precise. The Viterbi algorithm is described as follows: Contains an initial state value: = ;3⁄4, 3⁄4 3⁄4' ^ 3⁄4^, where is the initial probability of address level i. The value in Pi is set according to experience or prior knowledge. The value of each value in the following follows the following principles: The higher the administrative level of the address, the higher the initial probability, such as the initial probability of the provincial level is greater than the prefecture level.
[0039] 举例来说明上述算法的实现。 依据先验知识构建 Viterbi算法的概率模型, [0039] An example of the implementation of the above algorithm is illustrated. Constructing a probabilistic model of the Viterbi algorithm based on prior knowledge,
Pi和 A可取以下初始值: Pi and A can take the following initial values:
Pi={0.05,0.45,0.25,0.15,0.1 };  Pi={0.05, 0.45, 0.25, 0.15, 0.1 };
A = { {0.05, 0.45, 0.25, 0.15, 0.10};  A = { {0.05, 0.45, 0.25, 0.15, 0.10};
{0.05, 0.23, 0.45, 0.17, 0.10};  {0.05, 0.23, 0.45, 0.17, 0.10};
{0.05, 0.18, 0.25, 0.30, 0.22};  {0.05, 0.18, 0.25, 0.30, 0.22};
{0.05, 0.35, 0.05, 0.05, 0.50};  {0.05, 0.35, 0.05, 0.05, 0.50};
{0.05, 0.30, 0.15, 0.05, 0.45} }。  {0.05, 0.30, 0.15, 0.05, 0.45} }.
如输入的地址为: "广东深圳宝安西乡", 经过所述的地址切、 地址标注处理之 后可得到以下四种标注结果序列: "广东(1) 深圳 (2) 宝安 (3) 西乡 (4) "、"广东(1) 深 圳 (2) 宝安 (3) 西乡 (2)"、 "广东(1) 深圳 (4) 宝安 (3) 西乡 (4)"、 "广东(1) 深圳 (4) 宝 安 (3) 西乡 (2)"。 根据维特比 (Viterbi) 算法, 我们可得知四种标注状态的权值: If the input address is: "Shenzhen Shenzhen Baoan Xixiang", after the address cutting and address labeling processing, the following four results sequence can be obtained: "Guangdong (1) Shenzhen (2) Baoan (3) Xixiang ( 4) "," Guangdong (1) Shenzhen (2) Baoan (3) Xixiang (2)", "Guangdong (1) Shenzhen (4) Baoan (3) Xixiang (4)", "Guangdong (1) Shenzhen (4) Baoan (3) Xixiang (2)". According to the Viterbi algorithm, we can know the weights of the four label states:
1. 广东(1) 深圳 (2) 宝安 (3) 西乡 (4); P = 0.030375; 1. Guangdong (1) Shenzhen (2) Baoan (3) Xixiang (4); P = 0.030375;
2. 广东(1) 深圳 (2) 宝安 (3) 西乡 (2); P = 0.0030375;  2. Guangdong (1) Shenzhen (2) Baoan (3) Xixiang (2); P = 0.0030375;
3. 广东(1) 深圳 (4) 宝安 (3) 西乡 (4); P = 0.001125; 4. 广东(1) 深圳 (4) 宝安 (3) 西乡 (2); P = 1.125E-4。 3. Guangdong (1) Shenzhen (4) Baoan (3) Xixiang (4); P = 0.001125; 4. Guangdong (1) Shenzhen (4) Baoan (3) Xixiang (2); P = 1.125E-4.
概率最大的标注序列为第一种标注状况。 因此动态规划算法输出的结果也是第 一种标注状态 "广东(1) 深圳 (2) 宝安 (3) 西乡 (4)"。  The most probable sequence of labels is the first type of labeling. Therefore, the result of the dynamic programming algorithm output is also the first type of labeling status "Guangdong (1) Shenzhen (2) Baoan (3) Xixiang (4)".
在该模型和算法下解决不了一个地级市辖区的别名和县或县级市的别名相同的 情况, 例如 "太和县" (隶属安徽省阜阳市) 和 "太和区" (隶属辽宁省锦州市), 它们的别名都为 "太和", 但是他们属于不同的地址等级级别。 当出现 "阜阳 (市) 太和"和 "锦州 (市) 太和" 时, 依据算法和概率模型此时的 "太和"标注在第三 极地址级别上概率最大, 解决此类问题要根据其上文的地址名称判断其地址级别是 "2 " 或 " 3 ", 诸如此类作为特殊情况进行标注序列的校正。 举例如下:  Under the model and algorithm, it is impossible to solve the problem that the alias of a prefecture-level city area and the county or county-level city have the same alias, such as "Taihe County" (subordinate to Fuyang City, Anhui Province) and "Taihe District" (subordinate to Liaoning Province) Jinzhou), their aliases are "Taihe", but they belong to different address level levels. When "Xiangyang (City) Taihe" and "Jinzhou (City) Taihe" appear, the probability of "Taihe" labeling at the third pole address level is the largest according to the algorithm and probability model. The address name above determines whether the address level is "2" or "3", and so on as a special case for the correction of the labeling sequence. Examples are as follows:
输入的地址为: "河北石家庄平山古月", 标注的地址序列为: "河北(1,2,4) 石家庄 (2,4) 平山 (2,3,4) 古月(4) ", 此标注序列中每一个地址的标注等级解释为: "河北" 可以是 "河北省" 的别名, 也可以是天津市的 "河北区" 的别名, 也可以是 "河北 乡" 的别名; "石家庄" 可以是 "石家庄市" 和 "石家庄镇" 的别名; "平山" 可以 是 "平山县" 或 "平山区" 或 "平山镇" 的别名。 The address entered is: "Hebei Shijiazhuang Pingshan Ancient Moon", the address sequence marked is: "Hebei (1, 2, 4) Shijiazhuang (2, 4) Pingshan (2, 3, 4) Ancient Moon (4) ", this The labeling level of each address in the labeling sequence is interpreted as: "Hebei" may be an alias of "Hebei Province", or an alias of "Hebei District" in Tianjin, or an alias of "Hebei Township"; "Shijiazhuang" It can be an alias for "Shijiazhuang City" and "Shijiazhuang Town"; "Pingshan" can be an alias for "Pingshan County" or "Pingshan District" or "Pingshan Town".
最优的标注序列为: "河北(1) 石家庄 (2) 平山 (3) 古月(4)"。  The optimal labeling sequence is: "Hebei (1) Shijiazhuang (2) Pingshan (3) Ancient Moon (4)".
根据上下文校正之后的标注序列为: "河北 (1) 石家庄 (2) 平山 (2) 古月(4)", 因 为此时的 "平山" 是 "平山县"。  The sequence of the mark after the correction according to the context is: "Hebei (1) Shijiazhuang (2) Pingshan (2) Ancient Moon (4)", because "Pingshan" is "Pingshan County".
由此可以看出当一个地级市辖区的别名和县或者县级市别名相同的时候, 被标 注为三级地址的所属地级市是否它的直接前驱地址, 如果不是进行校正。 为了方便 上下文的规则采用上述相反规则的方式存储, 即记录别名为县或县级市所属地级市 的别名为上下文, 例如 (太和一阜阳)。 因此当满足此上下文时, 修改标注的等级, 不满足时不做任何修改。  It can be seen that when the alias of a prefecture-level city and the county or county-level city are the same, the prefecture-level city marked as the three-level address is its direct predecessor address, if not corrected. In order to facilitate the context, the rules are stored in the opposite manner as described above, that is, the alias of the prefecture-level city to which the county or county-level city belongs is the context, for example, (Taihe Yiyang). Therefore, when this context is satisfied, the level of the label is modified, and no modification is made when it is not satisfied.
与此同时还存在二级地址和四级地址同名的情况, 主要出现在县级市或县的别 名和乡镇的别名同名情况, 由于四级地址可以在一个完整的地址中连续出现多次, 因此有时候会把二级地址标注在四级上。 此时也要根据上下文进行判别, 来修订标 注的序列。 举例如下: At the same time, there are cases where the second-level address and the fourth-level address have the same name, mainly in the county-level city or county alias and the same name of the township alias. Since the four-level address can appear multiple times in a complete address, Sometimes the secondary address is marked on level 4. At this time, it is also necessary to discriminate according to the context. The sequence of the note. Examples are as follows:
输入的地址为: "黑龙江黑河五大连池新发乡和民村", 最优的标注序列为: "黑龙 江 (1) 黑河 (2) 五大连池 (4) 新发乡 (4) 和民村 (0)", 此时的 "五大连池"被标注在第 四级地址级别上, 实际上它是一个县级市。 The input address is: "Heihe River Heilongjiang Wudalianchi Xinfa Township and Mincun", the optimal labeling sequence is: "Heilongjiang (1) Heihe (2) Wudalianchi (4) Xinfa Township (4) and Mincun (0)", this The "Five Dalian Pool" was marked at the fourth level address level, in fact it is a county-level city.
根据上下文校正之后的标注序列为:"黑龙江 (1) 黑河 (2) 五大连池 (2) 新发乡 (4) 和民村 (0) ", 和区县拥有相同别名的解决方案类似, 对于乡镇和县同名的情况, 系 统保留的规则是别名为县或县级市所属地级市的别名为上下文, 例如 (五大连池一 黑河)。 因此当满足此上下文时, 修改标注的等级, 不满足时不做任何修改。  The sequence of labels after correction according to the context is: "Heilongjiang (1) Heihe (2) Wudalianchi (2) Xinfa Township (4) and Mincun (0) ", similar to the solution with the same alias in the district and county, for townships and counties In the case of the same name, the rule reserved by the system is that the alias is the context of the prefecture-level city of the county or county-level city, for example (Wudao Pool Ihehe). Therefore, when this context is satisfied, the level of the label is modified, and no modification is made when it is not satisfied.
因此对于一些特殊情况, 同时提供一个机制对最佳标注序列根据上下文进行校 正, 处理的方法是根据地址上下文消除因别名带来的歧义 (一个别名对应多个地址 等级)。 这样得出的结果更准确一些。  Therefore, for some special cases, a mechanism is also provided to correct the optimal label sequence according to the context. The method of processing is to eliminate the ambiguity caused by the alias according to the address context (an alias corresponds to multiple address levels). The result is more accurate.
步骤 S 120: 邮政编码査询子系统将待査询的通信地址进行标准化并检索出最接 近的标准化通信地址, 同时返回该标准化通信地址对应的邮政编码。  Step S120: The postal code query subsystem normalizes the communication address to be queried and retrieves the closest standardized communication address, and returns the zip code corresponding to the standardized communication address.
在邮政编码査询子系统中需要建立地址査询邮编的索引文件, 该索引文件是由 很多个文档 (Document) 构成, 每一个文档包含的字段有: 地址 (Address) 域, 一条完整的标准地址; 邮编 (ZIPcode) 域, 和完整的标准地址相关联的邮政编码; 地址的最低等级 (Level) 域, 地址中最低级别地址的行政区划级别。 其中地址的最 低等级域 (Level Field) 包含的数据值如下:  In the postal code query subsystem, an index file for the address query zip code needs to be established. The index file is composed of a plurality of documents, each of which contains fields: an address field, a complete standard address. ; ZIP code domain, the zip code associated with the full standard address; the lowest level of the address, the administrative level of the lowest level address in the address. The data field value of the lowest level field (Level Field) is as follows:
省级行政区级 (包括省、 自治区、 直辖市和特别行政区), 用 province表示; 地级行政区级 (包括地级市、 自治州、 地区、 盟、 直辖市辖区), 用 city表示; 县级行政区级 (包括市辖区、 县、 旗、 特区、 林区、 自治县和自治旗等), 用 district 表示; Provincial administrative district level (including provinces, autonomous regions, municipalities directly under the Central Government and special administrative regions), represented by province; prefecture-level administrative district level (including prefecture-level cities, autonomous prefectures, regions, alliances, municipalities directly under the central government), represented by city; county-level administrative district level (including Municipal districts, counties, flags, special zones, forest areas, autonomous counties, and autonomous flags, etc., are indicated by district;
乡级行政区级 (包括乡、 镇、 街道、 苏木、 区公所), 用 town表示; Township-level administrative district level (including township, town, street, Sumu, district office), represented by town;
低于乡级行政区级, 用 all表示。 Below the township level, use all to indicate.
一个地址文本, 其对应的最低地址等级域的值计算如下: 首先对地址文本做预处理,预处理包括删除多余的空格,全角字符转换成半角字符; 其次是地址切分和地址标注; An address text whose value corresponding to the lowest address level field is calculated as follows: First, the address text is preprocessed. The preprocessing includes deleting extra spaces, and the full-width characters are converted into half-width characters; the second is address segmentation and address labeling;
接着是地址命名实体识别, 获取最终的地名实体标注序列。 Next is the address naming entity identification, which obtains the final sequence of the geographical names of the geographical names.
然后根据规则计算出此地址文本的最低地址等级的值, 其规则定义如下: 标注序列中地址等级定义如下:  Then calculate the value of the lowest address level of the text of the address according to the rules, and the rules are defined as follows: The address level in the labeling sequence is defined as follows:
1 > 2 >3 > 4 > 0, 即一级地址 > 二级地址 > 三级地址 > 四级地址 > 五级地 址;  1 > 2 >3 > 4 > 0, ie primary address > secondary address > tertiary address > four-level address > five-level address;
当标注序列中最低地址等级为五级地址时, 返回 0; Returns 0 when the lowest address level in the label sequence is a five-level address;
否则当标注序列中最低的地址等级为四级, 且个数超过 1个的时候, 直接返回 0; 否则当标注序列中二级地址的个数超过 2个或三级地址的个数超过 1个或三级地址 的个数加上二级地址的个数的和超过 2的时候, 直接返回 4; Otherwise, when the lowest address level in the label sequence is four, and the number exceeds one, it directly returns 0; otherwise, when the number of secondary addresses in the label sequence exceeds 2 or the number of third-level addresses exceeds 1 Or when the sum of the number of three-level addresses plus the number of secondary addresses exceeds 2, directly returns 4;
否则当标注序列中最低的地址等级恰好是连续 2个二级地址时, 直接返回 3 ; Otherwise, when the lowest address level in the label sequence is exactly two consecutive secondary addresses, it returns directly to 3;
否则当标注序列中最低的地址等级为四级, 且个数恰好为 1个的时候, 若此四级地 址为道路则返回 0否则返回 4; Otherwise, when the lowest address level in the label sequence is four, and the number is exactly one, if the four-level address is a road, it returns 0, otherwise it returns 4;
其他的情况, 返回最低的地址等级; In other cases, return the lowest address level;
将最低地址等级映射到最低地址等级域的值: l→province; 2→city; 3→ district; 4 town; 0all。 The lowest address is mapped to the lowest level value of the address field level: l → province; 2 → city ; 3 → district; 4 → town; 0 → all.
步骤 S 120的详细流程示意图如图 3所示, 具体为:  The detailed process diagram of step S120 is shown in FIG. 3, which is specifically as follows:
步骤 S 121 : 获取用户确定的待査询通信地址并进行预处理。 Step S121: Acquire a communication address to be queried determined by the user and perform pre-processing.
由于在地址输入子系统中, 可能存在用户选择自己输入的地址文本, 不采用该 系统提供的输入提示功能, 那么有必要对用户确认的待査询通信地址进行预处理, 预处理过程和内容和地址输入子系统中一样。  Since in the address input subsystem, there may be an address text selected by the user to input by itself, and the input prompt function provided by the system is not used, it is necessary to pre-process the pre-queried communication address confirmed by the user, the pre-processing process and the content and The same is true in the address input subsystem.
步骤 S 122 : 进行地址切分获得地址元数据, 并标注所有的地址等级。  Step S122: Perform address segmentation to obtain address metadata, and mark all address levels.
步骤 S 123 : 通过地名实体识别获得最终的地名实体标注序列, 并生成 Query査 询语句。 步骤 S 124: 解析 Query査询语句并检索索引文件与之比对, 获得最接近的通信 地址。 Step S123: obtaining the final place name entity labeling sequence by the place name entity identification, and generating a Query query statement. Step S124: Parse the Query query statement and retrieve the index file to compare with the index file to obtain the closest communication address.
步骤 S 125 : 进行地址补全生成标准化通信地址, 并返回该标准化通信地址对应 的邮政编码。  Step S125: Perform address restoration to generate a standardized communication address, and return a zip code corresponding to the standardized communication address.
邮政编码査询子系统的各个步骤与地址输入子系统的各个步骤很相似, 唯一不 同的在于邮政编码査询子系统需要对通信地址进行补全, 所以邮政编码査询子系统 的步骤 S 121到 S 124步骤具体实现过程参照地址输入子系统中的具体实现流程, 在 这主要将地址补全过程进行阐述, 具体如下:  The steps of the zip code query subsystem are similar to the steps of the address input subsystem. The only difference is that the zip code query subsystem needs to complete the communication address, so step S 121 of the zip code query subsystem is The specific implementation process of the S 124 step implementation process refers to the specific implementation process in the address input subsystem, and the address completion process is mainly described here, as follows:
当用户提交査询请求后, 系统会返回査询的结果, 并把与用户输入的地址文本最相 似的地址排在第一位。 由于参考数据搜集的不是太全, 加上每年都有新增的建筑, 道路, 小区等, 还有一些行政区划的变更等等, 使得排在第一位置上的地址中区县 位置之后的地址和用户输入的地址有出入, 本系统采用地址补全技术, 对最相似的 返回结果进行改造, 使得更接近用户的要求。 When the user submits a query request, the system returns the result of the query and ranks the address most similar to the address text entered by the user. Because the reference data is not collected too much, plus every year there are new buildings, roads, communities, etc., there are some administrative divisions, etc., so that the address in the first location is the address after the district and county location. There is a discrepancy with the address entered by the user. The system uses address completion technology to modify the most similar return results to make it closer to the user's requirements.
地址补全是根据用户的输入来完善査询结果的技术, 使得结果更贴近用户的需 求。 地址补全主要用在某一级别的地址很难搜集全, 而且新增量比较大, 主要集中 在四级和五级地址上。 地址补全的条件时用户输入的地址文本其地址级别的顺序是 正常的, 即不存在一二三级地址出现在四级或者五级地址之后。 识别出用户输入的 地址级别中四级地址及以后的部分, 拼接到搜索结果最相似的那条地址中三级地址 之后。 地址补全实例如图 4所示。  Address completion is a technique to improve the results of queries based on user input, making the results closer to the needs of users. Address replenishment is mainly used at a certain level of address, and it is difficult to collect all of them. The new addition is relatively large, mainly concentrated on the four-level and five-level addresses. In the case of address completion, the order of the address level entered by the user is normal, that is, there is no one or two level address appearing after the level four or five level address. The four-level address and the subsequent part of the address level input by the user are identified, and are stitched to the third-level address in the address with the most similar search result. An example of address completion is shown in Figure 4.
在步骤 S125中, 对应的邮政编码根据标注地址的最低地址等级值来确定, 最终 返回该标准化通信地址对应的邮政编码还可以选择确定的邮政编码査询结果, 用户 可以获取地图定位或通过二维码将邮政编码査询结果发送到移动终端设备上。  In step S125, the corresponding zip code is determined according to the lowest address level value of the marked address, and finally the zip code corresponding to the standardized communication address is returned, and the determined zip code query result can be selected, and the user can obtain the map location or pass the two-dimensional The code sends the postal code query result to the mobile terminal device.
本发明另一个实施例提供一种通信地址査询邮政编码的终端, 所述终端包括: 用户输入提示单元和邮政编码査询单元, 其中, 所述用户输入提示单元, 用以实时 提示用户输入并接收用户最终确定的待査询通信地址; 所述邮政编码査询单元, 用 以检索出与待査询通信地址最接近的标准化通信地址, 并接收与该标准化通信地址 对应的邮政编码。 本发明通过帮助用户输入提示, 使得査询格式更加自由; 基于命 名实体识别技术能够标识出用户输入地址元数据的级别, 从而实现地址的逐级査 询, 同时对通信地址进行补全, 使得査询结果更加精确, 另外用户还可以将査询结 果以二维码的方式获取, 或者链接地图进行定位。 Another embodiment of the present invention provides a terminal for a communication address query zip code, the terminal includes: a user input prompting unit and a zip code query unit, wherein the user input prompting unit is configured to prompt the user to input and Receiving a communication address to be queried finally determined by the user; The normalized communication address closest to the communication address to be queried is retrieved, and the zip code corresponding to the standardized communication address is received. The invention makes the query format more free by helping the user input prompts; the named entity identification technology can identify the level of the user input address metadata, thereby implementing the level-by-level query of the address, and simultaneously completing the communication address, so that the search is performed. The result is more accurate. In addition, the user can also obtain the query result in two-dimensional code, or link the map for positioning.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明, 不能认 定本发明的具体实施只局限于这些说明。 对于本发明所属技术领域的普通技术人员 来说, 在不脱离本发明构思的前提下, 还可以做出若干简单推演或替换。  The above is a further detailed description of the present invention in conjunction with the specific preferred embodiments. It is not intended that the specific embodiments of the invention are limited to the description. For those skilled in the art to which the present invention pertains, several simple derivations or substitutions may be made without departing from the inventive concept.

Claims

1. 一种通信地址査询邮政编码的系统, 其特征在于, 该系统包括通信地址输入子系 统和邮政编码査询子系统; A system for querying a postal code by a communication address, characterized in that the system comprises a communication address input subsystem and a postal code query subsystem;
所述地址输入子系统通过对用户输入文本进行实时提示, 用户根据提示列表地址确 定待査询的通信地址; The address input subsystem prompts the user to input the text in real time, and the user determines the communication address to be queried according to the prompt list address;
所述邮政编码査询子系统将待査询的通信地址进行标准化并检索出最接近的标准 化通信地址, 同时返回该标准化通信地址对应的邮政编码。 The postal code query subsystem normalizes the communication address to be queried and retrieves the closest standardized communication address, and returns the zip code corresponding to the standardized communication address.
2. 根据权利要求 1所述的系统, 其特征在于, 所述确定待査询的通信地址还可以包 括: 用户可以不选择提示列表中的地址, 仅根据用户输入文本确定待査询的通信地 址。  The system according to claim 1, wherein the determining the communication address to be queried may further include: the user may not select an address in the prompt list, and determine the communication address to be queried only according to the user input text. .
3. 根据权利要求 1所述的系统, 其特征在于, 所述实时提示包括:  3. The system according to claim 1, wherein the real-time prompt comprises:
随着用户输入本文的每一次增加来自动改变提示内容; Automatically change the prompt content as the user enters each increment of the article;
所述提示内容的实现步骤具体为: The implementation steps of the prompt content are specifically:
获取当前用户输入的地址文本并进行预处理, 删除多余的空格; Obtain the address text input by the current user and perform preprocessing to remove extra spaces;
进行地址切分获得地址元数据, 并标注所有的地址等级; Perform address segmentation to obtain address metadata and label all address levels;
通过地名实体识别获得最终的地名实体标注序列, 并生成 Query査询语句; 检索索引地址文件, 获得提示列表地址内容。 The final place name entity tag sequence is obtained by the place name entity identification, and a Query query statement is generated; the index address file is retrieved, and the prompt list address content is obtained.
4. 根据权利要求 3所述的系统, 其特征在于, 所述预处理还包括:  The system according to claim 3, wherein the preprocessing further comprises:
将数字或字母的全角字符转换为半角字符; 所述预处理过程中字典采用基于双数组 的 Trie树数据结构进行存储。 Converts a full-width character of a number or letter to a half-width character; the dictionary is stored in a pre-processing process using a dual array-based Trie tree data structure.
5. 根据权利要求 1所述的系统, 其特征在于, 所述提示列表地址包括: 获得的提示 列表地址根据最接近的标准地址按降序排列。  5. The system according to claim 1, wherein the prompt list address comprises: the obtained prompt list address is arranged in descending order according to a closest standard address.
6. 根据权利要求 1所述的系统, 其特征在于, 所述将待査询的通信地址进行标准化 包括以下具体步骤: 获取用户确定的待査询通信地址并进行预处理; 6. The system according to claim 1, wherein the standardizing the communication address to be queried comprises the following specific steps: Obtaining a communication address to be queried determined by the user and performing preprocessing;
进行地址切分获得地址元数据, 并标注所有的地址等级; Perform address segmentation to obtain address metadata and label all address levels;
通过地名实体识别获得最终的地名实体标注序列, 并生成 Query査询语句; 解析 Query査询语句并检索索引文件与之比对, 获得最接近的通信地址; 进行地址补全生成标准化通信地址, 并返回该标准化通信地址对应的邮政编码。Obtaining the final place name entity annotation sequence by the place name entity identification, and generating a Query query statement; parsing the Query query statement and retrieving the index file to compare with the same, obtaining the closest communication address; performing address completion to generate a standardized communication address, and Returns the zip code corresponding to the standardized communication address.
7. 根据权利要求 1所述的系统, 其特征在于, 所述对应的邮政编码根据标注地址的 最低地址等级值来确定。 7. The system of claim 1, wherein the corresponding zip code is determined based on a lowest address level value of the tagged address.
8. 根据权利要求 6所述的系统, 其特征在于, 所述返回该标准化通信地址对应的邮 政编码还可以包括: 选择确定的邮政编码査询结果, 用户可以获取地图定位; 或通 过二维码将邮政编码査询结果发送到移动终端设备上。  The system according to claim 6, wherein the returning the zip code corresponding to the standardized communication address may further include: selecting a determined zip code query result, the user may obtain a map location; or adopting a two-dimensional code Send the postal code query results to the mobile device.
9. 根据权利要求 3或 6所述的系统, 其特征在于, 所述地址切分采用二元模型的分 词方法; 所述命名实体识别技术识别出地名实体标注结果中每一个地名元数据最可 能的地址等级。  The system according to claim 3 or 6, wherein the address segmentation adopts a binary model segmentation method; the named entity recognition technology recognizes that each place name metadata in the landmark name annotation result is most likely Address level.
10. 一种通信地址査询邮政编码的终端, 其特征在于, 所述终端包括用户输入提示 单元和邮政编码査询单元; 所述用户输入提示单元, 用以实时提示用户输入并接收 用户最终确定的待査询通信地址; 所述邮政编码査询单元, 用以检索出与待査询通 信地址最接近的标准化通信地址, 并接收与该标准化通信地址对应的邮政编码。  A terminal for querying a zip code by a communication address, wherein the terminal comprises a user input prompting unit and a zip code query unit; the user input prompting unit is configured to prompt the user to input and receive the user final determination in real time. The zip code query unit is configured to retrieve a standardized communication address that is closest to the communication address to be queried, and receive a zip code corresponding to the standardized communication address.
PCT/CN2014/084607 2013-08-27 2014-08-18 System and terminal for querying mailing address postal codes WO2015027835A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310377867.4 2013-08-27
CN201310377867.4A CN103440312B (en) 2013-08-27 2013-08-27 A kind of system and terminal of mailing address inquiry postcode

Publications (1)

Publication Number Publication Date
WO2015027835A1 true WO2015027835A1 (en) 2015-03-05

Family

ID=49694005

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/084607 WO2015027835A1 (en) 2013-08-27 2014-08-18 System and terminal for querying mailing address postal codes

Country Status (2)

Country Link
CN (1) CN103440312B (en)
WO (1) WO2015027835A1 (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440312B (en) * 2013-08-27 2019-01-22 深圳市华傲数据技术有限公司 A kind of system and terminal of mailing address inquiry postcode
CN103473289A (en) * 2013-08-30 2013-12-25 深圳市华傲数据技术有限公司 Device and method for completing communication addresses
CN103914569B (en) * 2014-04-24 2018-09-07 百度在线网络技术(北京)有限公司 Input creation method, the device of reminding method, device and dictionary tree-model
CN104156415B (en) * 2014-07-31 2017-04-12 沈阳锐易特软件技术有限公司 Mapping processing system and method for solving problem of standard code control of medical data
CN104200369B (en) * 2014-08-27 2019-12-31 北京京东尚科信息技术有限公司 Method and device for determining commodity distribution range
CN106326233B (en) * 2015-06-18 2019-10-11 菜鸟智能物流控股有限公司 address prompting method and device
CN105069056B (en) * 2015-07-24 2018-02-06 湖北文理学院 Identity certificate address information analytic method and system based on string matching
CN106469372B (en) * 2015-08-14 2020-06-12 菜鸟智能物流控股有限公司 Address mapping method and device
CN105224522A (en) * 2015-09-29 2016-01-06 小米科技有限责任公司 Geographical location information recognition methods and device
CN105653060A (en) * 2015-12-30 2016-06-08 浙江慧脑信息科技有限公司 Multi-functional address input method
CN107025232A (en) * 2016-01-29 2017-08-08 阿里巴巴集团控股有限公司 The processing method and processing device of address information in logistics system
CN105975099B (en) * 2016-04-28 2020-02-04 百度在线网络技术(北京)有限公司 Input method implementation method and device
CN106055650A (en) * 2016-05-31 2016-10-26 深圳市永兴元科技有限公司 Address standardization method and device
CN106777377A (en) * 2017-02-09 2017-05-31 辛国臣 Logistics odd numbers generation method and device
CN108256718B (en) * 2017-05-04 2022-04-29 平安科技(深圳)有限公司 Policy service task allocation method and device, computer equipment and storage equipment
CN109033225A (en) * 2018-06-29 2018-12-18 福州大学 Chinese address identifying system
CN109344254B (en) * 2018-09-20 2020-12-18 鼎富智能科技有限公司 Address information classification method and device
CN110334162B (en) * 2019-05-09 2021-11-09 德邦物流股份有限公司 Address recognition method and device
CN112100161B (en) * 2019-09-17 2021-05-28 上海寻梦信息技术有限公司 Data processing method and system, electronic device and storage medium
CN110688851B (en) * 2019-09-26 2023-07-28 亿企赢网络科技有限公司 Method, device and medium for extracting key information of address text
CN112528174A (en) * 2020-11-27 2021-03-19 暨南大学 Address finishing and complementing method based on knowledge graph and multiple matching and application
CN113569564B (en) * 2021-07-30 2024-03-19 拉扎斯网络科技(上海)有限公司 Address information processing and displaying method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339638A (en) * 2007-07-03 2009-01-07 周磊 Method and system for automatic matching of commercial articles dispensing scope and goods receiving address for ordering platform
CN102737060A (en) * 2011-04-14 2012-10-17 商业对象软件有限公司 Fuzzy search in geocoding application
CN103440312A (en) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 System and terminal for inquiring zip code for mailing address

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955833B (en) * 2011-08-31 2015-11-25 深圳市华傲数据技术有限公司 A kind of address identification, standardized method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339638A (en) * 2007-07-03 2009-01-07 周磊 Method and system for automatic matching of commercial articles dispensing scope and goods receiving address for ordering platform
CN102737060A (en) * 2011-04-14 2012-10-17 商业对象软件有限公司 Fuzzy search in geocoding application
CN103440312A (en) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 System and terminal for inquiring zip code for mailing address

Also Published As

Publication number Publication date
CN103440312B (en) 2019-01-22
CN103440312A (en) 2013-12-11

Similar Documents

Publication Publication Date Title
WO2015027835A1 (en) System and terminal for querying mailing address postal codes
WO2015027836A1 (en) Method and system for place name entity recognition
CN102955833B (en) A kind of address identification, standardized method
CN110909170B (en) Interest point knowledge graph construction method and device, electronic equipment and storage medium
CN105069056B (en) Identity certificate address information analytic method and system based on string matching
CN102955832B (en) A kind of address identification, standardized system
US20170116224A1 (en) Address Search Method and Device
CN106874287B (en) Method and device for processing POI address codes
WO2015027837A1 (en) Device and method for mailing address completion
CN104021198B (en) The relational database information search method and device indexed based on Ontology
CN102722525A (en) Methods and systems for establishing language model of address book names and searching voice
CN106777118B (en) A kind of quick abstracting method of geographical vocabulary based on fuzzy dictionary tree
CN106886565B (en) Automatic polymerization method for foundation house type
CN111104802A (en) Method for extracting address information text and related equipment
CN103577548A (en) Method and device for matching characters with close pronunciation
CN115630648A (en) Address element analysis method and system for man-machine conversation and computer readable medium
CN112948717B (en) Massive space POI searching method and system based on multi-factor constraint
CN116414824A (en) Administrative division information identification and standardization processing method, device and storage medium
CN111738008B (en) Entity identification method, device and equipment based on multilayer model and storage medium
CN112417812B (en) Address standardization method and system and electronic equipment
CN114398886A (en) Address extraction and standardization method based on pre-training
CN114792091A (en) Chinese address element analysis method and equipment based on vocabulary enhancement and storage medium
CN111401051B (en) Express information analysis method and system
CN113535883A (en) Business place entity linking method, system, electronic device and storage medium
CN116821271B (en) Address recognition and normalization method and system based on voice-shape code

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14839043

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14839043

Country of ref document: EP

Kind code of ref document: A1