WO2016165538A1 - Address data management method and device - Google Patents

Address data management method and device Download PDF

Info

Publication number
WO2016165538A1
WO2016165538A1 PCT/CN2016/077297 CN2016077297W WO2016165538A1 WO 2016165538 A1 WO2016165538 A1 WO 2016165538A1 CN 2016077297 W CN2016077297 W CN 2016077297W WO 2016165538 A1 WO2016165538 A1 WO 2016165538A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
address data
data
types
structured
Prior art date
Application number
PCT/CN2016/077297
Other languages
French (fr)
Chinese (zh)
Inventor
吴保华
Original Assignee
阿里巴巴集团控股有限公司
吴保华
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 吴保华 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016165538A1 publication Critical patent/WO2016165538A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a method and an apparatus for managing address data.
  • a large number of text addresses are generated in e-commerce websites and logistics systems, and the input format and address elements of these text addresses vary from user to user.
  • the text address input by the user A only includes the house number information
  • the text address input by the user B includes only POI (Point of Interest) information
  • the text address input by the user C includes the wrong district or house number information.
  • These text addresses lack standardization and standardization, and it is impossible to judge the similarities and differences between different text addresses, and the related attribution of text addresses cannot be recognized.
  • the address element refers to all levels of elements in the text address, such as provinces, cities, districts, development zones, towns, roads, POIs, and so on.
  • the POI can be a house, a shop, a mail box, a bus stop, and the like.
  • the embodiment of the present application provides a method and an apparatus for managing address data to generate normalized and standardized address data, thereby solving the problem that the text address cannot be normalized.
  • An embodiment of the present application provides a method for managing address data, where the method includes the following steps:
  • the address management device obtains original address data input by the user
  • the address management device determines a structured address format including a plurality of address types
  • the address management device converts the original address data into structured address data conforming to the structured address format, the structured address data including address data corresponding to a plurality of address types.
  • the address management device converts the original address data into structured address data that conforms to the structured address format, and specifically includes:
  • the address management device preprocesses original address data based on multiple address types
  • the address management apparatus performs segmentation on the preprocessed address data based on a plurality of address types
  • the address management device performs a complement check on the sliced address data based on the plurality of address types
  • the address management device normalizes the address data after the completion of the verification to obtain structured address data conforming to the structured address format.
  • the process of the pre-processing of the original address data by the address management device based on the multiple address types includes:
  • the address management device filters, from the original address data, address data that does not correspond to the multiple address types, deletes the currently filtered address data from the original address data, and stores the original address data.
  • Address data in a non-canonical format is converted to address data in a canonical format.
  • the process of the segmentation of the pre-processed address data by the address management device based on the multiple address types includes:
  • the address management device obtains the word breaker dictionary corresponding to the plurality of address types, and uses the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
  • the process of the address management device performing the completion verification on the segmented address data based on the multiple address types specifically includes:
  • the process of normalizing the address data after the verification by the address management apparatus includes: the address management apparatus normalizes the address data after the verification by using the pinyin similarity algorithm; and/or, The address management apparatus normalizes the address data after the completion verification using the POI normalization algorithm based on the probability retrieval model.
  • the embodiment of the present application provides an address management apparatus, where the address management apparatus specifically includes:
  • a determining module for determining a structured address format including a plurality of address types
  • a processing module configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.
  • the processing module includes: a pre-processing sub-module for pre-processing the original address data based on the plurality of address types; and a splicing module for segmenting the pre-processed address data based on the plurality of address types; a sub-module, configured to perform complement verification on the post-segment address data based on the multiple address types; the normalization sub-module is configured to normalize the address data after the complement verification to obtain the conformed address format Structured address data.
  • the pre-processing sub-module is specifically configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the non-original address data.
  • the address data of the canonical format is converted to address data in a canonical format.
  • the sharding module is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
  • the completion sub-module is specifically configured to check whether the address data after the severing includes the address data corresponding to the multiple address types; if not, determine the address type not included in the categorized address data, And the address data of the address type is complemented based on the historical data.
  • the normalization sub-module is specifically configured to normalize address data after completion verification by using a pinyin similarity algorithm; and/or use a POI normalization algorithm based on a probability retrieval model to complete the verified address
  • the data is normalized.
  • the embodiment of the present application has at least the following advantages: in the embodiment of the present application, by setting a structured address format including multiple address types, and generating structured address data conforming to the structured address format, thereby generating Standardized and standardized address data solves the problem of not being able to normalize text addresses, and can determine the similarities and differences between different text addresses, and can identify the relevant attribution of text addresses. Specifically, by identifying and extracting the address data in the massive historical text address, learning the knowledge and rules between the address data through the learning manner, and completing the learned knowledge and rules on the missing write address data, Check the wrong address data, right The canonical address data is normalized and a hierarchical structured address data is regenerated.
  • FIG. 1 is a schematic flowchart of a method for managing address data according to Embodiment 1 of the present application;
  • FIG. 2 is a schematic structural diagram of an address management apparatus according to Embodiment 2 of the present application.
  • the first embodiment of the present application provides a method for managing address data.
  • the method for managing the address data may specifically include the following steps:
  • step 101 the address management device obtains original address data input by the user.
  • an integration module may be configured in the address management apparatus, and the integration module is configured to integrate the address data sources of each party, generate a unique key (key), and load the text address library.
  • the address data for a key in the text address library that is, the original address data input by the user.
  • Step 102 The address management apparatus determines a structured address format including a plurality of address types.
  • the multiple address types included in the structured address format include, but are not limited to, one or any combination of the following: provinces, cities, districts, counties, townships (street offices), development zones, main roads, main road numbers, and branches. Road, branch road number, iconic POI (real estate, etc.), building, unit (floor), room number, etc.
  • Step 103 The address management apparatus converts the original address data into structured address data conforming to a structured address format, the structured address data including address data corresponding to a plurality of address types.
  • the structured address data conforming to the structured address format generated by the address management device may include address data corresponding to the province, address data corresponding to the city, address data corresponding to the district, and corresponding to the township (street office) Address data corresponding to the development zone, address data corresponding to the main road, address data corresponding to the main road number, address data corresponding to the branch, address data corresponding to the branch number, corresponding to the iconic
  • the address data of the POI real estate, etc.
  • the address data corresponding to the building the address data corresponding to the unit (floor), the address data corresponding to the room number, and the like.
  • the address management apparatus converts the original address data into the structured address data conforming to the structured address format, including but not limited to: the address management apparatus preprocesses the original address data based on the multiple address types; Thereafter, the address management device segments the preprocessed address data based on the plurality of address types; after that, the address management device performs a complement check on the sliced address data based on the plurality of address types; and thereafter, the address management device complements The fully verified address data is normalized to obtain structured address data conforming to the structured address format.
  • the process for the address management device to preprocess the original address data based on the multiple address types includes: the address management device filters the address data that does not correspond to the multiple address types from the original address data, from the original address.
  • the currently filtered address data is deleted in the data, and the non-canonical format address data existing in the original address data is converted into the address data in the canonical format.
  • the pre-processing module may be configured in the address management apparatus, and the pre-processing module filters out address data that does not correspond to multiple address types from the original address data, and deletes the currently filtered address data from the original address data. . Further, the preprocessing module converts the address data of the non-canonical format existing in the original address data into the address data of the canonical format.
  • the original address data may include address data corresponding to multiple address types, such as address data of Hebei province and Baoding City, and the original address data may also be included. Contains address data that does not correspond to multiple address types, such as Fee recharge information, virtual game card information, etc., these address data that do not correspond to multiple address types need to be cleaned. Based on this, the pre-processing module filters the address data that does not correspond to multiple address types from the original address data, and deletes the currently filtered address data from the original address data.
  • address data in a non-canonical format exists in the original address data.
  • the pre-processing module converts the address data of the non-canonical format existing in the original address data into the address data of the canonical format.
  • the address data of the canonical format includes but is not limited to: English, the full angle of the number is changed to a half-width; the standard format of the mainland address is simplified Chinese; the format of the address of Hong Kong, Macao and Taiwan is the traditional Chinese; the standard format of the road name is Chinese; The standard format of the house number, room number, etc. is a number.
  • the address management device performs a process of segmenting the preprocessed address data based on multiple address types, including but not limited to the following: the address management device obtains a word breaker dictionary corresponding to multiple address types, and The preprocessed address data is sliced out of the address data corresponding to the plurality of address types by using the word breaker dictionary corresponding to the plurality of address types.
  • the address management apparatus may slice the preprocessed address data into address data corresponding to the province, address data corresponding to the city, address data corresponding to the district, and corresponding In the township (street office), the address data corresponding to the development zone, the address data corresponding to the main road, the address data corresponding to the main road number, the address data corresponding to the branch, and the address data corresponding to the branch number Address data corresponding to the iconic POI (real estate, etc.), address data corresponding to the building, address data corresponding to the unit (floor), address data corresponding to the room number, and the like.
  • the address management device may be configured with a segmentation module, where the segmentation module corresponding to the plurality of address types is obtained, and the pre-processed address is obtained by using the word breaker dictionary corresponding to the multiple address types.
  • the data is sliced to correspond to address data corresponding to the plurality of address types.
  • the word breaker dictionary includes but is not limited to: provincial, municipal, district and county dictionary; township dictionary; industrial zone dictionary; village dictionary; street dictionary; university dictionary; community standard dictionary; community self-learning dictionary.
  • the corresponding segmentation algorithm specifically includes: a forward finite state maximum matching algorithm
  • the cutting rules include: based on keyword segmentation, such as: town, street, road, company, building, middle school, house number, community detailed address (building, unit, room number).
  • the corresponding segmentation process specifically includes: province, city, and district segmentation: using a word segmenter based on the provincial and municipal dictionary initialization to cut the detailed address, if the province, city, district, and original province, city, and district are divided. If the fields are different, replace them and reduce the subsequent segmentation error and retain the remaining addresses.
  • Township (industrial zone) segmentation use the township (industrial zone) dictionary to initialize the word segmentation device (total of 362 cities) to divide the remaining address; if the segmentation device fails to divide, the detailed address is divided; If the division fails, it is divided by the township rules and marked for subsequent processing.
  • Road segmentation Similar to the township (industrial zone) segmentation process, only the township dictionary is used to initialize 362 road segmentation devices.
  • House number segmentation segmentation is performed using the corresponding segmentation rules.
  • Community (property) segmentation use the community dictionary to initialize the community participle (by the city as a total of 362), split the remaining address of the previous step; if the word segmentation fails, the detailed address is divided; The community element, the largest string length as a community element; if still splitting, the self-learning dictionary segmentation device is used to segment the detailed address; if the segmentation still fails, the community rule is used to segment and the self-learning dictionary or Subsequent processing of community tags segmented by community rules.
  • the address management apparatus performs the process of performing the complement check on the split address data based on the multiple address types, including but not limited to: the address management apparatus verifies whether the address data has been included in the address data after the splitting Address data of the address type; if not, the address management apparatus determines the address type not included in the sliced address data, and complements the address data of the address type based on the history data; if yes, the address management apparatus does not need Complement the corresponding address data.
  • the address management device cuts out address data corresponding to the province based on a plurality of address types, Address data corresponding to the district/county, address data corresponding to the development zone, address data corresponding to the main road, address data corresponding to the main road number, address data corresponding to the branch, and address corresponding to the branch number
  • the address management apparatus verifies that the sliced address data does not include address data corresponding to all of the plurality of address types, and complements the address corresponding to the city based on the history data.
  • the data corresponds to the township (street office), the address data corresponding to the iconic POI (real estate, etc.), the address data corresponding to the building, and the address data corresponding to the room number.
  • the address verification apparatus may be configured with a completion verification module, and the completion verification module verifies whether the address data after the division has already included address data corresponding to all the multiple address types; if not, determining The address type not included in the sliced address data, and the address data of the address type is complemented based on the history data; if so, the corresponding address data does not need to be complemented.
  • the completion verification module processes the above situation in the address data processing process, and performs complementation and correction on the house number or community field of the segmented address data.
  • each address data in the structure address standard library can be structured into a corresponding segmentation algorithm: city + district + road + house number + community.
  • the above 5 fields are all calculated with full address frequency.
  • Filter addresses with address frequencies greater than 3. Count the frequency of use of each community under the city + district + road + house number, and retain the city + district / county + road + house number + community with the most frequent frequency, and add it to the structure address standard library.
  • each address data in the structure address standard library can be structured into a corresponding segmentation algorithm: city + road + house number + community.
  • the above 4 fields are all calculated with full address frequency. Filter addresses with address frequencies greater than or equal to 1. Count the frequency of use of each community under the city + road + house number, and retain the city + road + house number + community with the most frequent frequency, and add it to the structure address standard library.
  • the structure address standard library it is assumed that there is only one house number under the city + district/county+road+community, for each structured address data, if the house number is null or is a rule segmentation or a self-learning dictionary If the word segmenter is divided, you can query the city + district/county+road+community as the house number from the structure address standard library, and complete or correct the house number field.
  • the process of normalizing the address data after the verification by the address management apparatus includes, but is not limited to, the following method: the address management apparatus uses the pinyin similarity algorithm to perform the address data after the completion verification Normalization processing; and/or, the address management apparatus normalizes the address data after the completion verification using the POI normalization algorithm based on the probability retrieval model.
  • the normalization module may be configured in the address management apparatus, and the normalization module normalizes the address data after the complement verification by using the pinyin similarity algorithm; and/or complements the POI normalization algorithm based on the probability retrieval model.
  • the address data after full verification is normalized.
  • the address data filled in by the user has a large number of non-standard phenomena such as abbreviations, abbreviations, typos, and homophony of the address data.
  • the standard address data is West Lake International Technology Building
  • the non-standardized address data is West Lake International (abbreviation)
  • the standard address data is the first affiliated hospital of Zhejiang University
  • the non-standardized address data is Zhejiang University First affiliated Hospital (abbreviation)
  • standard address data For Gudun Road the non-standardized address data is Guteng Road (harmonic);
  • the standard address data is Baoshu Road, and the non-normalized address data is Baojiao Road (typo).
  • the normalization module performs normalization processing on the non-normalized address data, including but not limited to: a pinyin similarity algorithm and a POI normalization algorithm based on a probability retrieval model.
  • the normalization module converts the denormalized address data and the normalized address data into pinyin, calculates the similarity distance (such as the minimum edit distance), and denormalizes the normalized address data higher than the threshold and the highest similarity. Standardized address data for address data.
  • the normalization module divides the identified POI into a bigram, and then accumulates the estimate of each bigram for the bigram that appears in both the POI and the candidate POI.
  • the sum of the estimates of each bigram is the measure of the correlation between the candidate standard POI and the class-like POI.
  • the correlation scores of the candidate POIs are calculated, and the POI scores are sorted from large to small, and the POI type, the district and the address type of the POI, and the POIs with the highest scores corresponding to the districts corresponding to the addresses are selected. Is the specification POI.
  • S the correlation score of the candidate POI
  • N the number of POIs of one city or district
  • R the number of related POIs having two identical bigrams and jaccards (similarity coefficient) greater than 0.4
  • n i the number of POIs containing the bigram b i
  • dl the number of bigrams in the current candidate standard POI
  • avdl the average number of bigrams included in each candidate standard POI
  • r i the number of related POIs in n i
  • index i position of the sequence b i appears in the current the POI
  • avgindex i: b i comprising an average position of the order in which the POI appearing
  • k, b is freely adjustable parameters, set empirically k: 1.2, b is provided Is 0.75
  • K, I is a temporary variable in the formula.
  • the embodiment of the present application has at least the following advantages: in the embodiment of the present application, by setting a structured address format including multiple address types, and generating structured address data conforming to the structured address format, thereby generating Standardized and standardized address data solves the problem of not being able to normalize text addresses, and can determine the similarities and differences between different text addresses, and can identify the relevant attribution of text addresses. Specifically, by identifying and extracting the address data in the massive historical text address, learning the knowledge and rules between the address data through the learning manner, and completing the learned knowledge and rules on the missing write address data, The error address data is verified, the non-canonical address data is normalized, and a hierarchical structured address data is regenerated.
  • an address management apparatus is further provided in the embodiment of the present application. As shown in FIG. 2, the address management apparatus specifically includes:
  • a determining module 12 configured to determine a structured address format including a plurality of address types
  • the processing module 13 is configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.
  • the processing module 13 specifically includes: a pre-processing sub-module 131, configured to pre-process the original address data based on multiple address types; and a splicing module 132, configured to process the pre-processed address data based on the multiple address types. Performing a segmentation; a complementing sub-module 133, configured to perform a complement check on the sliced address data based on the plurality of address types; the normalization sub-module 134 is configured to normalize the address data after the complement check, to A structured address data conforming to the structured address format is obtained.
  • the pre-processing sub-module 131 is configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the original address data. Address data in a non-canonical format is converted to address data in a canonical format.
  • the sharding module 132 is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
  • the completion sub-module 133 is configured to verify whether the sharded address data already includes address data corresponding to the multiple address types; if not, determine an address type that is not included in the sharded address data. And complementing the address data of the address type based on the historical data.
  • the normalization sub-module 134 is specifically configured to perform normalization processing on the address data after completion verification by using a pinyin similarity algorithm; and/or, using a POI normalization algorithm based on a probability retrieval model to complete the verification
  • the address data is normalized.
  • the modules of the device of the present application may be integrated into one or may be deployed separately.
  • the above modules can be combined into one module, or can be further split into multiple sub-modules.
  • modules in the apparatus in the embodiments may be distributed in the apparatus of the embodiment according to the description of the embodiments, or the corresponding changes may be located in one or more apparatuses different from the embodiment.
  • the modules of the above embodiments may be combined into one module, or may be further split into multiple sub-modules.
  • the serial numbers of the embodiments of the present application are merely for the description, and do not represent the embodiments.
  • the pros and cons. The above disclosure is only a few specific embodiments of the present application, but the present application is not limited thereto, and any changes that can be made by those skilled in the art should fall within the protection scope of the present application.

Abstract

Disclosed are an address data management method and device. The method comprises: an address management device acquires original address data input by a user; the address management device determines a structured address format comprising multiple address types; and the address management device converts the original address data into structured address data satisfying the structured address format, the structured address data comprises address data corresponding to the multiple address types. In embodiments of the present application, by setting a structured address format comprising multiple address types and generating structured address data satisfying the structured address format, normalized and standardized address data is generated, the problem of failing to standardize text addresses is solved, similarities and differences among different text addresses can be determined, and the related homing of the text addresses can be identified.

Description

一种地址数据的管理方法和装置Method and device for managing address data 技术领域Technical field
本申请涉及通信技术领域,尤其涉及一种地址数据的管理方法和装置。The present application relates to the field of communications technologies, and in particular, to a method and an apparatus for managing address data.
背景技术Background technique
在电子商务网站和物流系统中产生了大量文本地址,这些文本地址的输入格式和地址元素因用户而不同。例如,用户A输入的文本地址只包括门牌号信息,用户B输入的文本地址只包括POI(Point of Interest,兴趣点)信息,用户C输入的文本地址包括错误的区县或门牌号信息。这些文本地址缺乏规范化、标准化,无法判断不同文本地址间的异同性,无法识别文本地址的相关归属。其中,地址元素是指文本地址中的各级元素,如省、市、区、开发区、镇、路、POI等。POI可以是一栋房子、一个商铺、一个邮筒、一个公交站等。A large number of text addresses are generated in e-commerce websites and logistics systems, and the input format and address elements of these text addresses vary from user to user. For example, the text address input by the user A only includes the house number information, and the text address input by the user B includes only POI (Point of Interest) information, and the text address input by the user C includes the wrong district or house number information. These text addresses lack standardization and standardization, and it is impossible to judge the similarities and differences between different text addresses, and the related attribution of text addresses cannot be recognized. Among them, the address element refers to all levels of elements in the text address, such as provinces, cities, districts, development zones, towns, roads, POIs, and so on. The POI can be a house, a shop, a mail box, a bus stop, and the like.
发明内容Summary of the invention
本申请实施例提供一种地址数据的管理方法和装置,以生成规范化、标准化的地址数据,从而解决无法对文本地址进行规范化的问题。The embodiment of the present application provides a method and an apparatus for managing address data to generate normalized and standardized address data, thereby solving the problem that the text address cannot be normalized.
本申请实施例提供一种地址数据的管理方法,所述方法包括以下步骤:An embodiment of the present application provides a method for managing address data, where the method includes the following steps:
地址管理装置获得用户输入的原始地址数据;The address management device obtains original address data input by the user;
所述地址管理装置确定包括多个地址类型的结构化地址格式;The address management device determines a structured address format including a plurality of address types;
所述地址管理装置将所述原始地址数据转换为符合所述结构化地址格式的结构化地址数据,所述结构化地址数据包括对应多个地址类型的地址数据。The address management device converts the original address data into structured address data conforming to the structured address format, the structured address data including address data corresponding to a plurality of address types.
所述地址管理装置将所述原始地址数据转换为符合所述结构化地址格式的结构化地址数据,具体包括:The address management device converts the original address data into structured address data that conforms to the structured address format, and specifically includes:
所述地址管理装置基于多个地址类型对原始地址数据进行预处理; The address management device preprocesses original address data based on multiple address types;
所述地址管理装置基于多个地址类型对预处理后的地址数据进行切分;The address management apparatus performs segmentation on the preprocessed address data based on a plurality of address types;
所述地址管理装置基于多个地址类型对切分后地址数据进行补全校验;The address management device performs a complement check on the sliced address data based on the plurality of address types;
所述地址管理装置对补全校验后的地址数据进行规范化处理,以得到符合所述结构化地址格式的结构化地址数据。The address management device normalizes the address data after the completion of the verification to obtain structured address data conforming to the structured address format.
所述地址管理装置基于多个地址类型对原始地址数据进行预处理的过程,具体包括:The process of the pre-processing of the original address data by the address management device based on the multiple address types includes:
所述地址管理装置从所述原始地址数据中筛选出未对应所述多个地址类型的地址数据,从所述原始地址数据中删除当前筛选的地址数据,并将所述原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。The address management device filters, from the original address data, address data that does not correspond to the multiple address types, deletes the currently filtered address data from the original address data, and stores the original address data. Address data in a non-canonical format is converted to address data in a canonical format.
所述地址管理装置基于多个地址类型对预处理后的地址数据进行切分的过程,具体包括:The process of the segmentation of the pre-processed address data by the address management device based on the multiple address types includes:
所述地址管理装置获得所述多个地址类型对应的分词器词典,利用所述多个地址类型对应的分词器词典切分出对应所述多个地址类型的地址数据。The address management device obtains the word breaker dictionary corresponding to the plurality of address types, and uses the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
所述地址管理装置基于多个地址类型对切分后地址数据进行补全校验的过程,具体包括:The process of the address management device performing the completion verification on the segmented address data based on the multiple address types, specifically includes:
所述地址管理装置校验切分后地址数据是否已经包含对应所述多个地址类型的地址数据;如果否,则所述地址管理装置确定切分后地址数据中不包含的地址类型,并基于历史数据补全所述地址类型的地址数据。Determining, by the address management apparatus, whether the address data after the severing has included address data corresponding to the plurality of address types; if not, the address management apparatus determines an address type not included in the categorized address data, and is based on The historical data complements the address data of the address type.
所述地址管理装置对补全校验后的地址数据进行规范化处理的过程,具体包括:所述地址管理装置利用拼音相似度算法对补全校验后的地址数据进行规范化处理;和/或,所述地址管理装置利用基于概率检索模型的兴趣点POI规范化算法对补全校验后的地址数据进行规范化处理。The process of normalizing the address data after the verification by the address management apparatus includes: the address management apparatus normalizes the address data after the verification by using the pinyin similarity algorithm; and/or, The address management apparatus normalizes the address data after the completion verification using the POI normalization algorithm based on the probability retrieval model.
本申请实施例提供一种地址管理装置,所述地址管理装置具体包括:The embodiment of the present application provides an address management apparatus, where the address management apparatus specifically includes:
获得模块,用于获得用户输入的原始地址数据;Obtaining a module for obtaining original address data input by a user;
确定模块,用于确定包括多个地址类型的结构化地址格式; a determining module for determining a structured address format including a plurality of address types;
处理模块,用于将所述原始地址数据转换为符合所述结构化地址格式的结构化地址数据,所述结构化地址数据包括对应多个地址类型的地址数据。And a processing module, configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.
所述处理模块包括:预处理子模块,用于基于多个地址类型对原始地址数据进行预处理;切分子模块,用于基于多个地址类型对预处理后的地址数据进行切分;补全子模块,用于基于多个地址类型对切分后地址数据进行补全校验;规范化子模块,用于对补全校验后的地址数据进行规范化处理,以得到符合所述结构化地址格式的结构化地址数据。The processing module includes: a pre-processing sub-module for pre-processing the original address data based on the plurality of address types; and a splicing module for segmenting the pre-processed address data based on the plurality of address types; a sub-module, configured to perform complement verification on the post-segment address data based on the multiple address types; the normalization sub-module is configured to normalize the address data after the complement verification to obtain the conformed address format Structured address data.
所述预处理子模块,具体用于从原始地址数据中筛选出未对应所述多个地址类型的地址数据,从原始地址数据中删除当前筛选的地址数据,并将原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。The pre-processing sub-module is specifically configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the non-original address data. The address data of the canonical format is converted to address data in a canonical format.
所述切分子模块,具体用于获得多个地址类型对应的分词器词典,利用多个地址类型对应的分词器词典切分出对应所述多个地址类型的地址数据。The sharding module is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
所述补全子模块,具体用于校验切分后的地址数据是否已经包含对应所述多个地址类型的地址数据;如果否,则确定切分后的地址数据中不包含的地址类型,并基于历史数据补全所述地址类型的地址数据。The completion sub-module is specifically configured to check whether the address data after the severing includes the address data corresponding to the multiple address types; if not, determine the address type not included in the categorized address data, And the address data of the address type is complemented based on the historical data.
所述规范化子模块,具体用于利用拼音相似度算法对补全校验后的地址数据进行规范化处理;和/或,利用基于概率检索模型的兴趣点POI规范化算法对补全校验后的地址数据进行规范化处理。The normalization sub-module is specifically configured to normalize address data after completion verification by using a pinyin similarity algorithm; and/or use a POI normalization algorithm based on a probability retrieval model to complete the verified address The data is normalized.
与现有技术相比,本申请实施例至少具有以下优点:本申请实施例中,通过设置包括多个地址类型的结构化地址格式,并生成符合结构化地址格式的结构化地址数据,从而生成规范化、标准化的地址数据,解决无法对文本地址进行规范化的问题,并能够判断不同文本地址间的异同性,能够识别文本地址的相关归属。具体的,通过对海量历史文本地址中的地址数据进行识别和提取,通过学习的方式从中学习出地址数据之间的知识和规则,并将学习的知识和规则对漏写地址数据进行补全、对错误地址数据进行校验,对非 规范地址数据进行规范化处理,重新生成一条分级的结构化地址数据。Compared with the prior art, the embodiment of the present application has at least the following advantages: in the embodiment of the present application, by setting a structured address format including multiple address types, and generating structured address data conforming to the structured address format, thereby generating Standardized and standardized address data solves the problem of not being able to normalize text addresses, and can determine the similarities and differences between different text addresses, and can identify the relevant attribution of text addresses. Specifically, by identifying and extracting the address data in the massive historical text address, learning the knowledge and rules between the address data through the learning manner, and completing the learned knowledge and rules on the missing write address data, Check the wrong address data, right The canonical address data is normalized and a hierarchical structured address data is regenerated.
附图说明DRAWINGS
为了更加清楚地说明本申请实施例的技术方案,下面将对本申请实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据本申请实施例的这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described. It is obvious that the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings may be obtained according to the drawings of the embodiments of the present application without any creative work.
图1是本申请实施例一提供的一种地址数据的管理方法流程示意图;1 is a schematic flowchart of a method for managing address data according to Embodiment 1 of the present application;
图2是本申请实施例二提供的一种地址管理装置的结构示意图。FIG. 2 is a schematic structural diagram of an address management apparatus according to Embodiment 2 of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
实施例一Embodiment 1
针对现有技术中存在的问题,本申请实施例一提供一种地址数据的管理方法,如图1所示,该地址数据的管理方法具体可以包括以下步骤:For the problem in the prior art, the first embodiment of the present application provides a method for managing address data. As shown in FIG. 1 , the method for managing the address data may specifically include the following steps:
步骤101,地址管理装置获得用户输入的原始地址数据。In step 101, the address management device obtains original address data input by the user.
本申请实施例中,地址管理装置内可以配置整合模块,整合模块用于将各方地址数据源进行整合,生成唯一的key(密钥),并装入文本地址库。其中,文本地址库中的针对一个key的地址数据,即用户输入的原始地址数据。In the embodiment of the present application, an integration module may be configured in the address management apparatus, and the integration module is configured to integrate the address data sources of each party, generate a unique key (key), and load the text address library. The address data for a key in the text address library, that is, the original address data input by the user.
步骤102,地址管理装置确定包括多个地址类型的结构化地址格式。Step 102: The address management apparatus determines a structured address format including a plurality of address types.
其中,结构化地址格式中包括的多个地址类型具体包括但不限于以下之一或者任意组合:省、市、区县、乡镇(街道办)、开发区、主路、主路门牌号、支路、支路门牌号、标志性POI(楼盘等)、幢、单元(楼层)、房间号等。 The multiple address types included in the structured address format include, but are not limited to, one or any combination of the following: provinces, cities, districts, counties, townships (street offices), development zones, main roads, main road numbers, and branches. Road, branch road number, iconic POI (real estate, etc.), building, unit (floor), room number, etc.
步骤103,地址管理装置将原始地址数据转换为符合结构化地址格式的结构化地址数据,该结构化地址数据包括对应多个地址类型的地址数据。Step 103: The address management apparatus converts the original address data into structured address data conforming to a structured address format, the structured address data including address data corresponding to a plurality of address types.
例如,在地址管理装置生成的符合结构化地址格式的结构化地址数据中,可以包括对应于省的地址数据、对应于市的地址数据、对应于区县的地址数据、对应于乡镇(街道办)、对应于开发区的地址数据、对应于主路的地址数据、对应于主路门牌号的地址数据、对应于支路的地址数据、对应于支路门牌号的地址数据、对应于标志性POI(楼盘等)的地址数据、对应于幢的地址数据、对应于单元(楼层)的地址数据、对应于房间号的地址数据等。For example, the structured address data conforming to the structured address format generated by the address management device may include address data corresponding to the province, address data corresponding to the city, address data corresponding to the district, and corresponding to the township (street office) Address data corresponding to the development zone, address data corresponding to the main road, address data corresponding to the main road number, address data corresponding to the branch, address data corresponding to the branch number, corresponding to the iconic The address data of the POI (real estate, etc.), the address data corresponding to the building, the address data corresponding to the unit (floor), the address data corresponding to the room number, and the like.
本申请实施例中,地址管理装置将原始地址数据转换为符合结构化地址格式的结构化地址数据的过程,具体包括但不限于:地址管理装置基于多个地址类型对原始地址数据进行预处理;之后,地址管理装置基于多个地址类型对预处理后的地址数据进行切分;之后,地址管理装置基于多个地址类型对切分后地址数据进行补全校验;之后,地址管理装置对补全校验后的地址数据进行规范化处理,以得到符合结构化地址格式的结构化地址数据。In the embodiment of the present application, the address management apparatus converts the original address data into the structured address data conforming to the structured address format, including but not limited to: the address management apparatus preprocesses the original address data based on the multiple address types; Thereafter, the address management device segments the preprocessed address data based on the plurality of address types; after that, the address management device performs a complement check on the sliced address data based on the plurality of address types; and thereafter, the address management device complements The fully verified address data is normalized to obtain structured address data conforming to the structured address format.
本申请实施例中,地址管理装置基于多个地址类型对原始地址数据进行预处理的过程,具体包括:地址管理装置从原始地址数据中筛选出未对应多个地址类型的地址数据,从原始地址数据中删除当前筛选的地址数据,并将原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。In the embodiment of the present application, the process for the address management device to preprocess the original address data based on the multiple address types includes: the address management device filters the address data that does not correspond to the multiple address types from the original address data, from the original address. The currently filtered address data is deleted in the data, and the non-canonical format address data existing in the original address data is converted into the address data in the canonical format.
本申请实施例中,地址管理装置内可以配置预处理模块,由该预处理模块从原始地址数据中筛选出未对应多个地址类型的地址数据,并从原始地址数据中删除当前筛选的地址数据。进一步的,由该预处理模块将原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。In the embodiment of the present application, the pre-processing module may be configured in the address management apparatus, and the pre-processing module filters out address data that does not correspond to multiple address types from the original address data, and deletes the currently filtered address data from the original address data. . Further, the preprocessing module converts the address data of the non-canonical format existing in the original address data into the address data of the canonical format.
其中,由于用户输入的原始地址数据是用户填写的,具有随意性,因此原始地址数据中会包含对应多个地址类型的地址数据,如河北省、保定市等地址数据,原始地址数据中也会包含未对应多个地址类型的地址数据,如话 费充值信息、虚拟游戏点卡信息等,这些未对应多个地址类型的地址数据是需要进行数据清洗的。基于此,预处理模块从原始地址数据中筛选出未对应多个地址类型的地址数据,并从原始地址数据中删除当前筛选的地址数据。Wherein, since the original address data input by the user is filled in by the user and has randomness, the original address data may include address data corresponding to multiple address types, such as address data of Hebei Province and Baoding City, and the original address data may also be included. Contains address data that does not correspond to multiple address types, such as Fee recharge information, virtual game card information, etc., these address data that do not correspond to multiple address types need to be cleaned. Based on this, the pre-processing module filters the address data that does not correspond to multiple address types from the original address data, and deletes the currently filtered address data from the original address data.
其中,由于用户输入的原始地址数据是用户填写的,具有随意性,因此原始地址数据中会存在非规范格式的地址数据。如英文;数字写为全角;非香港、澳门、台湾地区的地址存在繁体地址现象;香港、澳门、台湾地区的地址存在简体地址现象;门牌号的地址存在中文现象(如二十号);以数字命名的道路名出现数字现象(如文2路)等。基于此,预处理模块将原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。其中,规范格式的地址数据包括但不限于:英文、数字的全角更改为半角;大陆地址一律规范格式为简体中文;港澳台地区的地址一律规范格式为繁体中文;道路名一律规范格式为中文;门牌号、房间号等一律规范格式为数字。Wherein, since the original address data input by the user is filled in by the user and has randomness, address data in a non-canonical format exists in the original address data. Such as English; the number is written as a full-width; the addresses of non-Hong Kong, Macao and Taiwan have traditional addresses; the addresses of Hong Kong, Macao and Taiwan have simplified addresses; the address of the house number has Chinese phenomenon (such as No. 20); Digitally named road names appear digital phenomena (such as Wen 2 Road). Based on this, the pre-processing module converts the address data of the non-canonical format existing in the original address data into the address data of the canonical format. The address data of the canonical format includes but is not limited to: English, the full angle of the number is changed to a half-width; the standard format of the mainland address is simplified Chinese; the format of the address of Hong Kong, Macao and Taiwan is the traditional Chinese; the standard format of the road name is Chinese; The standard format of the house number, room number, etc. is a number.
本申请实施例中,地址管理装置基于多个地址类型对预处理后的地址数据进行切分的过程,具体包括但不限于如下方式:地址管理装置获得多个地址类型对应的分词器词典,并利用该多个地址类型对应的分词器词典将预处理后的地址数据切分出对应于这多个地址类型的地址数据。例如,基于多个地址类型对应的分词器词典,地址管理装置可以将预处理后的地址数据切分出对应于省的地址数据、对应于市的地址数据、对应于区县的地址数据、对应于乡镇(街道办)、对应于开发区的地址数据、对应于主路的地址数据、对应于主路门牌号的地址数据、对应于支路的地址数据、对应于支路门牌号的地址数据、对应于标志性POI(楼盘等)的地址数据、对应于幢的地址数据、对应于单元(楼层)的地址数据、对应于房间号的地址数据等。In the embodiment of the present application, the address management device performs a process of segmenting the preprocessed address data based on multiple address types, including but not limited to the following: the address management device obtains a word breaker dictionary corresponding to multiple address types, and The preprocessed address data is sliced out of the address data corresponding to the plurality of address types by using the word breaker dictionary corresponding to the plurality of address types. For example, based on the word breaker dictionary corresponding to the plurality of address types, the address management apparatus may slice the preprocessed address data into address data corresponding to the province, address data corresponding to the city, address data corresponding to the district, and corresponding In the township (street office), the address data corresponding to the development zone, the address data corresponding to the main road, the address data corresponding to the main road number, the address data corresponding to the branch, and the address data corresponding to the branch number Address data corresponding to the iconic POI (real estate, etc.), address data corresponding to the building, address data corresponding to the unit (floor), address data corresponding to the room number, and the like.
本申请实施例中,地址管理装置内可以配置切分模块,由该切分模块获得多个地址类型对应的分词器词典,并利用该多个地址类型对应的分词器词典将预处理后的地址数据切分出对应于这多个地址类型的地址数据。 In the embodiment of the present application, the address management device may be configured with a segmentation module, where the segmentation module corresponding to the plurality of address types is obtained, and the pre-processed address is obtained by using the word breaker dictionary corresponding to the multiple address types. The data is sliced to correspond to address data corresponding to the plurality of address types.
其中,分词器词典包括但不限于:省、市、区县词典;乡镇词典;工业区词典;村庄词典;街道词典;高校词典;社区标准词典;社区自学习词典。Among them, the word breaker dictionary includes but is not limited to: provincial, municipal, district and county dictionary; township dictionary; industrial zone dictionary; village dictionary; street dictionary; university dictionary; community standard dictionary; community self-learning dictionary.
其中,在切分模块利用分词器词典将预处理后的地址数据切分出对应于多个地址类型的地址数据的过程中,则相应的切分算法具体包括:前向有限状态最大匹配算法,其切分规则包括:基于关键字切分,如:镇,街,路,公司,大厦,中学,门牌号,社区详细地址(幢、单元、房间号)等。进一步的,相应的切分流程具体包括:省、市、区切分:采用基于省市区词典初始化的分词器切割详细地址,若切分后的省、市、区与原始的省、市、区字段不同,则替换,并减少后续切分误差,保留剩余地址。乡镇(工业区)切分:采用基于乡镇(工业区)词典初始化分词器(以市为单元共362个)切分上一步的剩余地址;若分词器切分失败则切分详细地址;若仍切分失败则采用乡镇规则切分,并标记后续处理。道路切分:与乡镇(工业区)切分流程类似,只是采用乡镇词典初始化362个道路分词器。门牌号切分:采用相应的切分规则进行切分。社区(楼盘)切分:采用社区词典初始化社区分词器(以市为单元共362个),切分上一步的剩余地址;若分词器切分失败则切分详细地址;若切分出两个社区元素,则字串长度最大的作为社区元素;若仍切分失败则采用自学习词典的分词器切分详细地址;若仍切分失败则采用社区规则切分,并将采用自学习词典或社区规则切分的社区标记后续处理。社区内详细地址切分(幢、单元、房间号):采用相应的切分规则进行切分。Wherein, in the process that the segmentation module uses the word breaker dictionary to segment the preprocessed address data into address data corresponding to multiple address types, the corresponding segmentation algorithm specifically includes: a forward finite state maximum matching algorithm, The cutting rules include: based on keyword segmentation, such as: town, street, road, company, building, middle school, house number, community detailed address (building, unit, room number). Further, the corresponding segmentation process specifically includes: province, city, and district segmentation: using a word segmenter based on the provincial and municipal dictionary initialization to cut the detailed address, if the province, city, district, and original province, city, and district are divided. If the fields are different, replace them and reduce the subsequent segmentation error and retain the remaining addresses. Township (industrial zone) segmentation: use the township (industrial zone) dictionary to initialize the word segmentation device (total of 362 cities) to divide the remaining address; if the segmentation device fails to divide, the detailed address is divided; If the division fails, it is divided by the township rules and marked for subsequent processing. Road segmentation: Similar to the township (industrial zone) segmentation process, only the township dictionary is used to initialize 362 road segmentation devices. House number segmentation: segmentation is performed using the corresponding segmentation rules. Community (property) segmentation: use the community dictionary to initialize the community participle (by the city as a total of 362), split the remaining address of the previous step; if the word segmentation fails, the detailed address is divided; The community element, the largest string length as a community element; if still splitting, the self-learning dictionary segmentation device is used to segment the detailed address; if the segmentation still fails, the community rule is used to segment and the self-learning dictionary or Subsequent processing of community tags segmented by community rules. Detailed address segmentation (building, unit, room number) in the community: segmentation is performed using the corresponding segmentation rules.
本申请实施例中,地址管理装置基于多个地址类型对切分后地址数据进行补全校验的过程,具体包括但不限于:地址管理装置校验切分后地址数据是否已经包含对应所有多个地址类型的地址数据;如果否,则地址管理装置确定切分后的地址数据中不包含的地址类型,并基于历史数据补全该地址类型的地址数据;如果是,则地址管理装置不需要补全相应的地址数据。In the embodiment of the present application, the address management apparatus performs the process of performing the complement check on the split address data based on the multiple address types, including but not limited to: the address management apparatus verifies whether the address data has been included in the address data after the splitting Address data of the address type; if not, the address management apparatus determines the address type not included in the sliced address data, and complements the address data of the address type based on the history data; if yes, the address management apparatus does not need Complement the corresponding address data.
例如,当地址管理装置基于多个地址类型切分出对应于省的地址数据、 对应于区县的地址数据、对应于开发区的地址数据、对应于主路的地址数据、对应于主路门牌号的地址数据、对应于支路的地址数据、对应于支路门牌号的地址数据、对应于单元(楼层)的地址数据时,则:地址管理装置校验出切分后的地址数据未包含对应所有多个地址类型的地址数据,并基于历史数据补全对应于市的地址数据、对应于乡镇(街道办)、对应于标志性POI(楼盘等)的地址数据、对应于幢的地址数据、对应于房间号的地址数据。For example, when the address management device cuts out address data corresponding to the province based on a plurality of address types, Address data corresponding to the district/county, address data corresponding to the development zone, address data corresponding to the main road, address data corresponding to the main road number, address data corresponding to the branch, and address corresponding to the branch number When the data corresponds to the address data of the unit (floor), the address management apparatus verifies that the sliced address data does not include address data corresponding to all of the plurality of address types, and complements the address corresponding to the city based on the history data. The data corresponds to the township (street office), the address data corresponding to the iconic POI (real estate, etc.), the address data corresponding to the building, and the address data corresponding to the room number.
本申请实施例中,地址管理装置内可以配置补全校验模块,由该补全校验模块校验切分后地址数据是否已经包含对应所有多个地址类型的地址数据;如果否,则确定切分后的地址数据中不包含的地址类型,并基于历史数据补全该地址类型的地址数据;如果是,则不需要补全相应的地址数据。In the embodiment of the present application, the address verification apparatus may be configured with a completion verification module, and the completion verification module verifies whether the address data after the division has already included address data corresponding to all the multiple address types; if not, determining The address type not included in the sliced address data, and the address data of the address type is complemented based on the history data; if so, the corresponding address data does not need to be complemented.
其中,地址数据中存在大量非正确的地址数据,如正确地址数据:杭州市文二路391号西湖国际科技大厦B座2楼小邮局,而用户填写如下非标准或不正确的地址数据:杭州市文二路391号2楼小邮局;杭州市文二路西湖国际科技大厦B座2楼小邮局;杭州市文二路380号西湖国际科技大厦B座2楼小邮局。基于上述情况,补全校验模块在地址数据处理过程中,对上述情况进行处理,在切分后的地址数据的门牌号或社区字段进行补全与校正。Among them, there is a large amount of incorrect address data in the address data, such as the correct address data: Xiao Post Office, 2nd Floor, Block B, West Lake International Technology Building, No. 391 Wen Er Road, Hangzhou, and the user fills in the following non-standard or incorrect address data: Hangzhou Xiao Post Office, 2nd Floor, No. 391, Wenji Road, Wenzhou; Xiao Post Office, 2nd Floor, Block B, West Lake International Technology Building, Wen Er Road, Hangzhou; Xiao Post Office, 2nd Floor, Block B, West Lake International Technology Building, No. 380, Wen Er Road, Hangzhou. Based on the above situation, the completion verification module processes the above situation in the address data processing process, and performs complementation and correction on the house number or community field of the segmented address data.
其中,基于结构地址标准库,则可以将结构地址标准库中的每条地址数据采用相应的切分算法进行结构化为:市+区县+道路+门牌号+社区。统计以上5个字段都完全的地址频次。筛选地址频次大于3的地址。统计市+区县+道路+门牌号下每个社区的使用频次,并保留频次最大的市+区县+道路+门牌号+社区,并将其加入结构地址标准库中。或者,基于结构地址标准库,则可以将结构地址标准库中的每条地址数据采用相应的切分算法进行结构化为:市+道路+门牌号+社区。统计以上4个字段都完全的地址频次。筛选地址频次大于等于1的地址。统计市+道路+门牌号下每个社区的使用频次,并保留频次最大的市+道路+门牌号+社区,并将其加入结构地址标准库中。 Among them, based on the structure address standard library, each address data in the structure address standard library can be structured into a corresponding segmentation algorithm: city + district + road + house number + community. The above 5 fields are all calculated with full address frequency. Filter addresses with address frequencies greater than 3. Count the frequency of use of each community under the city + district + road + house number, and retain the city + district / county + road + house number + community with the most frequent frequency, and add it to the structure address standard library. Alternatively, based on the structure address standard library, each address data in the structure address standard library can be structured into a corresponding segmentation algorithm: city + road + house number + community. The above 4 fields are all calculated with full address frequency. Filter addresses with address frequencies greater than or equal to 1. Count the frequency of use of each community under the city + road + house number, and retain the city + road + house number + community with the most frequent frequency, and add it to the structure address standard library.
基于结构地址标准库,则在地址数据的补全与校正过程中,假设市+区县+道路+门牌号下仅有一个社区,针对每一条已结构化的地址数据,如果社区字段为null(空)或者为规则切分或者为自学习词典分词器切分,则可以从结构地址标准库中查询市+区县+道路+门牌号为key的社区,并补全或者校正社区字段。进一步的,基于结构地址标准库,假设市+区县+道路+社区下仅有一个门牌号,针对每一条已结构化的地址数据,如果门牌号为null或者为规则切分或者为自学习词典分词器切分,则可以从结构地址标准库中查询市+区县+道路+社区为key的门牌号,并补全或者校正门牌号字段。Based on the structure address standard library, in the process of complementing and correcting the address data, it is assumed that there is only one community under the city + district + road + house number, for each structured address data, if the community field is null ( Empty) For the rule segmentation or for the self-learning dictionary tokenizer, you can query the community + district + road + house number key community from the structure address standard library, and complete or correct the community field. Further, based on the structure address standard library, it is assumed that there is only one house number under the city + district/county+road+community, for each structured address data, if the house number is null or is a rule segmentation or a self-learning dictionary If the word segmenter is divided, you can query the city + district/county+road+community as the house number from the structure address standard library, and complete or correct the house number field.
本申请实施例中,地址管理装置对补全校验后的地址数据进行规范化处理的过程,具体包括但不限于如下方式:地址管理装置利用拼音相似度算法对补全校验后的地址数据进行规范化处理;和/或,地址管理装置利用基于概率检索模型的POI规范化算法对补全校验后的地址数据进行规范化处理。In the embodiment of the present application, the process of normalizing the address data after the verification by the address management apparatus includes, but is not limited to, the following method: the address management apparatus uses the pinyin similarity algorithm to perform the address data after the completion verification Normalization processing; and/or, the address management apparatus normalizes the address data after the completion verification using the POI normalization algorithm based on the probability retrieval model.
本申请实施例中,地址管理装置内可以配置规范化模块,规范化模块利用拼音相似度算法对补全校验后的地址数据进行规范化处理;和/或,利用基于概率检索模型的POI规范化算法对补全校验后的地址数据进行规范化处理。In the embodiment of the present application, the normalization module may be configured in the address management apparatus, and the normalization module normalizes the address data after the complement verification by using the pinyin similarity algorithm; and/or complements the POI normalization algorithm based on the probability retrieval model. The address data after full verification is normalized.
其中,用户填写的地址数据中存在大量的地址数据的简称、缩写、错别字、谐音等非规范现象。如标准地址数据为西湖国际科技大厦,非规范化的地址数据为西湖国际(缩写);标准地址数据为浙江大学第一附属医院,非规范化的地址数据为浙大一附院(简称);标准地址数据为古墩路,非规范化的地址数据为古吨路(谐音);标准地址数据为保淑路,非规范化的地址数据为保椒路(错别字)。虽然在地址结构化过程中能够将这些地址数据切分出来,但由于多名称现象在地址坐标标注及后续的地址数据分析中存在很大的困难和弊端,因此,规范化模块需要对非规范化的地址数据进行规范化处理。Among them, the address data filled in by the user has a large number of non-standard phenomena such as abbreviations, abbreviations, typos, and homophony of the address data. For example, the standard address data is West Lake International Technology Building, the non-standardized address data is West Lake International (abbreviation); the standard address data is the first affiliated hospital of Zhejiang University, and the non-standardized address data is Zhejiang University First Affiliated Hospital (abbreviation); standard address data For Gudun Road, the non-standardized address data is Guteng Road (harmonic); the standard address data is Baoshu Road, and the non-normalized address data is Baojiao Road (typo). Although these address data can be segmented during the address structuring process, since the multi-name phenomenon has great difficulties and disadvantages in address coordinate labeling and subsequent address data analysis, the normalization module needs to denormalize the address. The data is normalized.
进一步的,规范化模块对非规范化的地址数据进行规范化处理的算法包括但不限于:拼音相似度算法、基于概率检索模型的POI规范化算法。 Further, the normalization module performs normalization processing on the non-normalized address data, including but not limited to: a pinyin similarity algorithm and a POI normalization algorithm based on a probability retrieval model.
针对拼音相似度算法:规范化模块将非规范化的地址数据和规范化的地址数据转换为拼音,计算相似距离(如最小编辑距离),并将高于阈值且相似度最高的规范化的地址数据作为非规范化的地址数据的标准化地址数据。For the Pinyin similarity algorithm: the normalization module converts the denormalized address data and the normalized address data into pinyin, calculates the similarity distance (such as the minimum edit distance), and denormalizes the normalized address data higher than the threshold and the highest similarity. Standardized address data for address data.
针对基于概率检索模型的POI规范化算法,规范化模块将识别出来的类POI进行bigram(二元语法)切分,然后对于同时出现在类POI和候选标准POI中的bigram,累加每个bigram的估值,各bigram的估值的和就是候选标准POI与类POI的相关性度量。进一步的,计算出候选POI的相关性得分,并对这些POI得分进行从大到小的排序,筛选出POI类型、POI的区县与地址类型以及地址对应的区县相符的且得分最大的POI,即为规范POI。For the POI normalization algorithm based on the probability retrieval model, the normalization module divides the identified POI into a bigram, and then accumulates the estimate of each bigram for the bigram that appears in both the POI and the candidate POI. The sum of the estimates of each bigram is the measure of the correlation between the candidate standard POI and the class-like POI. Further, the correlation scores of the candidate POIs are calculated, and the POI scores are sorted from large to small, and the POI type, the district and the address type of the POI, and the POIs with the highest scores corresponding to the districts corresponding to the addresses are selected. Is the specification POI.
为了实现上述过程,可以采用如下的BM25(二元独立模型)计算公式:In order to achieve the above process, the following BM25 (binary independent model) calculation formula can be used:
Figure PCTCN2016077297-appb-000001
Figure PCTCN2016077297-appb-000001
Figure PCTCN2016077297-appb-000002
Figure PCTCN2016077297-appb-000002
Figure PCTCN2016077297-appb-000003
Figure PCTCN2016077297-appb-000003
Figure PCTCN2016077297-appb-000004
Figure PCTCN2016077297-appb-000004
其中,上述四个公式的相关参数说明如下所示:Among them, the relevant parameters of the above four formulas are as follows:
  相关POIRelated POI 不相关POIIrrelevant POI POI数量Number of POIs
bi=1b i =1 ri r i ni-ri n i -r i ni n i
bi=0b i =0 R-ri Rr i (N-R)-(ni-ri)(NR)-(n i -r i ) N-ni Nn i
POI个数POI number RR N-RN-R NN
进一步的,S:候选POI的相关性得分;N:一个城市或者区县的POI数 量;R:与类POI具有两个相同的bigram且jaccard(相似性系数)相似度大于0.4的相关POI数量;ni:为包含bigram bi的POI数量;dl:当前候选标准POI中的bigram个数;avdl:平均每个候选标准POI包含的bigram个数;ri:为ni中的相关POI数量;indexi:bi在当前POI中出现的位置次序;avgindexi:bi在包含其的POI中出现的平均位置次序;k,b:为自由调节参数,根据经验k设置为:1.2,b设置为0.75;K,I:为公式中的临时变量。Further, S: the correlation score of the candidate POI; N: the number of POIs of one city or district; R: the number of related POIs having two identical bigrams and jaccards (similarity coefficient) greater than 0.4; n i : the number of POIs containing the bigram b i ; dl: the number of bigrams in the current candidate standard POI; avdl: the average number of bigrams included in each candidate standard POI; r i : the number of related POIs in n i ; index i: position of the sequence b i appears in the current the POI; avgindex i: b i comprising an average position of the order in which the POI appearing; k, b: is freely adjustable parameters, set empirically k: 1.2, b is provided Is 0.75; K, I: is a temporary variable in the formula.
与现有技术相比,本申请实施例至少具有以下优点:本申请实施例中,通过设置包括多个地址类型的结构化地址格式,并生成符合结构化地址格式的结构化地址数据,从而生成规范化、标准化的地址数据,解决无法对文本地址进行规范化的问题,并能够判断不同文本地址间的异同性,能够识别文本地址的相关归属。具体的,通过对海量历史文本地址中的地址数据进行识别和提取,通过学习的方式从中学习出地址数据之间的知识和规则,并将学习的知识和规则对漏写地址数据进行补全、对错误地址数据进行校验,对非规范地址数据进行规范化处理,重新生成一条分级的结构化地址数据。Compared with the prior art, the embodiment of the present application has at least the following advantages: in the embodiment of the present application, by setting a structured address format including multiple address types, and generating structured address data conforming to the structured address format, thereby generating Standardized and standardized address data solves the problem of not being able to normalize text addresses, and can determine the similarities and differences between different text addresses, and can identify the relevant attribution of text addresses. Specifically, by identifying and extracting the address data in the massive historical text address, learning the knowledge and rules between the address data through the learning manner, and completing the learned knowledge and rules on the missing write address data, The error address data is verified, the non-canonical address data is normalized, and a hierarchical structured address data is regenerated.
基于与上述方法同样的申请构思,本申请实施例中还提供了一种地址管理装置,如图2所示,所述地址管理装置具体包括:Based on the same application concept as the above method, an address management apparatus is further provided in the embodiment of the present application. As shown in FIG. 2, the address management apparatus specifically includes:
获得模块11,用于获得用户输入的原始地址数据;Obtaining a module 11 for obtaining original address data input by a user;
确定模块12,用于确定包括多个地址类型的结构化地址格式;a determining module 12, configured to determine a structured address format including a plurality of address types;
处理模块13,用于将所述原始地址数据转换为符合所述结构化地址格式的结构化地址数据,所述结构化地址数据包括对应多个地址类型的地址数据。The processing module 13 is configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.
其中,所述处理模块13具体包括:预处理子模块131,用于基于多个地址类型对原始地址数据进行预处理;切分子模块132,用于基于多个地址类型对预处理后的地址数据进行切分;补全子模块133,用于基于多个地址类型对切分后地址数据进行补全校验;规范化子模块134,用于对补全校验后的地址数据进行规范化处理,以得到符合所述结构化地址格式的结构化地址数据。 The processing module 13 specifically includes: a pre-processing sub-module 131, configured to pre-process the original address data based on multiple address types; and a splicing module 132, configured to process the pre-processed address data based on the multiple address types. Performing a segmentation; a complementing sub-module 133, configured to perform a complement check on the sliced address data based on the plurality of address types; the normalization sub-module 134 is configured to normalize the address data after the complement check, to A structured address data conforming to the structured address format is obtained.
所述预处理子模块131,具体用于从原始地址数据中筛选出未对应所述多个地址类型的地址数据,从原始地址数据中删除当前筛选的地址数据,并将原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。The pre-processing sub-module 131 is configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the original address data. Address data in a non-canonical format is converted to address data in a canonical format.
所述切分子模块132,具体用于获得多个地址类型对应的分词器词典,利用多个地址类型对应的分词器词典切分出对应多个地址类型的地址数据。The sharding module 132 is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
所述补全子模块133,具体用于校验切分后的地址数据是否已经包含对应所述多个地址类型的地址数据;如果否,则确定切分后的地址数据中不包含的地址类型,并基于历史数据补全所述地址类型的地址数据。The completion sub-module 133 is configured to verify whether the sharded address data already includes address data corresponding to the multiple address types; if not, determine an address type that is not included in the sharded address data. And complementing the address data of the address type based on the historical data.
所述规范化子模块134,具体用于利用拼音相似度算法对补全校验后的地址数据进行规范化处理;和/或,利用基于概率检索模型的兴趣点POI规范化算法对补全校验后的地址数据进行规范化处理。The normalization sub-module 134 is specifically configured to perform normalization processing on the address data after completion verification by using a pinyin similarity algorithm; and/or, using a POI normalization algorithm based on a probability retrieval model to complete the verification The address data is normalized.
其中,本申请装置的各个模块可以集成于一体,也可以分离部署。上述模块可以合并为一个模块,也可以进一步拆分成多个子模块。The modules of the device of the present application may be integrated into one or may be deployed separately. The above modules can be combined into one module, or can be further split into multiple sub-modules.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。本领域技术人员可以理解附图只是一个优选实施例的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施例中的装置中的模块可以按照实施例描述进行分布于实施例的装置中,也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块。上述本申请实施例序号仅仅为了描述,不代表实施例 的优劣。以上公开的仅为本申请的几个具体实施例,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。 Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is a better implementation. the way. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a A computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present application. A person skilled in the art can understand that the drawings are only a schematic diagram of a preferred embodiment, and the modules or processes in the drawings are not necessarily required to implement the application. Those skilled in the art can understand that the modules in the apparatus in the embodiments may be distributed in the apparatus of the embodiment according to the description of the embodiments, or the corresponding changes may be located in one or more apparatuses different from the embodiment. The modules of the above embodiments may be combined into one module, or may be further split into multiple sub-modules. The serial numbers of the embodiments of the present application are merely for the description, and do not represent the embodiments. The pros and cons. The above disclosure is only a few specific embodiments of the present application, but the present application is not limited thereto, and any changes that can be made by those skilled in the art should fall within the protection scope of the present application.

Claims (12)

  1. 一种地址数据的管理方法,其特征在于,所述方法包括以下步骤:A method for managing address data, characterized in that the method comprises the following steps:
    地址管理装置获得用户输入的原始地址数据;The address management device obtains original address data input by the user;
    所述地址管理装置确定包括多个地址类型的结构化地址格式;The address management device determines a structured address format including a plurality of address types;
    所述地址管理装置将所述原始地址数据转换为符合所述结构化地址格式的结构化地址数据,所述结构化地址数据包括对应多个地址类型的地址数据。The address management device converts the original address data into structured address data conforming to the structured address format, the structured address data including address data corresponding to a plurality of address types.
  2. 如权利要求1所述的方法,其特征在于,所述地址管理装置将所述原始地址数据转换为符合所述结构化地址格式的结构化地址数据,具体包括:The method of claim 1, wherein the address management device converts the original address data into structured address data that conforms to the structured address format, and specifically includes:
    所述地址管理装置基于多个地址类型对原始地址数据进行预处理;The address management device preprocesses original address data based on multiple address types;
    所述地址管理装置基于多个地址类型对预处理后的地址数据进行切分;The address management apparatus performs segmentation on the preprocessed address data based on a plurality of address types;
    所述地址管理装置基于多个地址类型对切分后地址数据进行补全校验;The address management device performs a complement check on the sliced address data based on the plurality of address types;
    所述地址管理装置对补全校验后的地址数据进行规范化处理,以得到符合所述结构化地址格式的结构化地址数据。The address management device normalizes the address data after the completion of the verification to obtain structured address data conforming to the structured address format.
  3. 如权利要求2所述的方法,其特征在于,所述地址管理装置基于多个地址类型对原始地址数据进行预处理的过程,具体包括:The method of claim 2, wherein the process of the pre-processing of the original address data by the address management device based on the multiple address types comprises:
    所述地址管理装置从所述原始地址数据中筛选出未对应所述多个地址类型的地址数据,从所述原始地址数据中删除当前筛选的地址数据,并将所述原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。The address management device filters, from the original address data, address data that does not correspond to the multiple address types, deletes the currently filtered address data from the original address data, and stores the original address data. Address data in a non-canonical format is converted to address data in a canonical format.
  4. 如权利要求2所述的方法,其特征在于,所述地址管理装置基于多个地址类型对预处理后的地址数据进行切分的过程,具体包括:The method according to claim 2, wherein the process of the segmentation of the pre-processed address data by the address management device based on the plurality of address types comprises:
    所述地址管理装置获得所述多个地址类型对应的分词器词典,利用所述多个地址类型对应的分词器词典切分出对应所述多个地址类型的地址数据。The address management device obtains the word breaker dictionary corresponding to the plurality of address types, and uses the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
  5. 如权利要求2所述的方法,其特征在于,所述地址管理装置基于多个地址类型对切分后地址数据进行补全校验的过程,具体包括:The method according to claim 2, wherein the process of the address management device performing a complement check on the post-segment address data based on the plurality of address types includes:
    所述地址管理装置校验切分后地址数据是否已经包含对应所述多个地址 类型的地址数据;如果否,则所述地址管理装置确定切分后地址数据中不包含的地址类型,并基于历史数据补全所述地址类型的地址数据。Determining, by the address management apparatus, whether the address data after the severing has already included the corresponding address Type address data; if not, the address management device determines an address type not included in the sliced address data, and complements the address data of the address type based on the history data.
  6. 如权利要求2所述的方法,其特征在于,所述地址管理装置对补全校验后的地址数据进行规范化处理的过程,具体包括:The method according to claim 2, wherein the process of normalizing the address data after the verification by the address management device comprises:
    所述地址管理装置利用拼音相似度算法对补全校验后的地址数据进行规范化处理;和/或,所述地址管理装置利用基于概率检索模型的兴趣点POI规范化算法对补全校验后的地址数据进行规范化处理。The address management apparatus normalizes the address data after the completion verification by using a pinyin similarity algorithm; and/or the address management apparatus uses the POI normalization algorithm based on the probability retrieval model to complete the verification The address data is normalized.
  7. 一种地址管理装置,其特征在于,所述地址管理装置具体包括:An address management apparatus, where the address management apparatus specifically includes:
    获得模块,用于获得用户输入的原始地址数据;Obtaining a module for obtaining original address data input by a user;
    确定模块,用于确定包括多个地址类型的结构化地址格式;a determining module for determining a structured address format including a plurality of address types;
    处理模块,用于将所述原始地址数据转换为符合所述结构化地址格式的结构化地址数据,所述结构化地址数据包括对应多个地址类型的地址数据。And a processing module, configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.
  8. 如权利要求7所述的地址管理装置,其特征在于,所述处理模块包括:The address management device according to claim 7, wherein the processing module comprises:
    预处理子模块,用于基于多个地址类型对原始地址数据进行预处理;a pre-processing sub-module for pre-processing the original address data based on multiple address types;
    切分子模块,用于基于多个地址类型对预处理后的地址数据进行切分;a molecular module for segmenting the preprocessed address data based on multiple address types;
    补全子模块,用于基于多个地址类型对切分后地址数据进行补全校验;Completing a sub-module for performing complement verification on the sharded address data based on multiple address types;
    规范化子模块,用于对补全校验后的地址数据进行规范化处理,以得到符合所述结构化地址格式的结构化地址数据。The normalization submodule is configured to normalize the address data after the completion verification to obtain structured address data conforming to the structured address format.
  9. 如权利要求8所述的地址管理装置,其特征在于,The address management device according to claim 8, wherein
    所述预处理子模块,具体用于从原始地址数据中筛选出未对应所述多个地址类型的地址数据,从原始地址数据中删除当前筛选的地址数据,并将原始地址数据中存在的非规范格式的地址数据转换为规范格式的地址数据。The pre-processing sub-module is specifically configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the non-original address data. The address data of the canonical format is converted to address data in a canonical format.
  10. 如权利要求8所述的地址管理装置,其特征在于,The address management device according to claim 8, wherein
    所述切分子模块,具体用于获得多个地址类型对应的分词器词典,利用多个地址类型对应的分词器词典切分出对应所述多个地址类型的地址数据。 The sharding module is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
  11. 如权利要求8所述的地址管理装置,其特征在于,The address management device according to claim 8, wherein
    所述补全子模块,具体用于校验切分后的地址数据是否已经包含对应所述多个地址类型的地址数据;如果否,则确定切分后的地址数据中不包含的地址类型,并基于历史数据补全所述地址类型的地址数据。The completion sub-module is specifically configured to check whether the address data after the severing includes the address data corresponding to the multiple address types; if not, determine the address type not included in the categorized address data, And the address data of the address type is complemented based on the historical data.
  12. 如权利要求8所述的地址管理装置,其特征在于,The address management device according to claim 8, wherein
    所述规范化子模块,具体用于利用拼音相似度算法对补全校验后的地址数据进行规范化处理;和/或,利用基于概率检索模型的兴趣点POI规范化算法对补全校验后的地址数据进行规范化处理。 The normalization sub-module is specifically configured to normalize address data after completion verification by using a pinyin similarity algorithm; and/or use a POI normalization algorithm based on a probability retrieval model to complete the verified address The data is normalized.
PCT/CN2016/077297 2015-04-13 2016-03-25 Address data management method and device WO2016165538A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510172985.0 2015-04-13
CN201510172985.0A CN106156145A (en) 2015-04-13 2015-04-13 The management method of a kind of address date and device

Publications (1)

Publication Number Publication Date
WO2016165538A1 true WO2016165538A1 (en) 2016-10-20

Family

ID=57127145

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/077297 WO2016165538A1 (en) 2015-04-13 2016-03-25 Address data management method and device

Country Status (2)

Country Link
CN (1) CN106156145A (en)
WO (1) WO2016165538A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628811A (en) * 2018-04-10 2018-10-09 北京京东尚科信息技术有限公司 The matching process and device of address text
CN109960795A (en) * 2019-02-18 2019-07-02 平安科技(深圳)有限公司 A kind of address information standardized method, device, computer equipment and storage medium
CN110414186A (en) * 2019-06-20 2019-11-05 阿里巴巴集团控股有限公司 Data assets cutting method of calibration and device
CN110837930A (en) * 2019-11-07 2020-02-25 腾讯科技(深圳)有限公司 Address selection method, device, equipment and storage medium
CN110866083A (en) * 2019-12-04 2020-03-06 国网浙江省电力有限公司 Address auditing method for electric power standard structured address library
CN110895651A (en) * 2018-08-23 2020-03-20 北京京东金融科技控股有限公司 Address standardization processing method, device, equipment and computer readable storage medium
CN111488409A (en) * 2019-01-25 2020-08-04 阿里巴巴集团控股有限公司 City address library construction method, retrieval method and device
CN111723165A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Address interest point determining method, device and system
CN112001172A (en) * 2020-08-25 2020-11-27 杭州橙鹰数据技术有限公司 Identification method and device
CN112052672A (en) * 2020-08-28 2020-12-08 丰图科技(深圳)有限公司 Unit area identification method and device based on address text and computer equipment
CN112199458A (en) * 2020-09-23 2021-01-08 北京睿企信息科技有限公司 Address grading standard method based on big data
CN113111652A (en) * 2020-01-13 2021-07-13 阿里巴巴集团控股有限公司 Data processing method and device and computing equipment

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255564B (en) * 2017-07-13 2022-09-06 菜鸟智能物流控股有限公司 Pick-up point address recommendation method and device
CN107656913B (en) * 2017-09-30 2021-03-23 百度在线网络技术(北京)有限公司 Map interest point address extraction method, map interest point address extraction device, server and storage medium
CN108664973A (en) * 2018-03-28 2018-10-16 北京捷通华声科技股份有限公司 Text handling method and device
CN108733810B (en) * 2018-05-21 2021-02-05 鼎富智能科技有限公司 Address data matching method and device
CN108683677B (en) * 2018-05-23 2020-11-03 国政通科技股份有限公司 Detailed judgment of address information
CN110874442A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Method, apparatus, device and medium for processing information
CN110909110B (en) * 2018-09-17 2023-05-30 阿里巴巴集团控股有限公司 Address standardization method and device, storage medium and processor
CN111274802B (en) * 2018-11-19 2023-04-18 阿里巴巴集团控股有限公司 Validity judgment method and device for address data
CN111198981A (en) * 2018-11-19 2020-05-26 阿里巴巴集团控股有限公司 Query method, device, system and storage medium
CN111198912A (en) * 2018-11-19 2020-05-26 阿里巴巴集团控股有限公司 Address data processing method and device
CN110334162B (en) * 2019-05-09 2021-11-09 德邦物流股份有限公司 Address recognition method and device
CN112100161B (en) * 2019-09-17 2021-05-28 上海寻梦信息技术有限公司 Data processing method and system, electronic device and storage medium
CN110765280B (en) * 2019-10-22 2021-05-25 京东数字科技控股有限公司 Address recognition method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
US20120047179A1 (en) * 2010-08-19 2012-02-23 International Business Machines Corporation Systems and methods for standardization and de-duplication of addresses using taxonomy
CN102955832A (en) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 Correspondence address identifying and standardizing system
CN103473289A (en) * 2013-08-30 2013-12-25 深圳市华傲数据技术有限公司 Device and method for completing communication addresses

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996247B (en) * 2010-11-10 2013-02-20 百度在线网络技术(北京)有限公司 Method and device for constructing address database
CN102955833B (en) * 2011-08-31 2015-11-25 深圳市华傲数据技术有限公司 A kind of address identification, standardized method
CN103440311A (en) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 Method and system for identifying geographical name entities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
US20120047179A1 (en) * 2010-08-19 2012-02-23 International Business Machines Corporation Systems and methods for standardization and de-duplication of addresses using taxonomy
CN102955832A (en) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 Correspondence address identifying and standardizing system
CN103473289A (en) * 2013-08-30 2013-12-25 深圳市华傲数据技术有限公司 Device and method for completing communication addresses

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628811A (en) * 2018-04-10 2018-10-09 北京京东尚科信息技术有限公司 The matching process and device of address text
CN110895651A (en) * 2018-08-23 2020-03-20 北京京东金融科技控股有限公司 Address standardization processing method, device, equipment and computer readable storage medium
CN110895651B (en) * 2018-08-23 2024-02-02 京东科技控股股份有限公司 Address standardization processing method, device, equipment and computer readable storage medium
CN111488409A (en) * 2019-01-25 2020-08-04 阿里巴巴集团控股有限公司 City address library construction method, retrieval method and device
CN109960795A (en) * 2019-02-18 2019-07-02 平安科技(深圳)有限公司 A kind of address information standardized method, device, computer equipment and storage medium
CN111723165A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Address interest point determining method, device and system
CN110414186B (en) * 2019-06-20 2023-06-30 创新先进技术有限公司 Data asset segmentation verification method and device
CN110414186A (en) * 2019-06-20 2019-11-05 阿里巴巴集团控股有限公司 Data assets cutting method of calibration and device
CN110837930A (en) * 2019-11-07 2020-02-25 腾讯科技(深圳)有限公司 Address selection method, device, equipment and storage medium
CN110837930B (en) * 2019-11-07 2023-09-19 腾讯科技(深圳)有限公司 Site selection method, device, equipment and storage medium
CN110866083A (en) * 2019-12-04 2020-03-06 国网浙江省电力有限公司 Address auditing method for electric power standard structured address library
CN110866083B (en) * 2019-12-04 2023-11-07 国网浙江省电力有限公司 Address auditing method for electric power standard structured address library
CN113111652A (en) * 2020-01-13 2021-07-13 阿里巴巴集团控股有限公司 Data processing method and device and computing equipment
CN113111652B (en) * 2020-01-13 2024-02-13 阿里巴巴集团控股有限公司 Data processing method and device and computing equipment
CN112001172A (en) * 2020-08-25 2020-11-27 杭州橙鹰数据技术有限公司 Identification method and device
CN112052672A (en) * 2020-08-28 2020-12-08 丰图科技(深圳)有限公司 Unit area identification method and device based on address text and computer equipment
CN112199458A (en) * 2020-09-23 2021-01-08 北京睿企信息科技有限公司 Address grading standard method based on big data

Also Published As

Publication number Publication date
CN106156145A (en) 2016-11-23

Similar Documents

Publication Publication Date Title
WO2016165538A1 (en) Address data management method and device
Li et al. b-Bit minwise hashing
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
Schulz et al. A multi-indicator approach for geolocalization of tweets
CN108628811B (en) Address text matching method and device
US9063226B2 (en) Detecting spatial outliers in a location entity dataset
WO2021189977A1 (en) Address coding method and apparatus, and computer device and computer-readable storage medium
US10331694B2 (en) Data sanitization and normalization and geocoding methods
CN110597870A (en) Enterprise relation mining method
CN109981625B (en) Log template extraction method based on online hierarchical clustering
CN108363686A (en) A kind of character string segmenting method, device, terminal device and storage medium
CN106909575B (en) Text clustering method and device
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
WO2022100154A1 (en) Artificial intelligence-based address standardization method and apparatus, device and storage medium
CN111680498B (en) Entity disambiguation method, device, storage medium and computer equipment
CN108304377A (en) A kind of extracting method and relevant apparatus of long-tail word
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN111291099B (en) Address fuzzy matching method and system and computer equipment
CN112650858A (en) Method and device for acquiring emergency assistance information, computer equipment and medium
CN107562720B (en) Alarm data matching method for electric power information network security linkage defense
CN109033370A (en) A kind of method and device that searching similar shop, the method and device of shop access
CN115062108A (en) Method for obtaining standardized house address
CN114220113A (en) Paper quality detection method, device and equipment
CN117669513B (en) Data management system and method based on artificial intelligence
CN107783957A (en) Ontology method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16779513

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16779513

Country of ref document: EP

Kind code of ref document: A1