WO2016165538A1

WO2016165538A1 - Address data management method and device

Info

Publication number: WO2016165538A1
Application number: PCT/CN2016/077297
Authority: WO
Inventors: 吴保华
Original assignee: 阿里巴巴集团控股有限公司; 吴保华
Priority date: 2015-04-13
Filing date: 2016-03-25
Publication date: 2016-10-20
Also published as: CN106156145A

Abstract

Disclosed are an address data management method and device. The method comprises: an address management device acquires original address data input by a user; the address management device determines a structured address format comprising multiple address types; and the address management device converts the original address data into structured address data satisfying the structured address format, the structured address data comprises address data corresponding to the multiple address types. In embodiments of the present application, by setting a structured address format comprising multiple address types and generating structured address data satisfying the structured address format, normalized and standardized address data is generated, the problem of failing to standardize text addresses is solved, similarities and differences among different text addresses can be determined, and the related homing of the text addresses can be identified.

Description

Method and device for managing address data

Technical field

The present application relates to the field of communications technologies, and in particular, to a method and an apparatus for managing address data.

Background technique

A large number of text addresses are generated in e-commerce websites and logistics systems, and the input format and address elements of these text addresses vary from user to user. For example, the text address input by the user A only includes the house number information, and the text address input by the user B includes only POI (Point of Interest) information, and the text address input by the user C includes the wrong district or house number information. These text addresses lack standardization and standardization, and it is impossible to judge the similarities and differences between different text addresses, and the related attribution of text addresses cannot be recognized. Among them, the address element refers to all levels of elements in the text address, such as provinces, cities, districts, development zones, towns, roads, POIs, and so on. The POI can be a house, a shop, a mail box, a bus stop, and the like.

Summary of the invention

The embodiment of the present application provides a method and an apparatus for managing address data to generate normalized and standardized address data, thereby solving the problem that the text address cannot be normalized.

An embodiment of the present application provides a method for managing address data, where the method includes the following steps:

The address management device obtains original address data input by the user;

The address management device determines a structured address format including a plurality of address types;

The address management device converts the original address data into structured address data conforming to the structured address format, the structured address data including address data corresponding to a plurality of address types.

The address management device converts the original address data into structured address data that conforms to the structured address format, and specifically includes:

The address management device preprocesses original address data based on multiple address types;

The address management apparatus performs segmentation on the preprocessed address data based on a plurality of address types;

The address management device performs a complement check on the sliced address data based on the plurality of address types;

The address management device normalizes the address data after the completion of the verification to obtain structured address data conforming to the structured address format.

The process of the pre-processing of the original address data by the address management device based on the multiple address types includes:

The address management device filters, from the original address data, address data that does not correspond to the multiple address types, deletes the currently filtered address data from the original address data, and stores the original address data. Address data in a non-canonical format is converted to address data in a canonical format.

The process of the segmentation of the pre-processed address data by the address management device based on the multiple address types includes:

The address management device obtains the word breaker dictionary corresponding to the plurality of address types, and uses the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.

The process of the address management device performing the completion verification on the segmented address data based on the multiple address types, specifically includes:

Determining, by the address management apparatus, whether the address data after the severing has included address data corresponding to the plurality of address types; if not, the address management apparatus determines an address type not included in the categorized address data, and is based on The historical data complements the address data of the address type.

The process of normalizing the address data after the verification by the address management apparatus includes: the address management apparatus normalizes the address data after the verification by using the pinyin similarity algorithm; and/or, The address management apparatus normalizes the address data after the completion verification using the POI normalization algorithm based on the probability retrieval model.

The embodiment of the present application provides an address management apparatus, where the address management apparatus specifically includes:

Obtaining a module for obtaining original address data input by a user;

a determining module for determining a structured address format including a plurality of address types;

And a processing module, configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.

The processing module includes: a pre-processing sub-module for pre-processing the original address data based on the plurality of address types; and a splicing module for segmenting the pre-processed address data based on the plurality of address types; a sub-module, configured to perform complement verification on the post-segment address data based on the multiple address types; the normalization sub-module is configured to normalize the address data after the complement verification to obtain the conformed address format Structured address data.

The pre-processing sub-module is specifically configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the non-original address data. The address data of the canonical format is converted to address data in a canonical format.

The sharding module is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.

The completion sub-module is specifically configured to check whether the address data after the severing includes the address data corresponding to the multiple address types; if not, determine the address type not included in the categorized address data, And the address data of the address type is complemented based on the historical data.

The normalization sub-module is specifically configured to normalize address data after completion verification by using a pinyin similarity algorithm; and/or use a POI normalization algorithm based on a probability retrieval model to complete the verified address The data is normalized.

Compared with the prior art, the embodiment of the present application has at least the following advantages: in the embodiment of the present application, by setting a structured address format including multiple address types, and generating structured address data conforming to the structured address format, thereby generating Standardized and standardized address data solves the problem of not being able to normalize text addresses, and can determine the similarities and differences between different text addresses, and can identify the relevant attribution of text addresses. Specifically, by identifying and extracting the address data in the massive historical text address, learning the knowledge and rules between the address data through the learning manner, and completing the learned knowledge and rules on the missing write address data, Check the wrong address data, right The canonical address data is normalized and a hierarchical structured address data is regenerated.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described. It is obvious that the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings may be obtained according to the drawings of the embodiments of the present application without any creative work.

1 is a schematic flowchart of a method for managing address data according to Embodiment 1 of the present application;

FIG. 2 is a schematic structural diagram of an address management apparatus according to Embodiment 2 of the present application.

detailed description

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

Embodiment 1

For the problem in the prior art, the first embodiment of the present application provides a method for managing address data. As shown in FIG. 1 , the method for managing the address data may specifically include the following steps:

In step 101, the address management device obtains original address data input by the user.

In the embodiment of the present application, an integration module may be configured in the address management apparatus, and the integration module is configured to integrate the address data sources of each party, generate a unique key (key), and load the text address library. The address data for a key in the text address library, that is, the original address data input by the user.

Step 102: The address management apparatus determines a structured address format including a plurality of address types.

The multiple address types included in the structured address format include, but are not limited to, one or any combination of the following: provinces, cities, districts, counties, townships (street offices), development zones, main roads, main road numbers, and branches. Road, branch road number, iconic POI (real estate, etc.), building, unit (floor), room number, etc.

Step 103: The address management apparatus converts the original address data into structured address data conforming to a structured address format, the structured address data including address data corresponding to a plurality of address types.

For example, the structured address data conforming to the structured address format generated by the address management device may include address data corresponding to the province, address data corresponding to the city, address data corresponding to the district, and corresponding to the township (street office) Address data corresponding to the development zone, address data corresponding to the main road, address data corresponding to the main road number, address data corresponding to the branch, address data corresponding to the branch number, corresponding to the iconic The address data of the POI (real estate, etc.), the address data corresponding to the building, the address data corresponding to the unit (floor), the address data corresponding to the room number, and the like.

In the embodiment of the present application, the address management apparatus converts the original address data into the structured address data conforming to the structured address format, including but not limited to: the address management apparatus preprocesses the original address data based on the multiple address types; Thereafter, the address management device segments the preprocessed address data based on the plurality of address types; after that, the address management device performs a complement check on the sliced address data based on the plurality of address types; and thereafter, the address management device complements The fully verified address data is normalized to obtain structured address data conforming to the structured address format.

In the embodiment of the present application, the process for the address management device to preprocess the original address data based on the multiple address types includes: the address management device filters the address data that does not correspond to the multiple address types from the original address data, from the original address. The currently filtered address data is deleted in the data, and the non-canonical format address data existing in the original address data is converted into the address data in the canonical format.

In the embodiment of the present application, the pre-processing module may be configured in the address management apparatus, and the pre-processing module filters out address data that does not correspond to multiple address types from the original address data, and deletes the currently filtered address data from the original address data. . Further, the preprocessing module converts the address data of the non-canonical format existing in the original address data into the address data of the canonical format.

Wherein, since the original address data input by the user is filled in by the user and has randomness, the original address data may include address data corresponding to multiple address types, such as address data of Hebei Province and Baoding City, and the original address data may also be included. Contains address data that does not correspond to multiple address types, such as Fee recharge information, virtual game card information, etc., these address data that do not correspond to multiple address types need to be cleaned. Based on this, the pre-processing module filters the address data that does not correspond to multiple address types from the original address data, and deletes the currently filtered address data from the original address data.

Wherein, since the original address data input by the user is filled in by the user and has randomness, address data in a non-canonical format exists in the original address data. Such as English; the number is written as a full-width; the addresses of non-Hong Kong, Macao and Taiwan have traditional addresses; the addresses of Hong Kong, Macao and Taiwan have simplified addresses; the address of the house number has Chinese phenomenon (such as No. 20); Digitally named road names appear digital phenomena (such as Wen 2 Road). Based on this, the pre-processing module converts the address data of the non-canonical format existing in the original address data into the address data of the canonical format. The address data of the canonical format includes but is not limited to: English, the full angle of the number is changed to a half-width; the standard format of the mainland address is simplified Chinese; the format of the address of Hong Kong, Macao and Taiwan is the traditional Chinese; the standard format of the road name is Chinese; The standard format of the house number, room number, etc. is a number.

In the embodiment of the present application, the address management device performs a process of segmenting the preprocessed address data based on multiple address types, including but not limited to the following: the address management device obtains a word breaker dictionary corresponding to multiple address types, and The preprocessed address data is sliced out of the address data corresponding to the plurality of address types by using the word breaker dictionary corresponding to the plurality of address types. For example, based on the word breaker dictionary corresponding to the plurality of address types, the address management apparatus may slice the preprocessed address data into address data corresponding to the province, address data corresponding to the city, address data corresponding to the district, and corresponding In the township (street office), the address data corresponding to the development zone, the address data corresponding to the main road, the address data corresponding to the main road number, the address data corresponding to the branch, and the address data corresponding to the branch number Address data corresponding to the iconic POI (real estate, etc.), address data corresponding to the building, address data corresponding to the unit (floor), address data corresponding to the room number, and the like.

In the embodiment of the present application, the address management device may be configured with a segmentation module, where the segmentation module corresponding to the plurality of address types is obtained, and the pre-processed address is obtained by using the word breaker dictionary corresponding to the multiple address types. The data is sliced to correspond to address data corresponding to the plurality of address types.

Among them, the word breaker dictionary includes but is not limited to: provincial, municipal, district and county dictionary; township dictionary; industrial zone dictionary; village dictionary; street dictionary; university dictionary; community standard dictionary; community self-learning dictionary.

Wherein, in the process that the segmentation module uses the word breaker dictionary to segment the preprocessed address data into address data corresponding to multiple address types, the corresponding segmentation algorithm specifically includes: a forward finite state maximum matching algorithm, The cutting rules include: based on keyword segmentation, such as: town, street, road, company, building, middle school, house number, community detailed address (building, unit, room number). Further, the corresponding segmentation process specifically includes: province, city, and district segmentation: using a word segmenter based on the provincial and municipal dictionary initialization to cut the detailed address, if the province, city, district, and original province, city, and district are divided. If the fields are different, replace them and reduce the subsequent segmentation error and retain the remaining addresses. Township (industrial zone) segmentation: use the township (industrial zone) dictionary to initialize the word segmentation device (total of 362 cities) to divide the remaining address; if the segmentation device fails to divide, the detailed address is divided; If the division fails, it is divided by the township rules and marked for subsequent processing. Road segmentation: Similar to the township (industrial zone) segmentation process, only the township dictionary is used to initialize 362 road segmentation devices. House number segmentation: segmentation is performed using the corresponding segmentation rules. Community (property) segmentation: use the community dictionary to initialize the community participle (by the city as a total of 362), split the remaining address of the previous step; if the word segmentation fails, the detailed address is divided; The community element, the largest string length as a community element; if still splitting, the self-learning dictionary segmentation device is used to segment the detailed address; if the segmentation still fails, the community rule is used to segment and the self-learning dictionary or Subsequent processing of community tags segmented by community rules. Detailed address segmentation (building, unit, room number) in the community: segmentation is performed using the corresponding segmentation rules.

In the embodiment of the present application, the address management apparatus performs the process of performing the complement check on the split address data based on the multiple address types, including but not limited to: the address management apparatus verifies whether the address data has been included in the address data after the splitting Address data of the address type; if not, the address management apparatus determines the address type not included in the sliced address data, and complements the address data of the address type based on the history data; if yes, the address management apparatus does not need Complement the corresponding address data.

For example, when the address management device cuts out address data corresponding to the province based on a plurality of address types, Address data corresponding to the district/county, address data corresponding to the development zone, address data corresponding to the main road, address data corresponding to the main road number, address data corresponding to the branch, and address corresponding to the branch number When the data corresponds to the address data of the unit (floor), the address management apparatus verifies that the sliced address data does not include address data corresponding to all of the plurality of address types, and complements the address corresponding to the city based on the history data. The data corresponds to the township (street office), the address data corresponding to the iconic POI (real estate, etc.), the address data corresponding to the building, and the address data corresponding to the room number.

In the embodiment of the present application, the address verification apparatus may be configured with a completion verification module, and the completion verification module verifies whether the address data after the division has already included address data corresponding to all the multiple address types; if not, determining The address type not included in the sliced address data, and the address data of the address type is complemented based on the history data; if so, the corresponding address data does not need to be complemented.

Among them, there is a large amount of incorrect address data in the address data, such as the correct address data: Xiao Post Office, 2nd Floor, Block B, West Lake International Technology Building, No. 391 Wen Er Road, Hangzhou, and the user fills in the following non-standard or incorrect address data: Hangzhou Xiao Post Office, 2nd Floor, No. 391, Wenji Road, Wenzhou; Xiao Post Office, 2nd Floor, Block B, West Lake International Technology Building, Wen Er Road, Hangzhou; Xiao Post Office, 2nd Floor, Block B, West Lake International Technology Building, No. 380, Wen Er Road, Hangzhou. Based on the above situation, the completion verification module processes the above situation in the address data processing process, and performs complementation and correction on the house number or community field of the segmented address data.

Among them, based on the structure address standard library, each address data in the structure address standard library can be structured into a corresponding segmentation algorithm: city + district + road + house number + community. The above 5 fields are all calculated with full address frequency. Filter addresses with address frequencies greater than 3. Count the frequency of use of each community under the city + district + road + house number, and retain the city + district / county + road + house number + community with the most frequent frequency, and add it to the structure address standard library. Alternatively, based on the structure address standard library, each address data in the structure address standard library can be structured into a corresponding segmentation algorithm: city + road + house number + community. The above 4 fields are all calculated with full address frequency. Filter addresses with address frequencies greater than or equal to 1. Count the frequency of use of each community under the city + road + house number, and retain the city + road + house number + community with the most frequent frequency, and add it to the structure address standard library.

Based on the structure address standard library, in the process of complementing and correcting the address data, it is assumed that there is only one community under the city + district + road + house number, for each structured address data, if the community field is null ( Empty) For the rule segmentation or for the self-learning dictionary tokenizer, you can query the community + district + road + house number key community from the structure address standard library, and complete or correct the community field. Further, based on the structure address standard library, it is assumed that there is only one house number under the city + district/county+road+community, for each structured address data, if the house number is null or is a rule segmentation or a self-learning dictionary If the word segmenter is divided, you can query the city + district/county+road+community as the house number from the structure address standard library, and complete or correct the house number field.

In the embodiment of the present application, the process of normalizing the address data after the verification by the address management apparatus includes, but is not limited to, the following method: the address management apparatus uses the pinyin similarity algorithm to perform the address data after the completion verification Normalization processing; and/or, the address management apparatus normalizes the address data after the completion verification using the POI normalization algorithm based on the probability retrieval model.

In the embodiment of the present application, the normalization module may be configured in the address management apparatus, and the normalization module normalizes the address data after the complement verification by using the pinyin similarity algorithm; and/or complements the POI normalization algorithm based on the probability retrieval model. The address data after full verification is normalized.

Among them, the address data filled in by the user has a large number of non-standard phenomena such as abbreviations, abbreviations, typos, and homophony of the address data. For example, the standard address data is West Lake International Technology Building, the non-standardized address data is West Lake International (abbreviation); the standard address data is the first affiliated hospital of Zhejiang University, and the non-standardized address data is Zhejiang University First Affiliated Hospital (abbreviation); standard address data For Gudun Road, the non-standardized address data is Guteng Road (harmonic); the standard address data is Baoshu Road, and the non-normalized address data is Baojiao Road (typo). Although these address data can be segmented during the address structuring process, since the multi-name phenomenon has great difficulties and disadvantages in address coordinate labeling and subsequent address data analysis, the normalization module needs to denormalize the address. The data is normalized.

Further, the normalization module performs normalization processing on the non-normalized address data, including but not limited to: a pinyin similarity algorithm and a POI normalization algorithm based on a probability retrieval model.

For the Pinyin similarity algorithm: the normalization module converts the denormalized address data and the normalized address data into pinyin, calculates the similarity distance (such as the minimum edit distance), and denormalizes the normalized address data higher than the threshold and the highest similarity. Standardized address data for address data.

For the POI normalization algorithm based on the probability retrieval model, the normalization module divides the identified POI into a bigram, and then accumulates the estimate of each bigram for the bigram that appears in both the POI and the candidate POI. The sum of the estimates of each bigram is the measure of the correlation between the candidate standard POI and the class-like POI. Further, the correlation scores of the candidate POIs are calculated, and the POI scores are sorted from large to small, and the POI type, the district and the address type of the POI, and the POIs with the highest scores corresponding to the districts corresponding to the addresses are selected. Is the specification POI.

In order to achieve the above process, the following BM25 (binary independent model) calculation formula can be used:

Among them, the relevant parameters of the above four formulas are as follows:

	相关POIRelated POI	不相关POIIrrelevant POI	POI数量Number of POIs
b_i＝1b _i =1	r_i r _i	n_i-r_i n _i -r _i	n_i n _i
b_i＝0b _i =0	R-r_i Rr _i	(N-R)-(n_i-r_i)(NR)-(n _i -r _i )	N-n_i Nn _i
POI个数POI number	RR	N-RN-R	NN

Further, S: the correlation score of the candidate POI; N: the number of POIs of one city or district; R: the number of related POIs having two identical bigrams and jaccards (similarity coefficient) greater than 0.4; n _i : the number of POIs containing the bigram b _i ; dl: the number of bigrams in the current candidate standard POI; avdl: the average number of bigrams included in each candidate standard POI; r _i : the number of related POIs in n _i ; index _i: position of the sequence b _i appears in the current the POI; avgindex _{_i:} b _i comprising an average position of the order in which the POI appearing; k, b: is freely adjustable parameters, set empirically k: 1.2, b is provided Is 0.75; K, I: is a temporary variable in the formula.

Compared with the prior art, the embodiment of the present application has at least the following advantages: in the embodiment of the present application, by setting a structured address format including multiple address types, and generating structured address data conforming to the structured address format, thereby generating Standardized and standardized address data solves the problem of not being able to normalize text addresses, and can determine the similarities and differences between different text addresses, and can identify the relevant attribution of text addresses. Specifically, by identifying and extracting the address data in the massive historical text address, learning the knowledge and rules between the address data through the learning manner, and completing the learned knowledge and rules on the missing write address data, The error address data is verified, the non-canonical address data is normalized, and a hierarchical structured address data is regenerated.

Based on the same application concept as the above method, an address management apparatus is further provided in the embodiment of the present application. As shown in FIG. 2, the address management apparatus specifically includes:

Obtaining a module 11 for obtaining original address data input by a user;

a determining module 12, configured to determine a structured address format including a plurality of address types;

The processing module 13 is configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.

The processing module 13 specifically includes: a pre-processing sub-module 131, configured to pre-process the original address data based on multiple address types; and a splicing module 132, configured to process the pre-processed address data based on the multiple address types. Performing a segmentation; a complementing sub-module 133, configured to perform a complement check on the sliced address data based on the plurality of address types; the normalization sub-module 134 is configured to normalize the address data after the complement check, to A structured address data conforming to the structured address format is obtained.

The pre-processing sub-module 131 is configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the original address data. Address data in a non-canonical format is converted to address data in a canonical format.

The sharding module 132 is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.

The completion sub-module 133 is configured to verify whether the sharded address data already includes address data corresponding to the multiple address types; if not, determine an address type that is not included in the sharded address data. And complementing the address data of the address type based on the historical data.

The normalization sub-module 134 is specifically configured to perform normalization processing on the address data after completion verification by using a pinyin similarity algorithm; and/or, using a POI normalization algorithm based on a probability retrieval model to complete the verification The address data is normalized.

The modules of the device of the present application may be integrated into one or may be deployed separately. The above modules can be combined into one module, or can be further split into multiple sub-modules.

Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is a better implementation. the way. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a A computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present application. A person skilled in the art can understand that the drawings are only a schematic diagram of a preferred embodiment, and the modules or processes in the drawings are not necessarily required to implement the application. Those skilled in the art can understand that the modules in the apparatus in the embodiments may be distributed in the apparatus of the embodiment according to the description of the embodiments, or the corresponding changes may be located in one or more apparatuses different from the embodiment. The modules of the above embodiments may be combined into one module, or may be further split into multiple sub-modules. The serial numbers of the embodiments of the present application are merely for the description, and do not represent the embodiments. The pros and cons. The above disclosure is only a few specific embodiments of the present application, but the present application is not limited thereto, and any changes that can be made by those skilled in the art should fall within the protection scope of the present application.

Claims

A method for managing address data, characterized in that the method comprises the following steps:

The address management device obtains original address data input by the user;

The address management device determines a structured address format including a plurality of address types;

The address management device converts the original address data into structured address data conforming to the structured address format, the structured address data including address data corresponding to a plurality of address types.
The method of claim 1, wherein the address management device converts the original address data into structured address data that conforms to the structured address format, and specifically includes:

The address management device preprocesses original address data based on multiple address types;

The address management apparatus performs segmentation on the preprocessed address data based on a plurality of address types;

The address management device performs a complement check on the sliced address data based on the plurality of address types;

The address management device normalizes the address data after the completion of the verification to obtain structured address data conforming to the structured address format.
The method of claim 2, wherein the process of the pre-processing of the original address data by the address management device based on the multiple address types comprises:

The address management device filters, from the original address data, address data that does not correspond to the multiple address types, deletes the currently filtered address data from the original address data, and stores the original address data. Address data in a non-canonical format is converted to address data in a canonical format.
The method according to claim 2, wherein the process of the segmentation of the pre-processed address data by the address management device based on the plurality of address types comprises:

The address management device obtains the word breaker dictionary corresponding to the plurality of address types, and uses the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
The method according to claim 2, wherein the process of the address management device performing a complement check on the post-segment address data based on the plurality of address types includes:

Determining, by the address management apparatus, whether the address data after the severing has already included the corresponding address Type address data; if not, the address management device determines an address type not included in the sliced address data, and complements the address data of the address type based on the history data.
The method according to claim 2, wherein the process of normalizing the address data after the verification by the address management device comprises:

The address management apparatus normalizes the address data after the completion verification by using a pinyin similarity algorithm; and/or the address management apparatus uses the POI normalization algorithm based on the probability retrieval model to complete the verification The address data is normalized.
An address management apparatus, where the address management apparatus specifically includes:

Obtaining a module for obtaining original address data input by a user;

a determining module for determining a structured address format including a plurality of address types;

And a processing module, configured to convert the original address data into structured address data conforming to the structured address format, where the structured address data includes address data corresponding to multiple address types.
The address management device according to claim 7, wherein the processing module comprises:

a pre-processing sub-module for pre-processing the original address data based on multiple address types;

a molecular module for segmenting the preprocessed address data based on multiple address types;

Completing a sub-module for performing complement verification on the sharded address data based on multiple address types;

The normalization submodule is configured to normalize the address data after the completion verification to obtain structured address data conforming to the structured address format.
The address management device according to claim 8, wherein

The pre-processing sub-module is specifically configured to filter, from the original address data, address data that does not correspond to the multiple address types, delete the currently-filtered address data from the original address data, and store the non-original address data. The address data of the canonical format is converted to address data in a canonical format.
The address management device according to claim 8, wherein

The sharding module is specifically configured to obtain a word breaker dictionary corresponding to a plurality of address types, and use the word breaker dictionary corresponding to the plurality of address types to segment the address data corresponding to the plurality of address types.
The address management device according to claim 8, wherein

The completion sub-module is specifically configured to check whether the address data after the severing includes the address data corresponding to the multiple address types; if not, determine the address type not included in the categorized address data, And the address data of the address type is complemented based on the historical data.
The address management device according to claim 8, wherein

The normalization sub-module is specifically configured to normalize address data after completion verification by using a pinyin similarity algorithm; and/or use a POI normalization algorithm based on a probability retrieval model to complete the verified address The data is normalized.