CN112487122B - Address normalization processing method and device - Google Patents

Address normalization processing method and device Download PDF

Info

Publication number
CN112487122B
CN112487122B CN202011397609.9A CN202011397609A CN112487122B CN 112487122 B CN112487122 B CN 112487122B CN 202011397609 A CN202011397609 A CN 202011397609A CN 112487122 B CN112487122 B CN 112487122B
Authority
CN
China
Prior art keywords
address
original
network
data source
longitude
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011397609.9A
Other languages
Chinese (zh)
Other versions
CN112487122A (en
Inventor
王乐斐
梁相军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tenth Research Institute Of Telecommunications Technology Co ltd
Original Assignee
Tenth Research Institute Of Telecommunications Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tenth Research Institute Of Telecommunications Technology Co ltd filed Critical Tenth Research Institute Of Telecommunications Technology Co ltd
Priority to CN202011397609.9A priority Critical patent/CN112487122B/en
Publication of CN112487122A publication Critical patent/CN112487122A/en
Application granted granted Critical
Publication of CN112487122B publication Critical patent/CN112487122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to an address normalization processing method and a device, wherein the method comprises the following steps: acquiring a plurality of address data sources, wherein the address data sources comprise a plurality of original addresses corresponding to target addresses; performing preliminary treatment on each original address in each address data source to obtain an original address after preliminary treatment; performing network crawling on each primary treated original address to obtain the network crawling longitude and latitude corresponding to the primary treated original address; and carrying out deep treatment on each primary treated original address according to the longitude and latitude of the network climbing, and obtaining a normalized standard place name address library. The method and the device unify the address information based on the precise matching of the longitude and latitude of multiple data sources and by using the address weight normalization, effectively solve the problems of low processing efficiency and low accuracy in the method for carrying out the normalization analysis on the address in the prior art, and improve the use capability of an application system on the address information.

Description

Address normalization processing method and device
Technical Field
The invention relates to the technical field of data processing, in particular to an address normalization processing method and device.
Background
In practical use, some conventions are able to accurately identify an address, and an address may also have multiple designations at the same time, such as short, common, old and new designations, inconsistent sequences, and using a landmark as an address. The conditions can enable one address information to have multiple names and even present different forms in different applications, and the conditions can bring larger interference when analyzing one address information, so that the address information is unified through address normalization analysis, and the application capability when analyzing the address can be improved.
Currently, the analysis algorithms which are widely applied mainly comprise a rule-based matching method or a statistical-based method and the like. The rule matching-based method is to extract various address elements from address information, such as province, city, street and the like, and then match the information with geographic information of corresponding rules, so that an accurate address which can be identified by a machine is obtained. However, as rule knowledge is obtained more and more, processing bottlenecks are brought about, and processing efficiency is low. The statistical method is based on no need of extensive linguistic knowledge, and the address with the highest probability and highest coincidence degree with the target address is calculated through the corresponding model, so that the address normalization analysis is realized. However, when new addresses, place names, or place name changes, etc. occur, the accuracy of the method will be greatly compromised.
It is noted that this section is intended to provide a background or context for the embodiments of the disclosure set forth in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Disclosure of Invention
The embodiment of the invention provides an address normalization processing method and device, which at least solve the problems of low processing efficiency and low accuracy in the prior art of a method for performing normalization analysis on an address.
In a first aspect, an embodiment of the present invention provides an address normalization processing method, including:
acquiring a plurality of address data sources, wherein the address data sources comprise a plurality of original addresses corresponding to target addresses;
performing preliminary treatment on each original address in each address data source to obtain an original address after preliminary treatment;
Performing network crawling on each primary treated original address to obtain the network crawling longitude and latitude corresponding to the primary treated original address;
and carrying out deep treatment on each primary treated original address according to the longitude and latitude of the network, and obtaining a normalized standard place name address library.
As a preferred mode of the first aspect of the present invention, the address data sources include a waybill address data source, a network address data source, and a social resource address data source;
the waybill address data source comprises a plurality of waybill original addresses corresponding to the target address, the network address data source comprises a plurality of network original addresses corresponding to the target address, and the social resource address data source comprises a plurality of social resource original addresses corresponding to the target address.
As a preferred mode of the first aspect of the present invention, if the address data source is a waybill address data source, the performing preliminary treatment on each original address in each address data source to obtain a preliminary treated original address includes:
Extracting a field from each original address of the waybill in the waybill address data source to obtain a waybill province field, a waybill market field and a waybill detailed address field corresponding to the original address of the waybill;
And sequentially verifying the legality and correctness of the waybill province field and the waybill city field, and processing special characters contained in the waybill detailed address field after passing the verification to obtain the waybill original address after preliminary treatment.
As a preferred mode of the first aspect of the present invention, if the address data source is a network address data source, the performing preliminary treatment on each original address in each address data source to obtain a preliminary treated original address includes:
Extracting a field from each network original address in the network address data source to obtain a network longitude field, a network latitude field and a network Chinese address field corresponding to the network original address;
And sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the network Chinese address field, and processing special characters contained in the network Chinese address field after the verification is passed to obtain the network original address after preliminary treatment.
As a preferred mode of the first aspect of the present invention, if the address data source is a social resource address data source, the performing preliminary treatment on each original address in each address data source to obtain a preliminary treated original address includes:
extracting a field from each social resource original address in the social resource address data source to obtain a social resource Chinese address field corresponding to the social resource original address;
And sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the social resource Chinese address field, and processing special characters contained in the social resource Chinese address field after the verification is passed to obtain the initial social resource address after preliminary treatment.
As a preferred mode of the first aspect of the present invention, the performing, according to the longitude and latitude of the network, the depth treatment on the primary treated original address to obtain a normalized standard address library includes:
correcting the deviation of the network climbing longitude and latitude to generate a standard coordinate system longitude and latitude corresponding to the network climbing longitude and latitude;
based on the spatial position determined by the longitude and latitude of the standard coordinate system, counting the occurrence times of the same primary treated original address in the address data sources respectively to obtain the data source confidence coefficient of a plurality of different primary treated original addresses in each address data source;
integrating and de-overlapping the plurality of different primary treated original addresses in each address data source, and processing the data source confidence of each primary treated original address after de-overlapping to obtain the integrated confidence of each primary treated original address after de-overlapping;
And sequencing the primary treated addresses after the de-duplication combination according to the sequence from the high confidence level to the low confidence level of the integration to obtain a normalized standard place name address library.
In a second aspect, an embodiment of the present invention provides an address normalization processing device, including:
the system comprises an original address acquisition unit, a target address acquisition unit and a storage unit, wherein the original address acquisition unit is used for acquiring various address data sources, and the address data sources comprise a plurality of original addresses corresponding to the target addresses;
the address preliminary treatment unit is used for carrying out preliminary treatment on each original address in each address data source to obtain an original address after preliminary treatment;
The longitude and latitude network climbing unit is used for performing network climbing on each primary treated original address to obtain the network climbing longitude and latitude corresponding to the primary treated original address;
And the address depth treatment unit is used for carrying out depth treatment on each primary treated original address according to the network climbing latitude and longitude to obtain a normalized standard place name address library.
As a preferred mode of the second aspect of the present invention, the address depth management unit is specifically configured to:
correcting the deviation of the network climbing longitude and latitude to generate a standard coordinate system longitude and latitude corresponding to the network climbing longitude and latitude;
based on the spatial position determined by the longitude and latitude of the standard coordinate system, counting the occurrence times of the same primary treated original address in the address data sources respectively to obtain the data source confidence coefficient of a plurality of different primary treated original addresses in each address data source;
integrating and de-overlapping the plurality of different primary treated original addresses in each address data source, and processing the data source confidence of each primary treated original address after de-overlapping to obtain the integrated confidence of each primary treated original address after de-overlapping;
And sequencing the primary treated addresses after the de-duplication combination according to the sequence from the high confidence level to the low confidence level of the integration to obtain a normalized standard place name address library.
In a third aspect, an embodiment of the present invention provides a computing device, including a processor and a memory, where the memory stores execution instructions, and the processor reads the execution instructions in the memory to perform the steps described in the address normalization processing method.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium containing computer-executable instructions for performing steps as described in the address normalization processing method described above.
The address normalization processing method and device provided by the embodiment of the invention unify the address information based on the accurate matching of the longitude and latitude of multiple data sources and by using the address weight normalization mode, effectively solve the problems of low processing efficiency and low accuracy in the method for carrying out the normalization analysis on the address in the prior art, and promote the use capability of an application system on the address information.
The invention solves the problems that the same address has multiple names and even has different forms, and ensures better usability of the Chinese address.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an address normalization processing method according to an embodiment of the present invention;
Fig. 2 is a schematic structural diagram of an address normalization processing device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a computing device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
Referring to fig. 1, an embodiment of the present invention discloses an address normalization processing method, which mainly includes:
101. acquiring a plurality of address data sources, wherein the address data sources comprise a plurality of original addresses corresponding to target addresses;
102. performing preliminary treatment on each original address in each address data source to obtain an original address after preliminary treatment;
103. Performing network crawling on each primary treated original address to obtain the network crawling longitude and latitude corresponding to the primary treated original address;
104. and carrying out deep treatment on each primary treated original address according to the longitude and latitude of the network climbing, and obtaining a normalized standard place name address library.
In step 101, for a target address to be normalized, a plurality of different address data sources are acquired, where each address data source includes a plurality of original addresses corresponding to the target address.
In this embodiment, the process of obtaining the address data source is not limited, and a person skilled in the art may obtain a plurality of different address data sources according to actual situations. The original addresses included in the respective address data sources should be as large as possible so that the result of the normalization process will be more accurate.
In an alternative embodiment provided by the application, the address data sources include a waybill address data source, a network address data source and a social resource address data source; the waybill address data source comprises a plurality of waybill original addresses corresponding to the target address, the network address data source comprises a plurality of network original addresses corresponding to the target address, and the social resource address data source comprises a plurality of social resource original addresses corresponding to the target address.
In this embodiment, three address data sources, namely, a waybill address data source, a network address data source and a social resource address data source, are preferable, and the address forms of the three data sources are different, so that multiple representation forms of the target address can be covered as much as possible.
The waybill address data source comprises a plurality of waybill original addresses corresponding to the target addresses, and the waybill original addresses generally comprise three fields of province, city and detail addresses. The network address data source comprises a plurality of network original addresses corresponding to the target addresses, and the network original addresses generally comprise three fields of Chinese addresses, longitudes and longitudes, and the network original addresses refer to network LBS addresses and network shopping addresses. The social resource address data source includes a plurality of social resource original addresses corresponding to the target address, and the social resource original addresses generally include only one field of Chinese addresses.
In step 102, for the plurality of different address data sources obtained in step 101, each original address included in each address data source is first subjected to preliminary treatment, so as to obtain an original address after the preliminary treatment.
Preferably, if the address data source is a waybill address data source, the step 102 specifically includes the following steps:
Extracting a field from each original address of the waybill in the waybill address data source to obtain a waybill province field, a waybill city field and a waybill detailed address field corresponding to the original address of the waybill;
And sequentially verifying the legality and correctness of the waybill province field and the waybill city field, and processing special characters contained in the waybill detailed address field after passing the verification to obtain the waybill original address after preliminary treatment.
In the method, each waybill original address in the waybill address data source specifically comprises a waybill province field, a waybill market field and a waybill detailed address field, and then the three fields are respectively extracted from each waybill original address.
Firstly, verifying the validity and the correctness of the waybill province field through a data dictionary of all the provinces in China, extracting provincial information from the waybill detailed address field if the verification is not passed, and then verifying the validity and the correctness of the waybill, if the verification is passed, retaining the original address data of the waybill, and if the verification is not passed, discarding the original address data of the waybill.
And then, verifying the validity and the correctness of the waybill field through the data dictionary of the province, extracting the city information from the waybill detailed address if the verification fails, and verifying the validity and the correctness of the city information, wherein the original address data of the waybill is reserved if the verification fails, and the original address data of the waybill is discarded if the verification fails.
Then, after passing the verification, the special characters contained in the detailed address field of the waybill are processed, and the specific processing procedure is as follows:
the processing of the paradox content contained in the detailed address of the waybill is like that one piece of address data has a plurality of province, city and county information;
Whether the province, the city and the county in the detailed address of the waybill have an affiliated relation or not is determined, so that whether the address data is reserved or not is determined;
the digits which are continuously more than 5 digits in the detailed address of the waybill are removed, and the number of the mobile phone is 11 digits and the number of the landline phone is 8 digits because the postal code is 6 digits;
processing special symbols in the detailed address of the waybill, such as (), [ ], (), [ MET ], { }, and the like, and deleting the symbols and the contents contained in the symbols;
The detailed address of the manifest contains contents such as "please", "freighted", "telephone", etc., and these contents and the following contents are deleted.
And finally, combining the waybill province field, the waybill city field and the processed waybill detailed address field which pass the verification to obtain the waybill original address after preliminary treatment. Of course, the original address of the waybill after the preliminary treatment also needs to be validated and validated to ensure that the attribution of province, city and district/county is correct.
After the preliminary treatment, a part of original waybill original addresses in the waybill address data source are discarded or processed, and the reserved waybill original addresses in the waybill address data source are all the waybill original addresses after the preliminary treatment finally.
Preferably, if the address data source is a network address data source, step 102 specifically includes the following steps:
extracting a field from each network original address in the network address data source to obtain a network longitude field, a network latitude field and a network Chinese address field corresponding to the network original address;
and sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the network Chinese address field, and processing special characters contained in the network Chinese address field after the verification is passed to obtain the network original address after preliminary treatment.
In the above method, each network original address in the network address data source specifically includes a network longitude field, a network latitude field and a network chinese address field, and then the three fields are extracted from each network original address respectively.
Firstly, extracting province information from a network Chinese address field, verifying the validity and correctness of the province information passing through a data dictionary of all provinces in China, reserving the network original address data if the province information passes through the province information, and discarding the network original address data if the province information does not pass through the province information.
And then, extracting the market information from the network Chinese address field, verifying the validity and the correctness of the network Chinese address field passing through a data dictionary belonging to the province in China, reserving the network original address data if the network original address data passes through the data dictionary, and discarding the network original address data if the network original address data does not pass through the data dictionary.
Then, after passing the verification, the special characters contained in the network Chinese address field are processed, and the specific processing procedure is as follows:
processing the paradox content contained in the network Chinese address, wherein the paradox content is as same as the processing of a plurality of province, city and county information in one address data;
whether the province, the city and the county in the network Chinese address have subordinate relations or not is determined, so that whether the address data are reserved or not is determined;
Removing digits which are continuously more than 5 digits from the network Chinese address, wherein the postal code is 6 digits, the mobile phone number is 11 digits, and the landline number is 8 digits;
Processing special symbols in the network Chinese address, such as (), [ ], (), [ METHOD ], { }, and the like, and deleting the symbols and the contents contained in the symbols;
the network chinese address contains contents such as "please", "freighted", "telephone", etc., and these contents and the following contents are deleted.
And finally, taking the verified and processed network Chinese address as the network original address after preliminary treatment. Of course, the original network address after preliminary treatment also needs to be validated and validated to ensure that the province, city, district/county is correctly attributed.
After the preliminary treatment, a part of original network original addresses in the network address data source are discarded or processed, and finally the network original addresses after the preliminary treatment are reserved in the network address data source.
Preferably, if the address data source is a social resource address data source, the step 102 specifically includes the following steps:
Extracting a field from each social resource original address in the social resource address data source to obtain a social resource Chinese address field corresponding to the social resource original address;
And sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the social resource Chinese address field, and processing special characters contained in the social resource Chinese address field after the verification, so as to obtain the initial social resource address after preliminary treatment.
In the method, each social resource original address in the social resource address data source specifically comprises a social resource Chinese address field, and then the field is extracted from each social resource original address.
The verification and processing process of the social resource original address is similar to the verification and processing process of the network original address, and the specific implementation process can refer to the related description in the foregoing process, which is not repeated here.
And finally, taking the verified and processed social resource original address as the social resource original address after preliminary treatment. Of course, the original address of the social resource after preliminary treatment also needs to be validated and validated to ensure that the attribution of provinces, cities and regions/counties is correct.
After the preliminary treatment, a part of the original social resource original address in the social resource address data source is discarded or processed, and finally the social resource original address after the preliminary treatment is reserved in the social resource address data source.
In step 103, network crawling is performed on the primary treated original addresses included in the multiple address data sources obtained in step 102, so as to obtain the network crawling latitude and longitude corresponding to each primary treated original address in the multiple address data sources.
Because the data sources are more and the data volume is larger, the network climbing is performed in a grouping mode in the embodiment, and a plurality of groups of primary address data after preliminary treatment are processed concurrently.
For each set of data, a web-crawling pointer is recorded, pointing to each set of web-crawled data. And performing network crawling on the next piece of data pointed by the network crawling pointer by using the detailed address or the Chinese address to obtain the longitude and latitude of the network crawling corresponding to the detailed address or the Chinese address. If the original address of the waybill after the preliminary treatment is aimed at, adopting the detailed address field to perform network climbing; aiming at the network original address or the social resource original address after preliminary treatment, the network climbing is carried out by adopting a Chinese address field.
It should be noted that, for the network original address after preliminary treatment, when field extraction is performed, a network longitude field and a network latitude field are extracted at the same time, and the two fields and the network climbing longitude and latitude obtained by network climbing are reserved at the same time.
In the network crawling process, a Goldmap, a hundred-degree map or other positioning application or web page can be adopted for network crawling. In this embodiment, the specific network crawling process is not limited, and those skilled in the art may perform network crawling of latitude and longitude according to actual situations.
In step 104, for the network longitude and latitude corresponding to each primary treated original address in the multiple address data sources obtained in step 103, performing deep treatment on each primary treated original address according to the network longitude and latitude, and finally obtaining the normalized standard place name address library.
In an alternative embodiment provided by the present application, step 104 may be implemented as follows:
1041. And (3) carrying out deviation correction processing on the network climbing longitude and latitude to generate a standard coordinate system longitude and latitude corresponding to the network climbing longitude and latitude.
In this step, for the network climbing longitude and latitude corresponding to each primary treated original address in the obtained multiple address data sources, since the network climbing longitude and latitude do not belong to the longitude and latitude under the same coordinate system, deviation correction processing is required to be performed first, and the standard coordinate system longitude and latitude corresponding to the network climbing longitude and latitude is uniformly generated, wherein the standard coordinate system is preferably a WGS-84 coordinate system.
It should be noted that, in the foregoing preliminary treatment process, the network longitude field and the network latitude field are also extracted from each network original address in the network address data source, that is, each network original address also carries the original longitude and latitude information. Whether the original longitude and latitude information needs to be corrected or not is determined according to actual conditions.
For each network original address in the network address data source, the original longitude and latitude information of the network original address needs to be compared with the corresponding network climbing longitude and latitude. If the actual space position converted by the difference value is smaller than 1 meter, reserving the network climbing longitude and latitude corresponding to the network original address, and discarding the original longitude and latitude information; if the actual spatial position of the difference conversion is greater than 1 meter, the piece of address data is discarded. In this way, some useless interference data can be further removed.
1042. And based on the spatial position determined by the longitude and latitude of the standard coordinate system, counting the occurrence times of the same primary treated original address in the address data sources respectively to obtain the data source confidence of a plurality of different primary treated original addresses in each address data source.
In the step, for different kinds of address data sources, counting the times of occurrence of the same initial addresses after preliminary treatment in the address data sources, and correspondingly calculating the data source confidence of each different initial addresses after preliminary treatment in the address data sources.
And when the two primary treated original addresses are identical, firstly judging whether the address information is identical or not, and then determining whether the space positions are identical or not according to the longitude and latitude of the standard coordinate system corresponding to the two primary treated original addresses, if the difference of the space positions determined by the two primary treated original addresses is within a certain error range, considering that the two primary treated original addresses are identical.
In addition, according to the difference of the data quality of different address data sources, the same primary treated original address in each address data source appears once, and the confidence of the data source is increased by different values. If the same primary treated waybill original address appears once in the waybill address data source, the data source confidence of the address is correspondingly increased by 1, and the same primary treated social resource original address appears once in the social resource address data source, and the data source confidence of the same primary treated social resource original address is correspondingly increased by 5.
1043. And integrating and de-overlapping the plurality of different primary treated original addresses in each address data source, and processing the data source confidence of each primary treated original address after de-overlapping and merging to obtain the integrated confidence of each primary treated original address after de-overlapping and merging.
In the step, a plurality of different primary treated original addresses included in different kinds of address data sources are integrated together, then de-overlapping is carried out, and the data source confidence degrees of the different primary treated original addresses after de-overlapping and merging are processed, so that the integrated confidence degrees of the different primary treated original addresses are obtained.
Preferably, in the implementation, a person skilled in the art may set a confidence coefficient according to the confidence coefficient of the data source, multiply the confidence coefficient of the data source with the corresponding confidence coefficient, and then add the multiplied confidence coefficient to obtain the integrated confidence coefficient finally.
For example, the confidence coefficient of the data source of the original address of the waybill after a certain preliminary treatment is 85, the corresponding confidence coefficient is 0.2, the confidence coefficient of the data source of the original address of the network after another same preliminary treatment is 51, the corresponding confidence coefficient is 0.5, the address data are de-overlapped, the two data source confidence coefficients are multiplied by the corresponding confidence coefficient and then added, and the integrated confidence coefficient is 85×0.2+51×0.5=42.5.
1044. And sequencing the primary treated addresses after the de-duplication combination according to the sequence from the high confidence level to the low confidence level, and obtaining a normalized standard place name address library.
In the step, the higher the integration confidence is, the higher the accuracy of the corresponding address data is, so that the original addresses after the preliminary treatment after the de-duplication combination are sequenced according to the sequence from the high integration confidence to the low integration confidence, and a normalized standard place name address library is obtained, and each address in the standard place name database is also corresponding to a space position determined by the longitude and latitude of the network.
It should be noted that, for simplicity of description, the above-described embodiments of the method are all described as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required for the present invention.
In summary, the address normalization processing method provided by the embodiment of the invention unifies address information based on the accurate matching of longitude and latitude of multiple data sources and by using the address weight normalization mode, effectively solves the problems of low processing efficiency and low accuracy in the method for carrying out normalization analysis on the address in the prior art, and improves the use capability of an application system on the address information. The invention solves the problems that the same address has multiple names and even has different forms, and ensures better usability of the Chinese address.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method according to the above-mentioned embodiments of the present application.
Referring to fig. 2, based on the same inventive concept, an embodiment of the present invention further provides an address normalization processing device, where the device mainly includes:
an original address obtaining unit 21, configured to obtain a plurality of address data sources, where the address data sources include a plurality of original addresses corresponding to target addresses;
An address preliminary treatment unit 22, configured to perform preliminary treatment on each original address in each address data source, so as to obtain a preliminary treated original address;
A longitude and latitude network climbing unit 23, configured to perform network climbing on each primary treated original address, so as to obtain a network climbing longitude and latitude corresponding to the primary treated original address;
The address depth management unit 24 is configured to perform depth management on each primary treated original address according to the longitude and latitude of the network, so as to obtain a normalized standard place name address library.
It should be noted that, the original address obtaining unit 21, the address preliminary treatment unit 22, the longitude and latitude network climbing unit 23 and the address depth treatment unit 24 correspond to steps 101 to 104 in the method embodiment, and the four units are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the method embodiment.
In an optional embodiment provided by the application, the address data source comprises a waybill address data source, a network address data source and a social resource address data source;
the waybill address data source comprises a plurality of waybill original addresses corresponding to the target address, the network address data source comprises a plurality of network original addresses corresponding to the target address, and the social resource address data source comprises a plurality of social resource original addresses corresponding to the target address.
In an alternative embodiment provided in the present application, if the address data source is a waybill address data source, the address preliminary treatment unit 22 is specifically configured to:
Extracting a field from each original address of the waybill in the waybill address data source to obtain a waybill province field, a waybill city field and a waybill detailed address field corresponding to the original address of the waybill;
And sequentially verifying the legality and correctness of the waybill province field and the waybill city field, and processing special characters contained in the waybill detailed address field after passing the verification to obtain the waybill original address after preliminary treatment.
In an alternative embodiment provided in the present application, if the address data source is a network address data source, the address preliminary treatment unit 22 is specifically configured to:
extracting a field from each network original address in the network address data source to obtain a network longitude field, a network latitude field and a network Chinese address field corresponding to the network original address;
and sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the network Chinese address field, and processing special characters contained in the network Chinese address field after the verification is passed to obtain the network original address after preliminary treatment.
In an alternative embodiment provided in the present application, if the address data source is a social resource address data source, the address preliminary treatment unit 22 is specifically configured to:
Extracting a field from each social resource original address in the social resource address data source to obtain a social resource Chinese address field corresponding to the social resource original address;
And sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the social resource Chinese address field, and processing special characters contained in the social resource Chinese address field after the verification, so as to obtain the initial social resource address after preliminary treatment.
In an alternative embodiment provided by the present application, the address depth management unit 24 is specifically configured to:
correcting the longitude and latitude of the network climbing to generate a standard coordinate system longitude and latitude corresponding to the longitude and latitude of the network climbing;
based on the spatial position determined by longitude and latitude of a standard coordinate system, counting the occurrence times of the same primary treated original address in the address data sources respectively to obtain the data source confidence coefficients of a plurality of different primary treated original addresses in each address data source;
integrating and de-overlapping the plurality of different primary treated original addresses in each address data source, and processing the data source confidence of each primary treated original address after de-overlapping and merging to obtain the integrated confidence of each primary treated original address after de-overlapping and merging;
And sequencing the primary treated addresses after the de-duplication combination according to the sequence from the high confidence level to the low confidence level, and obtaining a normalized standard place name address library.
In summary, the address normalization processing device provided by the embodiment of the invention unifies address information based on the accurate matching of longitude and latitude of multiple data sources and by using the address weight normalization mode, so that the problems of low processing efficiency and low accuracy in the method for performing normalization analysis on the address in the prior art are effectively solved, and the use capability of an application system on the address information is improved. The invention solves the problems that the same address has multiple names and even has different forms, and ensures better usability of the Chinese address.
It should be noted that, the address normalization processing device provided in the embodiment of the present invention belongs to the same technical concept as the address normalization processing method described in the foregoing embodiment, and the specific implementation process may refer to the description of the method steps in the foregoing embodiment, which is not repeated herein.
It should be understood that the above address normalization processing device includes units that are only logically divided according to functions implemented by the device, and in practical applications, the above units may be overlapped or split. The functions implemented by the address normalization processing device provided in this embodiment correspond to the address normalization processing method provided in the foregoing embodiment, and a more detailed processing procedure implemented by the device is described in detail in the foregoing method embodiment, which is not described in detail herein.
Referring to fig. 3, based on the same inventive concept, an embodiment of the present invention provides a computing device mainly including a processor 31 and a memory 32, wherein execution instructions are stored in the memory 32. The processor 31 reads the execution instructions in the memory 32 for performing the steps described in any of the embodiments of the address normalization processing method described above. Or the processor 31 reads the execution instructions in the memory 32 for implementing the functions of the units in any of the embodiments of the address normalization processing means described above.
FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present invention, and as shown in FIG. 3, the computing device includes a processor 31, a memory 32, and a transceiver 33; wherein the processor 31, the memory 32 and the transceiver 33 are interconnected by a bus 34.
The memory 32 is for storing a program; in particular, the program may include program code including computer-operating instructions. The memory 32 may include volatile memory (RAM), such as random-access memory (RAM); the memory 32 may also include a nonvolatile memory (non-volatilememory), such as a flash memory (flash memory), a hard disk (HARD DISK DRIVE, HDD for short) or a solid state disk (solid-STATE DRIVE, SSD for short); the memory 32 may also include a combination of the above types of memory.
The memory 32 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof:
operation instructions: including various operational instructions for carrying out various operations.
Operating system: including various system programs for implementing various basic services and handling hardware-based tasks.
Bus 34 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 3, but not only one bus or one type of bus.
The processor 31 may be a central processor (central processing unit, CPU for short), a network processor (network processor, NP for short), or a combination of CPU and NP. But also a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD for short), a field programmable gate array (fieldprogrammable GATE ARRAY, FPGA for short), a generic array logic (GENERIC ARRAY logic, GAL for short), or any combination thereof.
Embodiments of the present invention also provide a computer-readable storage medium containing computer-executable instructions for performing the steps described in any of the embodiments of the address normalization processing method described above. Or the computer-executable instructions are used to perform the functions of the units in the address normalization processing arrangement embodiments described above.
Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.
Those skilled in the art will appreciate that all or part of the steps of implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs the steps comprising the method embodiments described above, and the storage medium described above includes various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (9)

1. An address normalization processing method, which is characterized by comprising the following steps:
acquiring a plurality of address data sources, wherein the address data sources comprise a plurality of original addresses corresponding to target addresses;
performing preliminary treatment on each original address in each address data source to obtain an original address after preliminary treatment;
Performing network crawling on each primary treated original address to obtain the network crawling longitude and latitude corresponding to the primary treated original address;
carrying out deep treatment on each primary treated original address according to the longitude and latitude of the network, and obtaining a normalized standard place name address library;
Performing preliminary treatment on each original address in each address data source to obtain a preliminary treated original address, including:
Extracting a field from each original address in each address data source, verifying the validity and correctness of the extracted field, and processing special characters contained in the extracted field passing the verification to obtain an original address after preliminary treatment;
the original addresses after preliminary treatment are subjected to deep treatment according to the network climbing latitude and longitude, and a normalized standard place name address library is obtained, and the method comprises the following steps:
correcting the deviation of the network climbing longitude and latitude to generate a standard coordinate system longitude and latitude corresponding to the network climbing longitude and latitude;
based on the spatial position determined by the longitude and latitude of the standard coordinate system, counting the occurrence times of the same primary treated original address in the address data sources respectively to obtain the data source confidence coefficient of a plurality of different primary treated original addresses in each address data source;
integrating and de-overlapping the plurality of different primary treated original addresses in each address data source, and processing the data source confidence of each primary treated original address after de-overlapping to obtain the integrated confidence of each primary treated original address after de-overlapping;
And sequencing the primary treated addresses after the de-duplication combination according to the sequence from the high confidence level to the low confidence level of the integration to obtain a normalized standard place name address library.
2. The method of claim 1, wherein the address data sources comprise a waybill address data source, a network address data source, and a social resource address data source;
the waybill address data source comprises a plurality of waybill original addresses corresponding to the target address, the network address data source comprises a plurality of network original addresses corresponding to the target address, and the social resource address data source comprises a plurality of social resource original addresses corresponding to the target address.
3. The method of claim 2, wherein if the address data source is a waybill address data source, performing preliminary treatment on each original address in each address data source to obtain a preliminary treated original address, comprising:
Extracting a field from each original address of the waybill in the waybill address data source to obtain a waybill province field, a waybill market field and a waybill detailed address field corresponding to the original address of the waybill;
And sequentially verifying the legality and correctness of the waybill province field and the waybill city field, and processing special characters contained in the waybill detailed address field after passing the verification to obtain the waybill original address after preliminary treatment.
4. The method according to claim 2, wherein if the address data source is a network address data source, the performing preliminary treatment on each original address in each address data source to obtain a preliminary treated original address includes:
Extracting a field from each network original address in the network address data source to obtain a network longitude field, a network latitude field and a network Chinese address field corresponding to the network original address;
And sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the network Chinese address field, and processing special characters contained in the network Chinese address field after the verification is passed to obtain the network original address after preliminary treatment.
5. The method according to claim 2, wherein if the address data source is a social resource address data source, the performing preliminary treatment on each original address in each address data source to obtain a preliminary treated original address includes:
extracting a field from each social resource original address in the social resource address data source to obtain a social resource Chinese address field corresponding to the social resource original address;
And sequentially verifying the validity and the correctness of the network province information and the network city information acquired from the social resource Chinese address field, and processing special characters contained in the social resource Chinese address field after the verification is passed to obtain the initial social resource address after preliminary treatment.
6. An address normalization processing device, comprising:
the system comprises an original address acquisition unit, a target address acquisition unit and a storage unit, wherein the original address acquisition unit is used for acquiring various address data sources, and the address data sources comprise a plurality of original addresses corresponding to the target addresses;
the address preliminary treatment unit is used for carrying out preliminary treatment on each original address in each address data source to obtain an original address after preliminary treatment;
The longitude and latitude network climbing unit is used for performing network climbing on each primary treated original address to obtain the network climbing longitude and latitude corresponding to the primary treated original address;
The address depth treatment unit is used for carrying out depth treatment on each primary treated address according to the network climbing latitude and longitude to obtain a normalized standard place name address library;
the address preliminary treatment unit is specifically configured to perform field extraction on each original address in each address data source, verify validity and correctness of the extracted field, and process special characters contained in the extracted field passing verification to obtain an original address after preliminary treatment;
the address depth treatment unit is specifically configured to perform deviation rectification processing on the network climbing longitude and latitude, and generate a standard coordinate system longitude and latitude corresponding to the network climbing longitude and latitude; based on the spatial position determined by the longitude and latitude of the standard coordinate system, counting the occurrence times of the same primary treated original address in the address data sources respectively to obtain the data source confidence coefficient of a plurality of different primary treated original addresses in each address data source; integrating and de-overlapping the plurality of different primary treated original addresses in each address data source, and processing the data source confidence of each primary treated original address after de-overlapping to obtain the integrated confidence of each primary treated original address after de-overlapping; and sequencing the primary treated addresses after the de-duplication combination according to the sequence from the high confidence level to the low confidence level of the integration to obtain a normalized standard place name address library.
7. The apparatus of claim 6, wherein the address depth management unit is specifically configured to:
correcting the deviation of the network climbing longitude and latitude to generate a standard coordinate system longitude and latitude corresponding to the network climbing longitude and latitude;
based on the spatial position determined by the longitude and latitude of the standard coordinate system, counting the occurrence times of the same primary treated original address in the address data sources respectively to obtain the data source confidence coefficient of a plurality of different primary treated original addresses in each address data source;
integrating and de-overlapping the plurality of different primary treated original addresses in each address data source, and processing the data source confidence of each primary treated original address after de-overlapping to obtain the integrated confidence of each primary treated original address after de-overlapping;
And sequencing the primary treated addresses after the de-duplication combination according to the sequence from the high confidence level to the low confidence level of the integration to obtain a normalized standard place name address library.
8. A computing device comprising a processor and a memory, wherein the memory has stored therein execution instructions, the processor reading the execution instructions in the memory for performing the steps in the address normalization method of any one of claims 1 to 5.
9. A computer-readable storage medium containing computer-executable instructions for performing the steps in the address normalization processing method according to any one of claims 1 to 5.
CN202011397609.9A 2020-12-02 2020-12-02 Address normalization processing method and device Active CN112487122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011397609.9A CN112487122B (en) 2020-12-02 2020-12-02 Address normalization processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011397609.9A CN112487122B (en) 2020-12-02 2020-12-02 Address normalization processing method and device

Publications (2)

Publication Number Publication Date
CN112487122A CN112487122A (en) 2021-03-12
CN112487122B true CN112487122B (en) 2024-05-17

Family

ID=74939138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011397609.9A Active CN112487122B (en) 2020-12-02 2020-12-02 Address normalization processing method and device

Country Status (1)

Country Link
CN (1) CN112487122B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515677B (en) * 2021-07-22 2023-10-27 中移(杭州)信息技术有限公司 Address matching method, device and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699818A (en) * 2015-03-25 2015-06-10 武汉大学 Multi-source heterogeneous multi-attribute POI (point of interest) integration method
CN106709065A (en) * 2017-01-19 2017-05-24 国家电网公司 Standardization processing method and standardized processing device for address information
CN108345596A (en) * 2017-01-22 2018-07-31 分众(中国)信息技术有限公司 Building information converged services platform
CN109635063A (en) * 2018-12-06 2019-04-16 拉扎斯网络科技(上海)有限公司 Information processing method, device, electronic equipment and the storage medium of address base
CN111523606A (en) * 2020-04-28 2020-08-11 中交信息技术国家工程实验室有限公司 Road information updating method
CN111723172A (en) * 2020-06-10 2020-09-29 广东世纪高通科技有限公司 Data fusion method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10073870B2 (en) * 2013-07-05 2018-09-11 Here Global B.V. Method and apparatus for providing data correction and management

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699818A (en) * 2015-03-25 2015-06-10 武汉大学 Multi-source heterogeneous multi-attribute POI (point of interest) integration method
CN106709065A (en) * 2017-01-19 2017-05-24 国家电网公司 Standardization processing method and standardized processing device for address information
CN108345596A (en) * 2017-01-22 2018-07-31 分众(中国)信息技术有限公司 Building information converged services platform
CN109635063A (en) * 2018-12-06 2019-04-16 拉扎斯网络科技(上海)有限公司 Information processing method, device, electronic equipment and the storage medium of address base
CN111523606A (en) * 2020-04-28 2020-08-11 中交信息技术国家工程实验室有限公司 Road information updating method
CN111723172A (en) * 2020-06-10 2020-09-29 广东世纪高通科技有限公司 Data fusion method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
The Web as a Data Source for Spatial Database;KARLA A. V. BORGES等;Brazilian Symposium on Geoinformatics;20031231;1-7 *
基于词信息量加权的地理POI数据融合新方法研究;王逍翔等;软件导刊;20180315;第17卷(第3期);41-44+49 *

Also Published As

Publication number Publication date
CN112487122A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN110705214A (en) Automatic coding method and device
CN110647562B (en) Data query method and device, electronic equipment and storage medium
CN112069276A (en) Address coding method and device, computer equipment and computer readable storage medium
CN111159329A (en) Sensitive word detection method and device, terminal equipment and computer-readable storage medium
CN109165209B (en) Data verification method, device, equipment and medium for object types in database
CN112084179B (en) Data processing method, device, equipment and storage medium
CN113312338A (en) Data consistency checking method, device, equipment, medium and program product
CN112487122B (en) Address normalization processing method and device
CN112329954A (en) Article recall method and device, terminal equipment and storage medium
TW202407602A (en) Store duplicate removal processing method, apparatus and device, and storage medium
CN111724110A (en) Address information processing method and device, computer equipment and storage medium
CN110909110B (en) Address standardization method and device, storage medium and processor
CN110990350A (en) Log analysis method and device
CN114328017A (en) Database backup method, system, computer equipment and storage medium
CN113761137B (en) Method and device for extracting address information
CN109101630B (en) Method, device and equipment for generating search result of application program
CN116611092A (en) Multi-factor-based data desensitization method and device, and tracing method and device
CN114722824A (en) Address processing method and device, storage medium and electronic equipment
CN115544979A (en) Method, device and equipment for extracting administrative address and storage medium
CN111737237B (en) Missing milepost data generation method, device, equipment and storage medium
CN109241208B (en) Address positioning method, address monitoring method, information processing method and device
CN105786922B (en) Method and device for determining missing electronic map data
CN109284278B (en) Calculation logic migration method based on data analysis technology and terminal equipment
CN112287005A (en) Data processing method, device, server and medium
CN113535880B (en) Geographic information determination method and device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant