CN112988933A - Method and device for managing address information - Google Patents

Method and device for managing address information Download PDF

Info

Publication number
CN112988933A
CN112988933A CN202110264752.9A CN202110264752A CN112988933A CN 112988933 A CN112988933 A CN 112988933A CN 202110264752 A CN202110264752 A CN 202110264752A CN 112988933 A CN112988933 A CN 112988933A
Authority
CN
China
Prior art keywords
address information
address
fused
region
pieces
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110264752.9A
Other languages
Chinese (zh)
Inventor
冯军芳
黄泽宇
王洪良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huijun Technology Co ltd
Original Assignee
Beijing Huijun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huijun Technology Co ltd filed Critical Beijing Huijun Technology Co ltd
Priority to CN202110264752.9A priority Critical patent/CN112988933A/en
Publication of CN112988933A publication Critical patent/CN112988933A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Abstract

The invention discloses a method and a device for managing address information, and relates to the technical field of computers. The specific implementation mode of the method comprises the following steps: dividing a target geographical area corresponding to an initial address information set into a plurality of area blocks; respectively taking each region block as a region to be fused, and merging similar address information in the region to be fused to obtain an updated address information set; respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain a re-updated address information set; and gradually expanding the geographic range of the area to be fused to further update the address information set until the geographic range of the area to be fused meets the preset condition. The implementation method can solve the problem of high space complexity and time complexity caused by overlarge data volume, and can reduce data errors caused by incomplete fusion of similar geographic information to a great extent.

Description

Method and device for managing address information
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for managing address information.
Background
Because the writing of the address information is not standard, the same geographical position may correspond to a plurality of address information, and the same address information may also correspond to a plurality of geographical positions. Therefore, it is necessary to perform a fusion process on the address information so that each address position uniquely corresponds to one address information.
Fig. 1 is a schematic diagram of address fusion in the prior art. As shown in fig. 1, in the prior art, in the process of comparing and fusing address information, two addresses are usually calculated directly between the addresses. The computational load of this address fusion method is too large.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for address information management, which can solve the problem of high spatial complexity and time complexity caused by an excessively large data amount by using a block-by-block policy for a geographic space, and at the same time, can greatly reduce a data error caused by incomplete fusion of similar geographic information.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided an address information management method, including:
dividing a target geographical area corresponding to an initial address information set into a plurality of area blocks;
respectively taking each region block as a region to be fused, and merging similar address information in the region to be fused to obtain an updated address information set;
respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain a re-updated address information set;
and gradually expanding the geographic range of the area to be fused to further update the address information set until the geographic range of the area to be fused meets a preset condition.
Optionally, merging similar address information in the region to be fused, including:
sequencing all address information in the region to be fused to obtain an address list;
judging whether any address information in the address list is similar to the adjacent address information;
if yes, combining the arbitrary address information and the adjacent address information into one address information; otherwise, traversing the next address information in the address list.
Optionally, the method further comprises:
before judging whether any address information in the address list is similar to the adjacent address information, confirming that the precision level of any address information is less than or equal to a precision level threshold value;
if the precision level of any address information is greater than the precision level threshold, the precision level of any address information is adjusted to be less than or equal to the precision level threshold.
Optionally, the method further comprises: and judging whether the two pieces of address information are similar according to the text similarity and the space distance.
Optionally, judging whether the two pieces of address information are similar according to the text similarity includes:
determining the length of a matching segment between the address texts of the two pieces of address information and the length of the address text of each piece of address information in the two pieces of address information;
determining text similarity between the two pieces of address information according to the length of the matching fragment and the length of the address text of each piece of address information in the two pieces of address information;
judging whether the text similarity is greater than a text similarity threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
Optionally, determining whether the two pieces of address information are similar according to the spatial distance includes:
determining the longitude and latitude distance between the two pieces of address information according to the longitude and latitude data of each piece of address information in the two pieces of address information;
judging whether the longitude and latitude distance is smaller than a longitude and latitude distance threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
Optionally, before each of the region blocks is respectively used as a region to be fused, the method further includes at least one of:
judging whether a plurality of address information with the same longitude and latitude data exists in the initial address information set or not; if so, combining the plurality of address information with the same longitude and latitude data into one address information, and screening one address text from the plurality of address information with the same longitude and latitude data as the address text of the combined address information;
judging whether a plurality of address information with the same address text exist in the initial address information set or not; if so, combining the plurality of address information with the same address text into one address information, and screening one piece of longitude and latitude data from the plurality of address information with the same address text as the longitude and latitude data of the combined address information;
judging whether preset characters exist in the address information or not; and if so, removing the preset characters in the address information.
According to still another aspect of an embodiment of the present invention, there is provided an apparatus for geographical location information management, including:
the region dividing module is used for dividing a target geographic region corresponding to the initial address information set into a plurality of region blocks;
the address fusion module is used for merging similar address information in the areas to be fused by taking each area block as the area to be fused respectively to obtain an updated address information set; respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain a re-updated address information set; and gradually expanding the geographic range of the area to be fused to further update the address information set until the geographic range of the area to be fused meets a preset condition.
Optionally, the address merging module merges similar address information in the to-be-merged region, and includes:
sequencing all address information in the region to be fused to obtain an address list;
judging whether any address information in the address list is similar to the adjacent address information;
if yes, combining the arbitrary address information and the adjacent address information into one address information; otherwise, traversing the next address information in the address list.
Optionally, the address fusion module is further configured to:
before judging whether any address information in the address list is similar to the adjacent address information, confirming that the precision level of any address information is less than or equal to a precision level threshold value;
if the precision level of any address information is greater than the precision level threshold, the precision level of any address information is adjusted to be less than or equal to the precision level threshold.
Optionally, the address fusion module is further configured to: and judging whether the two pieces of address information are similar according to the text similarity and the space distance.
Optionally, the determining, by the address fusion module, whether the two pieces of address information are similar according to the text similarity includes:
determining the length of a matching segment between the address texts of the two pieces of address information and the length of the address text of each piece of address information in the two pieces of address information;
determining text similarity between the two pieces of address information according to the length of the matching fragment and the length of the address text of each piece of address information in the two pieces of address information;
judging whether the text similarity is greater than a text similarity threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
Optionally, the determining, by the address fusion module, whether the two pieces of address information are similar according to the spatial distance includes:
determining the longitude and latitude distance between the two pieces of address information according to the longitude and latitude data of each piece of address information in the two pieces of address information;
judging whether the longitude and latitude distance is smaller than a longitude and latitude distance threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
Optionally, the apparatus further includes an address cleansing module, configured to perform at least one of the following operations before each of the region blocks is respectively used as the region to be fused:
judging whether a plurality of address information with the same longitude and latitude data exists in the initial address information set or not; if so, combining the plurality of address information with the same longitude and latitude data into one address information, and screening one address text from the plurality of address information with the same longitude and latitude data as the address text of the combined address information;
judging whether a plurality of address information with the same address text exist in the initial address information set or not; if so, combining the plurality of address information with the same address text into one address information, and screening one piece of longitude and latitude data from the plurality of address information with the same address text as the longitude and latitude data of the combined address information;
judging whether preset characters exist in the address information or not; and if so, removing the preset characters in the address information.
According to another aspect of the embodiments of the present invention, there is provided an electronic device for address information management, including: one or more processors; the storage device is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to implement the method provided by the first aspect of the embodiments of the present invention.
According to a further aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.
One embodiment of the above invention has the following advantages or benefits: by adopting a block-by-block strategy of geographic space, the problems of high space complexity and time complexity caused by overlarge data volume can be solved, and meanwhile, the data error caused by incomplete fusion of similar geographic information can be greatly reduced.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of address fusion in the prior art;
FIG. 2 is a diagram illustrating address fusion according to cell address names;
FIG. 3 is a schematic diagram of address fusion by longitude;
FIG. 4 is a schematic diagram of the main flow of a method of address information management according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of partitioned regional partitions in an alternative embodiment of the invention;
FIG. 6 is a flow chart illustrating a method of address information management in an alternative embodiment of the invention;
FIG. 7 is a schematic diagram of a building distribution for a cell;
FIG. 8 is a schematic diagram of the major modules of an apparatus for address information management in some embodiments of the present invention;
FIG. 9 is a schematic diagram of the major modules of an apparatus for address information management in further embodiments of the present invention;
FIG. 10 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 11 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
According to an aspect of an embodiment of the present invention, there is provided a method of address information management.
Fig. 4 is a schematic diagram of a main flow of a method for address information management according to an embodiment of the present invention, and as shown in fig. 1, the method for address information management includes step S401, step S402, step S403, and step S404.
Step S401, dividing the target geographic area corresponding to the initial address information set into a plurality of area blocks.
The initial address information set is an address information set that needs to be fused. Taking the e-commerce field as an example, the address information set may include some or all of the addressee information written by the user when ordering on the e-commerce platform. Taking the field of logistics as an example, the address information set may include address information of a mail or an addressee which some or all users fill in when sending a mail.
The address information includes address text and latitude and longitude data. The address text includes, for example, country, province, city, prefecture, street, last level address, and the like. Optionally, the address information may further include a level of precision that reflects the degree of fineness of the address information. For example, including only city information, with an accuracy level of 1; further comprising county information, with an accuracy level of 2; further comprises information of villages, towns and streets, and the precision level is 3; further comprises village and community information, and the precision level is 4; further comprising development area information with an accuracy level of 5; further comprises hot spot areas and business circles, and the precision level is 6; further comprising road information, with an accuracy level of 7; further including road attachment points (intersections, toll stations, entrances and exits, etc.) with an accuracy level of 8; further comprising a portal address, precision level 9; further including those of a cell, a building, with an accuracy level of 10, further including POI points, with an accuracy level of 11, etc.
The division mode of the region into blocks can be selectively set according to actual conditions. For example, a city is divided into several zones according to administrative regions. In the practical application process, the target address area can be divided into a plurality of area blocks with different shapes and outlines according to the practical requirement. Fig. 5 is a schematic diagram of dividing the region blocks according to an alternative embodiment of the present invention, and as shown in fig. 5, the target address region (the largest rectangular box in fig. 5) is divided into 36 region blocks of 6 rows and 6 columns.
Fig. 2 is a diagram illustrating address fusion according to cell address names. Since different administrative areas may have the same cell address name, performing the fusion process only according to the cell name is prone to cause a large error in data. In the embodiment of the invention, the target geographic area is divided into a plurality of area blocks, so that the possibility that different cells in a smaller range have the same name is low, and the defects can be avoided to a certain extent. The embodiments of the present invention can completely avoid the above-mentioned disadvantages when the geographical range of the region block is small.
And S402, respectively taking each region block as a region to be fused, and merging similar address information in the region to be fused to obtain an updated address information set.
Merging refers to merging different address information of the same geographical position into one piece of address information. This step performs fusion processing on a plurality of pieces of address information in each of the area blocks, respectively. For each region block, in the fusion process, pairwise calculation may be performed between all address information in the region block to search for similar address information.
Optionally, merging similar address information in the region to be fused includes: sequencing all address information in the region to be fused to obtain an address list; judging whether any address information in the address list is similar to the adjacent address information; if yes, combining the arbitrary address information and the adjacent address information into one address information; otherwise, the next address information in the address list is traversed.
In this step, each region to be fused corresponds to one region block. Taking fig. 5 as an example, each region to be fused corresponds to a square region in the drawing, such as the region corresponding to the square a. Through sorting and similarity judgment between adjacent address information, the calculation amount of fusion processing can be greatly reduced, and the fusion efficiency is improved.
Further, the method of the embodiment of the present invention may further include: before judging whether any address information in the address list is similar to the adjacent address information, confirming that the precision level of the any address information is less than or equal to the precision level threshold value. If the precision level of the arbitrary address information is greater than the precision level threshold, the precision level of the arbitrary address information is adjusted to be less than or equal to the precision level threshold.
The more specific the content of the address information, the higher the level of accuracy. The value of the precision level threshold can be selectively set according to the actual situation. Illustratively, when analyzing consumption preferences of users in each cell through address information fusion processing, the address information only needs to be accurate to the cell name, and does not need to contain the building number, the unit number and the house number in the cell, and if the address content in certain address information is accurate to the building number, the unit number and the house number in the cell, the building number, the unit number, the house number and the like in the address information can be removed, so as to reduce the accuracy of the address information. For example: the cell 23, floor 1 unit 1303 or the cell 24 can be simplified to a cell a. By setting the precision level threshold, the calculation amount of subsequent fusion processing can be reduced, and the efficiency of the fusion processing is improved.
When judging whether the two pieces of address information are similar, whether the two pieces of address information are similar can be judged according to the text similarity and the space distance, and when judging that the two pieces of address information are similar according to the text similarity and judging that the two pieces of address information are also similar according to the space distance, the similarity between the first piece of address information and the second piece of address information is judged. Illustratively, if the spatial distance between two pieces of address information is less than 400m and the text similarity is greater than 0.6, the two pieces of address information are considered to be similar, and they are merged. The similarity between the two addresses is determined by simultaneously considering the address text and the latitude and longitude data in the address fusion process, so that the defects existing in the embodiment shown in fig. 2 can be completely avoided.
Further, if the order is made according to the latitude or longitude, only the address data of the same latitude or the same longitude (as shown in fig. 3) can be calculated. However, the actual geographic contour shapes of the cells are different, so that the method of only sequencing longitude and latitude to simplify data is adopted, which is equivalent to the method of merging geographic position data only following the same straight line on a map without considering the position of the cell and the boundary form of the cell, thereby resulting in incomplete cell information integration. Therefore, in the embodiment of the present invention, the address information includes both the address text and the latitude and longitude data. In the process of address fusion, the similarity between the two addresses is determined by simultaneously considering the address text and the latitude and longitude data, so that the defects can be avoided.
Optionally, judging whether the two pieces of address information are similar according to the text similarity includes: determining the length of a matching segment between address texts of two pieces of address information and the length of the address text of each piece of address information in the two pieces of address information; determining the text similarity between the two pieces of address information according to the length of the matching segment and the length of the address text of each piece of address information in the two pieces of address information; judging whether the text similarity is greater than a text similarity threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two address information are judged to be dissimilar.
The text similarity calculation can adopt a sequence Matcher module in a python difflib packet to calculate the sum T of the lengths of all matched fragments, and then calculate 2 xT/(len (a)) + len (b)), and the result returns 1 when the lengths are completely the same [0,1], and no identical fragment returns 0. The index will consider both similar segment lengths and order. For example: the first address information is 'No. 3 building in the Shenyuan south region of the Shenzhen' and the character length is 9; the second address information is 'No. 17 floor of the Shenzhen Garden north region', and the character length is 10. Wherein the two similar fragments are: the overall similar length of the Shenzhen family, region and horn is 7. Therefore, the total similarity is 7 × 2/(9+10) ═ 0.7368. The text similarity threshold is 0.6 (empirical value), and the index is larger than the text similarity threshold, which indicates that the two addresses can be merged into one address, otherwise, the two addresses are processed separately.
Optionally, the determining whether the two pieces of address information are similar according to the spatial distance includes: determining a longitude and latitude distance between the two pieces of address information according to the longitude and latitude data of each piece of address information in the two pieces of address information; judging whether the longitude and latitude distance is smaller than a longitude and latitude distance threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two address information are judged to be dissimilar.
Illustratively, the latitude and longitude distance may be determined according to the following formula:
Figure BDA0002971935720000101
in the formula (I), the compound is shown in the specification,lat1representing the latitude of the first address information,lng1represents the longitude of the first piece of address information,lat2representing the latitude of the second address information,lng2represents the longitude of the second address information and,Distrepresenting the latitude and longitude distance between the first and second address information.
The latitude and longitude distance threshold can be selectively set according to actual conditions, such as 600 meters, 400 meters and the like.
And S403, respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain the updated address information set.
In this step, the geographic range of the region to be fused is expanded to include a plurality of region segments, for example, two or more. Taking fig. 5 as an example, in step S402, a region block a of the 36 region blocks is taken as a region to be fused, and in step S403, a block B including the region block a is taken as a region to be fused, and the block B includes 9 region blocks. The amplitude of each expansion can be selectively set according to actual conditions. For example, 1 or more region blocks are enlarged at a time. For another example, the expansion is performed in multiples, each time up to 2 or more times the original.
And S404, gradually enlarging the geographic range of the area to be fused to further update the address information set until the geographic range of the area to be fused meets a preset condition.
In this step, the geographical range of the area to be fused is continued to be expanded to include more area blocks. Taking fig. 5 as an example, in step S402, a region block a of the 36 region blocks is taken as a region to be fused; in step S403, a block B including the region block a is taken as a region to be fused, and the block B includes 9 region blocks; in this step, 9 region blocks including the block B are taken as the region to be fused.
The preset conditions can be selectively set according to actual conditions. For example, when the area to be fused is enlarged to be the same as the target geographical area. For example, the address information set is updated after the region to be fused is expanded to include 2 region blocks in step S403; continuing to expand the geographic range of the region to be fused in the step S404 until the region to be fused comprises 4 region blocks, and updating the address information set; and continuing to gradually expand the geographic range of the region to be fused until the geographic range of the region to be fused contains 36 region blocks, and then updating the address information set as a final address information set.
In other embodiments, the preset conditions are: the region to be fused is expanded to include a set number of region blocks. For example, if the preset condition is that the region to be fused includes 2 region blocks, and the address information set is updated after the region to be fused is expanded to include 2 region blocks in step S403, the address information set updated in step S403 is directly used as the final address information set in step S404. For another example, if the preset condition is that the region to be fused includes 16 region blocks, and the address information set is updated after the region to be fused is expanded to include 9 region blocks in step S403, the geographic range of the region to be fused is continuously expanded until the region to be fused includes 16 region blocks in step S404, so that the address information set updated after the geographic range of the region to be fused is expanded to include 16 region blocks is used as the final address information set.
It should be noted that, when the address information set is updated each time, if the number of the region blocks included in the region to be fused is smaller than the number of all the region blocks included in the target geographic region, the target geographic region may include a plurality of regions to be fused. Exemplarily, taking fig. 5 as an example, in step S402, each region to be fused includes 1 region block, which contains 36 regions to be fused in total; in step S403, if each region to be fused includes 2 region blocks, 18 regions to be fused are included in total; in step S404, if each region to be fused includes 9 region blocks, a total of 4 regions to be fused is included.
The invention gradually enlarges the address range of the region to be fused in the process of fusing the address information, so that the blocks corresponding to the region to be fused are gradually enlarged. By adopting the block-by-block strategy of the geographic space, the problems of high space complexity and time complexity caused by overlarge data volume can be solved, and meanwhile, the data error caused by incomplete fusion of similar geographic information can be greatly reduced.
In some optional embodiments, before each region block is respectively used as a region to be fused, whether a plurality of address information with the same longitude and latitude data exists in an initial address information set or not is judged; if so, combining a plurality of address information with the same longitude and latitude data into one address information, and screening one address text from the plurality of address information with the same longitude and latitude data to serve as the address text of the combined address information. In filtering the address text, one piece of address text may be randomly filtered, or the address text with the least number of words or characters may be filtered. According to the embodiment, only one corresponding address text and corresponding longitude and latitude data can be reserved in each geographic position, the calculation amount of subsequent fusion processing is greatly reduced, and the fusion efficiency is improved.
In some optional embodiments, before each region block is used as a region to be fused, whether a plurality of address information with the same address text exists in an initial address information set or not is judged; if so, combining a plurality of address information with the same address text into one address information, and screening one piece of longitude and latitude data from the plurality of address information with the same address text to serve as the longitude and latitude data of the combined address information. When screening the longitude and latitude data, one piece of longitude and latitude data can be randomly screened, or the longitude and latitude data of a fixed position, such as the longitude and latitude data of the leftmost upper corner, the rightmost lower corner and other positions, can be screened. According to the embodiment, only one corresponding address text and corresponding longitude and latitude data can be reserved in each geographic position, the calculation amount of subsequent fusion processing is greatly reduced, and the fusion efficiency is improved.
Before each region block is taken as a region to be fused, whether preset characters exist in the address information can be judged; if so, removing the preset characters in the address information so as to facilitate subsequent fusion processing. For example, Chinese and English punctuations and special characters such as "-", brackets and the like in the address text are removed.
Fig. 6 is a flowchart illustrating a method for address information management according to an alternative embodiment of the present invention. The flow of the present embodiment is executed in a python 3.9.1 environment, and the following is a detailed flow.
And (3) address resolution: the address list is transmitted through a location service WebService (a lightweight independent communication technology capable of receiving requests transmitted from the Internet or other systems on the Intranet) API (Application Program Interface), and the corresponding province, city, county, street, last level address and address fineness level are obtained.
Whether or not the resolution precision level is equal to or less than a precision level threshold (for example, 10). When the precision level is more than 10, performing address optimization deletion, and then jumping to the step of address data analysis. And when the resolution precision is less than or equal to 10, entering the step of information output. In fig. 6, the POI represents address information with a precision level greater than 10.
After the information is output, the address information is fused. The principle is as follows: one longitude and latitude only corresponds to one address, and a plurality of addresses with similar height and space distance are integrated into one address. Specifically, the target geographic area is divided into a plurality of small segments, and for each small segment, the Chinese addresses are sorted. For each address information after sorting, the following process is executed:
and judging whether the text similarity degree between the address information and the adjacent address information is more than 0.6 or not. If yes, further calculating according to the longitude and latitude distance, and otherwise, ending the fusion processing flow of the current address information.
And when calculating according to the longitude and latitude distance, judging whether the longitude and latitude distance between the address information and the address information adjacent to the address information is less than or equal to 400 m. If yes, merging the address information with the address information adjacent to the address information.
After the calculation of each small block is finished, calculation is carried out on larger blocks, the calculation process is similar to the above, namely the Chinese addresses are sequenced, then calculation is carried out between two adjacent address information according to text similarity and longitude and latitude distances, and merging is carried out if the conditions are met. And then, performing calculation processing in a larger block, and repeating the steps until the set requirement is met.
The text similarity and the latitude and longitude distance are calculated in the foregoing related description, and details are not repeated here. The longitude and latitude data can be accurate to three digits after decimal point, the accuracy is about 100 meters, and if the longitude and latitude data are accurate to two digits after decimal point, the accuracy is about 1000 meters.
Fig. 7 is a schematic diagram of a building distribution of a cell. In the address information fusion method in the prior art, usually, only address information data of floors 1, 2, and 3 or floors 7 and 8 in fig. 7 can be merged respectively, the data of floors 4, 5, and 6 can be well merged with other data, and subsequently, data with high address similarity can be merged within a specified parameter distance.
In this embodiment, on the basis that the original data is irregular after being analyzed by the WebServiceAPI, normalization processing is performed on the data with the same longitude and latitude and different Chinese address names, and the longitude and latitude information of the same geographic location is subjected to normalization processing. By adopting the method of calculating the similarity of the address text and the distance of the geographic longitude and latitude while partitioning based on the geographic position, the highly similar geographic information is merged, and the purpose of reducing the dimensionality of the original address information can be achieved.
The invention can solve the problems of higher space complexity and time complexity caused by overlarge data amount through a step-by-step block division strategy of the geographic space on the basis of carrying out fusion processing on tens of millions of address information or even larger address information, and can reduce data errors caused by incomplete fusion of similar geographic positions to a great extent.
The method of the embodiment of the invention can be used in various service application fields. Illustratively, due to factors such as house price, regions and crowds, residential communities are naturally divided into crowds with similar purchasing ability, users are associated with geographic positions by the method of the embodiment of the invention, basic attributes and purchasing attributes of the users in the residential communities can be mined out through geographic data of the residential communities, the repurchase rate is predicted according to the crowds and the types, and the method plays an important role in accurate marketing, recommendation and search, new user expansion and other business applications.
According to still another aspect of an embodiment of the present invention, there is provided an apparatus for implementing the above method.
Fig. 8 is a schematic diagram of the main modules of an apparatus for address information management in some embodiments of the present invention. As shown in fig. 8, an apparatus 800 for managing geographical location information includes:
the region dividing module 801 is configured to divide a target geographic region corresponding to the initial address information set into a plurality of region blocks;
the address fusion module 802, which takes each region block as a region to be fused, and merges similar address information in the region to be fused to obtain an updated address information set; respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain a re-updated address information set; and gradually expanding the geographic range of the area to be fused to further update the address information set until the geographic range of the area to be fused meets a preset condition.
Optionally, the address merging module merges similar address information in the to-be-merged region, and includes:
sequencing all address information in the region to be fused to obtain an address list;
judging whether any address information in the address list is similar to the adjacent address information;
if yes, combining the arbitrary address information and the adjacent address information into one address information; otherwise, traversing the next address information in the address list.
Optionally, the address fusion module is further configured to:
before judging whether any address information in the address list is similar to the adjacent address information, confirming that the precision level of any address information is less than or equal to a precision level threshold value;
if the precision level of any address information is greater than the precision level threshold, the precision level of any address information is adjusted to be less than or equal to the precision level threshold.
Optionally, the address fusion module is further configured to: and judging whether the two pieces of address information are similar according to the text similarity and the space distance.
Optionally, the determining, by the address fusion module, whether the two pieces of address information are similar according to the text similarity includes:
determining the length of a matching segment between the address texts of the two pieces of address information and the length of the address text of each piece of address information in the two pieces of address information;
determining text similarity between the two pieces of address information according to the length of the matching fragment and the length of the address text of each piece of address information in the two pieces of address information;
judging whether the text similarity is greater than a text similarity threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
Optionally, the determining, by the address fusion module, whether the two pieces of address information are similar according to the spatial distance includes:
determining the longitude and latitude distance between the two pieces of address information according to the longitude and latitude data of each piece of address information in the two pieces of address information;
judging whether the longitude and latitude distance is smaller than a longitude and latitude distance threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
Fig. 9 is a schematic diagram of main modules of an apparatus for address information management according to another embodiment of the present invention. As shown in fig. 9, the apparatus 800 for address information management further includes an address cleansing module 803, configured to perform at least one of the following operations before each of the region blocks is respectively used as a region to be fused:
judging whether a plurality of address information with the same longitude and latitude data exists in the initial address information set or not; if so, combining the plurality of address information with the same longitude and latitude data into one address information, and screening one address text from the plurality of address information with the same longitude and latitude data as the address text of the combined address information;
judging whether a plurality of address information with the same address text exist in the initial address information set or not; if so, combining the plurality of address information with the same address text into one address information, and screening one piece of longitude and latitude data from the plurality of address information with the same address text as the longitude and latitude data of the combined address information;
judging whether preset characters exist in the address information or not; and if so, removing the preset characters in the address information.
According to another aspect of the embodiments of the present invention, there is provided an electronic device for address information management, including: one or more processors; the storage device is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to implement the method provided by the first aspect of the embodiments of the present invention.
According to a further aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.
Fig. 10 shows an exemplary system architecture 1000 to which the method of address information management or the apparatus of address information management of an embodiment of the present invention may be applied.
As shown in fig. 10, the system architecture 1000 may include terminal devices 1001, 1002, 1003, a network 1004, and a server 1005. The network 1004 is used to provide a medium for communication links between the terminal devices 1001, 1002, 1003 and the server 1005. Network 1004 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 1001, 1002, 1003 to interact with a server 1005 via a network 1004 to receive or transmit messages or the like. The terminal devices 1001, 1002, 1003 may have installed thereon various messenger client applications such as shopping applications, web browser applications, search applications, instant messenger, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 1001, 1002, 1003 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 1005 may be a server that provides various services, such as a backend management server (for example only) that supports shopping websites browsed by users using the terminal devices 1001, 1002, 1003. The back-office management server may analyze and otherwise process the received data such as the estimated delivery time request, and feed back the processing result (e.g., estimated delivery time information-for example only) to the terminal device.
It should be noted that the method for managing address information provided by the embodiment of the present invention is generally executed by the server 1005, and accordingly, the apparatus for managing address information is generally disposed in the server 1005.
It should be understood that the number of terminal devices, networks, and servers in fig. 10 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 11, shown is a block diagram of a computer system 1100 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 11, the computer system 1100 includes a Central Processing Unit (CPU)1101, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the system 1100 are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
The following components are connected to the I/O interface 1105: an input portion 1106 including a keyboard, mouse, and the like; an output portion 1107 including a signal output unit such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1108 including a hard disk and the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. A driver 1110 is also connected to the I/O interface 1105 as necessary. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1110 as necessary, so that a computer program read out therefrom is mounted into the storage section 1108 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 1109 and/or installed from the removable medium 1111. The computer program performs the above-described functions defined in the system of the present invention when executed by the central processing unit (C PU) 1101.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprising: the device comprises a region division module and an address fusion module. The names of these modules do not in some cases form a limitation on the modules themselves, and for example, the address fusion module may also be described as a unit that divides a target geographic area corresponding to an initial set of address information into a plurality of area blocks.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: dividing a target geographical area corresponding to an initial address information set into a plurality of area blocks; respectively taking each region block as a region to be fused, and merging similar address information in the region to be fused to obtain an updated address information set; respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain a re-updated address information set; and gradually expanding the geographical range of the area to be fused to further update the address information set until the geographical range of the area to be fused meets a preset condition.
According to the technical scheme of the embodiment of the invention, by adopting the step-by-step block division strategy of the geographic space, the problems of higher space complexity and time complexity caused by overlarge data volume can be solved, and the data error caused by incomplete fusion of similar geographic information can be greatly reduced.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for address information management, comprising:
dividing a target geographical area corresponding to an initial address information set into a plurality of area blocks;
respectively taking each region block as a region to be fused, and merging similar address information in the region to be fused to obtain an updated address information set;
respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain a re-updated address information set;
and gradually expanding the geographic range of the area to be fused to further update the address information set until the geographic range of the area to be fused meets a preset condition.
2. The method of claim 1, wherein merging similar address information in the region to be fused comprises:
sequencing all address information in the region to be fused to obtain an address list;
judging whether any address information in the address list is similar to the adjacent address information;
if yes, combining the arbitrary address information and the adjacent address information into one address information; otherwise, traversing the next address information in the address list.
3. The method of claim 1, wherein the method further comprises:
before judging whether any address information in the address list is similar to the adjacent address information, confirming that the precision level of any address information is less than or equal to a precision level threshold value;
if the precision level of any address information is greater than the precision level threshold, the precision level of any address information is adjusted to be less than or equal to the precision level threshold.
4. The method of claim 1, further comprising: and judging whether the two pieces of address information are similar according to the text similarity and the space distance.
5. The method of claim 4, wherein determining whether two address information are similar according to text similarity comprises:
determining the length of a matching segment between the address texts of the two pieces of address information and the length of the address text of each piece of address information in the two pieces of address information;
determining text similarity between the two pieces of address information according to the length of the matching fragment and the length of the address text of each piece of address information in the two pieces of address information;
judging whether the text similarity is greater than a text similarity threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
6. The method of claim 4, wherein determining whether two address information are similar according to the spatial distance comprises:
determining the longitude and latitude distance between the two pieces of address information according to the longitude and latitude data of each piece of address information in the two pieces of address information;
judging whether the longitude and latitude distance is smaller than a longitude and latitude distance threshold value; if yes, judging that the two pieces of address information are similar; otherwise, the two pieces of address information are judged to be dissimilar.
7. The method according to any of claims 1-4, wherein before each of said region blocks is respectively taken as a region to be fused, the method further comprises at least one of:
judging whether a plurality of address information with the same longitude and latitude data exists in the initial address information set or not; if so, combining the plurality of address information with the same longitude and latitude data into one address information, and screening one address text from the plurality of address information with the same longitude and latitude data as the address text of the combined address information;
judging whether a plurality of address information with the same address text exist in the initial address information set or not; if so, combining the plurality of address information with the same address text into one address information, and screening one piece of longitude and latitude data from the plurality of address information with the same address text as the longitude and latitude data of the combined address information;
judging whether preset characters exist in the address information or not; and if so, removing the preset characters in the address information.
8. An apparatus for geolocation information management, comprising:
the region dividing module is used for dividing a target geographic region corresponding to the initial address information set into a plurality of region blocks;
the address fusion module is used for merging similar address information in the areas to be fused by taking each area block as the area to be fused respectively to obtain an updated address information set; respectively taking the plurality of region blocks as new regions to be fused, and merging similar address information in the regions to be fused according to the updated address information set to obtain a re-updated address information set; and gradually expanding the geographic range of the area to be fused to further update the address information set until the geographic range of the area to be fused meets a preset condition.
9. An electronic device for address information management, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202110264752.9A 2021-03-11 2021-03-11 Method and device for managing address information Pending CN112988933A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110264752.9A CN112988933A (en) 2021-03-11 2021-03-11 Method and device for managing address information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110264752.9A CN112988933A (en) 2021-03-11 2021-03-11 Method and device for managing address information

Publications (1)

Publication Number Publication Date
CN112988933A true CN112988933A (en) 2021-06-18

Family

ID=76335003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110264752.9A Pending CN112988933A (en) 2021-03-11 2021-03-11 Method and device for managing address information

Country Status (1)

Country Link
CN (1) CN112988933A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568951A (en) * 2021-07-30 2021-10-29 拉扎斯网络科技(上海)有限公司 Data mining and processing method and device, storage medium and electronic equipment
CN114461540A (en) * 2022-04-12 2022-05-10 湖南三湘银行股份有限公司 Processing system for address normalization

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256230A (en) * 2017-05-03 2017-10-17 昆明理工大学 A kind of fusion method based on diversification geography information point
CN110209748A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 The method and apparatus for indexing geography fence
CN110532546A (en) * 2019-07-29 2019-12-03 河北远东通信系统工程有限公司 A kind of automatic delivery method of alert merging geographical location and text similarity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256230A (en) * 2017-05-03 2017-10-17 昆明理工大学 A kind of fusion method based on diversification geography information point
CN110209748A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 The method and apparatus for indexing geography fence
CN110532546A (en) * 2019-07-29 2019-12-03 河北远东通信系统工程有限公司 A kind of automatic delivery method of alert merging geographical location and text similarity

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568951A (en) * 2021-07-30 2021-10-29 拉扎斯网络科技(上海)有限公司 Data mining and processing method and device, storage medium and electronic equipment
CN114461540A (en) * 2022-04-12 2022-05-10 湖南三湘银行股份有限公司 Processing system for address normalization

Similar Documents

Publication Publication Date Title
US11550826B2 (en) Method and system for generating a geocode trie and facilitating reverse geocode lookups
CN108628811B (en) Address text matching method and device
US8996523B1 (en) Forming quality street addresses from multiple providers
US20150187337A1 (en) Resolving label collisions on a digital map
CN113342912B (en) Geographical location area coding method, and method and device for establishing coding model
CN112988933A (en) Method and device for managing address information
CN103514235A (en) Method and device for establishing incremental code library
CN111522838A (en) Address similarity calculation method and related device
CN110858347A (en) Method and device for logistics distribution and order distribution
KR102124657B1 (en) Apparatus and method for processing map data by real time index creation and system thereof
CN110674208B (en) Method and device for determining position information of user
CN113240175B (en) Distribution route generation method, distribution route generation device, storage medium, and program product
CN110657813B (en) Method and device for optimizing planned roads in map
CN112200336A (en) Method and device for planning vehicle driving path
CN112541624B (en) Site selection method, device, medium and electronic equipment for collecting throwing net points
US20140285526A1 (en) Apparatus and method for managing level of detail contents
CN114820960B (en) Method, device, equipment and medium for constructing map
CN114357102A (en) Road network data generation method and device
CN113139258B (en) Road data processing method, device, equipment and storage medium
CN115100231A (en) Method and device for determining region boundary
CN112396081A (en) Data fusion method and device
CN113190676A (en) Method and device for extracting address keywords
CN111861526A (en) Method and device for analyzing object source
CN112328713A (en) Data processing method and device of electronic map, electronic equipment and medium
CN111475742A (en) Address extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination