CN107305577B

CN107305577B - K-means-based appropriate address data processing method and system

Info

Publication number: CN107305577B
Application number: CN201610262654.0A
Authority: CN
Inventors: 胡焕叶
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2016-04-25
Filing date: 2016-04-25
Publication date: 2020-12-22
Anticipated expiration: 2036-04-25
Also published as: CN107305577A

Abstract

The invention discloses a K-means-based appropriate address data processing method and a system, wherein the method comprises the following steps: acquiring geographic coordinates of a plurality of proper addresses of a given address to form a coordinate point group; setting the clustering number to be 1 according to a K-means clustering algorithm, carrying out first aggregation on the coordinate point group, and obtaining a first aggregation coordinate point P₁(x₁,y₁) (ii) a At the first-time aggregated coordinate point P₁(x₁,y₁) Making a circle as the center of the circle to obtain the number of coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N₁The final circle of (2); according to the K-means clustering algorithm, a plurality of coordinate points in the final circle are aggregated to obtain a final aggregated coordinate point P_f(x_f,y_f) And the final aggregated coordinate point P is determined_f(x_f,y_f) As geographic coordinates of the given address. The method accurately determines the geographical coordinates of the given address by using the K-means clustering algorithm, improves the data processing efficiency on the premise of ensuring the accuracy of the result, and provides rich and accurate data for refining the map data at the later stage.

Description

K-means-based appropriate address data processing method and system

Technical Field

The invention relates to the technical field of processing of geospatial information data, in particular to a K-means-based appropriate address data processing method and system.

Background

With the development and application of spatial information technology and mobile internet, geographic information technology has been widely applied to various fields of internet, especially intelligent location service, which has become one of the key supporting technologies for internet development.

In the logistics delivery system of the E-commerce, a courier of the E-commerce delivers articles to users according to the freight note address of the commodity, and records the geographic coordinate when delivering, namely the geographic coordinate of the appropriate delivery address. In e-commerce systems, there are typically multiple orders placed at the same location with historical accumulations, as if a user had too many purchases and therefore had multiple orders placed. Since the couriers of each order may be different, and the places where the users receive the goods may be different, the geographical coordinates of the addresses which are properly put may be different for the same address, so that the problem that the geographical coordinates of a plurality of addresses which are properly put correspond to the e-commerce system in the same address is caused. These geographical coordinates of the appropriate address are not the final data required by the intelligent location service, and they will always vary within a certain range, which, if the fluctuation range is large, will affect the accuracy of forward/backward address acquisition. Namely, the geographic coordinates are obtained according to the address, or the accuracy of the address is obtained according to the geographic coordinates.

In the delivery System of large-scale e-commerce, a large amount of address Geographic coordinate data is recorded, and the data is the data source required by most of the current GIS (Geographic Information System) systems. Most GIS systems obtain data sources through geographic coordinate acquisition or purchase. For the collected geographic coordinate data, the data is single and thin, the accuracy of the data is influenced by various factors such as signals and collection personnel, the error of the collected data is difficult to avoid and the data is difficult to control.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a K-means-based method and a K-means-based system for processing the destination address data, aiming at the problem that the same address corresponds to a plurality of destination address geographical coordinates at present, and the K-means-based method and the K-means-based system are used for determining an accurate geographical coordinate corresponding to the address according to the geographical coordinates of the destination addresses.

In order to solve the above technical problem, according to an aspect of the present invention, the present invention provides a method for processing appropriate address data based on K-means, wherein the method comprises the following steps:

acquiring geographical coordinates of a plurality of proper addresses of a given address, and forming a coordinate point group by taking the geographical coordinates of the plurality of proper addresses as source data;

according to a K-means clustering algorithm, with the clustering number value as 1, carrying out first aggregation on the coordinate point group to obtain a first aggregation coordinate point P₁(x₁,y₁)；

At the first-time aggregated coordinate point P₁(x₁,y₁) Making a circle as the center of the circle to obtain the number of coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N₁The final circle of (2);

according to a K-means clustering algorithm, clustering multiple coordinate points in a final circle by taking the clustering number value as 1 to obtain a final clustering coordinate point P_f(x_f,y_f) And the final aggregated coordinate point P is determined_f(x_f,y_f) As geographic coordinates of the given address.

Preferably, at the first-time aggregation coordinate point P₁(x₁,y₁) When a circle is drawn as the circle center, the method comprises the following steps:

at the first-time aggregated coordinate point P₁(x₁,y₁) As the center of a circle, all coordinate points are used to reach the center of a circle P₁Average distance R of₁Making a circle for the radius;

statistical radius of R₁The number of coordinate points within the circle;

if the radius is R₁Is greater than or equal to a preset aggregation number threshold value N₁Then the radius is R₁The circle of (2) is the final circle;

if the radius is R₁Is less than a preset aggregation number threshold value N₁Then, re-rounding according to the correction radius obtained by the formula 1-1 until a final circle is obtained;

R_i+1＝R_i+(D_max-D_dev)/β 1-1

wherein R is_iIs the radius of the current circle, i is 1,2,3 … … k, k is a natural number, D_max、D_devRespectively from all coordinate points in the current circle to the circle center P₁Maximum and mean values of; beta is a weight.

Preferably, if the address is an index address, β ═ 2; if the source data is the last level address, β is 4.

Preferably, the threshold number of aggregations N₁N x λ, where N is the amount of source data, λ is any one point value in the interval of 95% -97% if the address is an index address, and λ is any one point value in the interval of 70% -80% if the address is a final address.

Preferably, before the coordinate point groups are aggregated for the first time, it is determined whether the number of the source data is greater than or equal to a predetermined number threshold, and if the number of the source data is greater than or equal to the predetermined number threshold, the coordinate point groups are aggregated for the first time.

Preferably, when rounding is performed again according to the radius obtained by equation 1-1, the method includes the step of judging the address as an index address or a last level address:

and reading the identification bit of the given address, and judging the address to be an index address or a last-stage address according to the identification of the identification bit.

Preferably, the method further comprises the step of classifying the user address:

carrying out segmentation processing on the user address according to the geographic region from large to small to form a plurality of address segments from a first address segment to a tail address segment;

sequentially counting the number of geographic coordinate points corresponding to each address segment according to the address segments from large to small in the geographic region;

comparing the number of the geographic coordinate points corresponding to each address segment with a preset threshold value respectively, and if the number of the geographic coordinate points corresponding to the current address segment is smaller than the preset threshold value and the number of the geographic coordinate points corresponding to the previous address segment of the current address segment is larger than or equal to the preset threshold value, segmenting the current address segment into a final address from the beginning to the end, and segmenting the current address segment into an index address from the first address segment to the previous address segment of the current address segment;

and setting corresponding identification bits for the index address and the last-stage address respectively.

Preferably, before counting the number of geographic coordinate points corresponding to each address segment in sequence, matching each address segment with a preset index address matching table, and determining address segments belonging to index addresses and address segments of non-index addresses;

when the number of the geographic coordinate points corresponding to each address segment is counted, the number of the geographic coordinate points corresponding to the address segments of the non-index addresses is counted;

and comparing the number of the geographic coordinate points corresponding to the address segments of the non-index address with a preset threshold value respectively when the number of the geographic coordinate points corresponding to the address segments of the non-index address is compared with the preset threshold value respectively.

In order to solve the above technical problem, according to another aspect of the present invention, there is provided a K-means based hit address data processing system, comprising:

the source data acquisition module is used for acquiring the geographic coordinates of a plurality of appropriate addresses corresponding to the addresses according to the provided addresses and forming a coordinate point group in a geographic information system;

the K-means clustering module is used for aggregating the set coordinate point group according to the set clustering number value to obtain an aggregated coordinate point with the set clustering number; and

a final circle obtaining module for obtaining the first aggregation coordinate point P according to the K-means clustering module₁(x₁,y₁) Obtaining the number of the coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N₁The final circle of (c).

Preferably, the system further includes an address type determination module, configured to determine that the address is an index address or a last-level address.

Preferably, the final circle acquiring module includes a first circle acquiring unit, a correction circle acquiring unit, a statistical unit and a comparing unit;

the first-time circle acquisition unit is connected with the K-means clustering module and the source data acquisition module and used for acquiring a first-time aggregation coordinate point P according to the K-means clustering module₁(x₁,y₁) With said coordinate point P₁(x₁,y₁) As the circle center, all coordinate points obtained from the source data acquisition module are used to reach the circle center P₁(x₁,y₁) Average distance R of₁Obtaining a first circle by taking the radius as the radius;

the correction circle acquisition unit is respectively connected with the first circle acquisition unit, the address category judgment module and the comparison unit, and obtains the correction radius R according to the recalculation notice output by the comparison unit and a formula 1-1_i+1With said coordinate point P₁(x₁,y₁) As the center of the circle, obtaining a correction circle, or according to the qualified notice output by the comparison unit, setting the radius as R₁The circle of (2) is determined as a final circle;

R_i+1＝R_i+(D_max-D_dev)/β 1-1

wherein R is_iIs the radius of the current circle, i is 1,2,3 … … k, k is a natural number, D_max、D_devRespectively, all coordinate points in the current circle to point P₁(x₁,y₁) Beta is the weight;

the statistical unit is connected with the first circle acquisition unit and the correction circle acquisition unit and is used for counting the number of coordinate points in the first circle and the correction circle;

the comparison unit is connected with the statistic unit and is used for respectively comparing the number of coordinate points in the first circle and the correction circle with the aggregation number threshold value N₁Comparing, if the number of coordinate points in the circle is less than the preset aggregation number threshold N₁And sending a recalculation notification to the correction circle acquisition unit, and if the number of coordinate points in the circle is greater than or equal to a preset aggregation number threshold N1, sending a qualification notification to the correction circle acquisition unit.

Preferably, the system of the present invention further comprises an address processing module, and the address processing module includes:

the address segmentation unit is used for segmenting the address from large to small according to the geographic area to form a plurality of address segments;

the number counting unit is connected with the address segmentation unit and used for counting the number of the geographic coordinate points corresponding to each address segmentation;

the comparison unit is connected with the quantity counting unit and used for comparing the quantity of the geographic coordinate points corresponding to each address segment with a preset threshold value; and

and the address category marking unit is connected with the comparing unit and marks the category of the address according to the comparison result of the comparing unit.

The address processing module further comprises:

the index address matching unit is connected with the address segmentation unit and the quantity counting unit and is used for comparing the plurality of address segments with a preset index address matching table and determining the address segments belonging to the index address and the address segments of the non-index address;

the number counting unit is used for counting the number of the geographic coordinate points corresponding to the address segments belonging to the non-index addresses.

The invention utilizes the K-means clustering algorithm to accurately determine the geographic coordinates corresponding to the address, and improves the data processing efficiency through a concise processing process on the premise of ensuring the accuracy of the result. The invention utilizes the abundant geographical coordinates of the appropriate delivery address provided by the courier to calculate the average value of a plurality of data, thereby reducing the data error and improving the positioning accuracy compared with the single data of the acquired address. Because the appropriate addresses in the invention not only are town addresses, but also comprise a plurality of country addresses, after the invention determines the coordinate of a certain address, abundant and accurate data are provided for the refining of map data in later period.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a schematic flow chart of a K-means-based method for processing appropriate address data according to the present invention;

FIG. 2 is a schematic diagram of the architecture of the K-means based hit address data processing system according to the present invention;

FIG. 3 is a schematic diagram of another architecture of the K-means based hit address data processing system according to the present invention;

FIG. 4 is a schematic diagram of the structure of the address processing module according to the present invention;

FIG. 5 is a schematic diagram of a structural principle of the final circle acquiring module according to the present invention;

FIG. 6 is a schematic flow chart illustrating a K-means based hit address data processing method according to an embodiment of the present invention;

FIG. 7 is a graph of the distribution of the raw data obtained based on the method shown in FIG. 6;

FIG. 8 is a graph of a circle and data distribution obtained after multiple polymerizations according to the method of FIG. 6;

FIG. 9 is a final geographical coordinate distribution diagram obtained after the final circle and the aggregation based on the method shown in FIG. 6 according to the present invention; and

fig. 10 is a schematic diagram of the final geographic coordinates obtained by the method shown in fig. 6 according to the present invention displayed on a map.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present invention. The figures are not necessarily drawn to scale.

The flowcharts and block diagrams in the figures and block diagrams illustrate the possible architectures, functions, and operations of the systems, methods, and apparatuses according to the embodiments of the present invention, and may represent a module, a program segment, or merely a code segment, which is an executable instruction for implementing a specified logical function. It should also be noted that the executable instructions that implement the specified logical functions may be recombined to create new modules and program segments. The blocks of the drawings, and the order of the blocks, are thus provided to better illustrate the processes and steps of the embodiments and should not be taken as limiting the invention itself.

FIG. 1 is a schematic flow chart of the K-means-based method for processing the committed address data according to the present invention. FIG. 2 is a schematic diagram of the K-means based system for processing the appropriate address according to the present invention; the invention is described below with reference to fig. 1 and 2:

the K-means-based appropriate address data processing system comprises a source data acquisition module 1, a K-means clustering module 2 and a final circle acquisition module 3. The source data acquisition module 1 acquires geographic coordinates of a plurality of appropriate addresses corresponding to the addresses according to the provided addresses to form a coordinate point group; the K-means clustering module 2 is used for aggregating the set coordinate point group according to the set clustering number value 1 to obtain an aggregated coordinate point, and obtaining a coordinate point P after the first aggregation₁(x₁,y₁) Then, according to a coordinate point within the radius of the circle obtained by the final circle obtaining module 3, aggregating to obtain a final aggregated coordinate point, wherein the coordinate point is a geographic coordinate corresponding to the address; the final circle obtaining module 3 is used for obtaining a first aggregation coordinate point P according to the K-means clustering module 2₁(x₁,y₁) Obtaining the number of the coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N₁The final circle of (c).

Based on the system shown in FIG. 2, the method for processing the appropriate address data based on the K-means provided by the invention comprises the following steps:

step S1, the source data acquisition module acquires 1 the geographical coordinates of a plurality of proper addresses of a given address, and inputs the geographical coordinates of the plurality of proper addresses as source data into the processing system to form a coordinate point group;

step S2, the K-means clustering module 2 sets the clustering number K to 1 according to the K-means clustering algorithm, carries out the first aggregation on the coordinate point group and obtains the first aggregation coordinate point P₁(x₁,y₁)；

Step S3, the final circle acquiring module aggregates the coordinates for the first timeA point step S1, making a circle for the center of the circle, and obtaining that the number of coordinate points in the circle is greater than or equal to a preset aggregation number threshold value N₁The final circle of (2);

step S4, the K-means clustering module sets the clustering number K to 1 according to the K-means clustering algorithm, and aggregates a plurality of coordinate points in the final circle to obtain a final aggregated coordinate point P_f(x_f,y_f)。

The invention utilizes the idea of clustering algorithm: entities in the same cluster are similar, and entities in different clusters are dissimilar; and clustering the geographic coordinates of a plurality of proper addresses of the given address, wherein the distance between any two points in the same cluster is smaller than the distance between any two points in different clusters. On the basis of a clustering algorithm, source data coordinate points are analyzed, data coordinates are classified to remove far-away deviation points, and then the remaining accurate coordinate points are aggregated, so that the accuracy of the geographical coordinates obtained by aggregation at last is improved.

The given address is divided into an index address and a final-stage address, and different parameters are adopted during calculation according to different address types so as to obtain a higher-accuracy aggregation coordinate. The index address refers to an address with a large geographic area, for example, an address including province, city and region, and relates to a wide region range. In the invention, an address that specifically refers to a large geographic address as described above and contains geographic coordinates greater than a predetermined threshold is referred to as an index address. The detailed address related to the user, such as a certain building and a certain room of a certain cell, is related to a narrow region range, the number of corresponding geographic coordinates is less than a predetermined threshold value, and the address is called a final address. Thus, a greater number of geographic coordinates correspond to the index address, and a lesser number of appropriate geographic coordinates correspond to the last level address. According to the probability normal distribution in the probability theory, corresponding to the geographic coordinates of the index address, the small probability event with the probability less than or equal to 3% -5% is a low probability event, the occurrence frequency is very low, and the small probability event can not occur generally and can be ignored. And the geographic coordinates corresponding to the last-level address are distributed relatively intensively, and the coordinates of the address can be represented by 70% -80% of the centrally distributed coordinate points. Thus, corresponding to the index geographical coordinates, 95% -97% of the amount of source data is taken as a reference threshold for the aggregate amount, and corresponding to the final geographical coordinates, 70% -80% of the amount of source data is taken as a reference threshold for the aggregate amount.

Therefore, the system of the present invention further includes an address category determination module 4. If the user address is processed before the system, i.e. the user address is divided into an index address and a last level address, an identification bit is set in the address, for example, 0 indicates the index address and 1 indicates the last level address, and by reading the identification bit, it can be determined whether the address is the index address or the last level address.

If the address is not processed before the method of the present invention is performed, the present invention further includes an address processing module 5, as shown in fig. 3, for splitting the user address into an index address and a last-level address, and adding a corresponding identifier for the address classification determining module 4 to identify.

Specifically, as shown in fig. 4, the address processing module 5 includes: an address segmentation unit 51, a number statistics unit 52, a comparison unit 53 and an address class marking unit 54.

The address segmentation unit 51 is configured to segment the address according to a geographic area from large to small to form a plurality of address segments. Such as province address segment, city address segment, district crossing segment, street address segment, street number address segment, cell address segment, building number address segment, room number address segment, etc., and also includes the depurating treatment of the address when the segmentation treatment is performed, for example, removing the symbols such as brackets, spaces, etc. in the address.

The number counting unit 52 is connected to the address segmenting unit 51, and is configured to count the number of the geographic coordinate points corresponding to each address segment.

The comparing unit 53 is connected to the number counting unit 52, and is configured to compare the number of the geographic coordinate points corresponding to each address segment with a predetermined threshold. The reference threshold set in the present invention can be obtained through a plurality of tests according to actual conditions. For example, to determine a specific value of a threshold, the operation is repeatedly set and operated, and then the operation result is compared and analyzed, so as to determine a suitable value. Generally, a smaller threshold is set when the data amount of the source data is small, and the threshold needs to be increased when the data amount of the source data is sufficiently accumulated. For example, in a batch of addresses processed by the address processing module 5, the threshold value may be set to 50 when the data amount of the committed address corresponding to each address is 50 to 100 for most addresses (for example, 70% or more of the total amount).

And when the number of the geographic coordinate points corresponding to the address segment is greater than the reference threshold value, the address is considered as an index address, and if the number of the geographic coordinate points corresponding to a certain address segment is less than the reference threshold value and the number of the geographic coordinate points corresponding to the last address segment of the current address segment is greater than or equal to the predetermined threshold value, the address segment from the beginning to the end of the current address segment is a final-stage address, and the address segment from the beginning to the last address segment of the current address segment is an index address. Thus, when the result of the comparison is sent to the address class marking unit 54, the address class marking unit 54 marks the determined last level address and index address, such as 0 for index address and 1 for last level address, according to the comparison result of the comparing unit 53.

Preferably, in order to speed up the processing progress, for some obvious address segments, an index address matching table is arranged in the system, after the address segments are segmented, before the number of the geographic coordinate points corresponding to each address segment is counted, index address matching is firstly carried out, so that the address segments belonging to the index address and the address segments of the non-index address are obtained, for the address segments of the non-index address, the number of the geographic coordinate points corresponding to each address segment is counted, and then the address segments are compared with a parameter threshold value, so that whether the address segments capable of being divided into the index address exist in the current address segments of the non-index address is determined. The address processing module 5 divides the user address into an index address and a final-stage address, marks the index address and the final-stage address for the address category judgment module 4 to identify, and adopts different calculation parameters in subsequent calculation.

Fig. 5 is a schematic diagram illustrating a structural principle of the final circle acquiring module 3 according to the present invention. The final circle acquiring module 3 includes a first circle acquiring unit 31, a correction circle acquiring unit 32, a counting unit 33, and a comparing unit 34.

The first-time circle obtaining unit 31 is connected to the K-means clustering module 2 and the source data obtaining module 1, and is configured to obtain a first aggregated coordinate point P according to the K-means clustering module 2₁With said coordinate point P₁(x₁,y₁) As the circle center, all coordinate points obtained from the source data acquisition module 1 are used to reach the circle center P₁(x₁,y₁) Average distance R of₁Obtaining a first circle by taking the radius as the radius;

the correction circle acquiring unit 32 is connected to the first circle acquiring unit 31, the address type determining module 1 and the comparing unit 34, respectively, and obtains the correction radius according to the formula 1-1 and the coordinate point P according to the recalculation notification output by the comparing unit 34₁(x₁,y₁) Obtaining a correction circle as the center of the circle, or according to the qualified notice output by the comparison unit 4, obtaining a radius R₁The circle of (2) is determined as a final circle;

R_i+1＝R_i+(D_max-D_dev)/β 1-1

wherein R is_iIs the radius of the current circle, i is 1,2,3 … … k, k is a natural number, D_max、D_devRespectively from all coordinate points in the current circle to the circle center P₁(x₁,y₁) The maximum value and the average value of beta are weight values, and the radius of the circle is increased according to a bisection method or a quartering method. The size of coordinate data under the index address is generally large, and the area range is wide, the circle radius is enlarged by a dichotomy, the last-stage address relates to the detailed address of the user, the area range is narrow, and the circle radius is enlarged by a quartering method if the number of coordinate data is small. According to the address category judgment module, when the address is an index address, beta is 2; when the address is the last level address, β is 4. This is because the index coordinates relate to a large area, the data points are distributed far apart, and therefore the denominator is usedThe setting is small, and the distance between the circle radiuses is increased every time the circle radiuses are changed; for the last stage address, on the contrary, the last stage address range is small, the data points are densely distributed, and therefore the denominator can be set larger. Of course, the value of β may take other values, and in the present embodiment, it is usually calculated by a dichotomy or a quartering method from the viewpoint of engineering.

The counting unit 33 is connected to the first circle acquiring unit 31 and the correction circle acquiring unit 32, and is configured to count the number of coordinate points in the first circle and the correction circle, and send the counted result to the comparing unit 34.

The comparing unit 34 is connected to the counting unit 33, and is used for comparing the number of coordinate points in the first circle and the correction circle with the aggregation number threshold N₁Comparing, if the number of coordinate points in the circle is less than the preset aggregation number threshold N₁Sending a recalculation notification to the correction circle acquisition unit if the number of coordinate points in the circle is greater than or equal to a preset aggregation number threshold N₁Then, a qualification notification is sent to the correction circle acquisition unit.

In addition, before the first aggregation of the coordinate point groups, it is necessary to determine whether the number of the source data is greater than or equal to a predetermined number threshold, that is, whether there are enough coordinates for aggregation. If the number of source data is greater than or equal to a predetermined number threshold, e.g., 40-50, the coordinate point groups are first aggregated, and if less than the predetermined number threshold, no processing is performed. Wherein, the quantity threshold value can be set with different specific values according to the situation.

The present invention will be described in detail below with reference to specific examples.

For example, for a detailed address: 1589 # in the urban century in the new region of Pudong, Shanghai (index address) + 2009 Changtai International financial mansion (final address). The process of obtaining accurate geographical coordinates according to the method of the present invention is shown in fig. 6, and specifically follows:

in step S1a, the geographical coordinates of a plurality of designated addresses "2009 th national financial building" are acquired, as shown in fig. 7, 200 of them are obtained, and the geographical coordinates of the designated addresses are input as source data to the data processing system to form a coordinate point group.

Step S2a, determining whether the number of coordinate points in the coordinate point group is greater than or equal to a set number threshold (e.g. 50), if so, not performing the processing, and if not, performing the next step.

And step S3a, calling a K-means clustering algorithm, and setting the number of clusters to be 1.

Step S4a, carrying out first aggregation on the coordinate point group to obtain a first aggregation coordinate point P₁(x₁,y₁)。

Step S5a, calculating all coordinate points to the circle center P₁(x₁,y₁) And finding an average distance D_devAnd a maximum distance D_max。

Step S6a, aggregating coordinate point P for the first time₁(x₁,y₁) As the center of circle, by the average distance D_devMaking a circle as a radius; and counting the number N of coordinate points in the circle_{At present}。

Step S7a, judging whether the address is index address or last level address, if it is index address, step S8a, taking N₁N × λ ═ N × 95%, β ═ 2; if it is the last level address, in step S9a, N is taken₁N × λ ═ N × 70%, β ═ 4; in this embodiment, since the address is the last level address, then N is taken₁＝N*λ＝N*70％，β＝4。

Step S10a, judging the number N of coordinate points in the circle_{At present}Whether or not it is greater than or equal to N₁If N is present_{At present}Greater than or equal to N₁If the current circle is qualified, in step S11a, the coordinate points in the circle are aggregated to obtain a final aggregated coordinate point P_f(x_f,y_f) And output as the geographic coordinates of the address, as shown in fig. 10.

If N is present_{At present}Less than N₁If the current circle is not qualified, the current circle is considered to be unqualified, and the formula R is used to determine the current circle in step S12a_i+1＝R_i+(D_max-D_dev) The/β recalculates the radius of the circle and returns to step S6a to make the circle again, as shown in fig. 8. Judging whether the number of the coordinate points in the current circle meets the requirement again until a qualified final circle is obtained, namely the number N of the coordinate points in the circle_{At present}Whether or not it is greater than or equal to N₁As shown in fig. 9.

It can be seen from the above flow and the accompanying drawings that the invention utilizes the K-means clustering algorithm to accurately determine the geographic coordinate of a given address, and improves the data processing efficiency through a concise processing process on the premise of ensuring the accuracy of the result. The invention utilizes abundant properly-delivered geographic coordinates provided by couriers, and the average value of a plurality of data is obtained, so that the position error caused by factors such as signal difference, human factors and the like is reduced and the accuracy of geographic position positioning is improved relative to single data of an acquisition address. In addition, because the well-thrown address in the invention is not only a town address, but also a plurality of country addresses, after the geographical coordinates of a given address are determined by the invention, abundant and accurate data are provided for the refining of map data in the later period.

While embodiments in accordance with the invention have been described above, these embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The scope of the invention should be determined from the following claims.

Claims

1. A K-means based committed address data processing method comprises the following steps:

setting the number of clusters to be 1 according to a K-means clustering algorithm, and aiming at the clustersThe coordinate point group is polymerized for the first time to obtain a first polymerized coordinate point P₁(x₁,y₁)；

At the first aggregated coordinate point P1 (x)₁,y₁) Making a circle as the center of the circle to obtain the number of coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N₁Judging whether the address class of the given address is an index address or a last-stage address, and selecting a weight of a final circle according to the given address class to obtain a correction radius to redo the circle;

according to the K-means clustering algorithm, with the clustering number value as 1, carrying out second aggregation on a plurality of coordinate points in the final circle to obtain a final aggregation coordinate point P_f(x_f,y_f) And the final aggregated coordinate point P is determined_f(x_f,y_f) As the geographical coordinates of the given address,

wherein the index address has a geographic area greater than the last level address and the index address contains a number of geographic coordinates greater than a predetermined threshold.

2. The K-means based appropriate address data processing method according to claim 1, wherein the first-time aggregation coordinate point P is obtained at₁(x₁,y₁) When a circle is drawn as the circle center, the method comprises the following steps:

at the first-time aggregated coordinate point P₁(x₁,y₁) As the center of a circle, all coordinate points are used to reach the center of a circle P₁(x₁,y₁) Average distance R of₁Making a circle for the radius;

statistical radius of R₁The number of coordinate points within the circle;

if the radius is R₁Is less than a preset aggregation number threshold value N₁Then, the circle is re-made according to the corrected radius obtained by the formula 1-1 until the final circle is obtained；

Wherein R is_iI =1,2,3 … … k, k being a natural number,

respectively from all coordinate points in the current circle to the circle center P₁(x₁,y₁) Maximum and mean values of; beta is a weight.

3. The K-means based tooling address data processing method of claim 2 wherein if the given address is an index address, β = 2; β =4 if the given address is the last level address.

4. The K-means based appropriate address data processing method according to claim 1,2 or 3, wherein the threshold value N of the number of aggregations is set₁N × λ, where N is the number of source data, λ is any one point value of 95% -97% if the given address is an index address, and λ is any one point value of 70% -80% if the given address is a last stage address.

5. The K-means-based appropriate address data processing method according to claim 1, wherein before the first aggregation of the coordinate point groups, it is determined whether the number of source data is greater than or equal to a predetermined number threshold, and if the number of source data is greater than or equal to the predetermined number threshold, the first aggregation of the coordinate point groups is performed.

6. The K-means-based appropriate address data processing method according to claim 2, wherein when rounding is performed again according to the radius obtained by the formula 1-1, the method comprises the step of judging the given address as an index address or a last level address:

and reading the identification bit of the given address, and judging the given address to be an index address or a last-stage address according to the identification of the identification bit.

7. The K-means based appropriate address data processing method according to claim 6, further comprising a step of performing classification processing on the user address:

and setting corresponding identifications for the index address and the last-stage address respectively.

8. The K-means-based committed address data processing method according to claim 7, wherein before counting the number of geographic coordinate points corresponding to each address segment in turn, the address segments are matched with a preset index address matching table to determine address segments belonging to index addresses and address segments belonging to non-index addresses;

9. A K-means based committed address data processing system comprising:

the source data acquisition module is used for acquiring the geographic coordinates of a plurality of appropriate addresses corresponding to a given address according to the given address to form a coordinate point group;

the K-means clustering module is used for carrying out first aggregation and second aggregation on the set coordinate point group according to the set clustering number to obtain an aggregation coordinate point with the set clustering number; and

a final circle obtaining module for obtaining the first aggregation coordinate point P according to the K-means clustering module₁(x₁,y₁) Obtaining the number of the coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N₁The final circle of (a) is,

wherein the final circle acquiring module judges whether the address class of the given address is an index address or a last-stage address, selects a weight of a final circle according to the given address class to obtain a corrected radius and re-makes a circle to obtain a final circle,

the K-means clustering module sets the number of clusters to be 1 according to a K-means clustering algorithm during first aggregation, aggregates the coordinate point groups and obtains a first aggregated coordinate point P₁(x₁,y₁) Aggregating a plurality of coordinate points in the final circle by taking the clustering number value as 1 according to a K-means clustering algorithm during the second aggregation to obtain a final aggregated coordinate point P_f(x_f,y_f)，

Wherein the geographical area of the index address is larger than the last level address, and the index address comprises a number of geographical coordinates larger than a predetermined threshold.

10. The K-means based tooling data processing system of claim 9 further comprising:

and the address type judging module is used for judging the given address as an index address or a final-stage address.

11. The K-means based due address data processing system of claim 10, wherein the final circle acquiring module includes a first circle acquiring unit, a correction circle acquiring unit, a statistical unit, and a comparing unit;

the correction circle acquisition unit is respectively connected with the first circle acquisition unit, the given address type judgment module and the comparison unit, and obtains the correction radius according to the recalculation notice output by the comparison unit and a formula 1-1

With said coordinate point P₁(x₁,y₁) As the center of the circle, obtaining a correction circle, or according to the qualified notice output by the comparison unit, setting the radius as R₁The circle of (2) is determined as a final circle;

wherein R is_iI =1,2,3 … … k, k being a natural number,

respectively, all coordinate points in the current circle to point P₁(x₁,y₁) Beta is the weight;

12. The K-means based care address data processing system of claim 10, further comprising an address processing module, the given address processing module comprising:

the address segmentation unit is used for segmenting the given address from large to small according to the geographic area to form a plurality of address segments;

the quantity counting unit is connected with the given address segmentation unit and is used for counting the quantity of the geographic coordinate points corresponding to each address segmentation;

and the address category marking unit is connected with the comparison unit and marks the category of the given address according to the comparison result of the comparison unit.

13. The K-means based care address data processing system of claim 12, wherein the given address processing module further comprises:

the index address matching unit is connected with the given address segmentation unit and the quantity counting unit and is used for comparing the plurality of address segments with a preset index address matching table and determining the address segments belonging to the index address and the address segments of the non-index address;

14. A K-means based committed address data processing apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to execute the K-means based compromised address data processing method of any of claims 1 to 8 based on instructions stored in the memory.

15. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the K-means based compromised address data processing method of any of claims 1 to 8.