CN107305577B - K-means-based appropriate address data processing method and system - Google Patents

K-means-based appropriate address data processing method and system Download PDF

Info

Publication number
CN107305577B
CN107305577B CN201610262654.0A CN201610262654A CN107305577B CN 107305577 B CN107305577 B CN 107305577B CN 201610262654 A CN201610262654 A CN 201610262654A CN 107305577 B CN107305577 B CN 107305577B
Authority
CN
China
Prior art keywords
address
circle
coordinate points
index
aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610262654.0A
Other languages
Chinese (zh)
Other versions
CN107305577A (en
Inventor
胡焕叶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Zhenshi Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201610262654.0A priority Critical patent/CN107305577B/en
Publication of CN107305577A publication Critical patent/CN107305577A/en
Application granted granted Critical
Publication of CN107305577B publication Critical patent/CN107305577B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a K-means-based appropriate address data processing method and a system, wherein the method comprises the following steps: acquiring geographic coordinates of a plurality of proper addresses of a given address to form a coordinate point group; setting the clustering number to be 1 according to a K-means clustering algorithm, carrying out first aggregation on the coordinate point group, and obtaining a first aggregation coordinate point P1(x1,y1) (ii) a At the first-time aggregated coordinate point P1(x1,y1) Making a circle as the center of the circle to obtain the number of coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N1The final circle of (2); according to the K-means clustering algorithm, a plurality of coordinate points in the final circle are aggregated to obtain a final aggregated coordinate point Pf(xf,yf) And the final aggregated coordinate point P is determinedf(xf,yf) As geographic coordinates of the given address. The method accurately determines the geographical coordinates of the given address by using the K-means clustering algorithm, improves the data processing efficiency on the premise of ensuring the accuracy of the result, and provides rich and accurate data for refining the map data at the later stage.

Description

K-means-based appropriate address data processing method and system
Technical Field
The invention relates to the technical field of processing of geospatial information data, in particular to a K-means-based appropriate address data processing method and system.
Background
With the development and application of spatial information technology and mobile internet, geographic information technology has been widely applied to various fields of internet, especially intelligent location service, which has become one of the key supporting technologies for internet development.
In the logistics delivery system of the E-commerce, a courier of the E-commerce delivers articles to users according to the freight note address of the commodity, and records the geographic coordinate when delivering, namely the geographic coordinate of the appropriate delivery address. In e-commerce systems, there are typically multiple orders placed at the same location with historical accumulations, as if a user had too many purchases and therefore had multiple orders placed. Since the couriers of each order may be different, and the places where the users receive the goods may be different, the geographical coordinates of the addresses which are properly put may be different for the same address, so that the problem that the geographical coordinates of a plurality of addresses which are properly put correspond to the e-commerce system in the same address is caused. These geographical coordinates of the appropriate address are not the final data required by the intelligent location service, and they will always vary within a certain range, which, if the fluctuation range is large, will affect the accuracy of forward/backward address acquisition. Namely, the geographic coordinates are obtained according to the address, or the accuracy of the address is obtained according to the geographic coordinates.
In the delivery System of large-scale e-commerce, a large amount of address Geographic coordinate data is recorded, and the data is the data source required by most of the current GIS (Geographic Information System) systems. Most GIS systems obtain data sources through geographic coordinate acquisition or purchase. For the collected geographic coordinate data, the data is single and thin, the accuracy of the data is influenced by various factors such as signals and collection personnel, the error of the collected data is difficult to avoid and the data is difficult to control.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a K-means-based method and a K-means-based system for processing the destination address data, aiming at the problem that the same address corresponds to a plurality of destination address geographical coordinates at present, and the K-means-based method and the K-means-based system are used for determining an accurate geographical coordinate corresponding to the address according to the geographical coordinates of the destination addresses.
In order to solve the above technical problem, according to an aspect of the present invention, the present invention provides a method for processing appropriate address data based on K-means, wherein the method comprises the following steps:
acquiring geographical coordinates of a plurality of proper addresses of a given address, and forming a coordinate point group by taking the geographical coordinates of the plurality of proper addresses as source data;
according to a K-means clustering algorithm, with the clustering number value as 1, carrying out first aggregation on the coordinate point group to obtain a first aggregation coordinate point P1(x1,y1);
At the first-time aggregated coordinate point P1(x1,y1) Making a circle as the center of the circle to obtain the number of coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N1The final circle of (2);
according to a K-means clustering algorithm, clustering multiple coordinate points in a final circle by taking the clustering number value as 1 to obtain a final clustering coordinate point Pf(xf,yf) And the final aggregated coordinate point P is determinedf(xf,yf) As geographic coordinates of the given address.
Preferably, at the first-time aggregation coordinate point P1(x1,y1) When a circle is drawn as the circle center, the method comprises the following steps:
at the first-time aggregated coordinate point P1(x1,y1) As the center of a circle, all coordinate points are used to reach the center of a circle P1Average distance R of1Making a circle for the radius;
statistical radius of R1The number of coordinate points within the circle;
if the radius is R1Is greater than or equal to a preset aggregation number threshold value N1Then the radius is R1The circle of (2) is the final circle;
if the radius is R1Is less than a preset aggregation number threshold value N1Then, re-rounding according to the correction radius obtained by the formula 1-1 until a final circle is obtained;
Ri+1=Ri+(Dmax-Ddev)/β 1-1
wherein R isiIs the radius of the current circle, i is 1,2,3 … … k, k is a natural number, Dmax、DdevRespectively from all coordinate points in the current circle to the circle center P1Maximum and mean values of; beta is a weight.
Preferably, if the address is an index address, β ═ 2; if the source data is the last level address, β is 4.
Preferably, the threshold number of aggregations N1N x λ, where N is the amount of source data, λ is any one point value in the interval of 95% -97% if the address is an index address, and λ is any one point value in the interval of 70% -80% if the address is a final address.
Preferably, before the coordinate point groups are aggregated for the first time, it is determined whether the number of the source data is greater than or equal to a predetermined number threshold, and if the number of the source data is greater than or equal to the predetermined number threshold, the coordinate point groups are aggregated for the first time.
Preferably, when rounding is performed again according to the radius obtained by equation 1-1, the method includes the step of judging the address as an index address or a last level address:
and reading the identification bit of the given address, and judging the address to be an index address or a last-stage address according to the identification of the identification bit.
Preferably, the method further comprises the step of classifying the user address:
carrying out segmentation processing on the user address according to the geographic region from large to small to form a plurality of address segments from a first address segment to a tail address segment;
sequentially counting the number of geographic coordinate points corresponding to each address segment according to the address segments from large to small in the geographic region;
comparing the number of the geographic coordinate points corresponding to each address segment with a preset threshold value respectively, and if the number of the geographic coordinate points corresponding to the current address segment is smaller than the preset threshold value and the number of the geographic coordinate points corresponding to the previous address segment of the current address segment is larger than or equal to the preset threshold value, segmenting the current address segment into a final address from the beginning to the end, and segmenting the current address segment into an index address from the first address segment to the previous address segment of the current address segment;
and setting corresponding identification bits for the index address and the last-stage address respectively.
Preferably, before counting the number of geographic coordinate points corresponding to each address segment in sequence, matching each address segment with a preset index address matching table, and determining address segments belonging to index addresses and address segments of non-index addresses;
when the number of the geographic coordinate points corresponding to each address segment is counted, the number of the geographic coordinate points corresponding to the address segments of the non-index addresses is counted;
and comparing the number of the geographic coordinate points corresponding to the address segments of the non-index address with a preset threshold value respectively when the number of the geographic coordinate points corresponding to the address segments of the non-index address is compared with the preset threshold value respectively.
In order to solve the above technical problem, according to another aspect of the present invention, there is provided a K-means based hit address data processing system, comprising:
the source data acquisition module is used for acquiring the geographic coordinates of a plurality of appropriate addresses corresponding to the addresses according to the provided addresses and forming a coordinate point group in a geographic information system;
the K-means clustering module is used for aggregating the set coordinate point group according to the set clustering number value to obtain an aggregated coordinate point with the set clustering number; and
a final circle obtaining module for obtaining the first aggregation coordinate point P according to the K-means clustering module1(x1,y1) Obtaining the number of the coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N1The final circle of (c).
Preferably, the system further includes an address type determination module, configured to determine that the address is an index address or a last-level address.
Preferably, the final circle acquiring module includes a first circle acquiring unit, a correction circle acquiring unit, a statistical unit and a comparing unit;
the first-time circle acquisition unit is connected with the K-means clustering module and the source data acquisition module and used for acquiring a first-time aggregation coordinate point P according to the K-means clustering module1(x1,y1) With said coordinate point P1(x1,y1) As the circle center, all coordinate points obtained from the source data acquisition module are used to reach the circle center P1(x1,y1) Average distance R of1Obtaining a first circle by taking the radius as the radius;
the correction circle acquisition unit is respectively connected with the first circle acquisition unit, the address category judgment module and the comparison unit, and obtains the correction radius R according to the recalculation notice output by the comparison unit and a formula 1-1i+1With said coordinate point P1(x1,y1) As the center of the circle, obtaining a correction circle, or according to the qualified notice output by the comparison unit, setting the radius as R1The circle of (2) is determined as a final circle;
Ri+1=Ri+(Dmax-Ddev)/β 1-1
wherein R isiIs the radius of the current circle, i is 1,2,3 … … k, k is a natural number, Dmax、DdevRespectively, all coordinate points in the current circle to point P1(x1,y1) Beta is the weight;
the statistical unit is connected with the first circle acquisition unit and the correction circle acquisition unit and is used for counting the number of coordinate points in the first circle and the correction circle;
the comparison unit is connected with the statistic unit and is used for respectively comparing the number of coordinate points in the first circle and the correction circle with the aggregation number threshold value N1Comparing, if the number of coordinate points in the circle is less than the preset aggregation number threshold N1And sending a recalculation notification to the correction circle acquisition unit, and if the number of coordinate points in the circle is greater than or equal to a preset aggregation number threshold N1, sending a qualification notification to the correction circle acquisition unit.
Preferably, the system of the present invention further comprises an address processing module, and the address processing module includes:
the address segmentation unit is used for segmenting the address from large to small according to the geographic area to form a plurality of address segments;
the number counting unit is connected with the address segmentation unit and used for counting the number of the geographic coordinate points corresponding to each address segmentation;
the comparison unit is connected with the quantity counting unit and used for comparing the quantity of the geographic coordinate points corresponding to each address segment with a preset threshold value; and
and the address category marking unit is connected with the comparing unit and marks the category of the address according to the comparison result of the comparing unit.
The address processing module further comprises:
the index address matching unit is connected with the address segmentation unit and the quantity counting unit and is used for comparing the plurality of address segments with a preset index address matching table and determining the address segments belonging to the index address and the address segments of the non-index address;
the number counting unit is used for counting the number of the geographic coordinate points corresponding to the address segments belonging to the non-index addresses.
The invention utilizes the K-means clustering algorithm to accurately determine the geographic coordinates corresponding to the address, and improves the data processing efficiency through a concise processing process on the premise of ensuring the accuracy of the result. The invention utilizes the abundant geographical coordinates of the appropriate delivery address provided by the courier to calculate the average value of a plurality of data, thereby reducing the data error and improving the positioning accuracy compared with the single data of the acquired address. Because the appropriate addresses in the invention not only are town addresses, but also comprise a plurality of country addresses, after the invention determines the coordinate of a certain address, abundant and accurate data are provided for the refining of map data in later period.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a schematic flow chart of a K-means-based method for processing appropriate address data according to the present invention;
FIG. 2 is a schematic diagram of the architecture of the K-means based hit address data processing system according to the present invention;
FIG. 3 is a schematic diagram of another architecture of the K-means based hit address data processing system according to the present invention;
FIG. 4 is a schematic diagram of the structure of the address processing module according to the present invention;
FIG. 5 is a schematic diagram of a structural principle of the final circle acquiring module according to the present invention;
FIG. 6 is a schematic flow chart illustrating a K-means based hit address data processing method according to an embodiment of the present invention;
FIG. 7 is a graph of the distribution of the raw data obtained based on the method shown in FIG. 6;
FIG. 8 is a graph of a circle and data distribution obtained after multiple polymerizations according to the method of FIG. 6;
FIG. 9 is a final geographical coordinate distribution diagram obtained after the final circle and the aggregation based on the method shown in FIG. 6 according to the present invention; and
fig. 10 is a schematic diagram of the final geographic coordinates obtained by the method shown in fig. 6 according to the present invention displayed on a map.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present invention. The figures are not necessarily drawn to scale.
The flowcharts and block diagrams in the figures and block diagrams illustrate the possible architectures, functions, and operations of the systems, methods, and apparatuses according to the embodiments of the present invention, and may represent a module, a program segment, or merely a code segment, which is an executable instruction for implementing a specified logical function. It should also be noted that the executable instructions that implement the specified logical functions may be recombined to create new modules and program segments. The blocks of the drawings, and the order of the blocks, are thus provided to better illustrate the processes and steps of the embodiments and should not be taken as limiting the invention itself.
FIG. 1 is a schematic flow chart of the K-means-based method for processing the committed address data according to the present invention. FIG. 2 is a schematic diagram of the K-means based system for processing the appropriate address according to the present invention; the invention is described below with reference to fig. 1 and 2:
the K-means-based appropriate address data processing system comprises a source data acquisition module 1, a K-means clustering module 2 and a final circle acquisition module 3. The source data acquisition module 1 acquires geographic coordinates of a plurality of appropriate addresses corresponding to the addresses according to the provided addresses to form a coordinate point group; the K-means clustering module 2 is used for aggregating the set coordinate point group according to the set clustering number value 1 to obtain an aggregated coordinate point, and obtaining a coordinate point P after the first aggregation1(x1,y1) Then, according to a coordinate point within the radius of the circle obtained by the final circle obtaining module 3, aggregating to obtain a final aggregated coordinate point, wherein the coordinate point is a geographic coordinate corresponding to the address; the final circle obtaining module 3 is used for obtaining a first aggregation coordinate point P according to the K-means clustering module 21(x1,y1) Obtaining the number of the coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N1The final circle of (c).
Based on the system shown in FIG. 2, the method for processing the appropriate address data based on the K-means provided by the invention comprises the following steps:
step S1, the source data acquisition module acquires 1 the geographical coordinates of a plurality of proper addresses of a given address, and inputs the geographical coordinates of the plurality of proper addresses as source data into the processing system to form a coordinate point group;
step S2, the K-means clustering module 2 sets the clustering number K to 1 according to the K-means clustering algorithm, carries out the first aggregation on the coordinate point group and obtains the first aggregation coordinate point P1(x1,y1);
Step S3, the final circle acquiring module aggregates the coordinates for the first timeA point step S1, making a circle for the center of the circle, and obtaining that the number of coordinate points in the circle is greater than or equal to a preset aggregation number threshold value N1The final circle of (2);
step S4, the K-means clustering module sets the clustering number K to 1 according to the K-means clustering algorithm, and aggregates a plurality of coordinate points in the final circle to obtain a final aggregated coordinate point Pf(xf,yf)。
The invention utilizes the idea of clustering algorithm: entities in the same cluster are similar, and entities in different clusters are dissimilar; and clustering the geographic coordinates of a plurality of proper addresses of the given address, wherein the distance between any two points in the same cluster is smaller than the distance between any two points in different clusters. On the basis of a clustering algorithm, source data coordinate points are analyzed, data coordinates are classified to remove far-away deviation points, and then the remaining accurate coordinate points are aggregated, so that the accuracy of the geographical coordinates obtained by aggregation at last is improved.
The given address is divided into an index address and a final-stage address, and different parameters are adopted during calculation according to different address types so as to obtain a higher-accuracy aggregation coordinate. The index address refers to an address with a large geographic area, for example, an address including province, city and region, and relates to a wide region range. In the invention, an address that specifically refers to a large geographic address as described above and contains geographic coordinates greater than a predetermined threshold is referred to as an index address. The detailed address related to the user, such as a certain building and a certain room of a certain cell, is related to a narrow region range, the number of corresponding geographic coordinates is less than a predetermined threshold value, and the address is called a final address. Thus, a greater number of geographic coordinates correspond to the index address, and a lesser number of appropriate geographic coordinates correspond to the last level address. According to the probability normal distribution in the probability theory, corresponding to the geographic coordinates of the index address, the small probability event with the probability less than or equal to 3% -5% is a low probability event, the occurrence frequency is very low, and the small probability event can not occur generally and can be ignored. And the geographic coordinates corresponding to the last-level address are distributed relatively intensively, and the coordinates of the address can be represented by 70% -80% of the centrally distributed coordinate points. Thus, corresponding to the index geographical coordinates, 95% -97% of the amount of source data is taken as a reference threshold for the aggregate amount, and corresponding to the final geographical coordinates, 70% -80% of the amount of source data is taken as a reference threshold for the aggregate amount.
Therefore, the system of the present invention further includes an address category determination module 4. If the user address is processed before the system, i.e. the user address is divided into an index address and a last level address, an identification bit is set in the address, for example, 0 indicates the index address and 1 indicates the last level address, and by reading the identification bit, it can be determined whether the address is the index address or the last level address.
If the address is not processed before the method of the present invention is performed, the present invention further includes an address processing module 5, as shown in fig. 3, for splitting the user address into an index address and a last-level address, and adding a corresponding identifier for the address classification determining module 4 to identify.
Specifically, as shown in fig. 4, the address processing module 5 includes: an address segmentation unit 51, a number statistics unit 52, a comparison unit 53 and an address class marking unit 54.
The address segmentation unit 51 is configured to segment the address according to a geographic area from large to small to form a plurality of address segments. Such as province address segment, city address segment, district crossing segment, street address segment, street number address segment, cell address segment, building number address segment, room number address segment, etc., and also includes the depurating treatment of the address when the segmentation treatment is performed, for example, removing the symbols such as brackets, spaces, etc. in the address.
The number counting unit 52 is connected to the address segmenting unit 51, and is configured to count the number of the geographic coordinate points corresponding to each address segment.
The comparing unit 53 is connected to the number counting unit 52, and is configured to compare the number of the geographic coordinate points corresponding to each address segment with a predetermined threshold. The reference threshold set in the present invention can be obtained through a plurality of tests according to actual conditions. For example, to determine a specific value of a threshold, the operation is repeatedly set and operated, and then the operation result is compared and analyzed, so as to determine a suitable value. Generally, a smaller threshold is set when the data amount of the source data is small, and the threshold needs to be increased when the data amount of the source data is sufficiently accumulated. For example, in a batch of addresses processed by the address processing module 5, the threshold value may be set to 50 when the data amount of the committed address corresponding to each address is 50 to 100 for most addresses (for example, 70% or more of the total amount).
And when the number of the geographic coordinate points corresponding to the address segment is greater than the reference threshold value, the address is considered as an index address, and if the number of the geographic coordinate points corresponding to a certain address segment is less than the reference threshold value and the number of the geographic coordinate points corresponding to the last address segment of the current address segment is greater than or equal to the predetermined threshold value, the address segment from the beginning to the end of the current address segment is a final-stage address, and the address segment from the beginning to the last address segment of the current address segment is an index address. Thus, when the result of the comparison is sent to the address class marking unit 54, the address class marking unit 54 marks the determined last level address and index address, such as 0 for index address and 1 for last level address, according to the comparison result of the comparing unit 53.
Preferably, in order to speed up the processing progress, for some obvious address segments, an index address matching table is arranged in the system, after the address segments are segmented, before the number of the geographic coordinate points corresponding to each address segment is counted, index address matching is firstly carried out, so that the address segments belonging to the index address and the address segments of the non-index address are obtained, for the address segments of the non-index address, the number of the geographic coordinate points corresponding to each address segment is counted, and then the address segments are compared with a parameter threshold value, so that whether the address segments capable of being divided into the index address exist in the current address segments of the non-index address is determined. The address processing module 5 divides the user address into an index address and a final-stage address, marks the index address and the final-stage address for the address category judgment module 4 to identify, and adopts different calculation parameters in subsequent calculation.
Fig. 5 is a schematic diagram illustrating a structural principle of the final circle acquiring module 3 according to the present invention. The final circle acquiring module 3 includes a first circle acquiring unit 31, a correction circle acquiring unit 32, a counting unit 33, and a comparing unit 34.
The first-time circle obtaining unit 31 is connected to the K-means clustering module 2 and the source data obtaining module 1, and is configured to obtain a first aggregated coordinate point P according to the K-means clustering module 21With said coordinate point P1(x1,y1) As the circle center, all coordinate points obtained from the source data acquisition module 1 are used to reach the circle center P1(x1,y1) Average distance R of1Obtaining a first circle by taking the radius as the radius;
the correction circle acquiring unit 32 is connected to the first circle acquiring unit 31, the address type determining module 1 and the comparing unit 34, respectively, and obtains the correction radius according to the formula 1-1 and the coordinate point P according to the recalculation notification output by the comparing unit 341(x1,y1) Obtaining a correction circle as the center of the circle, or according to the qualified notice output by the comparison unit 4, obtaining a radius R1The circle of (2) is determined as a final circle;
Ri+1=Ri+(Dmax-Ddev)/β 1-1
wherein R isiIs the radius of the current circle, i is 1,2,3 … … k, k is a natural number, Dmax、DdevRespectively from all coordinate points in the current circle to the circle center P1(x1,y1) The maximum value and the average value of beta are weight values, and the radius of the circle is increased according to a bisection method or a quartering method. The size of coordinate data under the index address is generally large, and the area range is wide, the circle radius is enlarged by a dichotomy, the last-stage address relates to the detailed address of the user, the area range is narrow, and the circle radius is enlarged by a quartering method if the number of coordinate data is small. According to the address category judgment module, when the address is an index address, beta is 2; when the address is the last level address, β is 4. This is because the index coordinates relate to a large area, the data points are distributed far apart, and therefore the denominator is usedThe setting is small, and the distance between the circle radiuses is increased every time the circle radiuses are changed; for the last stage address, on the contrary, the last stage address range is small, the data points are densely distributed, and therefore the denominator can be set larger. Of course, the value of β may take other values, and in the present embodiment, it is usually calculated by a dichotomy or a quartering method from the viewpoint of engineering.
The counting unit 33 is connected to the first circle acquiring unit 31 and the correction circle acquiring unit 32, and is configured to count the number of coordinate points in the first circle and the correction circle, and send the counted result to the comparing unit 34.
The comparing unit 34 is connected to the counting unit 33, and is used for comparing the number of coordinate points in the first circle and the correction circle with the aggregation number threshold N1Comparing, if the number of coordinate points in the circle is less than the preset aggregation number threshold N1Sending a recalculation notification to the correction circle acquisition unit if the number of coordinate points in the circle is greater than or equal to a preset aggregation number threshold N1Then, a qualification notification is sent to the correction circle acquisition unit.
In addition, before the first aggregation of the coordinate point groups, it is necessary to determine whether the number of the source data is greater than or equal to a predetermined number threshold, that is, whether there are enough coordinates for aggregation. If the number of source data is greater than or equal to a predetermined number threshold, e.g., 40-50, the coordinate point groups are first aggregated, and if less than the predetermined number threshold, no processing is performed. Wherein, the quantity threshold value can be set with different specific values according to the situation.
The present invention will be described in detail below with reference to specific examples.
For example, for a detailed address: 1589 # in the urban century in the new region of Pudong, Shanghai (index address) + 2009 Changtai International financial mansion (final address). The process of obtaining accurate geographical coordinates according to the method of the present invention is shown in fig. 6, and specifically follows:
in step S1a, the geographical coordinates of a plurality of designated addresses "2009 th national financial building" are acquired, as shown in fig. 7, 200 of them are obtained, and the geographical coordinates of the designated addresses are input as source data to the data processing system to form a coordinate point group.
Step S2a, determining whether the number of coordinate points in the coordinate point group is greater than or equal to a set number threshold (e.g. 50), if so, not performing the processing, and if not, performing the next step.
And step S3a, calling a K-means clustering algorithm, and setting the number of clusters to be 1.
Step S4a, carrying out first aggregation on the coordinate point group to obtain a first aggregation coordinate point P1(x1,y1)。
Step S5a, calculating all coordinate points to the circle center P1(x1,y1) And finding an average distance DdevAnd a maximum distance Dmax
Step S6a, aggregating coordinate point P for the first time1(x1,y1) As the center of circle, by the average distance DdevMaking a circle as a radius; and counting the number N of coordinate points in the circleAt present
Step S7a, judging whether the address is index address or last level address, if it is index address, step S8a, taking N1N × λ ═ N × 95%, β ═ 2; if it is the last level address, in step S9a, N is taken1N × λ ═ N × 70%, β ═ 4; in this embodiment, since the address is the last level address, then N is taken1=N*λ=N*70%,β=4。
Step S10a, judging the number N of coordinate points in the circleAt presentWhether or not it is greater than or equal to N1If N is presentAt presentGreater than or equal to N1If the current circle is qualified, in step S11a, the coordinate points in the circle are aggregated to obtain a final aggregated coordinate point Pf(xf,yf) And output as the geographic coordinates of the address, as shown in fig. 10.
If N is presentAt presentLess than N1If the current circle is not qualified, the current circle is considered to be unqualified, and the formula R is used to determine the current circle in step S12ai+1=Ri+(Dmax-Ddev) The/β recalculates the radius of the circle and returns to step S6a to make the circle again, as shown in fig. 8. Judging whether the number of the coordinate points in the current circle meets the requirement again until a qualified final circle is obtained, namely the number N of the coordinate points in the circleAt presentWhether or not it is greater than or equal to N1As shown in fig. 9.
It can be seen from the above flow and the accompanying drawings that the invention utilizes the K-means clustering algorithm to accurately determine the geographic coordinate of a given address, and improves the data processing efficiency through a concise processing process on the premise of ensuring the accuracy of the result. The invention utilizes abundant properly-delivered geographic coordinates provided by couriers, and the average value of a plurality of data is obtained, so that the position error caused by factors such as signal difference, human factors and the like is reduced and the accuracy of geographic position positioning is improved relative to single data of an acquisition address. In addition, because the well-thrown address in the invention is not only a town address, but also a plurality of country addresses, after the geographical coordinates of a given address are determined by the invention, abundant and accurate data are provided for the refining of map data in the later period.
While embodiments in accordance with the invention have been described above, these embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The scope of the invention should be determined from the following claims.

Claims (15)

1. A K-means based committed address data processing method comprises the following steps:
acquiring geographical coordinates of a plurality of proper addresses of a given address, and forming a coordinate point group by taking the geographical coordinates of the plurality of proper addresses as source data;
setting the number of clusters to be 1 according to a K-means clustering algorithm, and aiming at the clustersThe coordinate point group is polymerized for the first time to obtain a first polymerized coordinate point P1(x1,y1);
At the first aggregated coordinate point P1 (x)1,y1) Making a circle as the center of the circle to obtain the number of coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N1Judging whether the address class of the given address is an index address or a last-stage address, and selecting a weight of a final circle according to the given address class to obtain a correction radius to redo the circle;
according to the K-means clustering algorithm, with the clustering number value as 1, carrying out second aggregation on a plurality of coordinate points in the final circle to obtain a final aggregation coordinate point Pf(xf,yf) And the final aggregated coordinate point P is determinedf(xf,yf) As the geographical coordinates of the given address,
wherein the index address has a geographic area greater than the last level address and the index address contains a number of geographic coordinates greater than a predetermined threshold.
2. The K-means based appropriate address data processing method according to claim 1, wherein the first-time aggregation coordinate point P is obtained at1(x1,y1) When a circle is drawn as the circle center, the method comprises the following steps:
at the first-time aggregated coordinate point P1(x1,y1) As the center of a circle, all coordinate points are used to reach the center of a circle P1(x1,y1) Average distance R of1Making a circle for the radius;
statistical radius of R1The number of coordinate points within the circle;
if the radius is R1Is greater than or equal to a preset aggregation number threshold value N1Then the radius is R1The circle of (2) is the final circle;
if the radius is R1Is less than a preset aggregation number threshold value N1Then, the circle is re-made according to the corrected radius obtained by the formula 1-1 until the final circle is obtained;
Figure 11654DEST_PATH_IMAGE002
Wherein R isiI =1,2,3 … … k, k being a natural number,
Figure 418144DEST_PATH_IMAGE004
respectively from all coordinate points in the current circle to the circle center P1(x1,y1) Maximum and mean values of; beta is a weight.
3. The K-means based tooling address data processing method of claim 2 wherein if the given address is an index address, β = 2; β =4 if the given address is the last level address.
4. The K-means based appropriate address data processing method according to claim 1,2 or 3, wherein the threshold value N of the number of aggregations is set1N × λ, where N is the number of source data, λ is any one point value of 95% -97% if the given address is an index address, and λ is any one point value of 70% -80% if the given address is a last stage address.
5. The K-means-based appropriate address data processing method according to claim 1, wherein before the first aggregation of the coordinate point groups, it is determined whether the number of source data is greater than or equal to a predetermined number threshold, and if the number of source data is greater than or equal to the predetermined number threshold, the first aggregation of the coordinate point groups is performed.
6. The K-means-based appropriate address data processing method according to claim 2, wherein when rounding is performed again according to the radius obtained by the formula 1-1, the method comprises the step of judging the given address as an index address or a last level address:
and reading the identification bit of the given address, and judging the given address to be an index address or a last-stage address according to the identification of the identification bit.
7. The K-means based appropriate address data processing method according to claim 6, further comprising a step of performing classification processing on the user address:
carrying out segmentation processing on the user address according to the geographic region from large to small to form a plurality of address segments from a first address segment to a tail address segment;
sequentially counting the number of geographic coordinate points corresponding to each address segment according to the address segments from large to small in the geographic region;
comparing the number of the geographic coordinate points corresponding to each address segment with a preset threshold value respectively, and if the number of the geographic coordinate points corresponding to the current address segment is smaller than the preset threshold value and the number of the geographic coordinate points corresponding to the previous address segment of the current address segment is larger than or equal to the preset threshold value, segmenting the current address segment into a final address from the beginning to the end, and segmenting the current address segment into an index address from the first address segment to the previous address segment of the current address segment;
and setting corresponding identifications for the index address and the last-stage address respectively.
8. The K-means-based committed address data processing method according to claim 7, wherein before counting the number of geographic coordinate points corresponding to each address segment in turn, the address segments are matched with a preset index address matching table to determine address segments belonging to index addresses and address segments belonging to non-index addresses;
when the number of the geographic coordinate points corresponding to each address segment is counted, the number of the geographic coordinate points corresponding to the address segments of the non-index addresses is counted;
and comparing the number of the geographic coordinate points corresponding to the address segments of the non-index address with a preset threshold value respectively when the number of the geographic coordinate points corresponding to the address segments of the non-index address is compared with the preset threshold value respectively.
9. A K-means based committed address data processing system comprising:
the source data acquisition module is used for acquiring the geographic coordinates of a plurality of appropriate addresses corresponding to a given address according to the given address to form a coordinate point group;
the K-means clustering module is used for carrying out first aggregation and second aggregation on the set coordinate point group according to the set clustering number to obtain an aggregation coordinate point with the set clustering number; and
a final circle obtaining module for obtaining the first aggregation coordinate point P according to the K-means clustering module1(x1,y1) Obtaining the number of the coordinate points in the circle which is more than or equal to the preset aggregation number threshold value N1The final circle of (a) is,
wherein the final circle acquiring module judges whether the address class of the given address is an index address or a last-stage address, selects a weight of a final circle according to the given address class to obtain a corrected radius and re-makes a circle to obtain a final circle,
the K-means clustering module sets the number of clusters to be 1 according to a K-means clustering algorithm during first aggregation, aggregates the coordinate point groups and obtains a first aggregated coordinate point P1(x1,y1) Aggregating a plurality of coordinate points in the final circle by taking the clustering number value as 1 according to a K-means clustering algorithm during the second aggregation to obtain a final aggregated coordinate point Pf(xf,yf),
Wherein the geographical area of the index address is larger than the last level address, and the index address comprises a number of geographical coordinates larger than a predetermined threshold.
10. The K-means based tooling data processing system of claim 9 further comprising:
and the address type judging module is used for judging the given address as an index address or a final-stage address.
11. The K-means based due address data processing system of claim 10, wherein the final circle acquiring module includes a first circle acquiring unit, a correction circle acquiring unit, a statistical unit, and a comparing unit;
the first-time circle acquisition unit is connected with the K-means clustering module and the source data acquisition module and used for acquiring a first-time aggregation coordinate point P according to the K-means clustering module1(x1,y1) With said coordinate point P1(x1,y1) As the circle center, all coordinate points obtained from the source data acquisition module are used to reach the circle center P1(x1,y1) Average distance R of1Obtaining a first circle by taking the radius as the radius;
the correction circle acquisition unit is respectively connected with the first circle acquisition unit, the given address type judgment module and the comparison unit, and obtains the correction radius according to the recalculation notice output by the comparison unit and a formula 1-1
Figure 135564DEST_PATH_IMAGE006
With said coordinate point P1(x1,y1) As the center of the circle, obtaining a correction circle, or according to the qualified notice output by the comparison unit, setting the radius as R1The circle of (2) is determined as a final circle;
Figure 202877DEST_PATH_IMAGE008
wherein R isiI =1,2,3 … … k, k being a natural number,
Figure 954933DEST_PATH_IMAGE009
respectively, all coordinate points in the current circle to point P1(x1,y1) Beta is the weight;
the statistical unit is connected with the first circle acquisition unit and the correction circle acquisition unit and is used for counting the number of coordinate points in the first circle and the correction circle;
the comparison unit is connected with the statistic unit and is used for respectively comparing the number of coordinate points in the first circle and the correction circle with the aggregation number threshold value N1Comparing, if the number of coordinate points in the circle is less than the preset aggregation number threshold N1And sending a recalculation notification to the correction circle acquisition unit, and if the number of coordinate points in the circle is greater than or equal to a preset aggregation number threshold N1, sending a qualification notification to the correction circle acquisition unit.
12. The K-means based care address data processing system of claim 10, further comprising an address processing module, the given address processing module comprising:
the address segmentation unit is used for segmenting the given address from large to small according to the geographic area to form a plurality of address segments;
the quantity counting unit is connected with the given address segmentation unit and is used for counting the quantity of the geographic coordinate points corresponding to each address segmentation;
the comparison unit is connected with the quantity counting unit and used for comparing the quantity of the geographic coordinate points corresponding to each address segment with a preset threshold value; and
and the address category marking unit is connected with the comparison unit and marks the category of the given address according to the comparison result of the comparison unit.
13. The K-means based care address data processing system of claim 12, wherein the given address processing module further comprises:
the index address matching unit is connected with the given address segmentation unit and the quantity counting unit and is used for comparing the plurality of address segments with a preset index address matching table and determining the address segments belonging to the index address and the address segments of the non-index address;
the number counting unit is used for counting the number of the geographic coordinate points corresponding to the address segments belonging to the non-index addresses.
14. A K-means based committed address data processing apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to execute the K-means based compromised address data processing method of any of claims 1 to 8 based on instructions stored in the memory.
15. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the K-means based compromised address data processing method of any of claims 1 to 8.
CN201610262654.0A 2016-04-25 2016-04-25 K-means-based appropriate address data processing method and system Active CN107305577B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610262654.0A CN107305577B (en) 2016-04-25 2016-04-25 K-means-based appropriate address data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610262654.0A CN107305577B (en) 2016-04-25 2016-04-25 K-means-based appropriate address data processing method and system

Publications (2)

Publication Number Publication Date
CN107305577A CN107305577A (en) 2017-10-31
CN107305577B true CN107305577B (en) 2020-12-22

Family

ID=60150907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610262654.0A Active CN107305577B (en) 2016-04-25 2016-04-25 K-means-based appropriate address data processing method and system

Country Status (1)

Country Link
CN (1) CN107305577B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008970A (en) * 2018-01-05 2019-07-12 北京京东尚科信息技术有限公司 A kind of address information generation method and device
CN110503352B (en) * 2018-05-16 2021-09-07 北京三快在线科技有限公司 Method and device for determining delivery point and computer readable storage medium
CN110555448B (en) * 2018-05-30 2022-03-29 顺丰科技有限公司 Method and system for subdividing dispatch area
CN109636020A (en) * 2018-11-30 2019-04-16 广州亚美信息科技有限公司 A kind of method and system for predicting destination-address
CN111190988B (en) * 2019-12-31 2023-08-08 京东城市(北京)数字科技有限公司 Address resolution method, device, equipment and computer readable storage medium
CN113743845A (en) * 2020-05-29 2021-12-03 京东城市(北京)数字科技有限公司 Distribution area dividing method and device, electronic device and storage medium
CN111966774A (en) * 2020-08-18 2020-11-20 湖南省长株潭烟草物流有限责任公司 Dynamic positioning method and system for cigarette packet retail customer
CN112468546B (en) * 2020-11-12 2023-11-24 北京锐安科技有限公司 Account position determining method, device, server and storage medium
CN115271354B (en) * 2022-06-24 2023-08-25 湖南湘邮科技股份有限公司 Service electronic fence dynamic planning method and system based on delivery unit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163214A (en) * 2011-03-07 2011-08-24 北京百度网讯科技有限公司 Numerical map generation device and method thereof
CN102711239A (en) * 2012-05-10 2012-10-03 电子科技大学 RSS (received signal strength) fingerprint database based secondary fuzzy clustering indoor-positioning method
CN103812961A (en) * 2013-11-01 2014-05-21 北京奇虎科技有限公司 Method and device for recognizing Internet protocol (IP) addresses of designated class and defending method and system
CN105223546A (en) * 2015-09-30 2016-01-06 西安电子科技大学 Based on the indoor orientation method of received signal strength and reference point locations double focusing class

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8346770B2 (en) * 2003-09-22 2013-01-01 Google Inc. Systems and methods for clustering search results

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163214A (en) * 2011-03-07 2011-08-24 北京百度网讯科技有限公司 Numerical map generation device and method thereof
CN102711239A (en) * 2012-05-10 2012-10-03 电子科技大学 RSS (received signal strength) fingerprint database based secondary fuzzy clustering indoor-positioning method
CN103812961A (en) * 2013-11-01 2014-05-21 北京奇虎科技有限公司 Method and device for recognizing Internet protocol (IP) addresses of designated class and defending method and system
CN105223546A (en) * 2015-09-30 2016-01-06 西安电子科技大学 Based on the indoor orientation method of received signal strength and reference point locations double focusing class

Also Published As

Publication number Publication date
CN107305577A (en) 2017-10-31

Similar Documents

Publication Publication Date Title
CN107305577B (en) K-means-based appropriate address data processing method and system
WO2021052031A1 (en) Statistical interquartile range-based commodity inventory risk early warning method and system, and computer readable storage medium
CN104699606B (en) Method for predicting state of software system based on hidden Markov model
CN107169628B (en) Power distribution network reliability assessment method based on big data mutual information attribute reduction
CN106919957B (en) Method and device for processing data
CN104469932B (en) A kind of location fingerprint localization method based on support vector machines
CN106021997B (en) A kind of comparison method of three generations PacBio sequencing data
CN103136247A (en) Attribute data interval partition method and attribute data interval partition device
EP4209959A1 (en) Target identification method and apparatus, and electronic device
CN110555448B (en) Method and system for subdividing dispatch area
CN113312578A (en) Data index fluctuation attribution method, device, equipment and medium
CN108108883B (en) Clustering algorithm-based vehicle scheduling network elastic simplification method
CN114647684A (en) Traffic prediction method and device based on stacking algorithm and related equipment
CN117576823B (en) Queuing and calling system terminal
CN106802958A (en) Conversion method and system of the CAD data to GIS data
CN111506624B (en) Power deficiency data identification method and related device
CN111460293B (en) Information pushing method and device and computer readable storage medium
CN108510010A (en) A kind of density peaks clustering method and system based on prescreening
CN108363789B (en) Short-term missing repairing method and device for electricity consumption meter data of industrial and commercial users
CN114676931B (en) Electric quantity prediction system based on data center technology
CN111310121A (en) New energy output probability prediction method and system
CN116976574A (en) Building load curve dimension reduction method based on two-stage hybrid clustering algorithm
CN111445079B (en) Multi-fidelity simulation optimization method and equipment applied to workshop plan production
CN114123190A (en) Method and device for determining target region to which ammeter belongs, electronic equipment and storage medium
CN109189775B (en) Industrial monitoring platform mass data processing system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210302

Address after: 101, 1st floor, building 2, yard 20, Suzhou street, Haidian District, Beijing 100080

Patentee after: Beijing Jingbangda Trading Co.,Ltd.

Address before: 100080 Haidian District, Beijing, 65 Xing Shu Kou Road, 11C, west section of the western part of the building, 1-4 stories West 1-4 story.

Patentee before: BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY Co.,Ltd.

Patentee before: BEIJING JINGDONG CENTURY TRADING Co.,Ltd.

Effective date of registration: 20210302

Address after: 6 / F, 76 Zhichun Road, Haidian District, Beijing 100086

Patentee after: Beijing Jingdong Zhenshi Information Technology Co.,Ltd.

Address before: 101, 1st floor, building 2, yard 20, Suzhou street, Haidian District, Beijing 100080

Patentee before: Beijing Jingbangda Trading Co.,Ltd.