CN115630047A

CN115630047A - Address matching analysis method and system based on weighted clustering

Info

Publication number: CN115630047A
Application number: CN202210812518.XA
Authority: CN
Inventors: 杨光; 贺珊; 张宇; 张龙涛
Original assignee: Wuhan Zhongzhi Digital Technology Co ltd
Current assignee: Wuhan Zhongzhi Digital Technology Co ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2023-01-20

Abstract

An address matching analysis method based on weighted clustering comprises the following steps: s100, cleaning address data with different sources; s200, extracting address elements of the cleaned address data according to a first preset rule; s300, completing the extracted address elements according to a historical administrative division; s400, coding is carried out on the complemented address element set according to a second preset rule to obtain an address element coding set; s500, collecting the address element codes in the S400, and carrying out primary clustering by using an elbow method of kmeans to obtain an initial K value; s600, based on weighted clustering of WKmeans, carrying out weighted clustering on the address element codes in the S400 by using a WKmeans clustering mode in combination with the K value in the S500 to obtain a standard address and attribute weighted matrix; s700, carrying out weighted matching on the address to be matched according to the standard address and the attribute weighted matrix of S600; the invention obtains the unique standard address corresponding to the original address, realizes the standardization of multi-source address data and can well solve the problem of address ambiguity.

Description

Address matching analysis method and system based on weighted clustering

Technical Field

The invention relates to the field of address matching, in particular to an address matching analysis method and system based on weighted clustering.

Background

In current practical work, the problem of address ambiguity poses a great obstacle to the analysis application of address data. In principle, a unique standard address should be used at one place in a geographic space, and in practice, due to the lack of a uniform address standard in the process of acquiring addresses of a business system or the difference of address reference objects adopted in manual recording, different descriptions may exist at the same place in the geographic space, that is, address ambiguity exists, so that an analysis result based on address data becomes very inaccurate, and certain difficulty is caused for the practical business application of the address data.

Disclosure of Invention

In view of the above, the present invention has been developed to provide a weighted cluster-based address matching analysis method that overcomes or at least partially solves the above-mentioned problems.

In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:

an address matching analysis method based on weighted clustering comprises the following steps:

s100, cleaning address data with different sources;

s200, extracting address elements of the cleaned address data according to a first preset rule;

s300, completing the extracted address elements according to a historical administrative division;

s400, coding is carried out on the complemented address element set according to a second preset rule to obtain an address element coding set;

s500, collecting the address element codes in the S400, and performing primary clustering through a kmeans elbow method to obtain an initial K value;

s600, based on weighted clustering of WKmeans, collecting address element codes in S400, combining with a K value in S500, and performing weighted clustering by using a WKmeans clustering mode to obtain a standard address and attribute weighted matrix;

s700, carrying out weighted matching on the address to be matched according to the standard address and the attribute weighted matrix of S600;

and S800, updating the standard address of the standard address library according to the address weighting matching result to be matched.

Further, in S100, the address data source at least includes: and the relational database comprises relational tables, excel, csv and text format files.

Further, in S200, the address elements are extracted according to a first preset rule, where the first preset rule includes: and aiming at the cleaned address text, extracting elements according to province, city, county or district, village and town or street, village or community, road number, interest point, building, unit, floor and room number standards, and extracting and identifying the elements by adopting a named entity identification model.

Further, in S400, encoding is performed according to a second preset rule to obtain an address element encoding set, where the second preset rule is: according to the dimensions of province, city, county or district, village and town or street, village group or community, road number, interest point, building, unit, floor and room number, the hot independent coding is carried out, and the missing address elements in the address are replaced by random negative real numbers with larger absolute values.

Further, in S500, preliminary clustering is performed by means of an elbow method of kmeans to obtain an initial K value, and the specific method is as follows:

s501, dividing the address data into K normal categories, and taking the initial central coordinate of each category as an initial centroid;

s502, performing Euclidean distance calculation on each sample data and the initial centroid, and selecting the initial centroid closest to the Euclidean distance as the current sample category;

s503, clustering all sample data, resetting the average point of each class as a new initial centroid of all samples in the class, and recalculating the Euclidean distance between each sample data and a new central point;

s504, when the initial K value is increased, calculating the error sum of squares SSE of the samples, when the error sum of squares SSE is rapidly converged, increasing the K value, otherwise, outputting the current K value.

Further, the equation for the sum of squared errors SSE is:

wherein, C _i Is the ith cluster, p is C _i Data point of (1), m _i Is C _i The center of mass of the data is determined, SSE is the clustering error of all data, and SSE is the standard for evaluating the clustering effect.

Further, in S600, based on the weighted clustering of WKmeans, a weight parameter is added to the objective function, and the final weight of the variable calculated in the iterative solution process can identify the noise variable, thereby achieving the purpose of clustering the large-scale high-dimensional data.

Further, in S800, if the similarity of the candidate standard addresses returned in S700 does not satisfy the threshold, it indicates that the standard address library does not record the service address, at this time, a standard address updating mechanism is triggered, the standard address library records the address as a standard address, and recalculates the K value and the weight, otherwise, the ID of the corresponding standard address is returned, and the standard address text corresponding to the standard address is finally obtained through the ElasticSearch, thereby finally realizing the address standardization.

The invention also discloses an address matching analysis system based on weighted clustering, which comprises the following steps: the system comprises an address data cleaning unit, an address element extracting unit, an address element complementing unit, an address element coding unit, a preliminary clustering unit, a weighted matching unit and a standard address updating unit; wherein:

the address data cleaning unit is used for cleaning address data with different sources;

an address element extraction unit for extracting address elements from the cleaned address data according to a first preset rule

An address element completion unit configured to complete the extracted address elements according to a historical administrative division;

the address element coding unit is used for coding the complemented address element set according to a second preset rule to obtain an address element coding set;

the preliminary clustering unit is used for carrying out preliminary clustering through an elbow method of kmeans to obtain an initial K value;

the weighted clustering unit is used for gathering the address element codes based on weighted clustering of WKmeans, combining the K value in the primary clustering list and carrying out weighted clustering by using a WKmeans clustering mode to obtain a standard address and attribute weighted matrix;

the weighted matching unit is used for carrying out weighted matching on the address to be matched according to the standard address and the attribute weighted matrix of the weighted clustering unit;

and the standard address updating unit is used for updating the standard address of the standard address library according to the address weighted matching result to be matched.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the invention discloses an address matching analysis method based on weighted clustering, which is characterized in that address texts are cleaned, element extraction is carried out according to province, city, county (district), town (street), village (community), road number, interest point, building, unit, floor and room number standards, administrative address elements of province and city value communities are completed through regional administrative division data, preliminary k-means clustering is carried out by utilizing an elbow method aiming at the same batch of address data to obtain an optimal k value, addresses corresponding to the cluster centers of all types of addresses are obtained in a weighted clustering mode to serve as candidate standard addresses aiming at the obtained optimal k value, and the category weights of all elements are combined to be matched with the addresses in a standard address library, so that the standard address library is continuously supplemented.

The method is based on the standard address library, aiming at multi-source address data, through text cleaning, address element extraction and completion, and the weighted similarity matching calculation of the address elements is utilized to obtain the unique standard address corresponding to the original address, so that the standardization of the multi-source address data is finally realized, and the problem of address ambiguity can be well solved.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of an address matching analysis method based on weighted clustering in embodiment 1 of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to solve the problems in the prior art, embodiments of the present invention provide a method and a system for address matching analysis based on weighted clustering.

Example 1

The embodiment discloses an address matching analysis method based on weighted clustering, as shown in fig. 1, including:

s100, cleaning address data with different sources; specifically, the address data with different sources may be stored in different forms, and commonly include a relational table, an excel, a csv, a text format file and the like in a relational database, and abnormal characters and case letters in the address data are removed through address text cleaning.

S200, extracting address elements of the cleaned address data according to a first preset rule; specifically, for the address text after cleaning, element extraction is performed according to the standards of province, city, county (district), town (street), village (community), road number, interest point, building, unit, floor and room number, and element extraction and identification are performed by adopting a named entity identification model.

S300, completing the extracted address elements according to a historical administrative division; specifically, the identified elements are complemented through historical data at the province, city, county (district), village, town (street) and village group (community) level, so that the integrity of the address information is improved as much as possible.

S400, coding is carried out on the complemented address element set according to a second preset rule to obtain an address element coding set; in this embodiment, the encoding is performed according to a second preset rule to obtain an address element encoding set, where the second preset rule is: according to the dimensions of province, city, county or district, village and town or street, village group or community, road number, interest point, building, unit, floor and room number, the hot independent coding is carried out, and the missing address elements in the address are replaced by random negative real numbers with larger absolute values.

in S500 of this embodiment, preliminary clustering is performed by means of the elbow method of kmeans to obtain an initial K value, and the specific method is as follows:

s503, clustering all sample data, resetting each class of average point as a new initial centroid of all samples in the class, and recalculating Euclidean distance between each sample data and a new central point;

s504, when the initial K value is increased, calculating the error square sum SSE of the sample, when the error square sum SSE is rapidly converged, increasing the K value, otherwise, outputting the current K value.

Further, the equation for the sum of squared errors SSE is:

wherein, C _i Is the ith cluster, p is C _i Data point of (1), m _i Is C _i The center of mass of the data is determined, SSE is the clustering error of all data, and SSE is the standard for evaluating the clustering effect. As the number of clusters K increases, the sample division becomes finer, the aggregation degree of each cluster gradually increases, and then the sum of squared errors SSE (sum of squared errors) naturally becomes smaller. When K is smaller than the true cluster number, the decrease of the SSE is large because the increase of K greatly increases the aggregation level of each cluster, and when K reaches the true cluster number, the return of the aggregation level obtained by increasing K is rapidly reduced, so the decrease of the SSE is rapidly reduced and then becomes gentle with the continuous increase of the K value, that is, the relation graph of the SSE and K is the shape of an elbow, and the K value corresponding to the elbow is the true cluster number of the data.

And (5) performing preliminary clustering by using K-means in combination with an elbow method, and analyzing the most appropriate K value for the next-stage weighted clustering.

The specific algorithm idea of the elbow method is to cluster data through a mean value method. The method comprises the steps of setting a K value and an initial centroid of each category before clustering is started, wherein the K value is the number of dividing behavior data into K normal categories when an algorithm is executed, the initial centroid is the center coordinate of each initial category, and finally obtaining an optimal clustering result through mean iterative optimization after classification. Generally, in the euclidean space, a euclidean distance is used to represent a distance between two data items, and the following is a formula for calculating the euclidean distance between two data items in the two-dimensional space:

s600, based on weighted clustering of WKmeans, carrying out weighted clustering on the address element codes in the S400 by using a WKmeans clustering mode in combination with the K value in the S500 to obtain a standard address and attribute weighted matrix; in S600 of this embodiment, based on the weighted clustering of WKmeans, a weight parameter is added to the objective function, and the final weight of the calculated variable in the iterative solution process can identify the noise variable, thereby achieving the purpose of clustering large-scale high-dimensional data.

Specifically, pure Kmeans clustering treats each variable equally in the clustering process, and when data contains a large number of variables with high dissimilarity degree, a clustering structure is usually limited on only one subset of the variables instead of the whole variable set, so that global optimization cannot be obtained, and a high-quality clustering result is difficult to generate.

Actually, in the clustering process for the address element set, the contribution of each dimension characteristic variable of province, city, county (district), village (street), village group (community), road number, interest point, building, unit, floor and room number to the address clustering result is different. Therefore, the method adopts a weighted clustering mode based on WKmeans to realize the clustering process of the address element set.

Compared with the original Kmeans clustering algorithm, WKmeans only adds a weight parameter in an objective function, and can identify a noise variable through the final weight of the variable calculated in the iterative solution process, and the capability has important significance for variable selection of large-scale high-dimensional data clustering.

Wherein:

cluster allocation formula:

the mathematical meaning is that if the sample u _i With the p class least, this assigns it as the p class.

Cluster center allocation formula:

indicating that the new cluster center of the p-class is actually the intra-cluster mean.

Weight calculation formula:

the distance calculation formula is as follows:

D _j it is the sum of the distances of all sample points in the j dimension.

S700, carrying out weighted matching on the address to be matched according to the standard address and the attribute weighted matrix of S600; specifically, the weighted matching is performed according to the cluster center and the standard address in the standard address library, the weight set W calculated in step 6 and the standard address in the standard address library represent (cluster center) Z, and the cluster calculation formula for any U is:

the minimum spatial distance from U to the standard address (cluster center) Z indicates that Z with the calculated P minimum is the category of U, i.e., the standard address of the address corresponding to U.

And S800, updating the standard address of the standard address library according to the address weighting matching result to be matched. Specifically, for a given threshold of P, if the matching similarity between the standard addresses corresponding to all the cluster cores to be selected and the standard addresses in the standard address library calculated in S700 is outside the threshold, it is determined that the address does not exist in the standard address library, and the address corresponding to the cluster core is regarded as the standard address and persistently enters the standard address library, otherwise, it is determined that the standard address is accepted by the standard address library, and the address in the standard address library is returned as the standard address of the sample in the cluster.

The embodiment also discloses an address matching analysis system based on weighted clustering, which comprises: the system comprises an address data cleaning unit, an address element extracting unit, an address element complementing unit, an address element coding unit, a preliminary clustering unit, a weighted matching unit and a standard address updating unit; wherein:

the weighted clustering unit is used for gathering the address element codes based on weighted clustering of WKmeans, combining K values in the primary clustering list and carrying out weighted clustering by using a WKmeans clustering mode to obtain a standard address and attribute weighted matrix;

the weighted matching unit is used for carrying out weighted matching on the addresses to be matched according to the standard addresses and the attribute weighted matrix of the weighted clustering list;

The specific working methods of the address data cleaning unit, the address element extracting unit, the address element complementing unit, the address element encoding unit, the preliminary clustering unit, the weighted matching unit and the standard address updating unit are described in detail in an address matching analysis method based on weighted clustering, and are not described herein again.

The embodiment discloses an address matching analysis method based on weighted clustering, which is characterized in that address texts are cleaned, element extraction is carried out according to province, city, county (district), town (street), village (community), road number, interest point, building, unit, floor and room number standards, administrative address elements of province and city value communities are completed through regional administrative division data, initial k-means clustering is carried out by utilizing an elbow method aiming at the same batch of address data, an optimal k value is obtained, an address corresponding to the cluster center of each type of address is obtained in a weighted clustering mode according to the obtained optimal k value to serve as a candidate standard address, and matching is carried out by combining the category weight of each element with the address in a standard address library, so that the supplement of the standard address library is continuously realized. The method is based on a standard address library, aiming at multi-source address data, through text cleaning, address element extraction and completion, and by utilizing weighted similarity matching calculation of the address elements, the unique standard address corresponding to the original address is obtained, so that standardization of the multi-source address data is finally realized, and the problem of address ambiguity can be well solved.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. An address matching analysis method based on weighted clustering is characterized by comprising the following steps:

s100, cleaning address data with different sources;

s400, coding is carried out on the completed address element set according to a second preset rule to obtain an address element coding set;

s600, based on weighted clustering of WKmeans, carrying out weighted clustering on the address element codes in the S400 by using a WKmeans clustering mode in combination with the K value in the S500 to obtain a standard address and attribute weighted matrix;

2. The address matching analysis method based on weighted clustering as claimed in claim 1, wherein in S100, the address data sources at least comprise: and the relational database comprises relational tables, excel, csv and text format files.

3. The address matching analysis method based on weighted clustering as claimed in claim 1, wherein in S200, the address elements are extracted according to a first preset rule, and the first preset rule includes: and aiming at the cleaned address text, extracting elements according to province, city, county or district, village and town or street, village group or community, road number, interest point, building, unit, floor and room number standards, and extracting and identifying the elements by adopting a named entity identification model.

4. The address matching analysis method based on weighted clustering as claimed in claim 1, wherein in S400, coding is performed according to a second preset rule to obtain an address element coding set, and the second preset rule is: according to the dimensions of province, city, county or district, village and town or street, village group or community, road number, interest point, building, unit, floor and room number, the hot independent coding is carried out, and the missing address elements in the address are replaced by random negative real numbers with larger absolute values.

5. The address matching analysis method based on weighted clustering as claimed in claim 1, wherein in S500, preliminary clustering is performed by means of the elbow method of kmeans to obtain an initial K value, and the specific method is as follows:

s501, dividing the address data into K normal category numbers, and taking the center coordinate of each initial category as an initial centroid;

6. The address matching analysis method based on weighted clustering as claimed in claim 5, wherein the calculation formula of the sum of squared errors SSE is:

7. The address matching analysis method based on weighted clustering as claimed in claim 1, wherein in S600, based on the weighted clustering of WKmeans, by adding a weight parameter to the objective function, the final weight of the calculated variable in the iterative solution process can identify the noise variable, thereby achieving the purpose of clustering the large-scale high-dimensional data.

8. The address matching analysis method based on weighted clustering as claimed in claim 1, wherein in S800, if the similarity of the candidate standard addresses returned in S700 does not satisfy the threshold, it indicates that the standard address library does not include the service address, at this time, the standard address update mechanism is triggered, the standard address library includes the address as the standard address, and recalculates the K value and the weight, otherwise, the ID of the corresponding standard address is returned, and the standard address text corresponding to the standard address is finally obtained through ElasticSearch, thereby finally realizing address standardization.

9. An address matching analysis system based on weighted clustering, comprising: the system comprises an address data cleaning unit, an address element extracting unit, an address element complementing unit, an address element coding unit, a preliminary clustering unit, a weighted matching unit and a standard address updating unit; wherein: