CN111460069A

CN111460069A - Address correction method based on weighted directed acyclic graph

Info

Publication number: CN111460069A
Application number: CN202010241121.0A
Authority: CN
Inventors: 姚伟烈; 彭艳兵; 邱秀连; 许云
Original assignee: Nanjing Fenghuo Tiandi Communication Technology Co ltd
Current assignee: Nanjing Fenghuo Tiandi Communication Technology Co ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28

Abstract

The invention discloses an address correction method based on a weighted directed acyclic graph, which relates to the field of data management and specifically comprises the following steps: performing tail-free processing on the address data subjected to pre-normalization, and performing TF-IDF feature word extraction; calculating an address similarity threshold based on the extracted feature words and the weights; calculating similarity between addresses in the tailless address base and the pre-normalized addresses: performing word segmentation analysis on the address of the preprocessed address, and then combining every two in sequence to form an edge set of the directed graph; and generating a directed acyclic graph based on the calculation of the edge weight data, outputting a weighted maximum path, and completing address normalization. The address completion correction algorithm provided by the invention can effectively solve the problems and can solve the address normalization problem on the basis of no standard address level division data.

Description

Address correction method based on weighted directed acyclic graph

Technical Field

The invention belongs to the field of data management, and particularly relates to an address rectification method based on a weighted directed acyclic graph.

Background

In the field of data management, address normalization analysis is often involved, the most common normalized address data is logistics address data, early logistics address information is input by self, and the problem that address information is not standard due to the fact that people miss part of address information or fill part of address information by mistake is solved, so that an address normalization method is provided.

At present, most address normalization methods are fast normalization methods based on a dictionary tree, wherein each layer of the dictionary tree represents an address level, and the final normalized address is obtained by traversing from a root node to a leaf node. The main drawback of this method is the dependence on standard and complete address-level partitioning data for constructing the dictionary tree, and secondly the space consumption of the algorithm is excessive.

Disclosure of Invention

The invention aims to solve the technical problem of providing an address rectification method based on a weighted directed acyclic graph aiming at the defects of the background technology, wherein when no standard and complete address grade divides data, the weighted directed acyclic graph is established through the redundancy relation among mass data, and the weighted maximum path of the directed acyclic graph is the address after normalization.

The invention adopts the following technical scheme for solving the technical problems:

an address rectification method based on a weighted directed acyclic graph specifically comprises the following steps:

step 1, performing tailless processing on address data subjected to pre-normalization, and performing TF-IDF feature word extraction;

step 2, calculating an address similarity threshold value based on the extracted feature words and the weight;

step 3, calculating the similarity between the address in the non-tail address base and the pre-normalized address:

step 4, performing word segmentation analysis on the address of the preprocessed address, and then combining every two in sequence to form an edge set of the directed graph;

and 5, generating a directed acyclic graph based on the calculated edge weight data, outputting a weighted maximum path, and completing address normalization.

As a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, the step 1 specifically includes:

carrying out tailless processing on the address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF characteristic word extraction { (F)₁,W₁),(F₂,W₂),...,(F_n,W_n) In which F_nRepresenting a characteristic word, W_nThe TF-IDF value representing the feature word.

As a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, the step 2 is specifically as follows: calculating an address similarity threshold θ based on the extracted TF-IDF feature words and a weight FW, wherein FW { (F)₁,W₁),(F₂,W₂),...,(F_n,W_n) And, calculating specifically as follows:

as a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, in step 3, the similarity W between the address (Sadd) in the tailless address base and the pre-normalized address is calculated, which is specifically as follows:

extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S₁,s₂,...,s₁₀}。

As a further preferable scheme of the address rectification method based on the weighted directed acyclic graph, in step 4, word segmentation analysis is carried out on 11 addresses S and preprocessed addresses which meet the requirement that the similarity W is more than or equal to theta, and the two addresses are combined in sequence to form an edge set of the directed graph:

all the above-mentioned edge data are formed into edge weight data E, E { < w { [₁,w₂,c₁＞,＜w₂,w₃,c₂＞,...,＜w_k-1,w_k,c₃> -, where w₁,w₂Respectively representing the start node of an edge, c₁Representing the weight of an edge, i.e. w₁,w₂The number of occurrences.

As a further preferable scheme of the address rectification method based on the weighted directed acyclic graph of the present invention, in step 5, a directed acyclic graph G is generated based on the calculated edge weight data E, a weighted maximum path P is output, and address normalization is completed:

wherein the node information on the P path is the normalized result of the address Add, i.e. w₁w₂...w_p。

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. aiming at the defects that address normalization excessively depends on a standard and complete address grade data division, the invention provides a weighted directed acyclic graph established based on the redundancy relation among massive address data, wherein the weighted maximum path of the directed acyclic graph is a normalized labeled standard address;

2. the address normalization method is a method for optimizing address by segments, words forming the address are sequentially combined and divided into address sections, and the optimal address section relation forms the final normalized address, so that the error rate of address normalization can be reduced, and the better the address normalization effect is improved; when no standard and complete address grade division data exists, the weighted directed acyclic graph is established through the redundancy relation among the mass data, and the weighted maximum path of the directed acyclic graph is the normalized address.

Drawings

FIG. 1 is a flow chart of an address completion correction algorithm for a weighted directed acyclic graph of the present invention;

FIG. 2 is a weighted directed acyclic graph constructed based on an address in accordance with the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the defects that address normalization excessively depends on a standard and complete address grade data division, the invention provides a weighted directed acyclic graph established based on the redundancy relation among massive address data, wherein the weighted maximum path of the directed acyclic graph is a normalized labeled standard address.

FIG. 1 is a flow chart of an address completion correction algorithm for a weighted directed acyclic graph, including a precedence relationship between data processing modules; fig. 2 is a weighted directed acyclic graph constructed based on an address, in which part of the private information has been replaced with a #.

An address rectification method based on a weighted directed acyclic graph is shown in fig. 1, and specifically includes the following steps:

step 1, carrying out tailless processing on address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF feature word extraction { (F)₁,W₁),(F₂,W₂),...,(F_n,W_n) In which F_iRepresenting a characteristic word, W_iA TF-IDF value representing the feature word;

step 2, calculating an address similarity threshold value theta based on the extracted TF-IDF characteristic words and a weight FW, wherein FW { (F)₁,W₁),(F₂,W₂),...,(F_n,W_n) And, calculating specifically as follows:

step 3, calculating the similarity W between the address (Sadd) in the non-tail address base and the pre-normalized address, which is as follows:

extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S₁,s₂,...,s₁₀}；

Step 4, performing word segmentation analysis on 11 addresses including the first 10S and the preprocessing address which meet the similarity W being more than or equal to theta, and combining every two addresses in sequence to form an edge set of the directed graph:

all the above-mentioned edge data are formed into edge weight data E, E { < w { [₁,w₂,c₁＞,＜w₂,w₃,c₂＞,...,＜w_k-1,w_k,c₃> -, where w₁,w₂Respectively representing the start node of an edge, c₁Representing the weight of an edge, i.e. w₁,w₂The number of occurrences;

step 5, generating a directed acyclic graph G based on the calculated edge weight data E, and outputting a weighted maximum path P based on a weighted directed acyclic graph constructed on a certain address, as shown in fig. 2, to complete address normalization:

the normalization result w of whether the nodes on the weighted maximum path are sequentially connected or the address Add is₁w₂...w_p

The details of the technical solution of the invention are illustrated with reference to the accompanying drawings:

fig. 1 shows the overall process of the address completion algorithm, for example, the address to be completed is "oasis home yard in large watt kiln" unit cell "; the result after deleting the tail information is: "Luzhou home garden on the way in a big tile kiln"; the result of the word segmentation TF-IDF without the tail address is: 2.987, 'homeland', 2.178, 'oasis', 2.108, 'middle road', 2.064 }; the similarity threshold is: 5.051, respectively; the top 10 tailless addresses that meet the threshold are shown in table 1.

TABLE 1

Serial number	Tailless address
		1	Middle road of large tile kiln in Toyotai district of Beijing City
2	Middle road of large tile kiln in Toyotai district of Beijing City
		3	Large tile kiln middle road of Fengtai district of Haizui district of Beijing
4	Zhonglu Fengze home garden with large tile kiln in Fengtai district of Beijing City
		5	Great watt kiln green land home garden between five rings and six rings of Fengtai district in Beijing City
6	Luzhou home garden of Lugou bridge tile kiln Zhongyao of Fengtai district of Beijing
		7	Large tile kiln middle road at west road junction of Toyotai district of Beijing City
8	Great tile kiln oasis home garden in Fengtai district of Beijing City
		9	Middle road of large tile kiln in east city of Beijing City
10	Middle road of Rugou bridge large tile kiln in Fengtu platform area of Beijing City

The set of directed edge weights obtained by word segmentation is shown in table 2:

TABLE 2

Directed graph edges	Directed graph edges	Directed graph edges	Directed graph edges
				<Beginning, Beijing, 10>	<Dongchuan area, large brickkiln, 1>	<To, six rings, 1>	<Feng ze, Home 1>
<Start, large brickkiln, 1>	<Fengcai, Lugou bridge, 3>	<Six rings, between, 1>	<Oasis, home, 4>
				<Beijing City, Fengtai, 1>	<Rugou bridge, large brickkiln, 3>	<Chinese, Large Tile kiln 1>	<Zhongbaodao, oasis, 1>
<Beijing City, Tokyo, 1>	<Fengcai, West, 1>	<Fengtai district, Large Tile kiln, 3>	<Home, date, 1>
				<Beijing City, Fengtai district, 7>	<West, crossing, 1>	<Large tile kiln, middle road, 8>	<Period, cell, 1>
<Beijing, Haihe district, 1>	<Level crossing, large brickkiln, 1>	<Large Tile kiln, Medium court, 1>	<Cell, 1>
				<Fengtai, Fengtai district, 1>	<Fengcai region, five rings, 1>	<Zhongluo, oasis, 1>	<Chamber, end, 1>
<Haihai region, large brickkiln, 1>	<Pentacyclic ring to, 1>	<Zhonglu, Fengze, 1>

The formed weighted directed acyclic graph is shown in fig. 2 in the attached drawing, and finally the most weighted path node value is output as the result of address normalization, that is: "the Luzhou home garden of Lugou bridge large tile kiln of Fengti district of Beijing unit.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention. While the embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method for correcting addresses based on a weighted directed acyclic graph is characterized in that: the method specifically comprises the following steps:

2. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: the step 1 is specifically as follows:

3. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: the step 2 is specifically as follows: calculating an address similarity threshold θ based on the extracted TF-IDF feature words and a weight FW, wherein FW { (F)₁,W₁),(F₂,W₂),...,(F_n,W_n) And, calculating specifically as follows:

4. the address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 3, the similarity W between the address (Sadd) in the tailless address base and the pre-normalized address is calculated as follows:

5. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 4, performing word segmentation analysis on 11 addresses including the first 10 addresses S and the preprocessing address satisfying the similarity W being more than or equal to theta, and combining every two addresses in sequence to form an edge set of the directed graph:

......

6. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 5, generating a directed acyclic graph G based on the calculation edge weight data E, outputting a weighted maximum path P, and completing address normalization: