CN114756654A

CN114756654A - Dynamic place name and address matching method and device, computer equipment and storage medium

Info

Publication number: CN114756654A
Application number: CN202210440918.2A
Authority: CN
Inventors: 郭伟鹏; 黄诗颖; 王婷婷; 陈顺丽; 黄涛; 蒙梦; 王芳丽
Original assignee: Guangzhou China Dci Co ltd
Current assignee: Guangzhou China Dci Co ltd
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2022-07-15

Abstract

The application relates to a dynamic place name address matching method, a dynamic place name address matching device, computer equipment and a storage medium. The method comprises the following steps: acquiring an address to be matched and a first address word segmentation set corresponding to the address to be matched; acquiring a second address participle set from a pre-constructed address database, and acquiring a second address participle set matched with the first address participle set from the second address participle set to serve as an initial address participle set; acquiring the matching degree of the first address word segmentation set and the initial address word segmentation set, and if the matching degree meets a preset condition, taking the address word segmentation set as a candidate address word segmentation set; and acquiring a target address word segmentation set from the candidate address word segmentation set according to the similarity between the first address word segmentation set and the candidate address word segmentation set, and taking a target address corresponding to the target address word segmentation set as an address matching result of the address to be matched. By adopting the method, the precision and the efficiency of address matching can be improved.

Description

Dynamic place name address matching method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of geographic information service technologies, and in particular, to a dynamic place name address matching method, apparatus, computer device, and storage medium.

Background

Address matching is a spatial position processing technology in current smart city construction, and is always a hotspot in research in the field of geographic information, but based on the particularity and complexity of a Chinese address, how to realize accurate and fast matching of the Chinese address is a difficult problem in research.

At present, the existing Chinese place name address matching method and device mainly implement Chinese address matching by defining and splitting standard address elements and combining word segmentation, rules and other mixing methods. However, the existing Chinese place name address matching method has the problem of low address matching precision.

Disclosure of Invention

In view of the above, it is necessary to provide a dynamic place name address matching method, apparatus, computer device and storage medium capable of improving the address matching accuracy.

In a first aspect, the present application provides a dynamic place name address matching method, including:

acquiring an address to be matched and a first address word segmentation set corresponding to the address to be matched;

acquiring a second address participle set from a pre-constructed address database, and acquiring a second address participle set matched with the first address participle set from the second address participle set to serve as an initial address participle set;

acquiring the matching degree of the first address word segmentation set and the initial address word segmentation set, and if the matching degree meets a preset condition, taking the address word segmentation set as a candidate address word segmentation set;

and acquiring a target address word segmentation set from the candidate address word segmentation set according to the similarity between the first address word segmentation set and the candidate address word segmentation set, and taking a target address corresponding to the target address word segmentation set as an address matching result of the address to be matched.

In one embodiment, obtaining a second address participle set matched with the first address participle set as an initial address participle set includes:

acquiring a current second address participle set and second address participles contained in the current second address participle set;

acquiring first address participles contained in a first address participle set;

and if the address participles identical to the first address participles exist in the second address participles, taking the current second address participle set as an initial address participle set.

In one embodiment, obtaining a matching degree between the first address participle set and the initial address participle set, and if the matching degree meets a preset condition, taking the address participle set as a candidate address participle set, including:

acquiring a current initial address participle set and a first intersection of the current initial address participle set and a first address participle set;

acquiring the word segmentation quantity of address word segmentation contained in the first intersection;

and if the number of the participles is larger than a preset number threshold, taking the current initial address participle set as a candidate address participle set.

In one embodiment, the similarity between the first address participle set and the candidate address participle set comprises:

acquiring a current candidate address participle set and a second intersection of the current candidate address participle set and the first address participle set;

acquiring a union of a current candidate address word segmentation set and a first address word segmentation set;

according to the address participles contained in the second intersection and the address participles contained in the union set, the similarity between the first address participle set and the current candidate address participle set is obtained;

acquiring a target address word segmentation set from the candidate address word segmentation set, wherein the target address word segmentation set comprises the following steps:

acquiring the similarity between each candidate address word segmentation set and the first address word segmentation set;

and taking the candidate address word segmentation set with the maximum similarity as a target address word segmentation set.

In one embodiment, obtaining the similarity between the first address participle set and the current candidate address participle set according to each address participle included in the second intersection and the address participles included in the union set includes:

acquiring address word segmentation categories corresponding to the address word segmentations contained in the second intersection, and acquiring first word segmentation weights of the address word segmentations contained in the second intersection based on the address word segmentation categories;

acquiring address word segmentation categories corresponding to the address word segmentations contained in the union set, and acquiring second word segmentation weights of the address word segmentations contained in the union set based on the address word segmentation categories;

and obtaining the similarity of the first address participle set and the current candidate address participle set according to the address participles contained in the second intersection set, the address participles contained in the union set, the first participle weight and the second participle weight.

In one embodiment, before obtaining the second address participle set from the pre-constructed address database, the method includes:

obtaining sample address data from a plurality of databases;

carrying out data preprocessing on the sample address data, and carrying out word segmentation on the sample address data after the data preprocessing to form a sample address word segmentation set;

and constructing an address database by using the sample address participle set.

In one embodiment, the method further comprises:

responding to an updating operation aiming at the address database, and acquiring updated sample address data and a sample address word segmentation set corresponding to the updated sample address data;

updating the address database by using a sample address word segmentation set corresponding to the updated sample address data to obtain an updated address database;

acquiring a second address word segmentation set from a pre-constructed address database, wherein the second address word segmentation set comprises the following steps:

and acquiring a second address participle set from the updated address database.

In a second aspect, the present application further provides a dynamic place name address matching apparatus, including:

the address to be matched acquisition module is used for acquiring an address to be matched and a first address word segmentation set corresponding to the address to be matched;

the initial address set module is used for acquiring a second address participle set from a pre-constructed address database, and acquiring a second address participle set matched with the first address participle set from the second address participle set to serve as an initial address participle set;

the candidate address set module is used for acquiring the matching degree of the first address participle set and the initial address participle set, and if the matching degree meets a preset condition, the address participle set is used as a candidate address participle set;

and the target address set module is used for acquiring a target address word segmentation set from the candidate address word segmentation set according to the similarity between the first address word segmentation set and the candidate address word segmentation set, and taking a target address corresponding to the target address word segmentation set as an address matching result of the address to be matched.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method described above when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method described above.

According to the dynamic place name address matching method, the dynamic place name address matching device, the computer equipment and the storage medium, the address to be matched and the first address participle set corresponding to the address to be matched are obtained; acquiring a second address participle set from a pre-constructed address database, and acquiring a second address participle set matched with the first address participle set from the second address participle set to serve as an initial address participle set; acquiring the matching degree of the first address word segmentation set and the initial address word segmentation set, and if the matching degree meets a preset condition, taking the address word segmentation set as a candidate address word segmentation set; and acquiring a target address word segmentation set from the candidate address word segmentation set according to the similarity between the first address word segmentation set and the candidate address word segmentation set, and taking a target address corresponding to the target address word segmentation set as an address matching result of the address to be matched. Compared with the prior art, the address matching method and the device have the advantages that the first address participle set and the second address participle set are subjected to preliminary matching, the second address participle set with the matching degree meeting the preset conditions is used as the candidate address participle set, and the address matching result is obtained according to the similarity between the first address participle set and the candidate address participle set. Therefore, the address in the address database is respectively subjected to quick positioning of rough screening and accurate matching of precise screening to obtain an address matching result, so that the precision and the efficiency of address matching are improved.

Drawings

FIG. 1 is a diagram of an application environment for a dynamic place name address matching method in one embodiment;

FIG. 2 is a flow diagram of a dynamic place name address matching method in one embodiment;

FIG. 3 is a flowchart illustrating a dynamic location name address matching method in accordance with another embodiment;

FIG. 4 is a flowchart illustrating a dynamic location name address matching method in accordance with yet another embodiment;

FIG. 5 is a flowchart illustrating a dynamic location name address matching method according to yet another embodiment;

FIG. 6 is a schematic flow diagram that illustrates the construction of a dynamic location name address database, in one embodiment;

FIG. 7 is a flow diagram illustrating address matching design in one embodiment;

FIG. 8 is a schematic flow chart illustrating coarse filtering fast positioning in one embodiment;

FIG. 9 is a schematic flow chart of the fine screening exact matches in one embodiment;

FIG. 10 is a block diagram of an embodiment of a dynamic place name address matching apparatus;

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The dynamic place name address matching method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The data storage system may store a pre-constructed address database. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 2, a dynamic place name address matching method is provided, which is described by taking the method as an example applied to the server 104 in fig. 1, and includes the following steps:

step S202, an address to be matched and a first address word segmentation set corresponding to the address to be matched are obtained.

The address to be matched may be any address to be matched, and the first address word segmentation set may be a set formed by word segmentation elements of the address to be matched after the address to be matched is subjected to word segmentation. For example, the address to be matched may be a city, a province, a city, a district, a city, a district, a city.

Specifically, an address to be matched and a set composed of word segmentation elements of the address to be matched are obtained.

Step S204, a second address participle set is obtained from a pre-constructed address database, and a second address participle set matched with the first address participle set is obtained from the second address participle set to serve as an initial address participle set.

Wherein the address database may be constructed from address data derived from a plurality of databases. The second address word segmentation set can be a set formed by word segmentation elements in the second address after the second address is subjected to word segmentation; wherein the second address may be an address in an address database. The initial address participle set is a second address participle set matched with the first address participle set, namely if the second address participle set is matched with the first address participle set, the second address participle set is the initial address participle set. For example, the first address participle set includes a province, a city, a C district, a D street, and an E district, and the second address participle set includes a province, a city, a C district, an M street, and an N district, and the first address participle set and the second address participle set may be considered to match, and the second address participle set including elements of the province, the city, the C district, the M street, and the N district is taken as the initial address participle set. It should be noted that the second address participle set may be one or more address participle sets, and the initial address participle set may also be one or more address participle sets.

Specifically, a plurality of second address word segmentation sets may be obtained from a pre-constructed address database; and acquiring a second address participle set matched with the first address participle set from a plurality of second address participle sets, and taking the second address participle set as an initial address participle set.

Step S206, obtaining the matching degree of the first address participle set and the initial address participle set, and if the matching degree meets a preset condition, taking the address participle set as a candidate address participle set.

The matching degree may characterize the matching of the first address participle set and the initial address participle set. The preset condition refers to a condition related to the degree of matching, and for example, the satisfaction of the preset condition may be the satisfaction of a threshold. The preset condition may be set according to actual conditions, and is not limited herein. The candidate address word segmentation set is an initial address word segmentation set corresponding to the matching degree which meets a preset condition; namely, if the matching degree of one of the initial address participle sets and the first address participle set meets the preset condition, the initial address participle set is used as a candidate address participle set. It is worth mentioning that the set of candidate address tokens may be one or more sets of address tokens.

Specifically, the matching degree between the first address participle set and the initial address participle set may be obtained, and if the matching degree is a preset condition, the initial address participle set corresponding to the matching degree is used as a candidate address participle set.

Step S208, according to the similarity between the first address participle set and the candidate address participle set, a target address participle set is obtained from the candidate address participle set, and a target address corresponding to the target address participle set is used as an address matching result of the address to be matched.

The similarity is used for representing the similarity between the first address participle set and the candidate address participle set, and the similarity between the two sets can be obtained through a similarity calculation method, wherein the similarity calculation can include a euclidean distance calculation method, a manhattan distance calculation method, a minkowski distance calculation method, a jaccard distance calculation method and the like. The target address participle set may be a candidate address participle set having the highest similarity to the first address participle set. The target address is a complete address obtained after all address elements in the target address word segmentation set are combined.

Specifically, according to the similarity between the first address participle set and the candidate address participle set, a candidate address participle set with the highest similarity with the first address participle set can be obtained from the multiple candidate address participle sets, the candidate address participle set is used as a target address participle set, a target address corresponding to the target address participle set is used as an address matching result of an address to be matched, that is, address elements in the target address participle set are combined to obtain an address, and the address is used as an address matching result of the address to be matched, that is, the address is matched with the address to be matched.

In the embodiment, the address to be matched and the first address participle set corresponding to the address to be matched are obtained; acquiring a second address participle set from a pre-constructed address database, and acquiring a second address participle set matched with the first address participle set from the second address participle set to serve as an initial address participle set; acquiring the matching degree of the first address word segmentation set and the initial address word segmentation set, and if the matching degree meets a preset condition, taking the address word segmentation set as a candidate address word segmentation set; and acquiring a target address word segmentation set from the candidate address word segmentation set according to the similarity between the first address word segmentation set and the candidate address word segmentation set, and taking a target address corresponding to the target address word segmentation set as an address matching result of the address to be matched. Compared with the prior art, the address matching method and the device have the advantages that the first address participle set and the second address participle set are subjected to preliminary matching, the second address participle set with the matching degree meeting the preset conditions is used as the candidate address participle set, and the address matching result is obtained according to the similarity between the first address participle set and the candidate address participle set. Therefore, the address in the address database is respectively subjected to quick positioning of rough screening and accurate matching of precise screening to obtain an address matching result, so that the precision and the efficiency of address matching are improved.

In one embodiment, as shown in fig. 3, acquiring a second address participle set matching the first address participle set as an initial address participle set includes:

step S302, obtain the current second address participle set and the second address participles included in the current second address participle set.

Step S304, acquiring first address participles contained in the first address participle set;

step S306, if the address participles identical to the first address participles exist in the second address participles, taking the current second address participle set as an initial address participle set.

The current second address participle set may be any one of the second address participle sets, and the second address participle may be an address participle in the current second address participle set, that is, an element in the second address participle set. The first address participle may be an element in a first set of address participles, e.g. the first set of address participles may comprise a province, city B, district C, street D and cell E, and the first address participle may be a province, city B, district C, street D or cell E.

Specifically, any one second address participle set and a second address participle included in the second address participle set may be obtained. And acquiring a first address participle contained in the first address participle set, and performing participle comparison on a second address participle and the first address participle to obtain a result of whether the second address participle has the same address participle as the first address participle. If the second address participle has the same address participle as the first address participle, taking the second address participle set as an initial address participle set.

In this embodiment, by obtaining the first address participle and the second address participle in any one second address participle set, determining whether the first address participle and the second address participle share a common word, and using the second address participle set having the common word as the initial address participle set, the amount of data calculation can be reduced, and the address matching efficiency can be improved.

In one embodiment, as shown in fig. 4, the obtaining a matching degree between the first address participle set and the initial address participle set, and if the matching degree satisfies a preset condition, taking the address participle set as a candidate address participle set includes:

step S402, acquiring a current initial address participle set and a first intersection of the current initial address participle set and a first address participle set;

step S404, acquiring the word segmentation quantity of the address word segmentation contained in the first intersection;

in step S406, if the number of segmented words is greater than the preset number threshold, the current initial address segmented word set is used as a candidate address segmented word set.

The current initial address participle set can be any one of the initial address participle sets. The first intersection is the intersection of the initial address participle set and the first address participle set; wherein, intersection refers to a set composed of all elements belonging to the set A and belonging to the set B. The number of tokens may be the number of elements in the first intersection. The number threshold may be a threshold related to the number of the first address participles, and may be set according to practical situations, and is not specifically limited herein.

Specifically, any one initial address participle set and a first intersection of the initial address participle set and the first address participle set may be obtained. And acquiring the word segmentation quantity of the address word segmentation contained in the first intersection, and if the word segmentation quantity is greater than a preset quantity threshold value, taking the initial address word segmentation set as a candidate address word segmentation set. For example, the first address participle set may include a province, B city, C district, D street, and E district, where an initial address participle set includes a province, B city, C district, M street, and N district, then the first intersection includes a province, B city, and C district, the first intersection has a participle number of 3, and if the preset number threshold is 2, the initial address participle set may be used as a candidate address participle set.

In this embodiment, the candidate address participle set is obtained by using the participle number in the intersection of the first address participle set and the initial address participle set as the matching degree and using the number threshold as the judgment condition, so that the calculation amount of similarity calculation can be reduced, and the address matching efficiency can be improved.

and obtaining the similarity between the first address word segmentation set and the current candidate address word segmentation set according to each address word segmentation contained in the second intersection set and the address word segmentation contained in the merging set.

The current candidate address participle set can be any one of the candidate address participle sets. The second intersection is the intersection of the candidate address participle set and the first address participle set. A union refers to a set of two sets A, B that are given, all of their elements merged together.

Specifically, any one of the candidate address participle sets and a second intersection of the candidate address participle set and the first address participle set may be obtained. And acquiring a union of the candidate address participle set and the first address participle set. Similarity calculation can be performed according to each address participle contained in the second intersection and the address participles contained in the union set, so that the similarity of the candidate address participle set and the first address participle set is obtained. For example, the first set of address tokens may include a province, B city, C district, D street, and E cell, the set of candidate address tokens includes a province, B city, C district, M street, and N cell, then the second intersection may include a province, B city, and C district, and the union may include a province, B city, C district, D street, E cell, M street, and N cell. Similarity calculation can be performed according to the second intersection and the union to obtain the similarity of the candidate address word segmentation set and the first address word segmentation set.

Acquiring a target address word segmentation set from the candidate address word segmentation set, wherein the acquisition comprises the following steps:

The similarity may be a specific numerical value calculated by the similarity.

Specifically, the similarity between each candidate address participle set and the first address participle set may be obtained, that is, a specific numerical value obtained by calculating the similarity between each candidate address participle set and the first address participle set is obtained. The obtained candidate address word segmentation set with the maximum similarity value can be used as a target address word segmentation set.

In this embodiment, the accuracy of address matching can be improved by obtaining the second intersection and union of the first address participle set and the candidate address participle set, calculating the similarity according to the second intersection and union, and using the candidate address participle set with the maximum similarity as the target address participle set.

In one embodiment, as shown in fig. 5, obtaining the similarity between the first address participle set and the current candidate address participle set according to the address participles included in the second intersection and the address participles included in the union set includes:

step S502, address word segmentation categories corresponding to the address word segments contained in the second intersection are obtained, and first word segmentation weights of the address word segments contained in the second intersection are obtained based on the address word segmentation categories;

step S504, address participle categories corresponding to all address participles contained in the union set are obtained, and second participle weights of all address participles contained in the union set are obtained on the basis of the address participle categories;

step S506, obtaining the similarity between the first address participle set and the current candidate address participle set according to the address participles included in the second intersection, the address participles included in the union set, the first participle weight, and the second participle weight.

For example, if the first address segmentation set includes a province a, a city B, a district C, a street D, and a district E, the province a may be a first level, the city B may be a second level, the district C may be a third level, the street D may be a fourth level, and the district E may be a fifth level. The first segmentation weight may be a level weight corresponding to each address segmentation in the second intersection. The second participle weight may be a level weight corresponding to each address participle in the union.

Specifically, the address participle category corresponding to each address participle included in the second intersection may be obtained, and the weight corresponding to the address participle category is determined as the weight of the address participle in the second intersection corresponding to the address participle category. The address word segmentation class corresponding to each address word contained in the union set can be obtained, and the weight corresponding to the address word segmentation class is determined as the weight of the address word segmentation in the union set corresponding to the address word segmentation class. And calculating to obtain a specific numerical value of the similarity between the first address participle set and the current candidate address participle set according to the address participles contained in the second intersection set, the address participles contained in the union set, the first participle weight and the second participle weight.

In this embodiment, the similarity between the first address participle set and any one candidate address participle set is obtained by obtaining the first participle weight of each address participle included in the second intersection, and the second participle weight of each address participle included in the merged set, and calculating through each address participle included in the second intersection, each address participle included in the merged set, the first participle weight, and the second participle weight, so that the accuracy of address matching can be further improved.

In one embodiment, the similarity between the first address participle set and any one candidate address participle set is calculated by using an improved Jacobs distance calculation method, and the expression is as follows:

wherein, A is a first participle set, B is any one candidate address participle set, C is the intersection of the A set and the B set, and W is the number of the first participle set and the second participle set_iFor the weight set corresponding to the class to which each address participle (address element) in the set C belongs, W_nFor the weight set corresponding to the class to which each address participle (address element) in the set A belongs, W_mAnd (4) a weight set corresponding to the classification to which each address participle (address element) in the set B belongs.

obtaining sample address data from a plurality of databases;

The databases may include existing business data and POI (Point of Interest) data acquired by using a web crawler method, and the sample address data may be multi-source address data. The sample set of address tokens may be a standard set of address tokens comprising a plurality of address tokens.

Specifically, sample address data may be obtained from a plurality of databases, data preprocessing may be performed on the sample address data, and word segmentation processing may be performed on the sample address data after data preprocessing, so as to form a sample address word segmentation set. For example, a coordinate system of the sample address data may be subjected to a unification process, and the sample address data may be subjected to a unification process, where the unification process includes address element supplementation and the like. And constructing an address database by using the sample address participle set, and constructing an address index for the address database.

In one embodiment, the method further comprises:

Specifically, the update operation for the address database may be responded to irregularly, sample address data acquired by a network or manually entered may be acquired as updated sample address data, the updated sample address data is subjected to data preprocessing, and the sample address data subjected to data preprocessing is subjected to word segmentation processing to form an updated sample address word segmentation set. And updating the address database by using the sample address word segmentation set corresponding to the updated sample address data to obtain the updated address database. An updated second set of address tokens may be obtained from an updated address database.

In a specific embodiment, a dynamic place name address matching method is provided, which includes the following steps:

constructing a dynamic place name address database; constructing an individualized place name address word segmentation algorithm; optimizing self-adaptive place name address participles; constructing a matching algorithm based on inverted indexes and address element weighting; the process of matching the place name and the address is divided into two processes of rough screening, quick positioning and accurate screening and matching.

In one embodiment, as shown in fig. 6, constructing a dynamic location name address database includes:

integrating multi-source heterogeneous place name address data (including existing business data, POI data acquired by a web crawler method and the like) to construct a basic place name address database;

when the data acquired by the network has the condition of inconsistent coordinate systems, carrying out unification treatment on the multi-source data coordinate systems before the database is built;

cleaning original data, namely standardizing a place name address, including word segmentation of the place name address, supplement and perfection of place name address elements and the like to obtain a basic place name address library, and constructing a place name address index based on a full-text retrieval idea;

and updating the basic place name address library irregularly through network acquisition, manual input and other modes.

In one embodiment, the method for constructing the personalized place name address word segmentation algorithm comprises the following steps:

based on regional characteristics of place name address description, on the basis of research of a basic place name address model structure system, further deep research of regional place name address description characteristics is carried out, an individualized word segmentation dictionary is built, construction of a Chinese place name address model with the regional characteristics is achieved, and robustness of word segmentation of irregular place name address expression is improved.

In one embodiment, the optimization of the adaptive place name address segmentation comprises the following steps: aiming at the possible situations of synonyms, near-synonyms, wrongly written characters, unknown words and the like, pinyin matching support is supplemented in a matching algorithm, ambiguous situations caused by flat-warped tongues, front and back nasal sounds and the like are processed, the unknown words are automatically identified based on address element word segmentation, a standard place name address word segmentation library is further updated and perfected, and continuous self-adaptive optimization of a word segmentation algorithm is realized, so that the final matching accuracy is improved.

In one embodiment, the method for constructing the matching algorithm based on the inverted index and the address element weighting comprises the following steps:

on the basis of carrying out word segmentation and index construction on a basic place name address library, an inverted index method of full-text retrieval is introduced into a place name address matching algorithm for index construction, so that the matching precision and efficiency are improved;

by deep analysis and splitting of the place name address elements, different weights are given to the place name address elements of different levels in a matching algorithm, and the accuracy of place name address matching is improved.

In one embodiment, the address matching design flow chart is as shown in fig. 7, after the address of the place to be matched is input, the address of the place to be matched is segmented and normalized according to a user-defined segmentation dictionary and an individualized address segmentation model, then the rough screening of the address of the place is carried out by combining a simple common-word statistical method and a standard place address database, a pre-selected place set is quickly obtained, and finally, the precise screening of the address of the place is further realized by combining the hierarchical weight of the address elements and an improved character string matching algorithm and by calculating and comparing the address matching rate of the place and the address of the place and the final address matching result of the place is obtained.

In one embodiment, the process of matching the place name address is divided into two processes of coarse screening, fast positioning and fine screening, and the two processes include:

as shown in fig. 8, the coarse filtering fast positioning includes:

the rough address screening is to compare the address participle set to be matched with the address participle set in the address base, judge whether the two sets have intersection (i.e. whether the same address participle exists), if the intersection number is larger than a certain threshold (such as more than 3 common participles), add the changed data in the address base into the preselected address set, otherwise, the address is not matched by default, and skip the next judgment. Through a rough screening process, thousands of address data are screened from hundreds of thousands to millions of addresses in an address library for the next matching rate calculation.

As shown in fig. 9, the fine screening of exact matches includes:

the accurate screening of the address refers to screening out the best matching address by calculating and comparing the matching rate of the address elements in the address data. The matching rate between the addresses is defined as the matching degree between the address to be matched and the address in the standard address library, and is specifically measured by the similarity value between two address character strings. The precise location name address screening part is realized by performing address similarity calculation on an address element set of an address to be matched after address word segmentation and an address element set of each preselected address in a preselected address set obtained in a rough screening stage to obtain a similarity value (matching rate) among address character strings, and screening out an address with the maximum similarity value (matching rate) as a final location name address matching result according to a comparison analysis result of the similarity value.

In one embodiment, before performing place name address matching, address word segmentation preprocessing needs to be performed on the place name address to be matched. The method is similar to the address word segmentation processing of a basic place name address library, and combines a constructed user-defined dictionary to perform word segmentation operation on the address to be matched by using a Chinese character 'ba' word segmentation algorithm, so as to realize the splitting of address elements of the input address. And then, further carrying out processing such as duplication removal, augmentation, normalization and the like on the address word segmentation result based on the place name address model, namely filtering repeated words in the word segmentation, and carrying out normalization processing on diversified address expression forms to finally obtain a normalized place name address word segmentation result. For example, the address description "the area of the offshoot is the area of naught road nine", the area of the second "offshoot" is supplemented as "the area of offshoot", one of the two "offshoot areas" is repeated in the address is removed, and the word segmentation list is processed as "the area of offshoot", "the area of naught road", "9".

In one embodiment of the method, a matching similarity value between an address to be matched and a preselected address in a precise matching process of the place name address is calculated by using an improved Jacard distance calculation method according to a place name dynamic place name address matching method theory based on character similarity and the grading weight of address elements.

Let set A (a)₁，a₂，…a_n) And set B (B)₁，b₂，…b_m) Respectively a participle set of the address to be matched and a part-word set of a preselected address, C (C)₁，c₂，…c_i) For the intersection of the A set and the B set, the matching similarity value between A and B is calculated as follows:

wherein, W_iFor the weight set corresponding to the class to which each word (address element) in the set C belongs, W_nFor the weight set corresponding to the class to which each word (address element) in the set A belongs, W_mFor each in B setAnd the weight set corresponding to the classification to which the individual word (address element) belongs.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a dynamic place name address matching device for realizing the dynamic place name address matching method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the method, so the specific limitations in one or more embodiments of the dynamic address matching device provided below can refer to the limitations on the dynamic address matching method in the above description, and details are not described here again.

In one embodiment, as shown in fig. 10, there is provided a dynamic place name address matching apparatus including: an address to be matched obtaining module 610, an initial address set module 620, a candidate address set module 630 and a target address set module 640, wherein:

the to-be-matched address obtaining module 610 is configured to obtain an address to be matched and a first address word segmentation set corresponding to the address to be matched.

The initial address set module 620 is configured to obtain a second address participle set from a pre-constructed address database, and obtain a second address participle set matched with the first address participle set from the second address participle set as an initial address participle set.

The candidate address set module 630 is configured to obtain a matching degree between the first address participle set and the initial address participle set, and if the matching degree meets a preset condition, use the address participle set as a candidate address participle set.

And the target address set module 640 is configured to obtain a target address word segmentation set from the candidate address word segmentation set according to the similarity between the first address word segmentation set and the candidate address word segmentation set, and use a target address corresponding to the target address word segmentation set as an address matching result of the address to be matched.

In one embodiment, the initial address set module comprises a second address unit, a first address word segmentation unit and an address word segmentation judgment unit.

The second address unit is used for acquiring a current second address participle set and second address participles contained in the current second address participle set; the first address word segmentation unit is used for acquiring first address words contained in the first address word segmentation set; the address participle judging unit is used for taking the current second address participle set as an initial address participle set if the address participle same as the first address participle exists in the second address participle.

In one embodiment, the candidate address set module includes a first intersection unit, a number of participles unit, and a threshold comparison unit.

The first intersection unit is used for acquiring a current initial address participle set and a first intersection of the current initial address participle set and the first address participle set; the word segmentation quantity unit is used for acquiring the word segmentation quantity of the address word segmentation contained in the first intersection; and the threshold comparison unit is used for taking the current initial address word segmentation set as a candidate address word segmentation set if the number of the segmented words is greater than a preset number threshold.

In one embodiment, the target address set module includes a second intersection unit, a union unit, a similarity calculation unit, a similarity acquisition unit, and a target address unit.

The second intersection unit is used for acquiring a current candidate address participle set and a second intersection of the current candidate address participle set and the first address participle set; the union set unit is used for acquiring a union set of the current candidate address participle set and the first address participle set; the similarity calculation unit is used for obtaining the similarity between the first address participle set and the current candidate address participle set according to each address participle contained in the second intersection and the address participles contained in the merging set; the similarity obtaining unit is used for obtaining the similarity between each candidate address participle set and the first address participle set; the target address unit is used for taking the candidate address participle set with the maximum similarity as a target address participle set.

In one embodiment, the similarity calculation unit includes a first participle weight unit, a second participle weight unit, and a fusion calculation unit.

The first word segmentation weight unit is used for acquiring address word segmentation categories corresponding to the address word segments contained in the second intersection and obtaining first word segmentation weights of the address word segments contained in the second intersection based on the address word segmentation categories; the second word segmentation weight unit is used for acquiring address word segmentation categories corresponding to the address word segments contained in the merged set and obtaining second word segmentation weights of the address word segments contained in the merged set based on the address word segmentation categories; and the fusion calculation unit is used for obtaining the similarity between the first address participle set and the current candidate address participle set according to the address participles contained in the second intersection, the address participles contained in the union set, the first participle weight and the second participle weight.

In one embodiment, the initial address set module comprises a sample address unit, a sample address participle unit and a database construction unit.

The sample address unit is used for acquiring sample address data from a plurality of databases; the sample address word segmentation unit is used for carrying out data preprocessing on the sample address data and carrying out word segmentation on the sample address data after the data preprocessing to form a sample address word segmentation set; the database construction unit is used for constructing an address database by utilizing the sample address participle set.

In one embodiment, the apparatus further comprises: the device comprises an update response unit, an update processing unit and a second address updating unit.

The updating response unit is used for responding to the updating operation aiming at the address database, and acquiring updated sample address data and a sample address word segmentation set corresponding to the updated sample address data; the updating processing unit is used for updating the address database by using the sample address word segmentation set corresponding to the updated sample address data to obtain an updated address database; the second address updating unit is used for acquiring a second address participle set from the updated address database.

The modules in the dynamic location name matching device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing address database data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a dynamic place name address matching method.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A dynamic place name address matching method, the method comprising:

2. The method of claim 1, wherein obtaining a second set of address tokens that matches the first set of address tokens as an initial set of address tokens comprises:

acquiring first address participles contained in the first address participle set;

if the address participles which are the same as the first address participles exist in the second address participles, taking the current second address participle set as the initial address participle set.

3. The method according to claim 2, wherein the obtaining a matching degree between the first address participle set and the initial address participle set, and if the matching degree satisfies a preset condition, taking the address participle set as a candidate address participle set comprises:

acquiring a current initial address participle set and a first intersection of the current initial address participle set and the first address participle set;

and if the word segmentation quantity is larger than a preset quantity threshold value, taking the current initial address word segmentation set as the candidate address word segmentation set.

4. The method of claim 1, wherein the similarity between the first set of address tokens and the set of candidate address tokens is determined according to a similarity between the first set of address tokens and the set of candidate address tokens, comprising:

acquiring a union of the current candidate address word segmentation set and the first address word segmentation set;

obtaining the similarity between the first address participle set and the current candidate address participle set according to each address participle contained in the second intersection and the address participles contained in the union set;

the obtaining of the target address word segmentation set from the candidate address word segmentation set includes:

obtaining the similarity between each candidate address word segmentation set and the first address word segmentation set;

and taking the candidate address word segmentation set with the maximum similarity as the target address word segmentation set.

5. The method according to claim 4, wherein the obtaining the similarity between the first address participle set and the current candidate address participle set according to each address participle included in the second intersection and the address participle included in the union set comprises:

acquiring address word segmentation categories corresponding to the address word segmentations contained in the second intersection, and obtaining first word segmentation weights of the address word segmentations contained in the second intersection based on the address word segmentation categories;

acquiring address word segmentation categories corresponding to the address word segmentations contained in the union set, and obtaining second word segmentation weights of the address word segmentations contained in the union set based on the address word segmentation categories;

and obtaining the similarity between the first address participle set and the current candidate address participle set according to each address participle contained in the second intersection, each address participle contained in the union set, the first participle weight and the second participle weight.

6. The method of claim 1, wherein prior to obtaining the second set of address tokens from the pre-constructed address database, comprising:

obtaining sample address data from a plurality of databases;

carrying out data preprocessing on the sample address data, and carrying out word segmentation on the sample address data after data preprocessing to form a sample address word segmentation set;

and constructing the address database by using the sample address participle set.

7. The method of claim 6, further comprising:

responding to the updating operation aiming at the address database, and acquiring updated sample address data and a sample address word segmentation set corresponding to the updated sample address data;

the obtaining of the second address participle set from the pre-constructed address database includes:

8. A dynamic place name address matching apparatus, the apparatus comprising:

the device comprises an address to be matched acquisition module, a matching module and a matching module, wherein the address to be matched acquisition module is used for acquiring an address to be matched and a first address word segmentation set corresponding to the address to be matched;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.