CN113515677A - Address matching method and device and computer readable storage medium - Google Patents

Address matching method and device and computer readable storage medium Download PDF

Info

Publication number
CN113515677A
CN113515677A CN202110834270.2A CN202110834270A CN113515677A CN 113515677 A CN113515677 A CN 113515677A CN 202110834270 A CN202110834270 A CN 202110834270A CN 113515677 A CN113515677 A CN 113515677A
Authority
CN
China
Prior art keywords
address
target
matched
determining
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110834270.2A
Other languages
Chinese (zh)
Other versions
CN113515677B (en
Inventor
张强
高恩伟
闫岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202110834270.2A priority Critical patent/CN113515677B/en
Publication of CN113515677A publication Critical patent/CN113515677A/en
Application granted granted Critical
Publication of CN113515677B publication Critical patent/CN113515677B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an address matching method, an address matching device and a computer readable storage medium, wherein the address matching method comprises the following steps: acquiring at least two target addresses matched with addresses to be matched in a standard address set, wherein the standard address set comprises addresses of at least two data sources, and each target address is obtained by matching according to different matching models; determining the confidence of each target address, wherein the higher the number of data sources matched with the target address is, the higher the corresponding confidence is; and determining the target address matched with the address to be matched in all the target addresses according to the confidence degrees of all the target addresses. The invention can improve the accuracy of address matching.

Description

Address matching method and device and computer readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to an address matching method and apparatus, and a computer-readable storage medium.
Background
In the field of communications, there is a need for address matching, for example, after address information such as a mobile base station cell address, a residential community address, a school, a hospital institution address, and the like is manually collected, because there may be an inaccuracy problem, it is necessary to match the collected address with a standard address to obtain a corresponding correct standard address, for example, a cell to be matched is a "basketball garden", a result obtained by calculating similarity according to a minimum edit distance is a "basketball garden", and a correct result should be a "basketball garden", so when address matching is simply performed by using the minimum edit distance to calculate similarity, the matching accuracy is low, and the present invention at least solves the following technical problems: how to improve the accuracy of address matching.
Disclosure of Invention
The invention mainly aims to provide an address matching method, an address matching device and a computer readable storage medium, and aims to solve the technical problem of low accuracy of address matching.
In order to achieve the above object, the present invention provides an address matching method, including:
acquiring at least two target addresses matched with addresses to be matched in a standard address set, wherein the standard address set comprises addresses of at least two data sources, and each target address is obtained by matching according to different matching models;
determining the confidence of each target address, wherein the higher the number of data sources matched with the target address is, the higher the corresponding confidence is;
and determining the target address matched with the address to be matched in all the target addresses according to the confidence degrees of all the target addresses.
Optionally, the step of obtaining at least two target addresses matched with the address to be matched in the standard address set includes:
determining a first target address according to the address to be matched, the standard address set and a preset probability transition matrix model, wherein the preset probability transition matrix model is obtained by training a probability transition matrix training model according to an address training set and the standard address set;
determining a second target address according to the address to be matched, the standard address set and a preset residual error network fusion model, wherein the preset residual error network fusion model comprises an embedding layer, a TextRCNN network, a TextCNN network, a residual error layer and a preset activation function, the preset residual error network fusion model is obtained by training the residual error network fusion training model according to the address training set and the standard address set, and the target addresses are the first target address and the second target address respectively.
Optionally, the step of determining the first address according to the address to be matched, the standard address set, and a preset probability transition matrix model includes:
acquiring candidate characteristic words with frequency greater than preset frequency in the standard address set;
constructing a feature word set according to the candidate feature words;
extracting a characteristic word sequence corresponding to the address to be matched according to the characteristic word set, wherein the characteristic word sequence comprises the candidate characteristic words and common characters in the address to be matched;
combining the characteristic word elements in the characteristic word sequence according to the target combination length and a preset combination sequence to obtain a characteristic word substring set of the target combination length;
determining a joint probability corresponding to a feature word substring set according to a preset hidden Markov model and the feature word substring set with the target combination length, wherein the preset hidden Markov model is obtained by training a hidden Markov training model according to the standard address set, the joint probability corresponding to the feature word substring set is obtained according to a feature word transition probability in the hidden Markov model, and the preset transition probability model is the preset hidden Markov model;
when the target combination length is smaller than the preset combination length, increasing the target combination length, and returning to execute the step of combining the characteristic word elements in the characteristic word sequence according to the target combination length and the preset combination sequence to obtain a characteristic word substring set with the target combination length;
when the target combination length is greater than or equal to the preset combination length, acquiring the feature substring set with the maximum joint probability;
determining an optimal solution according to the feature substring set with the maximum joint probability;
determining the optimal solution as the first address.
Optionally, the step of determining, according to the confidence of each target address, the target address matched with the address to be matched from all the target addresses includes:
determining the matching degree of each target address and the address to be matched;
determining the product of the matching degree and the confidence degree corresponding to each target address;
and determining the target address matched with the address to be matched according to the target address corresponding to the maximum product.
Optionally, after the step of obtaining at least two target addresses matched with the address to be matched in the standard address set, the address matching method further includes:
when there are at least two different target addresses, performing the step of determining a confidence level for each of the target addresses;
and when all the target addresses are the same, determining that the target address is the target address matched with the address to be matched.
Optionally, the step of determining the confidence level of each target address includes:
determining the number of the data sources matched with each target address;
determining the confidence level of the target address according to the quantity.
Optionally, before the step of obtaining at least two target addresses matched with the address to be matched in the standard address set, the address matching method further includes:
acquiring an original address sent by a server;
and carrying out illegal character cleaning, redundant address cleaning, wrongly written character replacement and incomplete address completion on the original address to obtain the address to be matched.
In addition, in order to achieve the above object, the present invention further provides an address matching apparatus, which includes an obtaining module and a determining module, wherein:
the acquisition module is used for acquiring at least two target addresses matched with the addresses to be matched in a standard address set, the standard address set comprises addresses of at least two data sources, and each target address is obtained by matching according to different matching models;
the determining module is configured to determine a confidence level of each target address, where the greater the number of data sources matched with the target address, the higher the corresponding confidence level is, and determine, according to the confidence level of each target address, the target address matched with the address to be matched, in all the target addresses.
In addition, to achieve the above object, the present invention further provides an address matching apparatus, which includes a memory, a processor, and an address matching program stored in the memory and operable on the processor, wherein the address matching program, when executed by the processor, implements the steps of any one of the above address matching methods.
In addition, to achieve the above object, the present invention further provides a computer-readable storage medium having an address matching program stored thereon, the address matching program implementing the steps of the address matching method according to any one of the above items when executed by a processor.
The method, the device and the computer readable storage medium for address matching provided by the embodiments of the present invention determine the confidence of each target address by obtaining at least two target addresses in a standard address set, which are matched with the address to be matched, and determine the target address matched with the address to be matched in all the target addresses according to the confidence of each target address, wherein the standard address set includes addresses of at least two data sources, each target address is obtained by matching according to different matching models, the more the number of data sources matched with a target address is, the higher the corresponding confidence is, and since the confidence corresponding to the target address obtained by matching based on different matching models is obtained in the matching process, the more the number of data sources matched with a target address is, the higher the corresponding confidence is, therefore, when matching is carried out, the problem of low accuracy rate caused by the fact that the matching address is obtained by simply adopting the minimum editing distance to calculate the similarity when a single address data source is adopted for matching is avoided, and the accuracy of address matching can be effectively improved.
Drawings
FIG. 1 is a schematic diagram of an apparatus in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of an address matching method according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of an address matching method according to the present invention;
FIG. 4 is a flowchart illustrating a third embodiment of an address matching method according to the present invention;
FIG. 5 is a flowchart illustrating a fourth embodiment of an address matching method according to the present invention;
FIG. 6 is a functional block diagram of the address matching apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The address matching device related to the embodiment of the invention can be a server, a terminal device or other computer devices.
As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a memory 1002, and a communication bus 1003. The communication bus 1003 is used to implement connection communication among these components. The memory 1003 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1003 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the device shown in fig. 1 is not intended to be limiting of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, the memory 1003, which is a kind of computer storage medium, may include therein an operating system and an address matching program.
In the apparatus shown in fig. 1, the processor 1001 may be configured to call an address matching program stored in the memory 1003, and perform the following operations:
acquiring at least two target addresses matched with addresses to be matched in a standard address set, wherein the standard address set comprises addresses of at least two data sources, and each target address is obtained by matching according to different matching models;
determining the confidence of each target address, wherein the higher the number of data sources matched with the target address is, the higher the corresponding confidence is;
and determining the target address matched with the address to be matched in all the target addresses according to the confidence degrees of all the target addresses.
Further, the processor 1001 may call an address matching program stored in the memory 1003, and also perform the following operations:
determining a first target address according to the address to be matched, the standard address set and a preset probability transition matrix model, wherein the preset probability transition matrix model is obtained by training a probability transition matrix training model according to an address training set and the standard address set;
determining a second target address according to the address to be matched, the standard address set and a preset residual error network fusion model, wherein the preset residual error network fusion model comprises an embedding layer, a TextRCNN network, a TextCNN network, a residual error layer and a preset activation function, the preset residual error network fusion model is obtained by training the residual error network fusion training model according to the address training set and the standard address set, and the target addresses are the first target address and the second target address respectively.
Further, the processor 1001 may call an address matching program stored in the memory 1003, and also perform the following operations:
acquiring candidate characteristic words with frequency greater than preset frequency in the standard address set;
constructing a feature word set according to the candidate feature words;
extracting a characteristic word sequence corresponding to the address to be matched according to the characteristic word set, wherein the characteristic word sequence comprises the candidate characteristic words and common characters in the address to be matched;
combining the characteristic word elements in the characteristic word sequence according to the target combination length and a preset combination sequence to obtain a characteristic word substring set of the target combination length;
determining a joint probability corresponding to a feature word substring set according to a preset hidden Markov model and the feature word substring set with the target combination length, wherein the preset hidden Markov model is obtained by training a hidden Markov training model according to the standard address set, the joint probability corresponding to the feature word substring set is obtained according to a feature word transition probability in the hidden Markov model, and the preset transition probability model is the preset hidden Markov model;
when the target combination length is smaller than the preset combination length, increasing the target combination length, and returning to execute the step of combining the characteristic word elements in the characteristic word sequence according to the target combination length and the preset combination sequence to obtain a characteristic word substring set with the target combination length;
when the target combination length is greater than or equal to the preset combination length, acquiring the feature substring set with the maximum joint probability;
determining an optimal solution according to the feature substring set with the maximum joint probability;
determining the optimal solution as the first address.
Further, the processor 1001 may call an address matching program stored in the memory 1003, and also perform the following operations:
determining the matching degree of each target address and the address to be matched;
determining the product of the matching degree and the confidence degree corresponding to each target address;
and determining the target address matched with the address to be matched according to the target address corresponding to the maximum product.
Further, the processor 1001 may call an address matching program stored in the memory 1003, and also perform the following operations:
when there are at least two different target addresses, performing the step of determining a confidence level for each of the target addresses;
and when all the target addresses are the same, determining that the target address is the target address matched with the address to be matched.
Further, the processor 1001 may call an address matching program stored in the memory 1003, and also perform the following operations:
determining the number of the data sources matched with each target address;
determining the confidence level of the target address according to the quantity.
Further, the processor 1001 may call an address matching program stored in the memory 1003, and also perform the following operations:
acquiring an original address sent by a server;
and carrying out illegal character cleaning, redundant address cleaning, wrongly written character replacement and incomplete address completion on the original address to obtain the address to be matched.
Referring to fig. 2, a first embodiment of the present invention provides an address matching method, where the address matching method includes:
step S10, at least two target addresses matched with the addresses to be matched in a standard address set are obtained, the standard address set comprises the addresses of at least two data sources, and each target address is obtained by matching according to different matching models;
in this embodiment, the execution subject is an address matching device, and the address matching device may specifically be a server, or a terminal device, or may also be another computer device; the standard address set is a set composed of preset standard addresses, in this embodiment, addresses are obtained from more than two data sources as the standard addresses, the data sources refer to data sources providing the standard addresses, the data sources are map service providers, such as a high-grade map, a Baidu map, an Tencent map and the like, when a standard address is obtained from a map service provider, the standard address may be specifically and respectively crawled from an Application Programming Interface (API) of a map corresponding to each map service, each standard address may include a plurality of tags, for example, each standard address corresponds to five-level tags of province level, city level, prefecture level, street community and cell level, each standard address may further include tags of other levels, such as "road parcel", "street number, village group" and "detailed address", and the standard address may include fewer or more tags of the levels than the above example; taking the example of crawling tag information of all cell levels of Guizhou province through a Baidu map API, the URL of the Baidu map is http:// API. map. basic. com/place/v 2/search? The access parameters comprise longitude and latitude, a developer access key, a retrieval keyword, an output format and the like, the size of a longitude and latitude grid is adjusted until five-level label information of all cells is obtained, and a standard address corresponding to a Baidu map API is obtained, the five-level label information of all the cells can be further obtained from the Gaud map API and the Baidu map API by adopting a similar mode, so that the standard address corresponding to the Gaud map API and the standard address corresponding to the Baidu map API are respectively obtained, the standard addresses corresponding to different map APIs are combined into a standard address set, in addition, the standard addresses can be obtained from other storage address information data sources to obtain a standard address set, and the standard address set comprises the addresses of at least two data sources; the address to be matched is an address to be matched, based on different matching requirements, the address to be matched may be obtained based on various manners, for example, the address to be matched is obtained from a magic-box life service platform to match the address of the platform to obtain a corresponding target address, the target address is an address matched with the address to be matched, specifically, the target address is a standard address matched with the address to be matched in a standard address set, in this embodiment, the address to be matched is matched to obtain at least two target addresses, each target address is obtained according to different matching models, the number of the matching models is more than two, each matching model can be matched to obtain the corresponding target address by combining the address to be matched, the matching models may be various address matching models implemented by using a machine learning technology, such as: the method comprises the steps of establishing an address matching model based on a deep learning model, establishing an address matching model based on an interest knowledge point atlas pre-training, establishing an address matching model based on a probability transfer matrix, establishing an address matching model based on a residual error network fusion model, and the like, wherein address matching mechanisms adopted by different matching models are different, so that at least two target addresses can be obtained when standard addresses of different data sources are matched.
Step S20, determining the confidence of each target address, wherein the more the number of data sources matched with the target address is, the higher the corresponding confidence is;
the confidence level is used to indicate the credibility of the target address, and in this embodiment, the greater the number of data sources matched by the target address, the higher the corresponding confidence level, i.e. the higher the trustworthiness of the target addresses, wherein, after at least two target addresses have been matched based on different matching models, the target address may be matched with only one data source, or may be matched with more than two data sources, and the target address is matched with the data source in the meaning that the standard address corresponding to the data source comprises the target address, and because the target address is obtained based on different matching models, in the case of high accuracy of the matching model, the matching result tends to converge, as embodied by the standard addresses corresponding to more data sources, therefore, in this embodiment, the greater the number of data sources matched based on the target address, the higher the confidence corresponding to the target address.
When determining the confidence corresponding to each target address, the method may be that first, the standard addresses corresponding to different data sources in the standard address set are classified, and the classification may be performed based on that the standard addresses belong to an intersection of several data sources, for example, if there are three data sources, the standard addresses may belong to an intersection of three data sources, at this time, the standard addresses may be classified into a first type, if the standard addresses belong to an intersection of two data sources, the standard addresses may be classified into a second type, if the standard addresses only belong to one data source, the standard addresses may be classified into a third type, and in addition, for fewer or more data sources, the standard addresses may also be classified into different types in a similar manner, and when obtaining the target addresses, since the target addresses are the standard addresses matched with the addresses to be matched, the types of the target addresses may be obtained based on the types associated with the standard addresses, further, based on the type, the confidence level may be determined, and the confidence level corresponding to each type may be associated in advance, so that the confidence level corresponding to the target address may be determined, or the corresponding confidence level may also be obtained directly according to the number of data sources matched to the target address, that is, the corresponding relationship between the number of matched data sources and the confidence level is set in advance, and after the number of data sources matched to the target address is determined, the confidence level of the target address is determined according to the number and the corresponding relationship.
When at least two different target addresses exist, the step of determining the confidence of each target address is executed, and when the target addresses are the same, the target address is determined to be the target address matched with the address to be matched, so that the accuracy of address matching can be improved.
Step S30, determining the target address matched with the address to be matched from all the target addresses according to the confidence of each target address.
After the confidence degrees of the target addresses are obtained, the target address with the highest confidence degree in all the target addresses can be directly used as the target address matched with the address to be matched, or the target address matched with the address to be matched can be further obtained by combining the matching degrees of the target address and the address to be matched based on the confidence degrees, so that the accuracy of address matching is improved.
In this embodiment, by obtaining at least two target addresses matched with an address to be matched in a standard address set, determining a confidence of each target address, and determining a target address matched with the address to be matched in all the target addresses according to the confidence of each target address, wherein the standard address set includes addresses of at least two data sources, and each target address is obtained by matching according to different matching models, the more the number of data sources matched by the target address is, the higher the corresponding confidence is, because the confidence corresponding to the target address obtained by matching based on different matching models is obtained in the matching process, the further the target address matched by the address to be matched is obtained, and the more the number of data sources matched by the target address is, the higher the corresponding confidence is, so that when matching is performed, a single address data source is avoided, the problem of low accuracy rate caused by simply adopting the minimum editing distance to calculate the similarity to obtain the matched address can be effectively solved, and the accuracy of address matching can be effectively improved.
Referring to fig. 3, a second embodiment of the present invention provides an address matching method, based on the first embodiment shown in fig. 2, where the step S10 includes:
step S11, determining a first target address according to the address to be matched, the standard address set and a preset probability transition matrix model, wherein the preset probability transition matrix model is obtained by training a probability transition matrix training model according to an address training set and the standard address set;
when the first target address is determined according to the address to be matched, the standard address set and the preset probability transition matrix model, the following method can be adopted:
acquiring candidate characteristic words with frequency greater than preset frequency in the standard address set, and constructing a characteristic word set according to the candidate characteristic words;
the preset frequency is keywords or the occurrence frequency of the keywords, the keywords or the keywords with the frequency greater than the preset frequency are candidate feature words, the preset frequency is 10000, for example, the process characteristic feature word set of the selected candidate feature words is as follows: q ═ town and country cells of provincial and urban area street communities };
the Hidden Markov Model (HMM) is concerned with probability transition matrix models, known as iNN.epsilon.T is a discrete set of times, T.epsilon.1, 2NThe state space composed of possible values is a discrete set of characteristic words Q ═ Q1,q2,...,qNLet its transition probability matrix be a ═ pij]N*NSince HMM is a probabilistic model about time sequence, its previous state is only related to the next state, namely:
pij=P(it+1=qj|it=qi),i,j=1,2,…,N;
counting the transfer times of the feature words of the adjacent levels from the time t to the time t +1, and recording as:
N(it+1=qj|it=qi),i,j=1,2,…,N;
let its corresponding transition weight be a (i)t+1-qj|it-qi) In order to avoid the situation that the transfer times are 0 and cannot be calculated, the transfer weight is processed in the following way:
Figure BDA0003176162120000101
wherein i, j is 1,2<m<log2N(it+1=qj|it=qi)minSubsequently, analyzing and comparing experimental results of m under different value conditions to set the value of m, and further calculating the probability pij corresponding to the feature word transfer weight by using a SoftMax function, wherein the specific calculation mode is as follows:
Figure BDA0003176162120000111
and extracting a characteristic word sequence corresponding to the address to be matched according to the characteristic word set Q, wherein the characteristic word is extracted by traversing the address of the cell to be matched to obtain a characteristic word sequence O (O)1,o2,...,oi,...,oL),OiThe method comprises the following steps that (1) an E Q and an L E N are assigned, a characteristic word sequence comprises candidate characteristic words and common characters in an address to be matched, the common characters are characters except the candidate characteristic words in the address to be matched, characteristic word elements in the characteristic word sequence are combined according to a target combination length and a preset combination sequence to obtain a characteristic word substring set with the target combination length, the preset combination sequence can be an extraction sequence, the characteristic word elements in the characteristic word sequence are combined according to the extraction sequence, the combination length is L (L is more than or equal to 2 and less than or equal to L-1), the characteristic word substrings with the same length are classified to obtain the characteristic word substring set with the combination length L: { Ol|Ol=(o1,o2,...,ol) L-1, the joint probability of each substring in the set Ol is calculated in the following way:
Figure BDA0003176162120000112
calculating the substring with the maximum joint probability of the substrings of the feature words with the combination length of l by adopting the following method:
Figure BDA0003176162120000113
determining the joint probability corresponding to the feature word substring set according to the preset hidden Markov model and the feature word substring set with the target combination length l, wherein the preset hidden Markov model is obtained by training the hidden Markov training model according to the standard address set, and the joint probability corresponding to the feature word substring set is obtained according to the feature word transfer probability in the hidden Markov modelObtaining a preset transition probability model which is a preset hidden Markov model, increasing the target combination length when the target combination length is smaller than the preset combination length, wherein the preset combination length can be set as L-1, and returning to the step of combining the characteristic word elements in the characteristic word sequence according to the target combination length and the preset combination sequence to obtain a characteristic word substring set with the target combination length; so as to obtain the feature word substring set with maximum joint probability corresponding to all the length l, i.e. MAX ═ { MaxQ2,MaxQ3,...,MaxQL-1Acquiring a feature substring set with the maximum joint probability when the target combination length is greater than or equal to the preset combination length, determining an optimal solution according to the feature substring set with the maximum joint probability, and determining the optimal solution as a first address; respectively making i ═ 2MaxQiAnd i ═ 2MaxQi+1Traversing elements in the maximum joint probability word substring set MAX, recording the number of feature words with the same sequence in substrings with adjacent lengths as ni, wherein i is 2,3i1 is OS ═ MaxQi+1If n isi>1, then OS ═ MaxQiUntil i ═ L-2.
Step S12, a second target address is determined according to the address to be matched, the standard address set and a preset residual error network fusion model, the preset residual error network fusion model comprises an embedding layer, a TextRCNN network, a TextCNN network, a residual error layer and a preset activation function, the preset residual error network fusion model is obtained by training a residual error network fusion training model according to the address training set and the standard address set, and the target addresses are the first target address and the second target address respectively.
In this embodiment, in order to improve the accuracy of address matching, the matching models adopted are a preset probability transition matrix model and a preset residual error network fusion model, and are respectively matched with the model to be matched to obtain a first target address and a second target address, the number of the target addresses is two, the target addresses are respectively a first target address and a second target address, and a target address matched with the address to be matched is determined in the first target address and the second target address according to the confidence of determining the first target address and the confidence of determining the second target address, wherein the target address matched with the address to be matched can be directly determined by using the high confidence as the target address matched with the address to be matched, or the target address matched with the address to be matched can be determined by combining the first matching degree of the first target address with the address to be matched and the second matching degree of the second target address with the address to be matched, and respectively determining a first product of the first matching degree and the first confidence degree and a second product of the second matching degree and the second confidence degree, comparing the first product with the second product, and taking the target address corresponding to the larger product as the target address matched with the address to be matched.
In this embodiment, a first target address is determined according to an address to be matched, a standard address set and a preset probability transition matrix model, the preset probability transition matrix model is obtained by training a probability transition matrix training model according to an address training set and the standard address set, a second target address is determined according to the address to be matched, the standard address set and a preset residual error network fusion model, the preset residual error network fusion model comprises an embedding layer, a TextRCNN network, a TextCNN network, a residual error layer and a preset activation function, the preset residual error network fusion model is obtained by training the residual error network fusion training model according to the address training set and the standard address set, and the target addresses are the first target address and the second target address respectively, so that the accuracy of address matching is improved.
Referring to fig. 4, a third embodiment of the present invention provides an address matching method, based on the first embodiment shown in fig. 2, where the step S30 includes:
step S31, determining the matching degree of each target address and the address to be matched;
in order to further improve the accuracy of address matching, in this embodiment, a target address matched with the address to be matched is obtained further based on the matching degree between the target address and the address to be matched and the confidence corresponding to the target address; the matching degree of the target address and the address to be matched can be calculated by adopting the following modes, the word number of the same character and the word number of different characters of the address to be matched and the target address are determined, the sum of the word number of the same character and the word number of different characters is calculated, the ratio of the word number of the same character to the sum is determined, the ratio is used as the matching degree, the matching degree can also be calculated in other modes, and the more similar the target address and the address to be matched is, the higher the matching degree is.
Step S32, determining the product of the matching degree and the confidence degree corresponding to each target address;
step S33, determining the target address matched with the address to be matched according to the target address corresponding to the maximum product.
Taking the target address corresponding to the maximum product as the target address matched with the address to be matched, for example, the number of the target addresses is two, the address to be matched is a, the target addresses are b and c respectively, the matching degree corresponding to the target address b is p (b), the matching degree corresponding to the target address c is p (c), and it is determined that the confidence degree corresponding to the target address b is 0.9, the confidence degree corresponding to the target address is 0.8, the product of the matching degree corresponding to the target address b and the confidence degree is 0.9 × p (b), and the product of the matching degree corresponding to the target address c and the confidence degree is 0.8p (c), wherein p (b) is 0.9, and p (c) is 0.8, so that the product of the matching degree corresponding to the target address b and the confidence degree is larger, and therefore, the target address b is taken as the target address matched with the address to be matched, so that the target address is more accurate, the accuracy of address matching is improved.
In this embodiment, the matching degree of each target address and the address to be matched is determined, the product of the matching degree corresponding to each target address and the confidence degree is determined, and the target address matched with the address to be matched is determined according to the target address corresponding to the maximum product, so that the accuracy of address matching can be further improved by combining the matching degree and the confidence degree.
Referring to fig. 5, a fourth embodiment of the present invention provides an address matching method, based on the first embodiment shown in fig. 2, before the step S10, the address matching method further includes:
step S40, acquiring the original address sent by the server;
the original address is an address which needs to be matched and is sent by the server, and since the original address may have situations such as character errors or incorrect format, in order to improve the accuracy of address matching, in this embodiment, the original address is also preprocessed, the server may be a server which provides the original address at will, and the server which provides the original address and the server which performs address matching are different servers.
And step S50, performing illegal character cleaning, redundant address cleaning, wrongly written character replacement and incomplete address filling on the original address to obtain the address to be matched.
Illegal character cleaning refers to the deletion of illegal characters, such as ", () |! @? "etc. characters, which do not belong to the address information, and the illegal characters are cleaned, i.e. deleted, for example, the original address is" drum street of Bijie city, Guizhou city, Jinsha county @ Lijing famous city! ", contains the illegal characters" @ "and"! ", so it is necessary to connect" @ "and"! Deleting to obtain a famous city of the drum street in Jinsha county, Bijie, Guizhou province;
the redundant address cleaning refers to deleting a redundant address, wherein the redundant address is unnecessary address information, such as unnecessary information of a road number, a house number and the like, for example, an original address is 'F1 unit 5 building 4 in the famous city Jinsha county of Bijie city, Guizhou province', and then 'F1 unit 5 building 4' is deleted to obtain 'the famous city Jinsha county of Bijie city, Guizhou province';
the wrongly-written characters replacement is to replace recognized wrongly-written characters, an error correction model is needed during replacement, a sample containing wrongly-written characters can be automatically modeled based on a method for training a model on a large-scale data set to obtain an error correction model, and wrongly-written characters are corrected through the error correction model, for example, the original address is 'the famous city of the drum street in the Jinsha county of Bijiu city of Guizhou province', the 'name' is changed into 'name' through error correction by the error correction model, and 'the famous city of the drum street in the Jinsha county of Bijiu city of Guizhou province is obtained';
the incomplete address complementing means complementing missing address elements in the address information, for example, the original address is 'jinsha county famous city', and since the province, the city and the street of the jinsha county belong to fixed information, the address complementing can be performed on the 'jinsha county famous city', so that 'the honour city of the jinsha county drum street famous city of the honour state province, Bijie county is obtained'.
After the original address is preprocessed in the mode, the obtained address to be matched is more accurate, and therefore the accuracy of address matching can be improved.
In the embodiment, the address to be matched is obtained by obtaining the original address sent by the server and performing illegal character washing, redundant address washing, wrongly written character replacement and incomplete address completion on the original address, so that the address to be matched is more accurate, and the accuracy of address matching is further improved.
Referring to fig. 6, fig. 6 is a schematic diagram of an address matching apparatus according to an embodiment of the present invention, an obtaining module 10 and a determining module 20, wherein:
the acquiring module 10 is configured to acquire at least two target addresses matched with an address to be matched in a standard address set, where the standard address set includes addresses of at least two data sources, and each target address is obtained by matching according to different matching models;
the determining module 20 is configured to determine a confidence of each target address, where the greater the number of data sources matched with the target address, the higher the corresponding confidence is, and determine, according to the confidence of each target address, the target address matched with the address to be matched in all the target addresses.
In an embodiment, the obtaining module 10 is further configured to perform the following steps:
determining a first target address according to the address to be matched, the standard address set and a preset probability transition matrix model, wherein the preset probability transition matrix model is obtained by training a probability transition matrix training model according to an address training set and the standard address set;
determining a second target address according to the address to be matched, the standard address set and a preset residual error network fusion model, wherein the preset residual error network fusion model comprises an embedding layer, a TextRCNN network, a TextCNN network, a residual error layer and a preset activation function, the preset residual error network fusion model is obtained by training the residual error network fusion training model according to the address training set and the standard address set, and the target addresses are the first target address and the second target address respectively.
In an embodiment, the obtaining module 10 is further configured to perform the following steps:
acquiring candidate characteristic words with frequency greater than preset frequency in the standard address set;
constructing a feature word set according to the candidate feature words;
extracting a characteristic word sequence corresponding to the address to be matched according to the characteristic word set, wherein the characteristic word sequence comprises the candidate characteristic words and common characters in the address to be matched;
combining the characteristic word elements in the characteristic word sequence according to the target combination length and a preset combination sequence to obtain a characteristic word substring set of the target combination length;
determining a joint probability corresponding to a feature word substring set according to a preset hidden Markov model and the feature word substring set with the target combination length, wherein the preset hidden Markov model is obtained by training a hidden Markov training model according to the standard address set, the joint probability corresponding to the feature word substring set is obtained according to a feature word transition probability in the hidden Markov model, and the preset transition probability model is the preset hidden Markov model;
when the target combination length is smaller than the preset combination length, increasing the target combination length, and returning to execute the step of combining the characteristic word elements in the characteristic word sequence according to the target combination length and the preset combination sequence to obtain a characteristic word substring set with the target combination length;
when the target combination length is greater than or equal to the preset combination length, acquiring the feature substring set with the maximum joint probability;
determining an optimal solution according to the feature substring set with the maximum joint probability;
determining the optimal solution as the first address.
In an embodiment, the determining module 20 is further configured to perform the following steps:
determining the matching degree of each target address and the address to be matched;
determining the product of the matching degree and the confidence degree corresponding to each target address;
and determining the target address matched with the address to be matched according to the target address corresponding to the maximum product.
In an embodiment, the determining module 20 is further configured to perform the following steps:
when there are at least two different target addresses, performing the step of determining a confidence level for each of the target addresses;
and when all the target addresses are the same, determining that the target address is the target address matched with the address to be matched.
In an embodiment, the determining module 20 is further configured to perform the following steps:
determining the number of the data sources matched with each target address;
determining the confidence level of the target address according to the quantity.
In an embodiment, the obtaining module 10 is further configured to perform the following steps:
acquiring an original address sent by a server;
and carrying out illegal character cleaning, redundant address cleaning, wrongly written character replacement and incomplete address completion on the original address to obtain the address to be matched.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing an address matching apparatus (which may be a server or other computer device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An address matching method, characterized in that the address matching method comprises:
acquiring at least two target addresses matched with addresses to be matched in a standard address set, wherein the standard address set comprises addresses of at least two data sources, and each target address is obtained by matching according to different matching models;
determining the confidence of each target address, wherein the higher the number of data sources matched with the target address is, the higher the corresponding confidence is;
and determining the target address matched with the address to be matched in all the target addresses according to the confidence degrees of all the target addresses.
2. The address matching method according to claim 1, wherein the step of obtaining at least two target addresses in the standard address set that match the address to be matched comprises:
determining a first target address according to the address to be matched, the standard address set and a preset probability transition matrix model, wherein the preset probability transition matrix model is obtained by training a probability transition matrix training model according to an address training set and the standard address set;
determining a second target address according to the address to be matched, the standard address set and a preset residual error network fusion model, wherein the preset residual error network fusion model comprises an embedding layer, a TextRCNN network, a TextCNN network, a residual error layer and a preset activation function, the preset residual error network fusion model is obtained by training the residual error network fusion training model according to the address training set and the standard address set, and the target addresses are the first target address and the second target address respectively.
3. The address matching method of claim 2, wherein the step of determining the first address according to the address to be matched, the standard address set and a preset probability transition matrix model comprises:
acquiring candidate characteristic words with frequency greater than preset frequency in the standard address set;
constructing a feature word set according to the candidate feature words;
extracting a characteristic word sequence corresponding to the address to be matched according to the characteristic word set, wherein the characteristic word sequence comprises the candidate characteristic words and common characters in the address to be matched;
combining the characteristic word elements in the characteristic word sequence according to the target combination length and a preset combination sequence to obtain a characteristic word substring set of the target combination length;
determining a joint probability corresponding to a feature word substring set according to a preset hidden Markov model and the feature word substring set with the target combination length, wherein the preset hidden Markov model is obtained by training a hidden Markov training model according to the standard address set, the joint probability corresponding to the feature word substring set is obtained according to a feature word transition probability in the hidden Markov model, and the preset transition probability model is the preset hidden Markov model;
when the target combination length is smaller than the preset combination length, increasing the target combination length, and returning to execute the step of combining the characteristic word elements in the characteristic word sequence according to the target combination length and the preset combination sequence to obtain a characteristic word substring set with the target combination length;
when the target combination length is greater than or equal to the preset combination length, acquiring the feature substring set with the maximum joint probability;
determining an optimal solution according to the feature substring set with the maximum joint probability;
determining the optimal solution as the first address.
4. The address matching method according to claim 1, wherein the step of determining the target address matching the address to be matched among all the target addresses according to the confidence of each of the target addresses comprises:
determining the matching degree of each target address and the address to be matched;
determining the product of the matching degree and the confidence degree corresponding to each target address;
and determining the target address matched with the address to be matched according to the target address corresponding to the maximum product.
5. The address matching method of claim 1, wherein after the step of obtaining at least two target addresses in the standard address set that match the address to be matched, the address matching method further comprises:
when there are at least two different target addresses, performing the step of determining a confidence level for each of the target addresses;
and when all the target addresses are the same, determining that the target address is the target address matched with the address to be matched.
6. The address matching method of claim 1, wherein the step of determining a confidence level for each of the target addresses comprises:
determining the number of the data sources matched with each target address;
determining the confidence level of the target address according to the quantity.
7. The address matching method according to claim 1, wherein, before the step of obtaining at least two target addresses in the standard address set that match the address to be matched, the address matching method further comprises:
acquiring an original address sent by a server;
and carrying out illegal character cleaning, redundant address cleaning, wrongly written character replacement and incomplete address completion on the original address to obtain the address to be matched.
8. An address matching apparatus, comprising an obtaining module and a determining module, wherein:
the acquisition module is used for acquiring at least two target addresses matched with the addresses to be matched in a standard address set, the standard address set comprises addresses of at least two data sources, and each target address is obtained by matching according to different matching models;
the determining module is configured to determine a confidence level of each target address, where the greater the number of data sources matched with the target address, the higher the corresponding confidence level is, and determine, according to the confidence level of each target address, the target address matched with the address to be matched, in all the target addresses.
9. An address matching apparatus, characterized in that the address matching apparatus comprises a memory, a processor and an address matching program stored on the memory and executable on the processor, the address matching program implementing the steps of the address matching method according to any one of claims 1 to 7 when executed by the processor.
10. A computer-readable storage medium, characterized in that an address matching program is stored on the computer-readable storage medium, which when executed by a processor implements the steps of the address matching method according to any one of claims 1 to 7.
CN202110834270.2A 2021-07-22 2021-07-22 Address matching method, device and computer readable storage medium Active CN113515677B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110834270.2A CN113515677B (en) 2021-07-22 2021-07-22 Address matching method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110834270.2A CN113515677B (en) 2021-07-22 2021-07-22 Address matching method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113515677A true CN113515677A (en) 2021-10-19
CN113515677B CN113515677B (en) 2023-10-27

Family

ID=78067669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110834270.2A Active CN113515677B (en) 2021-07-22 2021-07-22 Address matching method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113515677B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030120651A1 (en) * 2001-12-20 2003-06-26 Microsoft Corporation Methods and systems for model matching
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database
CN102169498A (en) * 2011-04-14 2011-08-31 中国测绘科学研究院 Address model constructing method and address matching method and system
WO2016050088A1 (en) * 2014-09-30 2016-04-07 华为技术有限公司 Address search method and device
US20170308807A1 (en) * 2016-04-21 2017-10-26 Linkedin Corporation Secondary profiles with confidence scores
US20180089227A1 (en) * 2016-09-26 2018-03-29 Uber Technologies, Inc. Geographical location search using multiple data sources
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN111008625A (en) * 2019-12-06 2020-04-14 中国建设银行股份有限公司 Address correction method, device, equipment and storage medium
CN111444298A (en) * 2020-03-19 2020-07-24 浙江大学 Address matching algorithm based on interest point knowledge graph pre-training
WO2020168750A1 (en) * 2019-02-18 2020-08-27 平安科技(深圳)有限公司 Address information standardization method and apparatus, computer device and storage medium
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111797182A (en) * 2020-05-29 2020-10-20 深圳市跨越新科技有限公司 Address code analysis method and system
CN111881677A (en) * 2020-07-28 2020-11-03 武汉大学 Address matching algorithm based on deep learning model
CN112256932A (en) * 2020-12-22 2021-01-22 中博信息技术研究院有限公司 Word segmentation method and device for address character string
CN112487122A (en) * 2020-12-02 2021-03-12 电信科学技术第十研究所有限公司 Address normalization processing method and device
CN112528174A (en) * 2020-11-27 2021-03-19 暨南大学 Address finishing and complementing method based on knowledge graph and multiple matching and application
CN112925774A (en) * 2021-02-01 2021-06-08 大箴(杭州)科技有限公司 Method and device for cleaning address data, storage medium and computer equipment

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030120651A1 (en) * 2001-12-20 2003-06-26 Microsoft Corporation Methods and systems for model matching
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database
CN102169498A (en) * 2011-04-14 2011-08-31 中国测绘科学研究院 Address model constructing method and address matching method and system
WO2016050088A1 (en) * 2014-09-30 2016-04-07 华为技术有限公司 Address search method and device
US20170308807A1 (en) * 2016-04-21 2017-10-26 Linkedin Corporation Secondary profiles with confidence scores
US20180089227A1 (en) * 2016-09-26 2018-03-29 Uber Technologies, Inc. Geographical location search using multiple data sources
WO2020168750A1 (en) * 2019-02-18 2020-08-27 平安科技(深圳)有限公司 Address information standardization method and apparatus, computer device and storage medium
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN111008625A (en) * 2019-12-06 2020-04-14 中国建设银行股份有限公司 Address correction method, device, equipment and storage medium
CN111444298A (en) * 2020-03-19 2020-07-24 浙江大学 Address matching algorithm based on interest point knowledge graph pre-training
CN111797182A (en) * 2020-05-29 2020-10-20 深圳市跨越新科技有限公司 Address code analysis method and system
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111881677A (en) * 2020-07-28 2020-11-03 武汉大学 Address matching algorithm based on deep learning model
CN112528174A (en) * 2020-11-27 2021-03-19 暨南大学 Address finishing and complementing method based on knowledge graph and multiple matching and application
CN112487122A (en) * 2020-12-02 2021-03-12 电信科学技术第十研究所有限公司 Address normalization processing method and device
CN112256932A (en) * 2020-12-22 2021-01-22 中博信息技术研究院有限公司 Word segmentation method and device for address character string
CN112925774A (en) * 2021-02-01 2021-06-08 大箴(杭州)科技有限公司 Method and device for cleaning address data, storage medium and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张剑 等: "面向智慧城市的高精度地名地址匹配方法", 《测绘与空间地理信息》, vol. 42, no. 11, pages 166 - 169 *
魏金明 等: "基于置信度的地址匹配方法初探", 《测绘科学》, vol. 40, no. 01, pages 122 - 125 *

Also Published As

Publication number Publication date
CN113515677B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
AU2020202658B2 (en) Automatically detecting user-requested objects in images
CN101529372B (en) Method for determining enterprise information by computer and computer system
CN112069276B (en) Address coding method, address coding device, computer equipment and computer readable storage medium
CN108628811B (en) Address text matching method and device
US20080270549A1 (en) Extracting link spam using random walks and spam seeds
CN110990520B (en) Address coding method and device, electronic equipment and storage medium
CN110321437B (en) Corpus data processing method and device, electronic equipment and medium
CN105159884A (en) Method and device for establishing industry dictionary and industry identification method and device
CN114780680A (en) Retrieval and completion method and system based on place name and address database
CN103577414B (en) Data processing method and device
CN116992880A (en) Building name identification method, device, electronic equipment and storage medium
CN113221558B (en) Express address error correction method and device, storage medium and electronic equipment
CN113515677A (en) Address matching method and device and computer readable storage medium
CN116414808A (en) Method, device, computer equipment and storage medium for normalizing detailed address
CN114792091A (en) Chinese address element analysis method and equipment based on vocabulary enhancement and storage medium
CN114003812A (en) Address matching method, system, device and storage medium
CN113535883A (en) Business place entity linking method, system, electronic device and storage medium
US20210406469A1 (en) Management of concepts and intents in conversational systems
CN116432633A (en) Address error correction method, device, computer equipment and readable medium
CN117272053B (en) Method for generating address data set with few samples, address matching method, medium and equipment
CN111475742A (en) Address extraction method and device
CN111767722A (en) Word segmentation method and device
CN116910386B (en) Address completion method, terminal device and computer-readable storage medium
CN116501897B (en) Method for constructing knowledge graph based on fuzzy matching
CN116384386A (en) POI-based address type identification method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant