CN112231429A - Address matching method based on machine learning classification algorithm - Google Patents

Address matching method based on machine learning classification algorithm Download PDF

Info

Publication number
CN112231429A
CN112231429A CN202011236891.2A CN202011236891A CN112231429A CN 112231429 A CN112231429 A CN 112231429A CN 202011236891 A CN202011236891 A CN 202011236891A CN 112231429 A CN112231429 A CN 112231429A
Authority
CN
China
Prior art keywords
address
matching
machine learning
method based
classification algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011236891.2A
Other languages
Chinese (zh)
Inventor
许再涛
张谦
石兴磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Health Medical Big Data Co ltd
Original Assignee
Shandong Health Medical Big Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Health Medical Big Data Co ltd filed Critical Shandong Health Medical Big Data Co ltd
Priority to CN202011236891.2A priority Critical patent/CN112231429A/en
Publication of CN112231429A publication Critical patent/CN112231429A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Remote Sensing (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an address matching method based on a machine learning classification algorithm, and belongs to the technical field of computers. The address matching method based on the machine learning classification algorithm splits the address information according to the address levels, generates the similarity between each address level according to the matching mode of the text and the pinyin, forms the similarity between each level of addresses into vectors, and performs address matching by using a trained logistic regression model. The address matching method based on the machine learning classification algorithm can calculate the importance degree of each level of address more accurately, improves the matching accuracy and has good popularization and application values.

Description

Address matching method based on machine learning classification algorithm
Technical Field
The invention relates to the technical field of computers, and particularly provides an address matching method based on a machine learning classification algorithm.
Background
Currently, address matching is based on address division, an address text is divided into several parts according to address levels, each part is matched according to a certain text matching principle to obtain a similarity vector between two addresses, the similarity vectors are added according to weights to obtain address similarity, and whether two addresses are the same address is determined by judging whether the similarity is greater than or equal to a specified threshold value. Some address matching schemes use the existing address information base to obtain the similarity of two addresses according to the distance between the longitude and latitude corresponding to the two addresses, and judge whether the addresses are matched.
The addresses are matched according to grade division, a text matching method has certain defects, and certain addresses are input according to pronunciation and have the condition of words which are homophonic and different from the real addresses; the weight of the matching similarity of each level of address is set according to experience and is not the optimal condition; the address base is not available to everyone, so it is important to exactly match the address when there is no address base.
Disclosure of Invention
The technical task of the invention is to provide an address matching method based on the machine learning classification algorithm, which can more accurately calculate the importance degree of each level of address and improve the matching accuracy aiming at the existing problems.
In order to achieve the purpose, the invention provides the following technical scheme:
the method divides address information according to address levels, generates similarity between each address level according to a text and pinyin matching mode, forms the similarity between each level of addresses into vectors, and performs address matching by using a trained logistic regression model.
Preferably, the address is divided into nine parts according to the address level, and the divided nine parts are compared and calculated.
Preferably, the nine parts comprise first-level administrative district division names, second-level administrative district division names, third-level administrative village division names, fourth-level administrative village division names, village and district names, doorplate numbers, building plate numbers, unit numbers and room numbers.
Preferably, address text matching and address pinyin matching are combined to measure the address similarity.
Preferably, the similarity vectors generated by matching the address texts at all levels and the similarity vectors generated by matching the address pinyin are added according to the weight to generate final address similarity vectors at all levels.
The numerical value of the weight can be set according to actual needs, and the flexibility is high.
Preferably, the weight parameters of each level of address are trained using a logistic regression model.
Preferably, the calculated address similarity vector is input to a trained logistic regression model, and address matching is performed by using the trained logistic regression model.
Preferably, the address matching output value of the logistic regression model is compared with an address matching threshold value to judge whether the addresses are matched.
Compared with the prior art, the address matching method based on the machine learning classification algorithm has the following outstanding beneficial effects: the address matching method based on the machine learning classification algorithm divides the address into nine parts according to the address level, and uses a method combining text matching and pinyin matching to more accurately calculate the similarity between the addresses of all levels, uses a logistic regression model to train the weight parameters of the addresses of all levels, more accurately calculates the importance degree of the addresses of all levels, improves the matching accuracy, has good effect on address matching of address information incompleteness and wrongly written or harmonious characters in the address information, and has good popularization and application values.
Drawings
Fig. 1 is a flowchart of the address matching method based on the machine learning classification algorithm according to the present invention.
Detailed Description
The address matching method based on the machine learning classification algorithm of the present invention will be further described in detail with reference to the accompanying drawings and embodiments.
Examples
As shown in fig. 1, the address matching method based on the machine learning classification algorithm of the present invention splits address information according to address levels, generates similarities between each address level according to a text and pinyin matching manner, forms vectors from the similarities between each level of addresses, and performs address matching using a trained logistic regression model. Dividing the address into nine parts according to the address level, and respectively carrying out comparison calculation on the nine divided parts. The nine parts comprise first-level administrative district division names, second-level administrative district division names, third-level administrative village division names, fourth-level administrative village division names, village and community names, house numbers, building numbers, unit numbers and room numbers. And summing the similarity vectors generated by matching the address texts at all levels and the similarity vectors generated by matching the address pinyin according to the weight to generate final address similarity vectors at all levels. And training the weight parameters of the addresses at all levels by using a logistic regression model. And inputting the calculated address similarity vector into a trained logistic regression model, and performing address matching by using the trained logistic regression model. And comparing the address matching output value of the logistic regression model with an address matching threshold value to judge whether the addresses are matched.
The specific embodiment is as follows:
(1) level division words such as counties, districts and flags of the same level in the address data are counted, address division is performed according to different address levels in table 1, and the address data is divided into 9 parts.
TABLE 1 Address level partitioning
Figure BDA0002767014890000031
Use of
Figure BDA0002767014890000041
Figure BDA0002767014890000042
Representing address character vectors after address i and address j are divided; using the formula
Figure BDA0002767014890000043
When in use
Figure BDA0002767014890000044
And
Figure BDA0002767014890000045
while at the same time being 0, simk(ri,rj) When it is 0, calculate the address riAnd rjAddress similarity vector l of(i,j)=(sim1(ri,rj),sim2(ri,rj),…sim9(ri,rj) Therein), wherein
Figure BDA0002767014890000046
Indicating the number of identical elements in both characters,
Figure BDA0002767014890000047
representing the sum of the number of elements in two characters. Use of
Figure BDA0002767014890000048
Figure BDA0002767014890000049
Expressing address pinyin character vectors after address i and address j are divided; using the formula
Figure BDA00027670148900000410
When in use
Figure BDA00027670148900000411
And
Figure BDA00027670148900000412
when 0 at the same time, SIMk(Ri,Rj) When it is 0, calculate the address RiAnd RjAddress pinyin similarity vector L(i,j)=(SIM1(Ri,Rj),SIM2(Ri,Rj),…SIM9(Ri,Rj) Therein), wherein
Figure BDA00027670148900000413
Indicating the number of the same elements in the two address pinyin characters,
Figure BDA00027670148900000414
representing the sum of the number of elements in the two address pinyin characters. Adding the two similarities according to a proportion to obtain a final two-address similarity vector x(i,j)=m*l(i,j)+n*L(i,j)m>0,n>0,m+n=1。
And performing text and pinyin similarity calculation on the addresses, wherein sample cases are shown in table 2.
TABLE 2 similarity of similar address text and pinyin
Figure BDA00027670148900000415
(2) Dividing labels 0 and 1 according to whether two addresses are the same address, wherein the same address label is 1, but not the same address label is 0, and constructing a data set D ═ x(1,1),y(1,1)),(x(1,2),y(1,2)),…(x(N,M),y(N,M)),x(i,j)∈R9,y(i,j)E is 0, 1, i is 1, 2 … N, j is 1, 2 … M; dividing a data set into a training set and a testing set, and training a logistic regression (logistic regression) model by using the data of the training set:
Figure BDA0002767014890000051
Figure BDA0002767014890000052
setting:
P(Y=1|x)=p(x)
P(Y=0|x)=1-p(x)
log-likelihood function:
Figure BDA0002767014890000053
minimization of the loss function: j (w) ═ lnl (w), parameter w is obtained.
And inputting the parameter w into the model, and performing address matching by using a logistic regression model. Experiments prove that the address matching method has a good effect on address matching of address information incompleteness and wrongly written or harmonious characters in the address information.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (8)

1. An address matching method based on a machine learning classification algorithm is characterized in that: the method splits address information according to address levels, generates similarity between each address level according to a text and pinyin matching mode, forms the similarity between each level of addresses into vectors, and performs address matching by using a trained logistic regression model.
2. The address matching method based on the machine learning classification algorithm according to claim 1, characterized in that: dividing the address into nine parts according to the address level, and respectively carrying out comparison calculation on the nine divided parts.
3. The address matching method based on the machine learning classification algorithm according to claim 2, characterized in that: the nine parts comprise first-level administrative district division names, second-level administrative district division names, third-level administrative village division names, fourth-level administrative village division names, village and community names, house number, building number, unit number and room number.
4. The address matching method based on the machine learning classification algorithm according to claim 3, characterized in that: and combining address text matching and address pinyin matching to measure the address similarity.
5. The address matching method based on the machine learning classification algorithm according to claim 4, wherein: and summing the similarity vectors generated by matching the address texts at all levels and the similarity vectors generated by matching the address pinyin according to the weight to generate final address similarity vectors at all levels.
6. The address matching method based on the machine learning classification algorithm according to claim 5, wherein: and training the weight parameters of the addresses at all levels by using a logistic regression model.
7. The address matching method based on the machine learning classification algorithm according to claim 6, characterized in that: and inputting the calculated address similarity vector into a trained logistic regression model, and performing address matching by using the trained logistic regression model.
8. The address matching method based on the machine learning classification algorithm according to claim 7, wherein: and comparing the address matching output value of the logistic regression model with an address matching threshold value, and judging whether the addresses are matched.
CN202011236891.2A 2020-11-09 2020-11-09 Address matching method based on machine learning classification algorithm Pending CN112231429A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011236891.2A CN112231429A (en) 2020-11-09 2020-11-09 Address matching method based on machine learning classification algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011236891.2A CN112231429A (en) 2020-11-09 2020-11-09 Address matching method based on machine learning classification algorithm

Publications (1)

Publication Number Publication Date
CN112231429A true CN112231429A (en) 2021-01-15

Family

ID=74122650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011236891.2A Pending CN112231429A (en) 2020-11-09 2020-11-09 Address matching method based on machine learning classification algorithm

Country Status (1)

Country Link
CN (1) CN112231429A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618867A (en) * 2022-10-27 2023-01-17 中科星图数字地球合肥有限公司 Address error correction method, device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326233A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Address prompting method and device
CN109033086A (en) * 2018-08-03 2018-12-18 银联数据服务有限公司 A kind of address resolution, matched method and device
CN109255565A (en) * 2017-07-14 2019-01-22 菜鸟智能物流控股有限公司 Address attribution identification and logistics task distribution method and device
CN109684440A (en) * 2018-12-13 2019-04-26 北京惠盈金科技术有限公司 Address method for measuring similarity based on level mark
CN110019575A (en) * 2017-08-04 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus that geographical address is standardized
CN110895651A (en) * 2018-08-23 2020-03-20 北京京东金融科技控股有限公司 Address standardization processing method, device, equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326233A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Address prompting method and device
CN109255565A (en) * 2017-07-14 2019-01-22 菜鸟智能物流控股有限公司 Address attribution identification and logistics task distribution method and device
CN110019575A (en) * 2017-08-04 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus that geographical address is standardized
CN109033086A (en) * 2018-08-03 2018-12-18 银联数据服务有限公司 A kind of address resolution, matched method and device
CN110895651A (en) * 2018-08-23 2020-03-20 北京京东金融科技控股有限公司 Address standardization processing method, device, equipment and computer readable storage medium
CN109684440A (en) * 2018-12-13 2019-04-26 北京惠盈金科技术有限公司 Address method for measuring similarity based on level mark

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618867A (en) * 2022-10-27 2023-01-17 中科星图数字地球合肥有限公司 Address error correction method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107330011B (en) The recognition methods of the name entity of more strategy fusions and device
CN108287858A (en) The semantic extracting method and device of natural language
TWI752455B (en) Image classification model training method, image processing method, data classification model training method, data processing method, computer device, and storage medium
CN104536881A (en) Public testing error report priority sorting method based on natural language analysis
Malinin et al. Incorporating uncertainty into deep learning for spoken language assessment
CN113887930B (en) Question-answering robot health evaluation method, device, equipment and storage medium
CN112101039A (en) Learning interest discovery method for online learning community
CN103119584A (en) Machine translation evaluation device and method
CN108717459A (en) A kind of mobile application defect positioning method of user oriented comment information
CN115392237B (en) Emotion analysis model training method, device, equipment and storage medium
CN108108344A (en) A kind of method and device for combining identification and connection name entity
CN112231429A (en) Address matching method based on machine learning classification algorithm
CN112508697A (en) Resource recovery risk prediction method and device and electronic equipment
CN112559749A (en) Intelligent matching method and device for teachers and students in online education and storage medium
CN109086306A (en) The extracting method of atomic event label based on mixed hidden Markov model
CN106611599A (en) Voice recognition method and device based on artificial neural network and electronic equipment
Chang et al. Automatic detection and correction for Chinese misspelled words using phonological and orthographic similarities
CN114020886A (en) Speech intention recognition method, device, equipment and storage medium
CN114064459A (en) Software defect prediction method based on generation countermeasure network and ensemble learning
CN104281569A (en) Building device and method, classifying device and method and electronic device
Kirchner-Bossi et al. Multi-decadal variability in a centennial reconstruction of daily wind
CN106548787A (en) The evaluating method and evaluating system of optimization new word
CN106057196A (en) Vehicular voice data analysis identification method
Canovas et al. WiFiBoost: A terminal-based method for detection of indoor/outdoor places
Wang [Retracted] Design of Chinese Teaching Evaluation System for International Students under the Background of Data Mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210115