CN110019575A - The method and apparatus that geographical address is standardized - Google Patents

The method and apparatus that geographical address is standardized Download PDF

Info

Publication number
CN110019575A
CN110019575A CN201710661302.7A CN201710661302A CN110019575A CN 110019575 A CN110019575 A CN 110019575A CN 201710661302 A CN201710661302 A CN 201710661302A CN 110019575 A CN110019575 A CN 110019575A
Authority
CN
China
Prior art keywords
address
geographical
vector
library
standardized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710661302.7A
Other languages
Chinese (zh)
Inventor
梅尚健
罗尚勇
游正朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710661302.7A priority Critical patent/CN110019575A/en
Publication of CN110019575A publication Critical patent/CN110019575A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Remote Sensing (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses the method and apparatus that a kind of pair of geographical address is standardized, and are related to field of computer technology.The specific embodiment of this method includes: to receive a geographical address to be processed;It searches with the highest normal address of geographical address similarity to be processed in predefined standardized address library as the first normal address, corresponding similarity is as the first similarity, when first similarity is greater than scheduled first threshold, it is determined that first normal address is the standardized address of the geographical address to be processed.Embodiment accuracy with higher and treatment effeciency, and applicability is wide.

Description

The method and apparatus that geographical address is standardized
Technical field
The method and dress being standardized the present invention relates to field of computer technology more particularly to a kind of pair of geographical address It sets.
Background technique
The geographical address of user is counted, analyzed and excavated, many commercial values with higher can be obtained Data information.Since the geographic address information of user is often that user oneself inputs better address, and everyone is to same address Understanding may be different, therefore for identical address, the content of user's input be frequently not it is unified, format is also thousand poor Ten thousand are not, and in order to avoid being intercepted by air control rule noise artificially can be added in some people in the better address filled in, thus Cause identical address that there may be very multi-form address to go here and there, many difficulties that have been the identification band of address, so that subsequent do The difficulty of the index of the analysis and design addressing dimension of addressing dimension increases.
Therefore, it during obtaining and identifying the geographical address of user, needs to carry out the geographical address that user inputs The geographical address that user inputs is converted to the canonical form of definition by standardization.Currently, it is logical for being standardized to geographical address It crosses rule-based method to carry out, this method is based on address level rule, and address hierarchy is divided into province, city, area, street Road, city, unit, then participle obtains final standardization result.
In realizing process of the present invention, at least there are the following problems in the prior art: the mark of the prior art for inventor's discovery Standardization method applicability, accuracy and treatment effeciency are lower, existing for some abnormal conditions, use present in geographical address The standardized method of technology then needs all additionally to add special rules ability standardized address for each abnormal conditions, this is past Toward many manpower intervention operation and rule setting is needed, the great wasting of resources is caused.
Summary of the invention
In view of this, the embodiment of the present invention provides the method and apparatus that a kind of pair of geographical address is standardized, have compared with High accuracy and treatment effeciency, and applicability is wide.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of pair of geographical address is provided to mark The method of standardization, comprising:
Receive a geographical address to be processed;
It searches in predefined standardized address library and makees with the highest normal address of geographical address similarity to be processed For the first normal address, corresponding similarity is as the first similarity, when first similarity is greater than scheduled first threshold, Then determine that first normal address is the standardized address of the geographical address to be processed.
The method provided in an embodiment of the present invention that geographical address is standardized, further includes:
When first similarity is not more than scheduled first threshold, then generate the address of the geographical address to be processed to Amount;
It is searched in the corresponding normal address vector in standardized address library library similar to the geographical address to be processed Spend highest normal address vector, corresponding similarity is as the second similarity, when second similarity is greater than scheduled the Two threshold values, it is determined that corresponding second normal address of the normal address vector searched is the standardization of the geographical address Location.
The method provided in an embodiment of the present invention that geographical address is standardized, further includes:
When second similarity is not more than the second threshold, and first similarity and second similarity Weighted average is greater than scheduled third threshold value, it is determined that the greater is corresponding in first similarity and second similarity Normal address be the geographical address to be processed standardized address.
Further, standardized address library is obtained by following step:
Receive sample geographical address collection;
Sample geographical address collection is pre-processed;
Based on address dictionary read the geographical address that the geographical address is concentrated address key and corresponding level, and be based on Address key level removal do not meet wherein level requirement address key, obtain the geographical address by the address key The level standardized address of separation, to obtain the standardized address library being made of the level standardized address.
Further, the address vector of the generation geographical address to be processed includes:
The address word for obtaining the geographical address to be processed is searched in the corresponding term vector library in the standardized address library Corresponding term vector;
According to the corresponding term vector of the geographical address to be processed, it is based on preset hierarchical address word dictionary, passes through word Bag mode generates the address vector of the geographical address to be processed.
Further, the term vector library is obtained by following step:
The address word of all geographical address in the standardized address library is converted into term vector, the value of term vector is pair The weight coefficient of address word is answered, the weight coefficient is that the context semanteme based on the address word distributes, thus described in obtaining The corresponding term vector library in standardized address library.
Further, the address vector library is obtained by following step:
According to the corresponding term vector of geographical address in the standardized address library, it is based on preset hierarchical address word word Allusion quotation generates the address vector of the geographical address in the standardized address library by bag of words mode, to obtain by address vector The address vector library of composition.
Further, the preset hierarchical address word dictionary be according to the multi-C vector of address layer stage layered, wherein Each layer of one address level of expression and there are multiple dimensions, include in each layer present in standardized address library this layer it is right One dimension of all address word types for the address level answered, each address word type level indicates.
Further, in the address vector library, each address vector is corresponding with n record, wherein this n record Major key is corresponding with the expression of the grade address the n of address corresponding to the address vector respectively, and n is less than or equal to the number of levels of address, in n In the expression of grade address, the 1st grade of address is expressed as address book body, and every level-one address table is shown as removal single-level address expression thereon thereafter In the remaining address of the corresponding level address word institute of highest level serial number,
It is searched in the corresponding normal address vector in standardized address library library similar to the geographical address to be processed Spending highest normal address vector includes:
It indicates to search corresponding major key step by step with the n grade address of the geographical address to be processed in the vector library of normal address, Until find corresponding major key and obtain the address vector in the record corresponding to it, then calculate the address vector with wait locate Manage the similarity of the address vector of geographical address.
Further, first normal address is to search to obtain based on the corresponding inverted index in the standardized address library , the inverted index is that the address word string of the address word composition based on all geographical address in the standardized address library is built Vertical.
Further, the geographical address to be processed is received further include: the geographical address to be processed is pre-processed,
Wherein, the pretreatment includes conversion between simplified and traditional Chinese, the conversion of full-shape half-angle, removal punctuation mark and the special word of removal Symbol,
Wherein, when the geographical address to be processed is indicated with Chinese, the pretreatment further includes to the geography to be processed Address carries out participle and obtains its corresponding address word.
To achieve the above object, other side according to an embodiment of the present invention provides a kind of pair of geographical address and carries out Standardized device, comprising:
Receiving module, for receiving a geographical address to be processed;
Search module, for being searched and the geographical address similarity highest to be processed in predefined standardized address library Normal address as the first normal address, corresponding similarity as the first similarity, when first similarity be greater than it is pre- Fixed first threshold, it is determined that first normal address is the standardized address of the geographical address to be processed.
The device provided in an embodiment of the present invention that geographical address is standardized, further includes:
Vectors matching module then generates described wait locate for being not more than scheduled first threshold when first similarity Manage geographical address address vector, in the corresponding normal address vector in standardized address library library search with it is described to be processed The highest normal address vector of geographical address similarity, corresponding similarity is as the second similarity, when second similarity Greater than scheduled second threshold, it is determined that corresponding second normal address of the normal address vector searched is the geographical address Standardized address.
The device provided in an embodiment of the present invention that geographical address is standardized, further includes:
Output module, for being not more than the second threshold, and first similarity and institute when second similarity The weighted average for stating the second similarity is greater than scheduled third threshold value, it is determined that first similarity and described second similar The corresponding normal address of the greater is the standardized address of the geographical address to be processed in degree.
The device provided in an embodiment of the present invention that geographical address is standardized, further includes:
First configuration module, for obtaining standardized address library by following step:
Receive sample geographical address collection;
Sample geographical address collection is pre-processed;
Based on address dictionary read the geographical address that the geographical address is concentrated address key and corresponding level, and be based on Address key level removal do not meet wherein level requirement address key, obtain the geographical address by the address key The level standardized address of separation, to obtain the standardized address library being made of the level standardized address.
Further, the Vectors matching module is further used in the corresponding term vector library in the standardized address library The corresponding term vector of address word for obtaining the geographical address to be processed is searched, it is then corresponding according to the geographical address to be processed Term vector, be based on preset hierarchical address word dictionary, the address of the geographical address to be processed is generated by bag of words mode Vector.
The device provided in an embodiment of the present invention that geographical address is standardized, further includes:
Second configuration module, for obtaining the term vector library by following step:
The address word of all geographical address in the standardized address library is converted into term vector, the value of term vector is pair The weight coefficient of address word is answered, the weight coefficient is that the context semanteme based on the address word distributes, thus described in obtaining The corresponding term vector library in standardized address library.
The device provided in an embodiment of the present invention that geographical address is standardized, further includes:
Third configuration module, for obtaining the address vector library by following step:
According to the corresponding term vector of geographical address in the standardized address library, it is based on preset hierarchical address word word Allusion quotation generates the address vector of the geographical address in the standardized address library by bag of words mode, to obtain by address vector The address vector library of composition.
Further, in the address vector library, each address vector is corresponding with n record, wherein this n record Major key is corresponding with the expression of the grade address the n of address corresponding to the address vector respectively, and n is less than or equal to the number of levels of address, in n In the expression of grade address, the 1st grade of address is expressed as address book body, and every level-one address table is shown as removal single-level address expression thereon thereafter In the remaining address of the corresponding level address word institute of highest level serial number,
The Vectors matching module is further used for the n grade in the vector library of normal address with the geographical address to be processed Address indicates to search corresponding major key step by step, until find corresponding major key and obtain the address in the record corresponding to it to Then amount calculates the similarity of the address vector of the address vector and geographical address to be processed.
Further, described search module is further used for searching based on the corresponding inverted index in the standardized address library First normal address is obtained, the inverted index is the address based on all geographical address in the standardized address library What the address word string of word composition was established.
Further, the receiving module is further used for pre-processing the geographical address to be processed, wherein institute Stating pretreatment includes conversion between simplified and traditional Chinese, the conversion of full-shape half-angle, removal spcial character, wherein when the geographical address Chinese to be processed Language indicates that the pretreatment further includes carrying out participle to the geographical address to be processed to obtain its corresponding address word.
To achieve the above object, other side according to an embodiment of the present invention provides a kind of pair of geographical address and carries out Standardized electronic equipment, comprising:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes the method provided in an embodiment of the present invention being standardized to geographical address.
To achieve the above object, other side according to an embodiment of the present invention provides a kind of computer-readable medium, Be stored thereon with computer program, realized when described program is executed by processor it is provided in an embodiment of the present invention to geographical address into The standardized method of row.
The method and apparatus provided in an embodiment of the present invention being standardized to geographical address treat place by search engine Reason geographical address carries out matching and obtains standardized address, when matching result is unsatisfactory for requiring, then passes through address vector progress With standardized address, standardized method compared with the existing technology, the method for the present invention can effectively be standardized with wrong word, The address of the special circumstances such as redundant character and synonym, accuracy with higher and treatment effeciency, and applicability is wide.And The method of the present invention integrates two kinds of matching ways, is matched by address vector, makes up the matched deficiency of search engine, further mentions High matched accuracy rate.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the flow diagram of the method provided in an embodiment of the present invention being standardized to geographical address;
Fig. 2 is a kind of specific embodiment of the method provided in an embodiment of the present invention being standardized to geographical address Flow diagram;
Fig. 3 is the process of the off-line arrangement process of the method provided in an embodiment of the present invention being standardized to geographical address Schematic diagram;
Fig. 4 is the application flow schematic diagram of the method provided in an embodiment of the present invention being standardized to geographical address;
Fig. 5 is the schematic diagram of the device provided in an embodiment of the present invention being standardized to geographical address;
Fig. 6 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present invention.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
The embodiment of the present invention provides the method that a kind of pair of geographical address is standardized, as shown in Figure 1, this method comprises: Step S101 and step S102.
Firstly, in step s101, receiving a geographical address to be processed.Then, in step s 102, in predefined mark It is searched with the highest normal address of geographical address similarity to be processed as the first normal address, accordingly in standardization address base Similarity is as the first similarity, when the first similarity is greater than scheduled first threshold, it is determined that first normal address be to Handle the standardized address of geographical address.
Wherein, standardized address library is created during off-line arrangement provided by the invention, the off-line arrangement process It will be described in detail in the subsequent content of the embodiment of the present invention.It include a large amount of standards geography of collection in standardized address library Address, these standard geographical address are obtained by the way that the original geographical address collected is carried out standardization level processing.
In the present invention, step S101 receives geographical address to be processed further include: is located in advance to geographical address to be processed Reason, wherein pretreatment includes: that conversion between simplified and traditional Chinese, the conversion of full-shape half-angle, removal punctuation mark and removal spcial character etc. are processed Journey, when geographical address to be processed is indicated with Chinese, pretreatment further includes carrying out participle to geographical address to be processed to obtain that its is right The address word answered.
In the present invention, participle tool can segment tool using ansj, and the participle dictionary for segmenting use can be by right The address dictionary of authority is crawled or is parsed acquisition, and can also will be crawled or be parsed the original address obtained and combine specifically Demand (such as habit according to user's fill address) is converted to specific address format, and then forms participle dictionary.The present invention When building segments dictionary, the address dictionary for choosing authority is constructed, to ensure that the participle dictionary based on building is segmented Can be accurate as far as possible, and then guarantee the accuracy of the standardization of progress subsequent for Chinese geographical address.
In step s 102, it is searched in standardized address library by search engine most like with received geographical address Geographical address, search engine return to most like the first normal address and its corresponding first similarity, and first threshold can be Obtain empirical value by respective algorithms and data verification, it typically is be less than or equal to 1 and be more than or equal to 0 fractional value, when the One similarity is greater than first threshold, i.e., it is believed that first normal address meets matching requirement, determines that first normal address is The standardized address of geographical address to be processed is exported.The side that search engine passes through the reverse document-frequency (TF-IDF) of word frequency- Formula is geographical address weighting, so that can be good at considering when searching for the first most like normal address by search engine It to word frequency factor, returns the result and is more conform with actual conditions, and whole process is rapidly and efficiently.Wherein, search engine can use The full-text search engine ElasticSearch of distributed multi-user ability is realized.
As shown in Fig. 2, as a kind of specific embodiment, it is provided in an embodiment of the present invention that standard is carried out to geographical address The method of change further include: step S103.When the first similarity be not more than scheduled first threshold, then follow the steps 103, firstly, Generate the address vector of geographical address to be processed, then, in the corresponding normal address vector library of standardized address library search with The highest normal address vector of geographical address similarity to be processed, corresponding similarity is as the second similarity, when second similar Degree is greater than scheduled second threshold, it is determined that corresponding second normal address of the normal address vector searched is geographical address Standardized address.
Standardized address library corresponding normal address vector library is created during off-line arrangement provided by the invention, Wherein contain the address vector of all normal addresses in standardized address library.In step s 103, by geography to be processed The address vector of address is matched with the address vector in address vector library, to obtain based on the matched most phase of address vector As the second normal address, when the second similarity be greater than second threshold, i.e., it is believed that second normal address meet matching require, Determine that second normal address is that the standardized address of geographical address to be processed is exported.Wherein, in normal address vector library Address vector based on address word context semanteme construct.So that matching the first most like study plot by address vector When location, it can be good in view of context is semantic, it is more accurate to return the result.
In the present invention, step S102 matches geographical address to be processed by search engine, when matching result is discontented When foot requires, then step S103 is matched by address vector, in conjunction with two kinds of matching ways, by view of context semanteme Address vector matching way, make up the search engine matching way deficiency that may be present based on word frequency, improve matched standard True rate.
Due to being matched by search engine compared with smaller, the processing that carries out matched computational processing by address vector Therefore fast speed in the present invention, first carries out the matching based on search engine for received geographical address to be processed, when When as a result being unsatisfactory for requiring, just matched by address vector, thus while guaranteeing matched accuracy rate, as far as possible Improve matched speed.
Certainly, due to carrying out matching and carrying out matching by address vector to be two relatively independent mistakes by search engine Journey can also first pass through address vector and be matched in other embodiments of the invention, when matching result is unsatisfactory for requiring When, then matched by search engine.Or above two matching way can also execute parallel.
The method provided in an embodiment of the present invention that geographical address is standardized further include: step S104.When the second phase It is not more than second threshold like degree, and the weighted average of the first similarity and the second similarity is greater than scheduled third threshold value, holds Row step S104.Determine that the corresponding normal address of the greater is wait locate in the first similarity and the second similarity in step S104 Manage the standardized address of geographical address.In this step, the weight of the first similarity and the second similarity can be returned by logic Logistic regression model learning is returned to obtain, the above two matching way of Logic Regression Models study obtains corresponding The importance of first similarity and the second similarity, using the importance as the weight of the two.
In the present invention, first threshold, second threshold and third threshold value can be and obtained by respective algorithms and data verification To empirical value, it typically is be less than or equal to 1 and the fractional value more than or equal to 0.
Step S104 when the result of step S102 and step S103 all are unsatisfactory for requiring, to the similarities of two results into Row is further to be measured, and exports normal address of the result for meeting corresponding requirements as geographical address to be processed, with as far as possible Expansion be capable of the range of standardized geographical address.
As shown in figure 3, including: creation standardized address library, creation standardization during off-line arrangement provided by the invention The corresponding term vector library of address base and address vector library create the inverted index in standardized address library, and creation participle dictionary.
During the off-line arrangement, standardized address library is obtained by following step: firstly, receiving sample geographical address Collection, sample geographical address collection can be obtained by the address collected in the regular period, for example, user is in electric business net in collection half a year The harvest address inputted stand as sample geographical address collection.
Then, sample geographical address collection is pre-processed, wherein pretreatment includes: conversion between simplified and traditional Chinese, full-shape half-angle turn It changes, remove the treatment processes such as punctuation mark and removal spcial character.
For pretreated sample geographical address, the ground for the geographical address that geographical address is concentrated is read based on address dictionary Location key and corresponding level.Wherein address dictionary is made of address key and corresponding level serial number.It is with Chinese address dictionary Example can obtain correspondingly according to China address level key table and specific geographical address sample as shown in table 1 below Location dictionary.
Table 1
Acquisition can be extracted correspondingly according to upper 1 China address level key table of table and specific geographical address sample Location dictionary: " do residence door mausoleum office shop in town village, counties and districts, provinces and cities, the lane room Zuo Zulou tall building building, the road Yuan Chengyuan Yuan Chang single channel Duan Qijie The bridge mountains and rivers, which is received, sees stop institute ... ", wherein each address key has corresponding level serial number.
The address key of geographical address based on address dictionary reading sample geographical address concentration and corresponding level.For example, This same geographical address is " a, Hangzhou, Zhejiang province city, Wangjiang street Phoenix, Shangcheng District b unit xxx ", is read based on address dictionary Take its address key and corresponding level are as follows: " province [2] city [234] city [7] area [248] street [6] city [7] [8] unit [8] ", Wherein, provinces and cities city ... is address key, and the number in bracket [] is corresponding level serial number.
Later, the level removal based on address key does not meet the address key of level requirement wherein, obtains this geographically The level standardized address of location separated by address key, to obtain the standardized address being made of level standardized address Library.Example as above, since the address level of specification is only from small to large, i.e., 1 to 9 is sequentially increased, so city's level value 2 and 4 Undesirable, retention 3, city level value is 7, undesirable, is merged into area's name, region layer grade value 2 and 8, which is not met, to be wanted It asks, retains level 4.After by the removal of undesirable address key, the address key for obtaining meeting level requirement " is saved [2] City [3] area [4] street [6] city [7] [8] unit [8] ", and obtain the level mark of the geographical address separated by address key Standardization address: " 2, Hangzhou, Zhejiang province city, Wangjiang street Phoenix, Shangcheng District, 4 units 201 ", wherein space is indicated by address key Separate.
It is with parsing sample that address above mentioned level standardization main purpose is carried out during constructing standardized address library The level and removal noise of location, if sample address itself is standardization, the output after standardization is the result is that constant.
When geographical address to be matched and corresponding standardized address library are indicated with Chinese, also wrapped during off-line arrangement Include: creation participle dictionary, the quality of word segmentation result are depended in the collection of dictionary, therefore to keep address participle quasi- as far as possible Really, the present invention is crawled or is parsed to the address dictionary of authority, and will be obtained address contents and be converted to and meet user's habit Address format, such as address " village A villagers' committee ", since usual user is accustomed to filling in the village A when filling in village level address, and It is not A villagers' committee, so " village A villagers' committee " is converted to " village A village " when creation segments dictionary, meets it more User's habit, so that it is more accurate to obtain word segmentation result when being segmented based on the participle dictionary.
After the completion of segmenting dictionary creation, the address in standardized address library is segmented based on the participle dictionary, is obtained To the corresponding address word in address, tool can be segmented using ansj by segmenting tool.
After the completion of participle, the address word string of the address word composition based on all geographical address in standardized address library is utilized Search engine establishes the inverted index in standardized address library, forms the inverted index library for being used for On-line matching.Search engine can be with It is realized using the full-text search engine ElasticSearch of distributed multi-user ability.
During off-line arrangement further include: creation term vector library, term vector library are obtained by following step:
Utilize address word of the tool all geographical address in standardized address library that word is converted to vector form It is converted into term vector, and term vector is saved into database, to obtain the corresponding term vector library in standardized address library.Institute's benefit By word be converted to vector form tool can using word2vec tool, by address word input word2vec tool into After row training, the weight coefficient of each address word is obtained, the value of term vector is the weight coefficient of address word, which is Word2vec tool is distributed based on the context semanteme of the address word.
Example as above, address word " Zhejiang Province ", " Hangzhou ", " Shangcheng District " are by obtaining after Word2vec model training Following term vector: Zhejiang Province: 0.98, Hangzhou: 0.83, Shangcheng District: 0.88, wherein the subsequent numerical value of colon represents equivalent Term vector.Address above mentioned word and term vector are saved in the database in the form of key-value pair.
During off-line arrangement further include: creation address vector library, address vector library are obtained by following step:
According to the corresponding term vector of geographical address and preset hierarchical address word dictionary in standardized address library, The address vector of the geographical address in standardized address library is generated by bag of words mode, to obtain the ground being made of address vector Location vector library.
Before creating address vector library, creation hierarchical address word dictionary is first needed, which is one According to the multi-C vector of address layer stage layered, wherein each layer of one address level of expression and there are multiple dimensions, in each layer It include all address word types of this layer of corresponding address level present in standardized address library, each address word type It is indicated with a dimension of the level.Such as in the standardized address library of only one China geographical address, single-level address level is saved Address word have 50 kinds, the address word of city's single-level address level has 200 kinds, and the address word of area's single-level address level has 500 kinds, town The address word of single-level address level has 1000 kinds, and the address word of cell single-level address level has 5000 kinds, then in the standardization In the corresponding hierarchical address word dictionary in location library, save corresponding level is indicated with 50 dimensional vectors, the corresponding level in city with 200 tie up to Amount indicates that the corresponding level in area indicates that the corresponding level in town is indicated with 1000 dimensional vectors with 500 dimensional vectors, the corresponding layer of cell Grade is indicated with 5000 dimensional vectors, so the dictionary size is exactly 6750 (50+200+500+1000+5000) dimension sizes.
Example in correspondence, if standardized address Ku Zhong save single-level address level address word type include " Zhejiang Province ", " Hebei province ", " Shandong Province " ... then saves the dimensional attribute value of corresponding level i.e. in its corresponding hierarchical address word dictionary (Zhejiang Province, Hebei province, Shandong Province ...) is arranged according to serial number for address above mentioned word type.
Address vector is generated by bag of words mode specifically: according to the corresponding address of geographical address in standardized address library Word and term vector determine position (dimension) of all address words of the geographical address in the word dictionary of hierarchical address, then structure The address vector of the geographical address is built, address vector is identical as the level of hierarchical address word dictionary and dimension, and on its ground It at the corresponding dimension of location word, is indicated with the value of the term vector of the address word, remaining dimension is set as 0.Example in correspondence, if a certainly It is " 0.77 " that the address word for managing province's single-level address level of address, which is " Shandong Province " term vector, and " Shandong Province " is in hierarchical address word Serial number 3 in dictionary, then the vector that corresponding level is saved in its address vector is (0,0,0.77 ...), other levels Vector and so on obtains the sparse vector of one 6750 dimension size.
By in the standardized address library built geographical address and corresponding address vector in the form of key-value pair It saves in the database, to form the address vector library being made of address vector.
In the present invention, in order to avoid the method for the present invention provide based on vector carry out it is matched during need in number According to full table scan is carried out in library, protected during off-line arrangement using hierarchical address prefixes multiple in address vector library as major key It deposits.
In address vector library, address vector is corresponding with n record, and wherein the major key of this n record respectively corresponds address The grade address n indicate that n is less than or equal to the number of levels of address, in n grade addresses indicate, the 1st grade of address is expressed as address book body, Thereafter every level-one address table, which is shown as single-level address thereon, indicates the removal wherein corresponding level address word institute of highest level serial number Remaining address, wherein the corresponding level address word of highest level serial number of removal is not limited to one.
For example, a geographical address: 2, the area A, the Hangzhou, Zhejiang province city street the B city C Unit 42.Its corresponding address vector are as follows: addressVec.In the present invention, the geographical address and its address vector are saved and is saved by Hbase database.
It is recorded at Hbase database purchase 5, Hbase storage format are as follows:<major key (rowkey), addressVec>.It should Five records are respectively as follows:
<rowkey1, addressVec>;
<rowkey2, addressVec>;
<rowkey3, addressVec>;
<rowkey4, addressVec>;
<rowkey5, addressVec>.
Wherein, rowkey1 is stored: hash (2, the area A, the Hangzhou, Zhejiang province city street the B city C Unit 4 2)+" _ "+hash (addressVec), wherein " _ " is splicing symbol, and Hash Hash is the function for address and vector value being converted to digital representation.I.e. The corresponding 1st grade of address major key rowkey1 is expressed as " 2, the area A, the Hangzhou, Zhejiang province city street the B city C Unit 42 ".
Rowkey2 storage: hash (Unit 4,2, the street the B city C in the area A, Hangzhou, Zhejiang province city)+" _ "+hash (addressVec), i.e. the corresponding 2nd grade of address major key rowkey2 is expressed as " 24, the area A, the Hangzhou, Zhejiang province city street the B city C list Member ", compared with the 1st grade address expression eliminate the corresponding level address word of highest level serial number " No. 2 ", and following major key is with such It pushes away, repeats no more.
Rowkey3 storage: hash (2, the area A, the Hangzhou, Zhejiang province city street the B city C)+" _ "+hash (addressVec);
Rowkey4 storage: hash (area A, the Hangzhou, Zhejiang province city street the B city C)+" _ "+hash (addressVec);
Rowkey5 storage: hash (area A, the Hangzhou, Zhejiang province city street B)+" _ "+hash (addressVec).
Step S103 searches similar to geographical address to be processed in the corresponding normal address vector library of standardized address library Highest normal address vector is spent to specifically include:
It indicates to search corresponding major key step by step with the n of geographical address to be processed grade address, until it is right to obtain major key institute The address vector answered, and the similarity of the address vector of the address vector and geographical address to be processed is calculated, n is less than or equal to address Number of levels, in the expression of n grades of addresses, the 1st grade of address is expressed as geographical address to be processed itself, and every level-one address table shows thereafter The removal wherein corresponding remaining address of level address word of highest level serial number is indicated for single-level address thereon, wherein is gone The corresponding level address word of highest level serial number removed is not limited to one.
Corresponding to upper example, if address to be processed are as follows: Unit 5,2, the street the B city C in the area A, Hangzhou, Zhejiang province city are carrying out address It can be first Hash by the address conversion in query process first by the address lookup Hbase database when Vectors matching (the 1st grade of address of Unit 5,2, the street the B city C in the area A, Hangzhou, Zhejiang province city, i.e., address to be processed indicates), inquires in rowkey All records comprising the hash value prefix take out address vector from these records as the most like corresponding record in address Carry out similarity calculation;
If not including the hash value prefix, inquire in rowkey comprising the (area A, the Hangzhou, Zhejiang province city street the B city C Hash 2, i.e., the 2nd grade of address of address to be processed indicates) all records of prefix, and so on until find most like address, Otherwise no similar address is indicated.
As shown in figure 4, corresponding to above-mentioned off-line arrangement process, geographical address is standardized provided by the invention Method is applied in the application scenarios that the Chinese address that electric business field inputs user is standardized, real-time reception user input Order Address data flow, therefrom obtain geographical address to be processed, to geographical address to be processed carry out traditional font be converted to it is simplified, complete Angle is converted to half-angle, the removal pretreatment operations such as punctuation mark and spcial character.
Then, geographical address to be processed tool is segmented by ansj to segment based on the participle dictionary of off-line arrangement, Address word based on geographical address to be processed is scanned for using search engine ElasticSearch, passes through standardized address library Corresponding inverted index search the first normal address most like with geographical address to be processed in the standardized address library obtained and Its first similarity, when the first similarity is greater than scheduled first threshold, it is determined that first normal address is geography to be processed The standardized address of address and output.
When the first similarity be not more than scheduled first threshold, then generate the address vector of geographical address to be processed, wherein The address vector for generating geographical address to be processed include: the corresponding word in standardized address library that is constructed during off-line arrangement to Measure the corresponding term vector of address word searched in library and obtain geographical address to be processed, i.e., in lookup term vector library with geography to be processed The identical address word of the address word of address, using the corresponding term vector of the identical address word as the word of geographical address to be processed to Amount, then obtains the address vector of geographical address to be processed according to term vector and hierarchical address word dictionary.The generation is geographically The process of location vector is identical as the process of address vector of address constructed in standardized address library during off-line arrangement, herein It repeats no more.
Then, the address vector with address prefix, benefit are searched in the corresponding normal address vector library of standardized address library Similarity-rough set is carried out with cosine similarity, calculates the similarity in vector space, to indicate similar on text semantic Degree.Find with the highest normal address vector of geographical address similarity to be processed, corresponding similarity as the second similarity, when Second similarity is greater than scheduled second threshold, it is determined that corresponding second normal address of the normal address vector searched is ground Manage the standardized address of address and output.
When the second similarity is not more than second threshold, and the weighted average of the first similarity and the second similarity be greater than it is pre- Fixed third threshold value, it is determined that in the first similarity and the second similarity the corresponding normal address of the greater be it is to be processed geographically Otherwise the standardized address of location carries out rule-based standardization process to geographical address to be processed.
The method provided in an embodiment of the present invention that geographical address is standardized, by search engine to geography to be processed Address carries out matching acquisition standardized address and then carries out matching criteria by address vector when matching result is unsatisfactory for requiring Change address, standardized method compared with the existing technology, the method for the present invention can effectively be standardized with wrong word, extra word The address of the special circumstances such as symbol and synonym, accuracy with higher and treatment effeciency, and applicability is wide.And the present invention Method integrates two kinds of matching ways, is matched by address vector, makes up the matched deficiency of search engine, further improves The accuracy rate matched.
The embodiment of the present invention also provides the device that a kind of pair of geographical address is standardized, as shown in figure 5, the device 500 Include:
Receiving module 501, for receiving a geographical address to be processed;
Search module 502, for being searched and the geographical address similarity to be processed in predefined standardized address library Highest normal address is as the first normal address, and corresponding similarity is as the first similarity, when the first similarity is greater than in advance Fixed first threshold, it is determined that first normal address is the standardized address of geographical address to be processed.
The device that geographical address is standardized that the embodiment of the present invention also provides, further includes: Vectors matching module, to Flux matched module is used to be not more than scheduled first threshold when the first similarity, then generate the address of geographical address to be processed to Amount is searched and the highest study plot of geographical address similarity to be processed in the corresponding normal address vector library of standardized address library Location vector, corresponding similarity is as the second similarity, when the second similarity is greater than scheduled second threshold, it is determined that searched Corresponding second normal address of normal address vector be geographical address standardized address.
The device that geographical address is standardized that the embodiment of the present invention also provides, further includes: output module exports mould Block is used to be not more than second threshold when the second similarity, and the weighted average of the first similarity and the second similarity is greater than predetermined Third threshold value, it is determined that the corresponding normal address of the greater is geographical address to be processed in the first similarity and the second similarity Standardized address.
The device that geographical address is standardized that the embodiment of the present invention also provides, further includes: the first configuration module, the One configuration module is used to obtain standardized address library by following step:
Receive sample geographical address collection;
Sample geographical address collection is pre-processed;
The address key of geographical address based on address dictionary reading geographical address concentration and corresponding level, and it is based on address The level removal of key does not meet the address key of level requirement wherein, obtains the layer of the geographical address separated by address key Grade standard address, to obtain the standardized address library being made of level standardized address.
In the present invention, Vectors matching module is further used for searching in the corresponding term vector library in standardized address library and obtain The corresponding term vector of address word for obtaining geographical address to be processed is based on then according to the corresponding term vector of geographical address to be processed Preset hierarchical address word dictionary, the address vector of geographical address to be processed is generated by bag of words mode.
The device that geographical address is standardized that the embodiment of the present invention also provides, further includes: the second configuration module, the Two configuration modules are used to obtain term vector library by following step:
The address word of all geographical address in standardized address library is converted into term vector, the value of term vector is accordingly The weight coefficient of location word, weight coefficient is that the context semanteme based on the address word distributes, to obtain standardized address library Corresponding term vector library.
The device that geographical address is standardized that the embodiment of the present invention also provides, further includes: third configuration module, the Three configuration modules are used to obtain address vector library by following step:
According to the corresponding term vector of geographical address in standardized address library, it is based on preset hierarchical address word dictionary, The address vector of the geographical address in standardized address library is generated by bag of words mode, to obtain the ground being made of address vector Location vector library.
In the present invention, in address vector library, each address vector is corresponding with n record, wherein the master of this n record Key is corresponding with the expression of the grade address the n of address corresponding to the address vector respectively, and n is less than or equal to the number of levels of address, at n grades In the expression of address, the 1st grade of address is expressed as address book body, and every level-one address table is shown as in removal single-level address expression thereon thereafter The remaining address of the highest corresponding level address word institute of level serial number,
Vectors matching module is further used in the vector library of normal address being indicated with the n of geographical address to be processed grade address Corresponding major key is searched step by step, until finding corresponding major key and obtaining the address vector in the record corresponding to it, is then counted Calculate the similarity of the address vector of the address vector and geographical address to be processed.
In the present invention, search module, which is further used for searching based on the corresponding inverted index in standardized address library, obtains the One normal address, inverted index are that the address word string of the address word composition based on all geographical address in standardized address library is built Vertical.
In the present invention, receiving module is further used for pre-processing geographical address to be processed, wherein pretreatment packet Include conversion between simplified and traditional Chinese, the conversion of full-shape half-angle, removal spcial character, wherein when geographical address to be processed is indicated with Chinese, pretreatment is also Its corresponding address word is obtained including carrying out participle to geographical address to be processed.
The device provided in an embodiment of the present invention that geographical address is standardized, by search engine to geography to be processed Address carries out matching acquisition standardized address and then carries out matching criteria by address vector when matching result is unsatisfactory for requiring Change address, standardized method compared with the existing technology, the method for the present invention can effectively be standardized with wrong word, extra word The address of the special circumstances such as symbol and synonym, accuracy with higher and treatment effeciency, and applicability is wide.And the present invention Method integrates two kinds of matching ways, is matched by address vector, makes up the matched deficiency of search engine, further improves The accuracy rate matched.
Below with reference to Fig. 6, it illustrates the computer system X00 for the electronic equipment for being suitable for being used to realize the embodiment of the present invention Structural schematic diagram.Electronic equipment shown in Fig. 6 is only an example, function to the embodiment of the present invention and should not use model Shroud carrys out any restrictions.
As shown in fig. 6, computer system X00 includes central processing unit (CPU) X01, it can be read-only according to being stored in Program in memory (ROM) X02 or be loaded into the program in random access storage device (RAM) X03 from storage section X08 and Execute various movements appropriate and processing.In RAM X03, various programs and data needed for being also stored with system X00 operation. CPU X01, ROM X02 and RAM X03 are connected with each other by bus X04.Input/output (I/O) interface X05 is also connected to always Line X04.
I/O interface X05 is connected to lower component: the importation X06 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c X07 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section X08 including hard disk etc.; And the communications portion X09 of the network interface card including LAN card, modem etc..Communications portion X09 via such as because The network of spy's net executes communication process.Driver X10 is also connected to I/O interface X05 as needed.Detachable media X11, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver X10, in order to read from thereon Computer program be mounted into storage section X08 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion X09, and/or from can Medium X11 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) X01, system of the invention is executed The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet Include receiving module, search module, Vectors matching module and output module.Wherein, the title of these modules is under certain conditions simultaneously The restriction to the module itself is not constituted, for example, receiving module is also described as " carrying out the geographical address to be processed Pretreated module ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtaining the equipment includes:
Receive a geographical address to be processed;
It searches in predefined standardized address library and makees with the highest normal address of geographical address similarity to be processed For the first normal address, corresponding similarity is as the first similarity, when first similarity is greater than scheduled first threshold, Then determine that first normal address is the standardized address of the geographical address to be processed.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (23)

1. the method that a kind of pair of geographical address is standardized characterized by comprising
Receive a geographical address to be processed;
It is searched in predefined standardized address library with the highest normal address of geographical address similarity to be processed as the One normal address, corresponding similarity is as the first similarity, when first similarity is greater than scheduled first threshold, then really Fixed first normal address is the standardized address of the geographical address to be processed.
2. the method according to claim 1, wherein further include:
When first similarity be not more than scheduled first threshold, then generate the address vector of the geographical address to be processed;
It is searched with the geographical address similarity to be processed most in the corresponding normal address vector in standardized address library library High normal address vector, corresponding similarity is as the second similarity, when second similarity is greater than scheduled second threshold Value, it is determined that corresponding second normal address of the normal address vector searched is the standardized address of the geographical address.
3. according to the method described in claim 2, it is characterized by further comprising:
When second similarity is not more than the second threshold, and the weighting of first similarity and second similarity Average value is greater than scheduled third threshold value, it is determined that the corresponding mark of the greater in first similarity and second similarity Quasi- address is the standardized address of the geographical address to be processed.
4. the method according to claim 1, wherein standardized address library is obtained by following step:
Receive sample geographical address collection;
Sample geographical address collection is pre-processed;
Based on address dictionary read the geographical address that the geographical address is concentrated address key and corresponding level, and be based on address The level removal of key does not meet the address key of level requirement wherein, obtains being separated by the address key for the geographical address Level standardized address, to obtain the standardized address library being made of the level standardized address.
5. according to the method described in claim 2, it is characterized in that, generating the address vector packet of the geographical address to be processed It includes:
The address word that the acquisition geographical address to be processed is searched in the corresponding term vector library in the standardized address library is corresponding Term vector;
According to the corresponding term vector of the geographical address to be processed, it is based on preset hierarchical address word dictionary, by bag of words side Formula generates the address vector of the geographical address to be processed.
6. according to the method described in claim 5, it is characterized in that, the term vector library is obtained by following step:
The address word of all geographical address in the standardized address library is converted into term vector, the value of term vector is accordingly The weight coefficient of location word, the weight coefficient is that the context semanteme based on the address word distributes, to obtain the standard Change the corresponding term vector library of address base.
7. according to the method described in claim 6, it is characterized in that, the address vector library is obtained by following step:
According to the corresponding term vector of geographical address in the standardized address library, it is based on preset hierarchical address word dictionary, The address vector of the geographical address in the standardized address library is generated by bag of words mode, is made of to obtain address vector Address vector library.
8. the method according to claim 5 or 7, which is characterized in that the preset hierarchical address word dictionary be according to The multi-C vector of address layer stage layered includes in each layer wherein each layer of one address level of expression and having multiple dimensions There are all address word types of this layer of corresponding address level present in standardized address library, each address word type is used should One dimension of level indicates.
9. the method according to claim 2 or 7, which is characterized in that in the address vector library, each address vector pair There should be n record, wherein the major key of this n record indicates opposite with the grade address the n of address corresponding to the address vector respectively It answers, n is less than or equal to the number of levels of address, in n grade addresses indicate, the 1st grade of address is expressed as address book body, thereafter every level-one Location is expressed as the remaining address of the highest corresponding level address word institute of level serial number in removal single-level address expression thereon,
It is searched with the geographical address similarity to be processed most in the corresponding normal address vector in standardized address library library High normal address vector includes:
It indicates to search corresponding major key step by step with the n grade address of the geographical address to be processed in the vector library of normal address, until It finds corresponding major key and obtains the address vector in the record corresponding to it, then calculate the address vector and to be processed Manage the similarity of the address vector of address.
10. the method according to claim 1, wherein first normal address is based on the standardization ground The corresponding inverted index in location library, which is searched, to be obtained, the inverted index be based in the standardized address library it is all geographically What the address word string of the address word composition of location was established.
11. the method according to claim 1, wherein receiving the geographical address to be processed further include: to described Geographical address to be processed is pre-processed,
Wherein, the pretreatment includes conversion between simplified and traditional Chinese, the conversion of full-shape half-angle, removal punctuation mark and removes spcial character,
Wherein, when the geographical address to be processed is indicated with Chinese, the pretreatment further includes to the geographical address to be processed It carries out participle and obtains its corresponding address word.
12. the device that a kind of pair of geographical address is standardized characterized by comprising
Receiving module, for receiving a geographical address to be processed;
Search module, for being searched and the highest mark of geographical address similarity to be processed in predefined standardized address library Quasi- address is as the first normal address, and corresponding similarity is as the first similarity, when first similarity is greater than scheduled First threshold, it is determined that first normal address is the standardized address of the geographical address to be processed.
13. device according to claim 12, which is characterized in that further include:
Vectors matching module, for when first similarity be not more than scheduled first threshold, then generate it is described to be processedly The address vector for managing address is searched and the geography to be processed in the corresponding normal address vector in standardized address library library The highest normal address vector of address similarity, corresponding similarity is as the second similarity, when second similarity is greater than Scheduled second threshold, it is determined that corresponding second normal address of the normal address vector searched is the mark of the geographical address Standardization address.
14. device according to claim 13, which is characterized in that further include:
Output module, for being not more than the second threshold, and first similarity and described the when second similarity The weighted average of two similarities is greater than scheduled third threshold value, it is determined that in first similarity and second similarity The corresponding normal address of the greater is the standardized address of the geographical address to be processed.
15. device according to claim 12, which is characterized in that further include:
First configuration module, for obtaining standardized address library by following step:
Receive sample geographical address collection;
Sample geographical address collection is pre-processed;
Based on address dictionary read the geographical address that the geographical address is concentrated address key and corresponding level, and be based on address The level removal of key does not meet the address key of level requirement wherein, obtains being separated by the address key for the geographical address Level standardized address, to obtain the standardized address library being made of the level standardized address.
16. device according to claim 13, which is characterized in that the Vectors matching module is further used in the mark The corresponding term vector of address word for obtaining the geographical address to be processed is searched in the corresponding term vector library of standardization address base, then According to the corresponding term vector of the geographical address to be processed, it is based on preset hierarchical address word dictionary, it is raw to pass through bag of words mode At the address vector of the geographical address to be processed.
17. device according to claim 16, which is characterized in that further include:
Second configuration module, for obtaining the term vector library by following step:
The address word of all geographical address in the standardized address library is converted into term vector, the value of term vector is accordingly The weight coefficient of location word, the weight coefficient is that the context semanteme based on the address word distributes, to obtain the standard Change the corresponding term vector library of address base.
18. device according to claim 17, which is characterized in that further include:
Third configuration module, for obtaining the address vector library by following step:
According to the corresponding term vector of geographical address in the standardized address library, it is based on preset hierarchical address word dictionary, The address vector of the geographical address in the standardized address library is generated by bag of words mode, is made of to obtain address vector Address vector library.
19. device described in 3 or 18 according to claim 1, which is characterized in that in the address vector library, each address to Amount is corresponding with n record, and wherein the major key of this n record is indicated with the grade address the n of address corresponding to the address vector respectively Corresponding, n is less than or equal to the number of levels of address, and in n grades of address expressions, the 1st grade of address is expressed as address book body, each thereafter Grade address is expressed as the remaining address of the highest corresponding level address word institute of level serial number in removal single-level address expression thereon,
The Vectors matching module is further used for the n grade address in the vector library of normal address with the geographical address to be processed It indicates to search corresponding major key step by step, until finding corresponding major key and obtaining the address vector in the record corresponding to it, so The similarity of the address vector of the address vector and geographical address to be processed is calculated afterwards.
20. device according to claim 12, which is characterized in that described search module is further used for based on the standard Change the corresponding inverted index of address base and search acquisition first normal address, the inverted index is based on the standardization ground What the address word string of the address word composition of all geographical address in the library of location was established.
21. device according to claim 12, which is characterized in that the receiving module is further used for described to be processed Geographical address is pre-processed, wherein and the pretreatment includes conversion between simplified and traditional Chinese, the conversion of full-shape half-angle, removes spcial character, In, when the geographical address to be processed is indicated with Chinese, the pretreatment further includes dividing the geographical address to be processed Word obtains its corresponding address word.
22. the electronic equipment that a kind of pair of geographical address is standardized characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-11.
23. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in claim 1-11 is realized when row.
CN201710661302.7A 2017-08-04 2017-08-04 The method and apparatus that geographical address is standardized Pending CN110019575A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710661302.7A CN110019575A (en) 2017-08-04 2017-08-04 The method and apparatus that geographical address is standardized

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710661302.7A CN110019575A (en) 2017-08-04 2017-08-04 The method and apparatus that geographical address is standardized

Publications (1)

Publication Number Publication Date
CN110019575A true CN110019575A (en) 2019-07-16

Family

ID=67186044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710661302.7A Pending CN110019575A (en) 2017-08-04 2017-08-04 The method and apparatus that geographical address is standardized

Country Status (1)

Country Link
CN (1) CN110019575A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704696A (en) * 2019-10-10 2020-01-17 北京东软望海科技有限公司 Data standardization method and device, electronic equipment and readable storage medium
CN111008625A (en) * 2019-12-06 2020-04-14 中国建设银行股份有限公司 Address correction method, device, equipment and storage medium
CN111639493A (en) * 2020-05-22 2020-09-08 上海微盟企业发展有限公司 Address information standardization method, device, equipment and readable storage medium
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111966766A (en) * 2020-02-18 2020-11-20 上海寻梦信息技术有限公司 Address information detection method, system, electronic device and storage medium
CN112231429A (en) * 2020-11-09 2021-01-15 山东健康医疗大数据有限公司 Address matching method based on machine learning classification algorithm
CN112380858A (en) * 2020-11-12 2021-02-19 中国科学技术大学智慧城市研究院(芜湖) Address completion and correction method based on government affair big data
CN112559658A (en) * 2020-12-08 2021-03-26 中国科学技术大学 Address matching method and device
CN112749532A (en) * 2019-10-30 2021-05-04 阿里巴巴集团控股有限公司 Address text processing method, device and equipment
CN113282702A (en) * 2021-03-16 2021-08-20 广东医通软件有限公司 Intelligent retrieval method and retrieval system
CN113326267A (en) * 2021-06-24 2021-08-31 中国科学技术大学智慧城市研究院(芜湖) Address matching method based on inverted index and neural network algorithm
CN113468881A (en) * 2021-07-23 2021-10-01 浙江大华技术股份有限公司 Address standardization method and device
CN113987114A (en) * 2021-09-17 2022-01-28 上海燃气有限公司 Address matching method and device based on semantic analysis and electronic equipment
CN114064827A (en) * 2020-08-05 2022-02-18 北京四维图新科技股份有限公司 Position searching method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631911A (en) * 2013-11-27 2014-03-12 中国人民大学 OLAP query processing method based on array storage and vector processing
CN104217030A (en) * 2014-09-28 2014-12-17 北京奇虎科技有限公司 Method and device for classifying users according to search log data of server
US20150261858A1 (en) * 2009-06-29 2015-09-17 Google Inc. System and method of providing information based on street address
CN106096024A (en) * 2016-06-24 2016-11-09 北京京东尚科信息技术有限公司 The appraisal procedure of address similarity and apparatus for evaluating
CN106547770A (en) * 2015-09-21 2017-03-29 阿里巴巴集团控股有限公司 A kind of user's classification based on address of theenduser information, user identification method and device
CN106598953A (en) * 2016-12-28 2017-04-26 上海博辕信息技术服务有限公司 Address resolution method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150261858A1 (en) * 2009-06-29 2015-09-17 Google Inc. System and method of providing information based on street address
CN103631911A (en) * 2013-11-27 2014-03-12 中国人民大学 OLAP query processing method based on array storage and vector processing
CN104217030A (en) * 2014-09-28 2014-12-17 北京奇虎科技有限公司 Method and device for classifying users according to search log data of server
CN106547770A (en) * 2015-09-21 2017-03-29 阿里巴巴集团控股有限公司 A kind of user's classification based on address of theenduser information, user identification method and device
CN106096024A (en) * 2016-06-24 2016-11-09 北京京东尚科信息技术有限公司 The appraisal procedure of address similarity and apparatus for evaluating
CN106598953A (en) * 2016-12-28 2017-04-26 上海博辕信息技术服务有限公司 Address resolution method and device

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704696A (en) * 2019-10-10 2020-01-17 北京东软望海科技有限公司 Data standardization method and device, electronic equipment and readable storage medium
CN112749532A (en) * 2019-10-30 2021-05-04 阿里巴巴集团控股有限公司 Address text processing method, device and equipment
CN111008625A (en) * 2019-12-06 2020-04-14 中国建设银行股份有限公司 Address correction method, device, equipment and storage medium
CN111008625B (en) * 2019-12-06 2023-07-18 建信金融科技有限责任公司 Address correction method, device, equipment and storage medium
CN111966766A (en) * 2020-02-18 2020-11-20 上海寻梦信息技术有限公司 Address information detection method, system, electronic device and storage medium
CN111639493A (en) * 2020-05-22 2020-09-08 上海微盟企业发展有限公司 Address information standardization method, device, equipment and readable storage medium
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111783419B (en) * 2020-06-12 2024-02-27 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN114064827A (en) * 2020-08-05 2022-02-18 北京四维图新科技股份有限公司 Position searching method, device and equipment
CN112231429A (en) * 2020-11-09 2021-01-15 山东健康医疗大数据有限公司 Address matching method based on machine learning classification algorithm
CN112380858A (en) * 2020-11-12 2021-02-19 中国科学技术大学智慧城市研究院(芜湖) Address completion and correction method based on government affair big data
CN112559658A (en) * 2020-12-08 2021-03-26 中国科学技术大学 Address matching method and device
CN112559658B (en) * 2020-12-08 2022-12-30 中国科学技术大学 Address matching method and device
CN113282702A (en) * 2021-03-16 2021-08-20 广东医通软件有限公司 Intelligent retrieval method and retrieval system
CN113282702B (en) * 2021-03-16 2023-12-19 广东医通软件有限公司 Intelligent retrieval method and retrieval system
CN113326267A (en) * 2021-06-24 2021-08-31 中国科学技术大学智慧城市研究院(芜湖) Address matching method based on inverted index and neural network algorithm
CN113326267B (en) * 2021-06-24 2023-08-08 长三角信息智能创新研究院 Address matching method based on inverted index and neural network algorithm
CN113468881B (en) * 2021-07-23 2024-02-27 浙江大华技术股份有限公司 Address standardization method and device
CN113468881A (en) * 2021-07-23 2021-10-01 浙江大华技术股份有限公司 Address standardization method and device
CN113987114B (en) * 2021-09-17 2023-04-07 上海燃气有限公司 Address matching method and device based on semantic analysis and electronic equipment
CN113987114A (en) * 2021-09-17 2022-01-28 上海燃气有限公司 Address matching method and device based on semantic analysis and electronic equipment

Similar Documents

Publication Publication Date Title
CN110019575A (en) The method and apparatus that geographical address is standardized
CN103268348B (en) A kind of user&#39;s query intention recognition methods
CN110019211A (en) The methods, devices and systems of association index
KR20200104789A (en) Method, apparatus, device and medium for storing and querying data
WO2014145154A1 (en) Method and system for generating a geocode trie and facilitating reverse geocode lookups
CN109255564A (en) Pick-up point address recommendation method and device
CN103049495A (en) Method, device and equipment for providing searching advice corresponding to inquiring sequence
CN110348730A (en) Risk subscribers judgment method and its system, electronic equipment
CN105580003A (en) Data sanitization and normalization and geocoding methods
CN111815738B (en) Method and device for constructing map
CN107609192A (en) The supplement searching method and device of a kind of search engine
CN114416900A (en) Method and device for analyzing track stop point
CN109684624A (en) A kind of method and apparatus in automatic identification Order Address road area
CN109508361A (en) Method and apparatus for output information
CN110020312A (en) The method and apparatus for extracting Web page text
CN109726295A (en) Brand knowledge map display methods, device, figure server and storage medium
CN113010752B (en) Recall content determining method, apparatus, device and storage medium
EP3696686A1 (en) Feature value generation device, feature value generation method, and feature value generation program
CN110895591A (en) Method and device for positioning self-picking point
JP2023517518A (en) Vector embedding model for relational tables with null or equivalent values
CN110390011A (en) The method and apparatus of data classification
CN112100493A (en) Document sorting method, device, equipment and storage medium
CN102760127A (en) Method, device and equipment for determining resource type based on extended text information
CN104615620A (en) Map search type identification method and device and map search method and system
CN116541578A (en) Asset digital multidimensional management method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination