CN102955832B - A kind of address identification, standardized system - Google Patents

A kind of address identification, standardized system Download PDF

Info

Publication number
CN102955832B
CN102955832B CN201110255616.XA CN201110255616A CN102955832B CN 102955832 B CN102955832 B CN 102955832B CN 201110255616 A CN201110255616 A CN 201110255616A CN 102955832 B CN102955832 B CN 102955832B
Authority
CN
China
Prior art keywords
address
module
metadata
standardized
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110255616.XA
Other languages
Chinese (zh)
Other versions
CN102955832A (en
Inventor
王国印
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaao Data Technology Co Ltd filed Critical Shenzhen Huaao Data Technology Co Ltd
Priority to CN201110255616.XA priority Critical patent/CN102955832B/en
Publication of CN102955832A publication Critical patent/CN102955832A/en
Application granted granted Critical
Publication of CN102955832B publication Critical patent/CN102955832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of address identification, standardized system, for solving address identification and standardized problem.Described address identification, standardized system comprise address load module, address metadata dictionary module, Address Recognition and standardized module, address output module; Described Address Recognition is connected with described address output module with described address load module, described address metadata dictionary module with standardized module, for receiving the address of described user's input that described address load module transmits, the address that described user inputs is identified and standardization, and produces standardized address; Described address metadata dictionary module is used for memory address metadata, receives and responds the control command of described Address Recognition and standardized module.Invention increases the degree of accuracy of address process, throughput, the recall rate of address process are higher.

Description

A kind of address identification, standardized system
One, technical field
The present invention relates to a kind for the treatment of technology of address, refer more particularly to a kind of address identification, standardized system.
Two, background technology
The application of address technology widely, the process such as also closely bound up with daily life, the mail contact as postal field needs to identify address and process, first need in geocoding system to carry out needing in standardization, banking system storing address date to address, identify, renewal.Address technology also can be used for network, ecommerce, electronic chart etc.
Address lack of standardization or nonstandard, will bring many inconvenience.Such as: lack of standardization due to station address, mail system needs to drop into a large amount of man power and materials and goes to solve how to identify address that is correct, standard, otherwise the delivery that will make the mistake, repeat deliver situation, and along with the growth of postal service data, this input will correspondingly expand, and this situation will be difficult to bear concerning mail system.Also the Problem of station address is there is in banking system, if banking system does not carry out standardized process to station address, (in fact a lot of banking system does not carry out standardization to station address really, the typing of station address and more new capital are that artificial carrying out operates), along with the data volume of operation system increases, the incompatibility of disparate databases, banking system will be faced with that processing speed is comparatively slow, inefficiency, the phenomenon of operation system data corruption, is easy to the loss and the loss that cause client.
Some problems and phenomenon are all undertaken processing and solving by address technology above.Since the establishment of the nation, along with the determination of China administrative region, communication in Chinese address has formed certain rule and feature; From another one angle, because China's history culture is long, vast territory and abundant resources, communication in Chinese address also exists a large amount of addresses phenomenon of the same name.The feature of communication in Chinese address and rule can be summarized as follows: 1), address information has level, rank character administrative region property in other words, as province, city, district, road, " Shennan Road, Nanshan District, Shenzhen City, Guangdong Province "; Province, county, small towns, village, " Leng Jiaba village of Shuan Hechang township, Yuezhi County, Sichuan Province " etc.2), the address date (place name) of different stage can also exist phenomenon of the same name, easily causes ambiguity and misunderstanding.Address as provided is " Wuhu ", and it is " City of Wuhu in Anhui " that people are difficult to differentiation, or " Wuhu County, Anhui Province ".3, same address metadata (place name) has multiple expression way, there is the phenomenon of address aliases, as " Guangxi Zhuang Autonomous Region ", is write as in " osmanthus ", " Guangxi ", or " Guangxi autonomous region ".
For above-mentioned phenomenon and problem, prior art also provides some resolving ideas and method.If number of patent application is 200910156650.4, the patent documentation that patent name is " a kind of Chinese geocoding defining method based on fuzzy matching " provides so a kind of resolving ideas: read in descriptive Chinese address information (Chinese address information reference china administration Region dividing standard, established standards typing pattern), take administrative region as breakpoint, adopt forward maximum searching method, cutting is carried out to the address of reading in, obtains original address element array; Then original address element array is carried out standardization by address dictionary, the address of standard can be obtained.Existing this technical scheme more depends on the accuracy of Input Address, and the address date for different stage exists phenomenon of the same name and is comparatively difficult to process, and processing procedure and method are comparatively simple and crude, and degree of accuracy is not very high comparatively speaking.
The present invention is just based on some phenomenons and the problem of above existence, and the deficiencies in the prior art part, provides a kind of new address identification, standardized resolving ideas and method.
Three, summary of the invention
In order to solve the deficiencies in the prior art part, and address exist the address date of some phenomenons as different stage exist phenomenon of the same name cause address be difficult to process, improve the degree of accuracy of address process, and throughput, the recall rate of transaction processing system under big data quantity environment.The invention provides a kind of address identification, standardized system.
In order to realize the object of the invention, the invention provides a kind of address identification and standardized system, described a kind of address identification and standardized system comprise address load module, address metadata dictionary module, Address Recognition and standardized module, address output module; Described address load module for receiving the address of user's input, and sends described Address Recognition and standardized module to the address that described user inputs; Described Address Recognition is connected with described address output module with described address load module, described address metadata dictionary module with standardized module, for receiving the address of described user's input that described address load module transmits, the address that described user inputs is identified and standardization, and produces standardized address; Described address metadata dictionary module is used for memory address metadata, receives and responds the control command of described Address Recognition and standardized module; Described address output module is for receiving the control command of described Address Recognition and standardized module, and the address of outputting standard.
Invention increases the degree of accuracy of address process, under the processing environment of big data quantity, the present invention can be suitable for equally, and throughput, the recall rate of address process are higher.Processing procedure of the present invention, committed memory space is less, and address metadata dictionary is easy to upgrade and safeguard.
Should be appreciated that above general description and following detailed description are all illustrative and exemplary, aim to provide and of the present inventionly to further illustrate required.
Four, accompanying drawing explanation
The accompanying drawing comprised is for providing a further understanding of the present invention, and it is merged in instructions and forms its part, illustrates embodiments of the invention, and is used from instructions one and understands principle of the present invention.
Fig. 1 is a kind of address identification of the present invention and standardized system construction drawing.
Fig. 2 is the address metadata dictionary function structure chart of present pre-ferred embodiments.
Fig. 3 is that the Trie of present pre-ferred embodiments sets theory structure schematic diagram.
Fig. 4 is Address Recognition and the standardized module structural drawing of present pre-ferred embodiments.
Fig. 5 is Address Recognition and the standardized module structural drawing of another preferred embodiment of the present invention.
Fig. 6 is the address metadata dictionary function structure chart of another preferred embodiment of the present invention.
Five, embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only for explaining the present invention, being not intended to limit the present invention.
As shown in Figure 1, be a kind of address identification of the present invention and standardized system construction drawing.Described a kind of address identification and standardized system comprise address load module 100, Address Recognition and standardized module 300, address metadata dictionary module 200 and address output module 400.Described address load module 100 for receiving the address of user's input, and sends described Address Recognition and standardized module 300 to the address that described user inputs; Described Address Recognition is connected with described address output module 400 with described address load module 100, described address metadata dictionary module 200 with standardized module 300, for receiving the address of described user's input that described address load module 100 transmits, the address that described user inputs is identified and standardization, and produces standardized address.Described address metadata dictionary module 200, for memory address metadata data, receives and responds the control command of described Address Recognition and standardized module 300; Described address output module 400 is for receiving the control command of described Address Recognition and standardized module 300, and the address of outputting standard.
Described address metadata dictionary module 200, as shown in the address metadata dictionary function structure chart of Fig. 2 present pre-ferred embodiments.Described address metadata dictionary module 200 comprises hierarchical address metadata dictionary 210 and address aliases metadata dictionary 220.Described hierarchical address metadata dictionary 210 is for storing hierarchical address metadata, and hierarchical address metadata dictionary can be four-stage grading address metadata dictionary or six grades of hierarchical address metadata dictionary.
Described four-stage grading address metadata dictionary divides by administrative region.In example of the present invention, provide a kind of four-stage grading address model in order to form four-stage grading address metadata dictionary, as shown in " table 1, four-stage grading address model (a) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as four-stage grading address model; Vico-provincial Cities, prefecture-level city, municipality directly under the Central Government area under one's jurisdiction, county, autonomous county, county-level city, flag, autonomous prefecture, area will divide the second level of four-stage grading address model; Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction is by the third level for four-stage grading address model; Small towns, road, natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.The address that this kind of hierarchy model is more often applied to I.D. and generally writes.
Table 1, four-stage grading address model (a)
Another kind of four-stage grading address model also can be adopted in embodiments of the present invention in order to form four-stage grading address metadata dictionary, as shown in " table 2, four-stage grading address model (b) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as four-stage grading address model; Vico-provincial Cities, prefecture-level city, autonomous prefecture, area will divide the second level of four-stage grading address model; Be directly under the jurisdiction of urban district, Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction, county, county-level city, autonomous county, flag is by the third level for four-stage grading address model; Small towns, road, natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.This hierarchy model is comparatively rigorous, in strict accordance with the grade classification in state administration region, is comparatively usually used in the address sort in internet and rating information.
Table 2, four-stage grading address model (b)
Six grades of described hierarchical address metadata dictionary divide by administrative region equally.One or six grades of classification address models can be adopted in embodiments of the present invention in order to form six grades of hierarchical address metadata dictionary, as shown in " table 3, six grades of classifications address model (a) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as six grades of classification address models; Vico-provincial Cities, prefecture-level city, municipality directly under the Central Government area under one's jurisdiction, county, autonomous county, county-level city, flag, autonomous prefecture, area is by the second level of division six grades of classification address models; Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction is by the third level for six grades of classification address models; Small towns will become the fourth stage of six grades of classification address models; Street, road, road, natural village is using the level V as six grades of classification address models; Number as the 6th grade of six grades of classification address models.The address that this kind of hierarchy model is also more often applied to I.D. and generally writes.
Table 3, six grades of classifications address model (a)
Another kind of six grades of classification address models also can be adopted in embodiments of the present invention in order to form six grades of hierarchical address metadata dictionary, as shown in " table 4, six grades of classifications address model (b) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as six grades of classification address models; Vico-provincial Cities, prefecture-level city, autonomous prefecture, area is by the second level of division six grades of classification address models; Be directly under the jurisdiction of urban district, Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction, county, county-level city, autonomous county, flag is by the third level for six grades of classification address models; Small towns will become the fourth stage of six grades of classification address models; Street, road, road, natural village is using the level V as six grades of classification address models; Number as the 6th grade of six grades of classification address models.This hierarchy model is comparatively rigorous, is the grade classification in strict accordance with state administration region equally, is comparatively usually used in the address sort in internet and rating information.
Table 4, six grades of classifications address model (b)
Can find out from the table of above-mentioned hierarchical address model, six grades of hierarchical address metadata dictionary are compared with the metadata dictionary of four-stage grading address, and its address data information more in detail, clearly, dictionary scale is relatively large.
Described hierarchical address metadata dictionary 210 can adopt Trie to set storage organization.Described Trie sets storage organization and the method for even numbers group can be adopted to realize.Adopt the Trie data structure that even numbers group realizes, all entries will be compiled into dictionary tree, and this dictionary tree is a train out report system (DeterministicFiniteAutomaton, DFA).
Described address aliases metadata dictionary 220 for memory address another name metadata, and has mapping relations with the metadata in hierarchical address metadata dictionary 210.Hierarchical address metadata dictionary 210 and address aliases metadata dictionary 220 are relations of one-to-many, such as, address metadata " Anhui Province " in hierarchical address metadata dictionary 210, with the address aliases metadata " Anhui " in address aliases metadata dictionary 220, " Anhui ", it is exactly the relation of one-to-many.In same rank place name collection of metadata, address aliases metadata can only a corresponding address metadata, therefore in same rank place name collection of metadata, mapping relations can be set up by between address aliases metadata and address metadata, be mapped to address metadata by address aliases metadata, realize the unified process of address metadata.
Described Address Recognition and standardized module 300, as shown in the Address Recognition of Fig. 4 present pre-ferred embodiments and standardized module structural drawing.Described Address Recognition and standardized module 300 comprise address cutting module 310, address labeling module 320, weights module 330 and Address Standardization module 340.
Described address cutting module 310, for receiving the address of described user's input of described address load module 100 transmission, is carried out cutting to the address that described user inputs, and is generated the address set of metadata of cutting.Described cutting is the hierarchical address metadata dictionary 210 of available described address metadata dictionary 200, adopts maximum matching process to the right to mate and cutting the address that described user inputs.
Described hierarchical address metadata dictionary 210 adopts Trie to set storage organization, and described Trie sets storage organization and adopts the method for even numbers group to realize.Adopt the Trie data structure that even numbers group realizes, comprising two arrays, is array Base [] respectively, and described array Base [] is for depositing the initial offset (avoiding a conflict) of current state next input variable relative; Array Check [], described array Check [] is for depositing the position of forerunner's state of current state.The all entries stored will be compiled into dictionary tree, and this dictionary tree is a train out report system (DeterministicFiniteAutomaton, DFA).Maximum matching algorithm to the right can be adopted in present pre-ferred embodiments to realize mating and cutting the address that described user inputs, with the Unicode code of each character in the character string of input for train out report system (DeterministicFiniteAutomaton, DFA) input variable is that example is (with UTF-8, each byte of GBK or GB18030 coding is as the same as the input variable of DFA), maximum matching algorithm implementation is as follows to the right:
From original state, according to the value of input variable, obtain next state, obtained by following formula:
Base [s]+c=t formula (1)
Check [Base [s]+c]=s formula (2)
Wherein s is current state, and c is the value of input variable, and t is the position of next state, and formula (2) is used for verifying, and represents that the forerunner of NextState is current state.If current state is done state, in Base array, the value of correspondence position is negative, otherwise is positive number (representing that current state is non-done state or intermediateness).
Read according to above account form the value that Unicode value corresponding to each character of character string is input variable successively, calculate NextState from current state, then NextState is performed as current state recurrence, until terminate.Situation about terminating has several as follows:
1), the character string of input reaches end position
2) condition of formula (2), being used for verifying does not meet
When reaching end position, the input variable that the done state of last process is corresponding is maximum matched position, and the distance returning its relative starting point is maximum matching length.
The complexity of this algorithm is not difficult to find out to be O (n), n is the length of character string of being retrieved according to above implementation.The scale of this algorithm and dictionary (i.e. dictionary comprise entry quantity) is irrelevant.
The rie of composition graphs 3 present pre-ferred embodiments sets theory structure schematic diagram, well can understand implementation procedure and the principle of above-mentioned maximum matching algorithm to the right.Address as user's input be " ShenZhen,GuangDong Bao'an Xixiang ", and by address metadata dictionary, the address after employing maximum matching method matches and cutting is to the right " ShenZhen,GuangDong Bao'an Xixiang ".
Described address labeling module 320 marks according to predefined mark attribute for utilizing the address set of metadata of described address metadata dictionary module 200 to described cutting, and generates the address set of metadata of mark.Due to user input randomness, probably address information is inputed by mistake, or between correct addresses at different levels the random text of admixture, this just requires that system has powerful robustness and processing power.In order to realize foregoing invention object, a kind of address information mark disposal route is provided in present pre-ferred embodiments, address information mark disposal route is different due to the model difference of hierarchical address metadata dictionary, in the preferred embodiment for " table 1, four-stage grading address model (a) ", its disposal route is as follows:
Mark the pre-defined as follows of attribute:
1), economize, autonomous region, municipality directly under the Central Government is the first order, additional label symbol " (1) " after the first-level address metadata cut out, i.e. single level address;
2), Vico-provincial Cities, prefecture-level city, municipality directly under the Central Government area under one's jurisdiction, county, autonomous county, county-level city, flag, autonomous prefecture, area is the second level, additional label symbol " (2) " after the first-level address metadata cut out, i.e. two-level address;
3), Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction is the 3rd pole, additional label symbol " (3) " after the first-level address metadata cut out, i.e. third-level address;
4), small towns, road, natural village, the names of buildings etc. are the fourth stage, additional label symbol " (4) " after the first-level address metadata cut out, i.e. level Four address;
5) address information that, for the data of non-address information, system can not identify, additional label symbol " (0) ", i.e. real-time address.
According to above definition, the address as user's input be " ShenZhen,GuangDong Bao'an Xixiang ", by address metadata dictionary, adopts the address after the coupling of maximum matching algorithm to the right and cutting to be " ShenZhen,GuangDong Bao'an Xixiang "; Address after mark can be " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ".Because the address metadata of different stage exists situation of the same name, when therefore the address of described input being marked, there will be multiple mark situation.Just there are four kinds of situations marked address as above-mentioned input: " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2) ".
Accordingly, according to above-mentioned definition, general mark order has following types, as shown in " table 5, general mark order are illustrated ":
Table 5, the citing of general mark order
Can also similar above-mentioned processing mode mark for the address information of " table 2, four-stage grading address model (b) ", " table 3, six grades of classifications address model (a) " and " table 4, six grades of classifications address model (b) " mark disposal route.Unlike, the marking types of six grades of hierarchical address metadata dictionary is more detailed relative to four-stage grading address metadata dictionary.
Described weights module 330, for the address set of metadata to described mark, calculates its corresponding weights and exports the address set of metadata of maximum weight.The address of input, after described address cutting module 310, address labeling module 320 process, will obtain the address set of metadata of one or more mark.As Input Address " ShenZhen,GuangDong Bao'an Xixiang ", following four kinds of mark states can be obtained: " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2) ".The reason producing this problem is because the address metadata of different stage exists phenomenon of the same name.Can its corresponding weights be calculated according to dynamic programming algorithm and export the address set of metadata of maximum weight in embodiments of the present invention.
Described dynamic programming algorithm, the classic algorithm in dynamic programming algorithm can be have employed in embodiments of the present invention: Viterbi (Viterbi) algorithm calculates optimum address rank annotated sequence, and the observed value in algorithm and state are address rank.This algorithm comprises following content:
An original state value:
Pi n × 1=(π 1, π 2, π 3..., π n) tformula (3)
Wherein π ithe probability of to be address rank be i.Value in Pi rule of thumb sets, each value size in it follows following principle: the probability of the higher correspondence of address administrative grade is higher, and the probability as provincial is greater than city-level.A probability transfer matrix A n × n:
A = a 11 a 12 a 13 · · · a 1 n a 21 a 22 a 23 · · · a 2 n a 31 a 32 a 33 · · · a 3 n · · · · · · · · · · · · · · · a n 1 a n 2 a n 3 · · · a nn
Wherein
a ij=P(q t=j|q t-1=i)1≤i,j≤n
Represent that current address rank is i, next address rank is the probability of j.Each value in matrix A rule of thumb sets.Such as, in matrix A each a ijvalue size should follow following principle:
1), when generally (i, j) forms a backward (i >=j), a ijvalue should be less than the value of all non-backwards (i < j) in principle, this condition is in order to ensure the direction that the net result of annotated sequence increases progressively according to address rank.
2), can continuous print situation when there is a certain level address in the hierarchy model of address, as " (4)+" represent that fourth stage address can occur continuously, corresponding a 44value slightly larger, otherwise just slightly smaller.
Being constrained on it:
a ij &GreaterEqual; 0 , &ForAll; i , j
&Sigma; j = 1 n a ij = 1 , &ForAll; i
The execution flow process of Viterbi (Viterbi) algorithm is as follows:
1), initialization
δ 1(i)=π i, 1≤i≤n formula (4)
formula (5)
2), circulation performs
&delta; t ( j ) = max 1 &le; i &le; n [ &delta; t - 1 ( i ) a ij ] Formula (6)
formula (7)
Wherein 2≤t≤T, 1≤j≤n
3). terminate
P * = max 1 &le; i &le; n [ &delta; T ( i ) ] Formula (8)
q T * = arg max 1 &le; i &le; n [ &delta; T ( i ) ] Formula (9)
Best annotated sequence is obtained, following formula by back-track algorithm:
t=T-1, T-2 ..., 1 formula (10)
For example the realization of bright above-mentioned algorithm.Because the model of hierarchical address metadata dictionary is different, the occurrence of Pi and A of Viterbi (Viterbi) algorithm is by different.In embodiments of the present invention, for " table 1, four-stage grading address model (a) ", the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address as input is: " ShenZhen,GuangDong Bao'an Xixiang ", can obtain following four kinds of mark states: " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2) " through described address cutting module 310, address labeling module 320 after processing.According to Viterbi (Viterbi) algorithm, we can learn the weights of four kinds of mark states:
1), Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4); P=0.030375
2), Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2); P=0.0030375
3), Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4); P=0.001125
4), Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2); P=1.125E-4
The annotated sequence of maximum probability is the first mark situation.Therefore the result that dynamic programming algorithm exports also is the first mark state " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ".
Can also similar above-mentioned processing mode process with the weights disposal route that " table 2, four-stage grading address model (b) ", " table 3, six grades of classifications address model (a) " and " table 4, six grades of classifications address model (b) " are as the criterion.
Described address aliases metadata dictionary 220 for memory address another name metadata, and has mapping relations with hierarchical address metadata dictionary 210.By described address aliases metadata dictionary 220, described Address Standardization module 340 can carry out standardization to the address aliases of input, and the name of Address Standardization is called the full name of the official of address.As: " Shanghai " is standardized as " Shanghai City ", " Guangdong " is standardized as " Guangdong Province ", " Guangxi " is standardized as " Guangxi Zhuang Autonomous Region ", " Beijing " is standardized as " Beijing " etc.
After the address set of metadata generation of described maximum weight, described Address Standardization module 340 will carry out standardization by described address aliases metadata dictionary 220 to it, generate standardized address, and control command occurs to described address output module 400.If above-mentioned Input Address is " ShenZhen,GuangDong Bao'an Xixiang ", through the process of address cutting module 310, address labeling module 320 and weights module 330, again through the standardization of Address Standardization module 340, standardized address " West Township Town, Baoan District, Shenzhen City, Guangdong Province " will be obtained.The address information of described Address Standardization module 340 pairs of real-time addresses (being namely labeled as " 0 ") will be handled as follows:
1), before real-time address appears at the first time appearance of level Four address.Due to one-level, secondary, it is relatively complete that third-level address is collected, and generally there will not be the front third-level address be not included.If before real-time address appears at the first time appearance of level Four address, general all because the address of user's input error causes, Address Standardization module 340 carries out standardization by what give tacit consent to other address of other non-zero order before and after real-time address in embodiments of the present invention.Certainly, also can be labeled as the address text of " (0) " level in other embodiments do deletion or standardization is not done to the address text being labeled as " (0) " level.
2), after real-time address appears at level Four address for the first time.This kind of situation may cause (as: the address metadata dictionary being hierarchical address model with " table 1, four-stage grading address model (a) " because address date is complete not, the data None-identified after " road " will be there will be, as typical buildings, cell name etc.), or input error causes, this kind of situation in present pre-ferred embodiments, last before occurring real-time address first time is labeled as " (4) " after all mark set of metadata all do not do standardization.Such as, for following address: " 3 unit 203 Room, 7, Wan Xin garden, No. 1037, Luo Yu road, Wuchang District, Wuhan City, Hubei Province ".Its annotated sequence is as follows: " new (0) 7, the garden of Hubei Province (1) Wuhan City (2) Wuchang District (3) Luo Yulu (4) No. 1037 (4) Anhui (1) (4) 3 unit (4) 203 Room (4) ".When the data in address database are full-time not, process for above-mentioned situation is as follows: first time is labeled as the data of real-time address for " new garden ", before it, last is labeled as the data of level Four address for " No. 1037 ", all annotated sequence then after " No. 1037 " all do not do standardization, therefore the net result sequence of " 3 unit 203 Room, No. 1,037 7, Wan Xin gardens " should be " 3 unit 203 Room, No. 1,037 7, Wan Xin gardens ", instead of " 3 unit 203 Room, 7, new garden, No. 1037 Anhui Province ".
In another preferred embodiment of the present invention, described Address Recognition and standardized module 300 comprise outside address cutting module 310, address labeling module 320, weights module 330 and Address Standardization module 340, also comprise a correcting module 350, as shown in the Address Recognition of another preferred embodiment of Fig. 5 the present invention and standardized module structural drawing.Described correcting module utilizes address metadata dictionary, determines whether consistent with the constraint condition of described address metadata dictionary to the annotated sequence of the address set of metadata of maximum weight; If inconsistent, the annotated sequence of the address set of metadata of described maximum weight is revised; Generate revised address set of metadata and export to described Address Standardization module.
In another preferred embodiment of the present invention, described address metadata dictionary module 200 also comprises an address metadata correction dictionary 230, as shown in the address metadata dictionary function structure chart of another preferred embodiment of Fig. 6 the present invention.What deposit in described address metadata correction dictionary 230 is the constraint condition data of following 2 types:
1), for district and county or county-level city of the same name when, deposit its constraint condition.Such as: in level Four address model (a, b) and six grades of address models (a, b), the another name of the another name in district and county or county-level city exists identical situation.As " Taihe County " (being subordinate to Anhui Province's Fuyang City) and " great He Qu " (being subordinate to Jinzhou City of Liaoning Province), their another name is all " Taihe county ", optimum annotated sequence is being calculated through dynamic programming algorithm, " Taihe county " is all labeled in district's one-level, therefore also needs to carry out correction process to the candidate result of optimum with a constraint condition.In embodiments of the present invention, be: " Taihe county → Fuyang City " that this constraint condition represents that " Taihe county " is a county under " Fuyang City " by the constraint condition deposited.
2), for township, town and county or county-level city of the same name when, deposit its constraint condition.Such as in level Four address model (a) and six grades of address models (a), there is township, town and county or county-level city situation of the same name.As address " Fujian Longyan Changting and level road ", " Changting " now will be labeled in the rank of township level address, in fact it is a county under " Longyan ", i.e. " Changting County ", instead of " Changting town " under " Hailin City, Mudanjiang City of Heilongjiang Province ", therefore also will judge according to constraint condition, constraint condition is: " Changting → Longyan ", and this constraint condition represents that " Changting " is a county under " Longyan ".
By address metadata correction dictionary 230, described correcting module 350 judges that whether the address set of metadata of maximum weight is consistent, namely whether meet constraint condition.If inconsistent, namely do not meet constraint condition, then selected optimal result existing problems are described, the mark of the address set of metadata to described maximum weight is revised by described correcting module 350; Generate revised address set of metadata and export to described Address Standardization module.
For example the realization of another preferred embodiment bright.In another preferred embodiment of the present invention, for " table 1, four-stage grading address model (a) ", the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address as user's input is " Xin Fa township, Heihe In The Heilongjiang River Wudalianchi ", and through the process of address cutting module 310, address labeling module 320 and weights module 330, mark and the weights situation of known Input Address are as follows:
1), " Heilungkiang (1) Heihe (2) Wudalianchi (4) Xin Fa township (4) " P=0.0200475;
2), " Heilungkiang (1) Heihe (2) Wudalianchi (2) Xin Fa township (4) " P=0.0111375;
3), " Heilungkiang (1) Heihe (4) Wudalianchi (4) Xin Fa township (4) " P=0.0091125;
4), " Heilungkiang (1) Heihe (4) Wudalianchi (2) Xin Fa township (4) " P=0.001485.
At this time export the set of metadata of maximum weight: " Heilungkiang (1) Heihe (2) Wudalianchi (4) Xin Fa township (4) ".Described correcting module utilizes the address metadata correction dictionary 230 in address metadata dictionary 200, judge known to the set of metadata of described maximum weight, " Heihe (2) Wudalianchi (4) " and constraint condition " Wudalianchi → Heihe City " are inconsistent.Now, the mark of the address set of metadata " Heilungkiang (1) Heihe (2) Wudalianchi (4) Xin Fa township (4) " to described maximum weight is revised by correcting module; Generate revised address set of metadata " Heilungkiang (1) Heihe (2) Wudalianchi (2) Xin Fa township (4) ", and export to described Address Standardization module.

Claims (9)

1. address identification and a standardized system, is characterized in that: described a kind of address identification and standardized system comprise address load module, address metadata dictionary module, Address Recognition and standardized module, address output module;
Described address load module for receiving the address of user's input, and sends described Address Recognition and standardized module to the address that described user inputs;
Described Address Recognition is connected with described address output module with described address load module, described address metadata dictionary module with standardized module, for receiving the address of described user's input that described address load module transmits, the address that described user inputs is identified and standardization, and produces standardized address;
Described address metadata dictionary module is used for memory address metadata, receives and responds the control command of described Address Recognition and standardized module;
Described address output module is for receiving the control command of described Address Recognition and standardized module, and the address of outputting standard;
Described address metadata dictionary module comprises hierarchical address metadata dictionary and address aliases metadata dictionary, described address aliases metadata dictionary is used for the another name metadata of memory address, and the metadata in described address aliases metadata dictionary and described hierarchical address metadata dictionary has mapping relations;
Described Address Recognition and standardized module comprise address cutting module, address labeling module, weights module and Address Standardization module;
Described address cutting module, for receiving the address of described user's input of described address load module transmission, utilizes address metadata dictionary module, carries out cutting to the address that described user inputs, and generate the address set of metadata of cutting;
Described address labeling module marks for utilizing the described address address set of metadata of metadata dictionary module to described cutting, and generates the address set of metadata of mark;
Described weights module is used for address set of metadata to described mark, calculates its corresponding weights and exports the address set of metadata of maximum weight;
Described Address Standardization module carries out standardization for utilizing the described address address set of metadata of metadata dictionary module to described maximum weight, generates standardized address, and control command occurs to described address output module;
Described Address Standardization module carries out standardization by the address aliases of described address aliases metadata dictionary to input, and the name of Address Standardization is called the full name of the official of address.
2. a kind of address identification as claimed in claim 1 and standardized system, is characterized in that: described cutting can adopt maximum matching algorithm to the right to mate and cutting the address that described user inputs.
3. a kind of address identification as claimed in claim 1 and standardized system, it is characterized in that: described hierarchical address metadata dictionary, for storing hierarchical address metadata, can be four-stage grading address metadata dictionary or six grades of hierarchical address metadata dictionary.
4. a kind of address identification as claimed in claim 3 and standardized system, is characterized in that: described hierarchical address metadata dictionary can adopt Trie to set storage organization.
5. a kind of address identification as claimed in claim 4 and standardized system, is characterized in that: described Trie sets storage organization and the method for even numbers group can be adopted to realize.
6. a kind of address identification as claimed in claim 1 and standardized system, is characterized in that: described weights module can adopt dynamic programming algorithm to export the address set of metadata of maximum weight.
7. a kind of address identification as claimed in claim 6 and standardized system, is characterized in that: described dynamic programming algorithm can adopt viterbi algorithm to realize.
8. a kind of address identification as claimed in claim 1 and standardized system, is characterized in that: described Address Recognition and standardized module also comprise a correcting module; Described correcting module utilizes described address metadata dictionary module, determines whether to meet constraint condition to the address set of metadata of maximum weight; If the constraint condition of not meeting, the mark of the address set of metadata of described maximum weight is revised; Generate revised address set of metadata and export to described Address Standardization module.
9. a kind of address identification as claimed in claim 8 and standardized system, is characterized in that: described address metadata dictionary module also comprises an address metadata correction dictionary, for storing constraint condition data.
CN201110255616.XA 2011-08-31 2011-08-31 A kind of address identification, standardized system Active CN102955832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110255616.XA CN102955832B (en) 2011-08-31 2011-08-31 A kind of address identification, standardized system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110255616.XA CN102955832B (en) 2011-08-31 2011-08-31 A kind of address identification, standardized system

Publications (2)

Publication Number Publication Date
CN102955832A CN102955832A (en) 2013-03-06
CN102955832B true CN102955832B (en) 2015-11-25

Family

ID=47764642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110255616.XA Active CN102955832B (en) 2011-08-31 2011-08-31 A kind of address identification, standardized system

Country Status (1)

Country Link
CN (1) CN102955832B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252507B (en) * 2013-06-28 2017-06-27 北京华傲达数据技术有限公司 A kind of business data matching process and device
CN103440311A (en) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 Method and system for identifying geographical name entities
CN105988988A (en) * 2015-02-13 2016-10-05 阿里巴巴集团控股有限公司 Method and device for processing text address
CN106156145A (en) * 2015-04-13 2016-11-23 阿里巴巴集团控股有限公司 The management method of a kind of address date and device
CN104951508B (en) * 2015-05-21 2017-11-21 腾讯科技(深圳)有限公司 Temporal information recognition methods and device
CN106407475A (en) * 2016-11-18 2017-02-15 广州爱九游信息技术有限公司 Content screening method, device and server
CN107145577A (en) * 2017-05-08 2017-09-08 上海东方网络金融服务有限公司 Address standardization method, device, storage medium and computer
CN110210020B (en) * 2019-05-22 2023-06-20 武汉虹旭信息技术有限责任公司 Communication address standardization system and method thereof
CN110569322A (en) * 2019-07-26 2019-12-13 苏宁云计算有限公司 Address information analysis method, device and system and data acquisition method
CN110851559B (en) * 2019-10-14 2020-10-09 中科曙光南京研究院有限公司 Automatic data element identification method and identification system
CN113434708A (en) * 2021-05-25 2021-09-24 北京百度网讯科技有限公司 Address information detection method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350012A (en) * 2007-07-18 2009-01-21 北京灵图软件技术有限公司 Method and system for matching address
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7836047B2 (en) * 2007-12-11 2010-11-16 Pitney Bowes Inc. Method for assignment of point level address geocodes to street networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350012A (en) * 2007-07-18 2009-01-21 北京灵图软件技术有限公司 Method and system for matching address
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于条件随机场的中文分词研究;姜锋;《中国优秀硕士学位论文全文数据库信息科技辑》;20070215(第2期);第I138-698页 *

Also Published As

Publication number Publication date
CN102955832A (en) 2013-03-06

Similar Documents

Publication Publication Date Title
CN102955833B (en) A kind of address identification, standardized method
CN102955832B (en) A kind of address identification, standardized system
CN112347268B (en) Text-enhanced knowledge-graph combined representation learning method and device
CN103440312B (en) A kind of system and terminal of mailing address inquiry postcode
CN103440311A (en) Method and system for identifying geographical name entities
CN109033086A (en) A kind of address resolution, matched method and device
CN106528526B (en) A kind of Chinese address semanteme marking method based on Bayes&#39;s segmentation methods
CN105069056B (en) Identity certificate address information analytic method and system based on string matching
CN102207946B (en) Knowledge network semi-automatic generation method
CN106599029A (en) Chinese short text clustering method
CN103473289A (en) Device and method for completing communication addresses
CN101930435A (en) Method and system for retrieving organization names
CN103473217B (en) The method and apparatus of extracting keywords from text
CN104346438A (en) Data management service system based on large data
CN109033225A (en) Chinese address identifying system
CN103970842A (en) Water conservancy big data access system and method for field of flood control and disaster reduction
CN106777118B (en) A kind of quick abstracting method of geographical vocabulary based on fuzzy dictionary tree
CN107025232A (en) The processing method and processing device of address information in logistics system
CN101908055A (en) Method for setting information classification threshold for optimizing lam percentage and information filtering system using same
CN102236641B (en) Method for generating similarity matrix between concepts in agricultural field
CN101329680A (en) Large scale rapid matching method of sentence surface
CN105447119A (en) Text clustering method
CN105160046A (en) Text-based data retrieval method
CN107045545A (en) A kind of peanut cultivation information database constructing system
CN103136212A (en) Mining method of class new words and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: 518057, building 713, room 7, building 9, high tech, central high tech Zone, Shenzhen, Guangdong

Applicant after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 607, room 29, overseas student Pioneer Building, 518057 South Ring Road, Nanshan District hi tech Zone, Guangdong, Shenzhen

Applicant before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: Room 713, 7/F, Software Building, No. 9, High-tech Middle Road, Central District, Shenzhen, Guangdong 518057

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

CP02 Change in the address of a patent holder