Five, embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only for explaining the present invention, being not intended to limit the present invention.
As shown in Figure 1, be a kind of address identification of the present invention and standardized system construction drawing.Described a kind of address identification and standardized system comprise address load module 100, Address Recognition and standardized module 300, address metadata dictionary module 200 and address output module 400.Described address load module 100 for receiving the address of user's input, and sends described Address Recognition and standardized module 300 to the address that described user inputs; Described Address Recognition is connected with described address output module 400 with described address load module 100, described address metadata dictionary module 200 with standardized module 300, for receiving the address of described user's input that described address load module 100 transmits, the address that described user inputs is identified and standardization, and produces standardized address.Described address metadata dictionary module 200, for memory address metadata data, receives and responds the control command of described Address Recognition and standardized module 300; Described address output module 400 is for receiving the control command of described Address Recognition and standardized module 300, and the address of outputting standard.
Described address metadata dictionary module 200, as shown in the address metadata dictionary function structure chart of Fig. 2 present pre-ferred embodiments.Described address metadata dictionary module 200 comprises hierarchical address metadata dictionary 210 and address aliases metadata dictionary 220.Described hierarchical address metadata dictionary 210 is for storing hierarchical address metadata, and hierarchical address metadata dictionary can be four-stage grading address metadata dictionary or six grades of hierarchical address metadata dictionary.
Described four-stage grading address metadata dictionary divides by administrative region.In example of the present invention, provide a kind of four-stage grading address model in order to form four-stage grading address metadata dictionary, as shown in " table 1, four-stage grading address model (a) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as four-stage grading address model; Vico-provincial Cities, prefecture-level city, municipality directly under the Central Government area under one's jurisdiction, county, autonomous county, county-level city, flag, autonomous prefecture, area will divide the second level of four-stage grading address model; Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction is by the third level for four-stage grading address model; Small towns, road, natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.The address that this kind of hierarchy model is more often applied to I.D. and generally writes.
Table 1, four-stage grading address model (a)
Another kind of four-stage grading address model also can be adopted in embodiments of the present invention in order to form four-stage grading address metadata dictionary, as shown in " table 2, four-stage grading address model (b) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as four-stage grading address model; Vico-provincial Cities, prefecture-level city, autonomous prefecture, area will divide the second level of four-stage grading address model; Be directly under the jurisdiction of urban district, Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction, county, county-level city, autonomous county, flag is by the third level for four-stage grading address model; Small towns, road, natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.This hierarchy model is comparatively rigorous, in strict accordance with the grade classification in state administration region, is comparatively usually used in the address sort in internet and rating information.
Table 2, four-stage grading address model (b)
Six grades of described hierarchical address metadata dictionary divide by administrative region equally.One or six grades of classification address models can be adopted in embodiments of the present invention in order to form six grades of hierarchical address metadata dictionary, as shown in " table 3, six grades of classifications address model (a) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as six grades of classification address models; Vico-provincial Cities, prefecture-level city, municipality directly under the Central Government area under one's jurisdiction, county, autonomous county, county-level city, flag, autonomous prefecture, area is by the second level of division six grades of classification address models; Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction is by the third level for six grades of classification address models; Small towns will become the fourth stage of six grades of classification address models; Street, road, road, natural village is using the level V as six grades of classification address models; Number as the 6th grade of six grades of classification address models.The address that this kind of hierarchy model is also more often applied to I.D. and generally writes.
Table 3, six grades of classifications address model (a)
Another kind of six grades of classification address models also can be adopted in embodiments of the present invention in order to form six grades of hierarchical address metadata dictionary, as shown in " table 4, six grades of classifications address model (b) ".Economize, autonomous region, municipality directly under the Central Government is using the first order as six grades of classification address models; Vico-provincial Cities, prefecture-level city, autonomous prefecture, area is by the second level of division six grades of classification address models; Be directly under the jurisdiction of urban district, Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction, county, county-level city, autonomous county, flag is by the third level for six grades of classification address models; Small towns will become the fourth stage of six grades of classification address models; Street, road, road, natural village is using the level V as six grades of classification address models; Number as the 6th grade of six grades of classification address models.This hierarchy model is comparatively rigorous, is the grade classification in strict accordance with state administration region equally, is comparatively usually used in the address sort in internet and rating information.
Table 4, six grades of classifications address model (b)
Can find out from the table of above-mentioned hierarchical address model, six grades of hierarchical address metadata dictionary are compared with the metadata dictionary of four-stage grading address, and its address data information more in detail, clearly, dictionary scale is relatively large.
Described hierarchical address metadata dictionary 210 can adopt Trie to set storage organization.Described Trie sets storage organization and the method for even numbers group can be adopted to realize.Adopt the Trie data structure that even numbers group realizes, all entries will be compiled into dictionary tree, and this dictionary tree is a train out report system (DeterministicFiniteAutomaton, DFA).
Described address aliases metadata dictionary 220 for memory address another name metadata, and has mapping relations with the metadata in hierarchical address metadata dictionary 210.Hierarchical address metadata dictionary 210 and address aliases metadata dictionary 220 are relations of one-to-many, such as, address metadata " Anhui Province " in hierarchical address metadata dictionary 210, with the address aliases metadata " Anhui " in address aliases metadata dictionary 220, " Anhui ", it is exactly the relation of one-to-many.In same rank place name collection of metadata, address aliases metadata can only a corresponding address metadata, therefore in same rank place name collection of metadata, mapping relations can be set up by between address aliases metadata and address metadata, be mapped to address metadata by address aliases metadata, realize the unified process of address metadata.
Described Address Recognition and standardized module 300, as shown in the Address Recognition of Fig. 4 present pre-ferred embodiments and standardized module structural drawing.Described Address Recognition and standardized module 300 comprise address cutting module 310, address labeling module 320, weights module 330 and Address Standardization module 340.
Described address cutting module 310, for receiving the address of described user's input of described address load module 100 transmission, is carried out cutting to the address that described user inputs, and is generated the address set of metadata of cutting.Described cutting is the hierarchical address metadata dictionary 210 of available described address metadata dictionary 200, adopts maximum matching process to the right to mate and cutting the address that described user inputs.
Described hierarchical address metadata dictionary 210 adopts Trie to set storage organization, and described Trie sets storage organization and adopts the method for even numbers group to realize.Adopt the Trie data structure that even numbers group realizes, comprising two arrays, is array Base [] respectively, and described array Base [] is for depositing the initial offset (avoiding a conflict) of current state next input variable relative; Array Check [], described array Check [] is for depositing the position of forerunner's state of current state.The all entries stored will be compiled into dictionary tree, and this dictionary tree is a train out report system (DeterministicFiniteAutomaton, DFA).Maximum matching algorithm to the right can be adopted in present pre-ferred embodiments to realize mating and cutting the address that described user inputs, with the Unicode code of each character in the character string of input for train out report system (DeterministicFiniteAutomaton, DFA) input variable is that example is (with UTF-8, each byte of GBK or GB18030 coding is as the same as the input variable of DFA), maximum matching algorithm implementation is as follows to the right:
From original state, according to the value of input variable, obtain next state, obtained by following formula:
Base [s]+c=t formula (1)
Check [Base [s]+c]=s formula (2)
Wherein s is current state, and c is the value of input variable, and t is the position of next state, and formula (2) is used for verifying, and represents that the forerunner of NextState is current state.If current state is done state, in Base array, the value of correspondence position is negative, otherwise is positive number (representing that current state is non-done state or intermediateness).
Read according to above account form the value that Unicode value corresponding to each character of character string is input variable successively, calculate NextState from current state, then NextState is performed as current state recurrence, until terminate.Situation about terminating has several as follows:
1), the character string of input reaches end position
2) condition of formula (2), being used for verifying does not meet
When reaching end position, the input variable that the done state of last process is corresponding is maximum matched position, and the distance returning its relative starting point is maximum matching length.
The complexity of this algorithm is not difficult to find out to be O (n), n is the length of character string of being retrieved according to above implementation.The scale of this algorithm and dictionary (i.e. dictionary comprise entry quantity) is irrelevant.
The rie of composition graphs 3 present pre-ferred embodiments sets theory structure schematic diagram, well can understand implementation procedure and the principle of above-mentioned maximum matching algorithm to the right.Address as user's input be " ShenZhen,GuangDong Bao'an Xixiang ", and by address metadata dictionary, the address after employing maximum matching method matches and cutting is to the right " ShenZhen,GuangDong Bao'an Xixiang ".
Described address labeling module 320 marks according to predefined mark attribute for utilizing the address set of metadata of described address metadata dictionary module 200 to described cutting, and generates the address set of metadata of mark.Due to user input randomness, probably address information is inputed by mistake, or between correct addresses at different levels the random text of admixture, this just requires that system has powerful robustness and processing power.In order to realize foregoing invention object, a kind of address information mark disposal route is provided in present pre-ferred embodiments, address information mark disposal route is different due to the model difference of hierarchical address metadata dictionary, in the preferred embodiment for " table 1, four-stage grading address model (a) ", its disposal route is as follows:
Mark the pre-defined as follows of attribute:
1), economize, autonomous region, municipality directly under the Central Government is the first order, additional label symbol " (1) " after the first-level address metadata cut out, i.e. single level address;
2), Vico-provincial Cities, prefecture-level city, municipality directly under the Central Government area under one's jurisdiction, county, autonomous county, county-level city, flag, autonomous prefecture, area is the second level, additional label symbol " (2) " after the first-level address metadata cut out, i.e. two-level address;
3), Vico-provincial Cities area under one's jurisdiction, prefecture-level city area under one's jurisdiction is the 3rd pole, additional label symbol " (3) " after the first-level address metadata cut out, i.e. third-level address;
4), small towns, road, natural village, the names of buildings etc. are the fourth stage, additional label symbol " (4) " after the first-level address metadata cut out, i.e. level Four address;
5) address information that, for the data of non-address information, system can not identify, additional label symbol " (0) ", i.e. real-time address.
According to above definition, the address as user's input be " ShenZhen,GuangDong Bao'an Xixiang ", by address metadata dictionary, adopts the address after the coupling of maximum matching algorithm to the right and cutting to be " ShenZhen,GuangDong Bao'an Xixiang "; Address after mark can be " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ".Because the address metadata of different stage exists situation of the same name, when therefore the address of described input being marked, there will be multiple mark situation.Just there are four kinds of situations marked address as above-mentioned input: " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2) ".
Accordingly, according to above-mentioned definition, general mark order has following types, as shown in " table 5, general mark order are illustrated ":
Table 5, the citing of general mark order
Can also similar above-mentioned processing mode mark for the address information of " table 2, four-stage grading address model (b) ", " table 3, six grades of classifications address model (a) " and " table 4, six grades of classifications address model (b) " mark disposal route.Unlike, the marking types of six grades of hierarchical address metadata dictionary is more detailed relative to four-stage grading address metadata dictionary.
Described weights module 330, for the address set of metadata to described mark, calculates its corresponding weights and exports the address set of metadata of maximum weight.The address of input, after described address cutting module 310, address labeling module 320 process, will obtain the address set of metadata of one or more mark.As Input Address " ShenZhen,GuangDong Bao'an Xixiang ", following four kinds of mark states can be obtained: " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2) ".The reason producing this problem is because the address metadata of different stage exists phenomenon of the same name.Can its corresponding weights be calculated according to dynamic programming algorithm and export the address set of metadata of maximum weight in embodiments of the present invention.
Described dynamic programming algorithm, the classic algorithm in dynamic programming algorithm can be have employed in embodiments of the present invention: Viterbi (Viterbi) algorithm calculates optimum address rank annotated sequence, and the observed value in algorithm and state are address rank.This algorithm comprises following content:
An original state value:
Pi
n × 1=(π
1, π
2, π
3..., π
n)
tformula (3)
Wherein π
ithe probability of to be address rank be i.Value in Pi rule of thumb sets, each value size in it follows following principle: the probability of the higher correspondence of address administrative grade is higher, and the probability as provincial is greater than city-level.A probability transfer matrix A
n × n:
Wherein
a
ij=P(q
t=j|q
t-1=i)1≤i,j≤n
Represent that current address rank is i, next address rank is the probability of j.Each value in matrix A rule of thumb sets.Such as, in matrix A each a
ijvalue size should follow following principle:
1), when generally (i, j) forms a backward (i >=j), a
ijvalue should be less than the value of all non-backwards (i < j) in principle, this condition is in order to ensure the direction that the net result of annotated sequence increases progressively according to address rank.
2), can continuous print situation when there is a certain level address in the hierarchy model of address, as " (4)+" represent that fourth stage address can occur continuously, corresponding a
44value slightly larger, otherwise just slightly smaller.
Being constrained on it:
The execution flow process of Viterbi (Viterbi) algorithm is as follows:
1), initialization
δ
1(i)=π
i, 1≤i≤n formula (4)
formula (5)
2), circulation performs
Formula (6)
formula (7)
Wherein 2≤t≤T, 1≤j≤n
3). terminate
Formula (8)
Formula (9)
Best annotated sequence is obtained, following formula by back-track algorithm:
t=T-1, T-2 ..., 1 formula (10)
For example the realization of bright above-mentioned algorithm.Because the model of hierarchical address metadata dictionary is different, the occurrence of Pi and A of Viterbi (Viterbi) algorithm is by different.In embodiments of the present invention, for " table 1, four-stage grading address model (a) ", the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address as input is: " ShenZhen,GuangDong Bao'an Xixiang ", can obtain following four kinds of mark states: " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4) ", " Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2) " through described address cutting module 310, address labeling module 320 after processing.According to Viterbi (Viterbi) algorithm, we can learn the weights of four kinds of mark states:
1), Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4); P=0.030375
2), Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (2); P=0.0030375
3), Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (4); P=0.001125
4), Guangdong (1) Shenzhen (4) Bao'an (3) Xixiang (2); P=1.125E-4
The annotated sequence of maximum probability is the first mark situation.Therefore the result that dynamic programming algorithm exports also is the first mark state " Guangdong (1) Shenzhen (2) Bao'an (3) Xixiang (4) ".
Can also similar above-mentioned processing mode process with the weights disposal route that " table 2, four-stage grading address model (b) ", " table 3, six grades of classifications address model (a) " and " table 4, six grades of classifications address model (b) " are as the criterion.
Described address aliases metadata dictionary 220 for memory address another name metadata, and has mapping relations with hierarchical address metadata dictionary 210.By described address aliases metadata dictionary 220, described Address Standardization module 340 can carry out standardization to the address aliases of input, and the name of Address Standardization is called the full name of the official of address.As: " Shanghai " is standardized as " Shanghai City ", " Guangdong " is standardized as " Guangdong Province ", " Guangxi " is standardized as " Guangxi Zhuang Autonomous Region ", " Beijing " is standardized as " Beijing " etc.
After the address set of metadata generation of described maximum weight, described Address Standardization module 340 will carry out standardization by described address aliases metadata dictionary 220 to it, generate standardized address, and control command occurs to described address output module 400.If above-mentioned Input Address is " ShenZhen,GuangDong Bao'an Xixiang ", through the process of address cutting module 310, address labeling module 320 and weights module 330, again through the standardization of Address Standardization module 340, standardized address " West Township Town, Baoan District, Shenzhen City, Guangdong Province " will be obtained.The address information of described Address Standardization module 340 pairs of real-time addresses (being namely labeled as " 0 ") will be handled as follows:
1), before real-time address appears at the first time appearance of level Four address.Due to one-level, secondary, it is relatively complete that third-level address is collected, and generally there will not be the front third-level address be not included.If before real-time address appears at the first time appearance of level Four address, general all because the address of user's input error causes, Address Standardization module 340 carries out standardization by what give tacit consent to other address of other non-zero order before and after real-time address in embodiments of the present invention.Certainly, also can be labeled as the address text of " (0) " level in other embodiments do deletion or standardization is not done to the address text being labeled as " (0) " level.
2), after real-time address appears at level Four address for the first time.This kind of situation may cause (as: the address metadata dictionary being hierarchical address model with " table 1, four-stage grading address model (a) " because address date is complete not, the data None-identified after " road " will be there will be, as typical buildings, cell name etc.), or input error causes, this kind of situation in present pre-ferred embodiments, last before occurring real-time address first time is labeled as " (4) " after all mark set of metadata all do not do standardization.Such as, for following address: " 3 unit 203 Room, 7, Wan Xin garden, No. 1037, Luo Yu road, Wuchang District, Wuhan City, Hubei Province ".Its annotated sequence is as follows: " new (0) 7, the garden of Hubei Province (1) Wuhan City (2) Wuchang District (3) Luo Yulu (4) No. 1037 (4) Anhui (1) (4) 3 unit (4) 203 Room (4) ".When the data in address database are full-time not, process for above-mentioned situation is as follows: first time is labeled as the data of real-time address for " new garden ", before it, last is labeled as the data of level Four address for " No. 1037 ", all annotated sequence then after " No. 1037 " all do not do standardization, therefore the net result sequence of " 3 unit 203 Room, No. 1,037 7, Wan Xin gardens " should be " 3 unit 203 Room, No. 1,037 7, Wan Xin gardens ", instead of " 3 unit 203 Room, 7, new garden, No. 1037 Anhui Province ".
In another preferred embodiment of the present invention, described Address Recognition and standardized module 300 comprise outside address cutting module 310, address labeling module 320, weights module 330 and Address Standardization module 340, also comprise a correcting module 350, as shown in the Address Recognition of another preferred embodiment of Fig. 5 the present invention and standardized module structural drawing.Described correcting module utilizes address metadata dictionary, determines whether consistent with the constraint condition of described address metadata dictionary to the annotated sequence of the address set of metadata of maximum weight; If inconsistent, the annotated sequence of the address set of metadata of described maximum weight is revised; Generate revised address set of metadata and export to described Address Standardization module.
In another preferred embodiment of the present invention, described address metadata dictionary module 200 also comprises an address metadata correction dictionary 230, as shown in the address metadata dictionary function structure chart of another preferred embodiment of Fig. 6 the present invention.What deposit in described address metadata correction dictionary 230 is the constraint condition data of following 2 types:
1), for district and county or county-level city of the same name when, deposit its constraint condition.Such as: in level Four address model (a, b) and six grades of address models (a, b), the another name of the another name in district and county or county-level city exists identical situation.As " Taihe County " (being subordinate to Anhui Province's Fuyang City) and " great He Qu " (being subordinate to Jinzhou City of Liaoning Province), their another name is all " Taihe county ", optimum annotated sequence is being calculated through dynamic programming algorithm, " Taihe county " is all labeled in district's one-level, therefore also needs to carry out correction process to the candidate result of optimum with a constraint condition.In embodiments of the present invention, be: " Taihe county → Fuyang City " that this constraint condition represents that " Taihe county " is a county under " Fuyang City " by the constraint condition deposited.
2), for township, town and county or county-level city of the same name when, deposit its constraint condition.Such as in level Four address model (a) and six grades of address models (a), there is township, town and county or county-level city situation of the same name.As address " Fujian Longyan Changting and level road ", " Changting " now will be labeled in the rank of township level address, in fact it is a county under " Longyan ", i.e. " Changting County ", instead of " Changting town " under " Hailin City, Mudanjiang City of Heilongjiang Province ", therefore also will judge according to constraint condition, constraint condition is: " Changting → Longyan ", and this constraint condition represents that " Changting " is a county under " Longyan ".
By address metadata correction dictionary 230, described correcting module 350 judges that whether the address set of metadata of maximum weight is consistent, namely whether meet constraint condition.If inconsistent, namely do not meet constraint condition, then selected optimal result existing problems are described, the mark of the address set of metadata to described maximum weight is revised by described correcting module 350; Generate revised address set of metadata and export to described Address Standardization module.
For example the realization of another preferred embodiment bright.In another preferred embodiment of the present invention, for " table 1, four-stage grading address model (a) ", the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address as user's input is " Xin Fa township, Heihe In The Heilongjiang River Wudalianchi ", and through the process of address cutting module 310, address labeling module 320 and weights module 330, mark and the weights situation of known Input Address are as follows:
1), " Heilungkiang (1) Heihe (2) Wudalianchi (4) Xin Fa township (4) " P=0.0200475;
2), " Heilungkiang (1) Heihe (2) Wudalianchi (2) Xin Fa township (4) " P=0.0111375;
3), " Heilungkiang (1) Heihe (4) Wudalianchi (4) Xin Fa township (4) " P=0.0091125;
4), " Heilungkiang (1) Heihe (4) Wudalianchi (2) Xin Fa township (4) " P=0.001485.
At this time export the set of metadata of maximum weight: " Heilungkiang (1) Heihe (2) Wudalianchi (4) Xin Fa township (4) ".Described correcting module utilizes the address metadata correction dictionary 230 in address metadata dictionary 200, judge known to the set of metadata of described maximum weight, " Heihe (2) Wudalianchi (4) " and constraint condition " Wudalianchi → Heihe City " are inconsistent.Now, the mark of the address set of metadata " Heilungkiang (1) Heihe (2) Wudalianchi (4) Xin Fa township (4) " to described maximum weight is revised by correcting module; Generate revised address set of metadata " Heilungkiang (1) Heihe (2) Wudalianchi (2) Xin Fa township (4) ", and export to described Address Standardization module.