Five, embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only is used for explaining the present invention, be not intended to limit the present invention.
As shown in Figure 1, be a kind of address identification of the present invention and standardized system construction drawing.Described a kind of address identification and standardized system comprise address load module 100, Address Recognition and standardized module 300, address metadata dictionary module 200 and address output module 400.Described address load module 100 is used for receiving the address of user's input, and the address of described user's input is sent to described Address Recognition and standardized module 300; Described Address Recognition is connected with described address output module 400 with described address load module 100, described address metadata dictionary module 200 with standardized module 300, address for the described user's input that receives described address load module 100 transmission, address to described user's input is identified and standardization, and produces standardized address.Described address metadata dictionary module 200 is used for memory address metadata data, receives and respond the control command of described Address Recognition and standardized module 300; Described address output module 400 is used for receiving the control command of described Address Recognition and standardized module 300, and the address of outputting standard.
Described address metadata dictionary module 200 is shown in the address metadata dictionary modular structure figure of Fig. 2 preferred embodiment of the present invention.Described address metadata dictionary module 200 comprises hierarchical address metadata dictionary 210 and address another name metadata dictionary 220.Described hierarchical address metadata dictionary 210 is used for storage hierarchical address metadata, and the hierarchical address metadata dictionary can be four-stage grading address metadata dictionary or six grades of classification address metadata dictionary.
Described four-stage grading address metadata dictionary is divided by the administrative region.In example of the present invention, provide a kind of four-stage grading address model in order to consist of four-stage grading address metadata dictionary, shown in " table 1, four-stage grading address model (a) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of four-stage grading address model; Secondary provincial city, prefecture-level city, the municipality directly under the Central Government area under one's jurisdiction, the county, autonomous county, county-level city, flag, autonomous prefecture, the second level of four-stage grading address model will be divided in the area; Area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction will be the third level of four-stage grading address model; The small towns, road, the natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.The address that this kind hierarchy model more often is applied to I.D. and generally writes.
Table 1, four-stage grading address model (a)
Also can adopt in embodiments of the present invention another kind of four-stage grading address model in order to consist of four-stage grading address metadata dictionary, shown in " table 2, four-stage grading address model (b) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of four-stage grading address model; Secondary provincial city, prefecture-level city, autonomous prefecture, the second level of four-stage grading address model will be divided in the area; Be directly under the jurisdiction of the urban district, area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction, the county, county-level city, autonomous county, flag will be the third level of four-stage grading address model; The small towns, road, the natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.This hierarchy model is comparatively rigorous, in strict accordance with the grade classification in state administration zone, is usually used in address sort and rating information in the internet.
Table 2, four-stage grading address model (b)
Described six grades of classification address metadata dictionary are divided by the administrative region equally.Can adopt in embodiments of the present invention one or six grades of classification address models in order to consist of six grades of classification address metadata dictionary, shown in " table 3, six grades of classification address models (a) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of six grades of classification address models; Secondary provincial city, prefecture-level city, the municipality directly under the Central Government area under one's jurisdiction, the county, autonomous county, county-level city, flag, autonomous prefecture, the second level of six grades of classification address models will be divided in the area; Area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction will be the third level of six grades of classification address models; The small towns will become the fourth stage of six grades of classification address models; The street, the road, the road, the natural village will be as the level V of six grades of classification address models; Numbering will be as the 6th grade of six grades of classification address models.The address that this kind hierarchy model also more often is applied to I.D. and generally writes.
Table 3, six grades of classification address models (a)
Also can adopt in embodiments of the present invention another kind of six grades of classification address models in order to consist of six grades of classification address metadata dictionary, shown in " table 4, six grades of classification address models (b) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of six grades of classification address models; Secondary provincial city, prefecture-level city, autonomous prefecture, the second level of six grades of classification address models will be divided in the area; Be directly under the jurisdiction of the urban district, area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction, the county, county-level city, autonomous county, flag will be the third level of six grades of classification address models; The small towns will become the fourth stage of six grades of classification address models; The street, the road, the road, the natural village will be as the level V of six grades of classification address models; Numbering will be as the 6th grade of six grades of classification address models.This hierarchy model is comparatively rigorous, is the grade classification in strict accordance with the state administration zone equally, is usually used in address sort and rating information in the internet.
Table 4, six grades of classification address models (b)
Can find out from the table of above-mentioned hierarchical address model, six grades of classification address metadata dictionary are compared with four-stage grading address metadata dictionary, and its address date information is more detailed, clear and definite, and the dictionary scale is relatively large.
Described hierarchical address metadata dictionary 210 can adopt Trie tree storage organization.Described Trie tree storage organization can adopt the method for even numbers group to realize.The Trie data structure that adopts the even numbers group to realize, all entries will be compiled into dictionary tree, and this dictionary tree is a definite finte-state machine (Deterministic Finite Automaton, DFA).
Described address another name metadata dictionary 220 is used for memory address another name metadata, and has mapping relations with metadata in the hierarchical address metadata dictionary 210.Hierarchical address metadata dictionary 210 and address another name metadata dictionary 220 are relations of one-to-many, the address metadata " Anhui Province " in the hierarchical address metadata dictionary 210 for example, with address another name metadata " Anhui ", " Anhui " in the another name metadata dictionary 220 of address, it is exactly the relation of one-to-many.Address another name metadata can only corresponding address metadata in same rank place name collection of metadata, therefore can be in same rank place name collection of metadata the address be called between metadata and the address metadata and set up mapping relations, be about to address another name metadata and be mapped to the address metadata, realize the unified processing of address metadata.
Described Address Recognition and standardized module 300 are shown in the Address Recognition and standardized module structural drawing of Fig. 4 preferred embodiment of the present invention.Described Address Recognition and standardized module 300 comprise address cutting module 310, address labeling module 320, weights module 330 and Address Standardization module 340.
Described address cutting module 310 is for the address of the described user's input that receives described address load module 100 transmission, and the address that described user is inputted carries out cutting, and generates the address set of metadata of cutting.Described cutting is the hierarchical address metadata dictionary 210 of available described address metadata dictionary 200, and adopting to the right, maximum matching process mates and cutting the address that described user inputs.
Described hierarchical address metadata dictionary 210 adopts Trie tree storage organization, and described Trie tree storage organization adopts the method for even numbers group to realize.The Trie data structure that adopts the even numbers group to realize comprises two arrays, is respectively array Base[], described array Base[] for depositing the relatively initial offset (avoiding a conflict) of next input variable of current state; Array Check[], described array Check[] be used for depositing the position of forerunner's state of current state.All entries of storage will be compiled into dictionary tree, and this dictionary tree is a definite finte-state machine (Deterministic Finite Automaton, DFA).In preferred embodiment of the present invention, can adopt maximum matching algorithm realization to the right that the address that described user inputs is mated and cutting, the Unicode code of each character is for definite finte-state machine (Deterministic Finite Automaton in the character string of input, DFA) input variable is that example is (with UTF-8, each byte of GBK or GB 18030 codings is as the same as the input variable of DFA), the maximum matching algorithm implementation is as follows to the right:
From original state, according to the value of input variable, obtain next state, obtain by following formula:
Base[s]+c=t formula (1)
Check[Base[s]+c]=s formula (2)
Wherein s is current state, and c is the value of input variable, and t is the position of next state, and formula (2) is used for verification, and the forerunner of expression NextState is current state.If current state be done state then in the Base array value of correspondence position be negative, otherwise be positive number (the expression current state is non-done state or intermediateness).
Be the value of input variable according to the above account form Unicode value that each character of reading character string is corresponding successively, calculate NextState from current state, then NextState is carried out down as the current state recurrence, until finish.Situation about finishing has following several:
1), the character string of input reaches end position
2), being used for the condition of formula (2) of verification does not satisfy
When reaching end position, the input variable that the done state of last process is corresponding is maximum matched position, and the distance of returning its relative starting point is maximum matching length.
Be not difficult to find out that according to above implementation the complexity of this algorithm is O (n), n is the length of character string of being retrieved.The scale of this algorithm and dictionary (being the entry quantity that dictionary comprises) is irrelevant.
In conjunction with the rie tree theory structure schematic diagram of Fig. 3 preferred embodiment of the present invention, can well understand implementation procedure and the principle of above-mentioned maximum matching algorithm to the right.Address such as user input be " ShenZhen,GuangDong Bao'an Xixiang ", and by the address metadata dictionary, adopting to the right, maximum matching method matches and cutting address afterwards are " ShenZhen,GuangDong Bao'an Xixiang ".
Described address labeling module 320 is used for utilizing the address set of metadata of 200 pairs of described cuttings of described address metadata dictionary module to mark according to predefined mark attribute, and generates the address set of metadata of mark.Because the randomness of user input is probably inputed address information by mistake, the perhaps random text of admixture between correct addresses at different levels, this just requires system to have powerful robustness and processing power.In order to realize the foregoing invention purpose, a kind of address information mark disposal route is provided in the preferred embodiment of the present invention, address information mark disposal route is different because the model of hierarchical address metadata dictionary is different, as example, its disposal route is as follows take " table 1, four-stage grading address model (a) " in preferred embodiment:
Mark the pre-defined as follows of attribute:
1), economize, autonomous region, municipality directly under the Central Government are the first order, additional label symbol " (1) ", i.e. single level address in the metadata back, first-level address that cuts out;
2), secondary provincial city, prefecture-level city, the municipality directly under the Central Government area under one's jurisdiction, the county, autonomous county, county-level city, flag, autonomous prefecture, the area is the second level, additional label symbol " (2) ", i.e. two-level address in the metadata back, first-level address that cuts out;
3), area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction is the 3rd utmost point, additional label symbol " (3) ", i.e. third-level address in the metadata back, first-level address that cuts out;
4), the small towns, road, the natural village, the name of buildings etc. is the fourth stage, additional label symbol " (4) ", i.e. level Four address in the metadata back, first-level address that cuts out;
5), for the data of non-address information, the address information that system can not identify, additional label symbol " (0) ", i.e. real-time address.
According to above definition, be " ShenZhen,GuangDong Bao'an Xixiang " such as the address of user input, by the address metadata dictionary, adopting to the right, maximum matching algorithm coupling and cutting address afterwards are " ShenZhen,GuangDong Bao'an Xixiang "; Can be " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) " through the address after the mark.Because there is situation of the same name in the address metadata of different stage, when therefore the address of described input being marked, multiple mark situation can appear.The situation that four kinds of marks are just arranged such as the address of above-mentioned input: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".
Accordingly, according to above-mentioned definition, general mark sequentially has following types, shown in " table 5, general mark are sequentially given an example ":
Table 5, general mark are sequentially given an example
Address information mark disposal route as an example of " table 2, four-stage grading address model (b) ", " table 3, six grades of classification address models (a) " and " table 4, six grades of classification address models (b) " example can also similar above-mentioned processing mode mark.Different is that the mark type of six grades of classification address metadata dictionary is more detailed with respect to four-stage grading address metadata dictionary.
The address set of metadata that described weights module 330 is used for described mark is calculated its corresponding weights and is exported the address set of metadata of weights maximum.The address of input will obtain the address set of metadata of one or more marks after processing through described address cutting module 310, address labeling module 320.Such as Input Address " ShenZhen,GuangDong Bao'an Xixiang ", can obtain following four kinds of mark states: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".The reason that produces this problem is because there is phenomenon of the same name in the address metadata of different stage.Can calculate its corresponding weights and export the address set of metadata of weights maximum according to dynamic programming algorithm in embodiments of the present invention.
Described dynamic programming algorithm, can adopt in embodiments of the present invention the classic algorithm in the dynamic programming algorithm: Viterbi (Viterbi) algorithm calculates optimum address grade mark sequence, and the observed value in the algorithm and state are the address grade.This algorithm comprises following content:
An original state value:
Pi
N * 1=(π
1, π
2, π
3..., π
n)
TFormula (3)
π wherein
iThat the address rank is the initial probability of i.Value in the Pi is rule of thumb set, and each the value size in it is followed following principle: the initial probability of the higher correspondence of address administrative grade is higher, such as provincial initial probability greater than city-level.A probability transfer matrix A
N * n:
Wherein
a
ij=P(q
t=j|q
t-1=i) 1≤i,j≤n
Represent that current address rank is i, the next address rank is the probability of j.Each value in the matrix A is rule of thumb set.Each a in the matrix A for example
IjThe value size should follow following principle:
1), generally speaking (i, j) consist of a backward (during i>=j), a
IjValue should be in principle less than all non-backwards (value of i<j), the direction that this condition increases progressively according to the address rank for the net result that guarantees to mark sequence.
2), in the hierarchy model of address, have the situation that a certain level address can be continuous, corresponding a can appear in expression fourth stage address continuously such as " (4)+"
44Value slightly larger, otherwise just slightly smaller.
Being constrained on it:
The execution flow process of Viterbi (Viterbi) algorithm is as follows:
1), initialization
δ
1(i)=π
i, 1≤i≤n formula (4)
2), circulation is carried out
Formula (6)
2≤t≤T wherein, 1≤j≤n
3). finish
Formula (8)
Formula (9)
Obtain best mark sequence by back-track algorithm, following formula:
T=T-1, T-2 ..., 1 formula (10)
The for instance realization of bright above-mentioned algorithm.Different because of the model of hierarchical address metadata dictionary, the Pi of Viterbi (Viterbi) algorithm and the occurrence of A are with different.In embodiments of the present invention, take " table 1, four-stage grading address model (a) " as example, the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address such as input is: " ShenZhen,GuangDong Bao'an Xixiang " can obtain following four kinds of mark states through described address cutting module 310, address labeling module 320 after processing: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".According to Viterbi (Viterbi) algorithm, we can learn the weights of four kinds of mark states:
1), Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4); P=0.030375
2), Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2); P=0.0030375
3), Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4); P=0.001125
4), Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2); P=1.125E-4
The mark sequence of maximum probability is the first mark situation.Therefore the result of dynamic programming algorithm output also is the first mark state " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ".
Can also similar above-mentioned processing mode process with the weights disposal route that " table 2, four-stage grading address model (b) ", " table 3, six grades of classification address models (a) " and " table 4, six grades of classification address models (b) " are as the criterion.
Described address another name metadata dictionary 220 is used for memory address another name metadata, and has mapping relations with hierarchical address metadata dictionary 210.By described address another name metadata dictionary 220, described Address Standardization module 340 can be carried out standardization to the address another name of input, and the name of Address Standardization is called the full name of official of address.As: " Shanghai " is standardized as " Shanghai City ", " Guangdong " and is standardized as that " Guangdong Province ", " Guangxi " are standardized as " Guangxi Zhuang Autonomous Region ", " Beijing " is standardized as " Beijing " etc.
After the address set of metadata of described weights maximum produces, described Address Standardization module 340 will be carried out standardization to it with described address another name metadata dictionary 220, generate standardized address, and control command occurs to described address output module 400.Be " ShenZhen,GuangDong Bao'an Xixiang " such as above-mentioned Input Address, processing through address cutting module 310, address labeling module 320 and weights module 330, pass through again the standardization of Address Standardization module 340, will obtain standardized address " West Township Town, Baoan District, Shenzhen City, Guangdong Province ".The address information of 340 pairs of real-time addresses of described Address Standardization module (namely being labeled as " 0 ") will be handled as follows:
1), real-time address appears at before the level Four address occurs for the first time.Since one-level, secondary, and it is relatively complete that collect the third-level address, the front third-level address of not included generally can not occur.If real-time address appears at before the level Four address occurs for the first time, general all is because the address of user's input error causes, what Address Standardization module 340 will be given tacit consent in embodiments of the present invention carries out standardization to other address of other non-zero order, real-time address front and back.Certainly, also can do in other embodiments deletion is labeled as the address text of " (0) " level or the address text that is labeled as " (0) " level is not done standardization.
2), real-time address appears at after the level Four address for the first time.This kind situation may be because complete not (as: the address metadata dictionary take " table 1, four-stage grading address model (a) " as the hierarchical address model that causes of address date, " road " data None-identified afterwards will appear, such as typical buildings, cell name etc.), perhaps input error causes, and last before this kind situation will occur for the first time to real-time address in preferred embodiment of the present invention is labeled as afterwards all mark set of metadata of " (4) " and all do not do standardization.For example, for following address: " Room 203,7 Unit 3, new garden, No. 1037 Anhui, Wuchang District Lopa Nationality fine jade road, Wuhan City, Hubei Province ".Its mark sequence is as follows: " new Room (4) (4) 203, (0) 7 Unit (4) 3, garden, Hubei Province (1) Wuhan City (2) Wuchang District (3) (4) No. 1037 (4) Anhui of Luo Yulu (1) ".Data in the database of address are full-time not, processing for above-mentioned situation is as follows: the data that are labeled as real-time address for the first time are " new garden ", last data that are labeled as the level Four address of its front are " No. 1037 ", then " No. 1037 " all mark sequences are not afterwards all done standardization, therefore the net result sequence of " Room 203,7 Unit 3, new garden, No. 1037 Anhui " should be " Room 203,7 Unit 3, new garden, No. 1037 Anhui ", rather than " Room 203,7 Unit 3, new garden, No. 1037 Anhui Province ".
In another preferred embodiment of the present invention, described Address Recognition and standardized module 300 comprise outside address cutting module 310, address labeling module 320, weights module 330 and the Address Standardization module 340, also comprise a correcting module 350, shown in the Address Recognition and standardized module structural drawing of another preferred embodiment of Fig. 5 the present invention.Described correcting module utilizes the address metadata dictionary, determines whether consistent with the constraint condition of described address metadata dictionary to the mark sequence of the address set of metadata of weights maximum; If inconsistent, the mark sequence of the address set of metadata of described weights maximum is revised; Generate revised address set of metadata and export to described Address Standardization module.
In another preferred embodiment of the present invention, described address metadata dictionary module 200 also comprises an address metadata correction dictionary 230, shown in the address metadata dictionary modular structure figure of another preferred embodiment of Fig. 6 the present invention.What deposit in the described address metadata correction dictionary 230 is following 2 types constraint condition data:
1), in district and county or the county-level city situation of the same name, deposit its constraint condition.For example: in level Four address model (a, b) and six grades of address models (a, b), there is identical situation in the another name of the another name in district and county or county-level city.Such as " Taihe County " (being subordinate to Anhui Province's Fuyang City) and " Da He district " (being subordinate to Jinzhou City, Liaoning Province), their another name all is " Taihe county ", when calculating optimum mark sequence through dynamic programming algorithm, " Taihe county " all is labeled in district's one-level, therefore also needs the candidate result of optimum is proofreaied and correct processing with a constraint condition.In embodiments of the present invention, with the constraint condition of depositing be: " Taihe county → Fuyang City ", this constraint condition represent that " Taihe county " is a county under " Fuyang City ".
2), in township, town and county or the county-level city situation of the same name, deposit its constraint condition.For example in level Four address model (a) and six grades of address models (a), there is township, town and county or county-level city situation of the same name.Such as address " Fujian Longyan Changting and level road ", this moment " Changting " will be labeled on the rank of township level address, in fact it is a county under " Longyan ", i.e. " Changting County ", rather than " Changting town " under " Hailin City, Mudanjiang City, Heilongjiang Province ", therefore also will judge according to constraint condition, constraint condition is: " Changting → Longyan ", this constraint condition represent that " Changting " is a county under " Longyan ".
Described correcting module 350 can judge whether the address set of metadata of weights maximum is consistent, namely whether meets constraint condition by address metadata correction dictionary 230.If inconsistent, namely do not meet constraint condition, selected optimal result existing problems then are described, described correcting module 350 will be revised the mark of the address set of metadata of described weights maximum; Generate revised address set of metadata and export to described Address Standardization module.
The for instance realization of bright another preferred embodiment.In another preferred embodiment of the present invention, take " table 1, four-stage grading address model (a) " as example, the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address such as user's input is " new township, Heihe In The Heilongjiang River Wudalianchi ", and through the processing of address cutting module 310, address labeling module 320 and weights module 330, the mark of Input Address and weights situation are as follows as can be known:
1), " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) " P=0.0200475;
2), " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (2) (4) " P=0.0111375;
3), " new township, Wudalianchi, Heihe, Heilungkiang (1) (4) (4) (4) " P=0.0091125;
4), " new township, Wudalianchi, Heihe, Heilungkiang (1) (4) (2) (4) " P=0.001485.
At this time export the set of metadata of weights maximum: " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) ".Described correcting module utilizes the address metadata correction dictionary 230 in the address metadata dictionary 200, the set of metadata of described weights maximum is judged as can be known, and " Wudalianchi, Heihe (2) (4) " is inconsistent with constraint condition " Wudalianchi → Heihe City ".At this moment, correcting module will be revised the mark of the address set of metadata " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) " of described weights maximum; Generate revised address set of metadata " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (2) (4) ", and export to described Address Standardization module.