The specific embodiment
Elaborate for the specific embodiment of the present invention below in conjunction with accompanying drawing.
Send the difference on ground to according to letter, letter sorting divides dual mode: this mouth letter sorting and transit are sorted.This mouthful of letter sorting is that letter is the letter of sorter location with sending to, and it need be accurate to the letter sorting of delivering suboffice or delivering the road section; The transit letter sorting is that letter is other the regional letters except that the sorter location with sending to, and it is accurate to letter sorting provincial, prefecture-level or counties and districts' level by the different needs of sort plan.For these two kinds of sorting mode, all need to set up the Chinese address identification that mutually deserved address base drives letter.
For transit address storage format, Biao Shi transit address table structure as shown in Figure 1, form by the three grades of administrative divisions in the whole nation, be respectively 31 provincial administrative divisions (as Zhejiang Province), each provincial administrative division is its prefecture-level administrative division (as the Hangzhou) down, and each prefecture-level administrative division comprises the administrative division (as Yuhang District) of a plurality of counties and districts level again.Each administrative division all marked should the zone in the scope of postcode, as being 3111 such as preceding four of the postcode of Yuhang District, back two uncertain, so represent with 3111xx, x represents any 0~9 numeral.What specify is the postcode of the postcode of prefecture-level city's mark for its inner city.Set up thus one cover gazetteer that the whole nation is accurate to the counties and districts administrative divisions with and corresponding postcode.Each address entry during the present invention will show is called inquire address.
For this port address storage format, this port address is divided into two of road address and organizations.
For road address information, adopt road to represent with the form of the relevant combination of number.And for organization information, its contents diversification can be various other address expression-forms except that the name of road such as certain sub-district, company, mansion, office, school, little place name.At first need to remove some redundant information in these address informations, such as this mouthful unit information for Zhejiang Hangzhou, be provided with in " Zhejiang ", " Zhejiang Province ", " Hangzhou " " Hangzhou " " company " everyday words such as " Co., Ltds " is as redundancy, removing these information does not influence the expression of address, so we are not influencing these information of the following removal of principle that express the address.Expression after the removal is called the abbreviation of this address.Such as:
The abbreviating as of " the Zhejiang Province People's Government " " provincial people's government "
The abbreviation of " Zhejiang University " is appointed and is " Zhejiang University "
The abbreviating as of " Hangzhou Co., Ltd of Alibaba " " Alibaba "
The advantage that adopt to be called for short is that because this mouthful unit information is longer, Pi Pei difficulty is very big fully, and very high as phrase frequency of occurrences in the address in " Zhejiang Province ", the differentiation of similar address that can have a strong impact on two.Adopt and be called for short and improve matching degree than the interference of going out of large program.
For removing redundant back address entry is that be called for short the address, next will extract keyword as retrieval, for follow-up coupling is prepared.Here the definition of keyword is: the speech that continuous three words in certain bar address are formed, and also the frequency that this speech occurs in other addresses is minimum.
The representation of road address information is as follows:
Sequence number |
The road name |
Parity flag |
Starting symbol |
Stop number |
Postcode |
Under the district and |
Deliver suboffice |
The road segment number |
Retrieval |
??1 |
Chao Wanglu |
Very |
??7 |
??53 |
??210014 |
Xiacheng District |
|
|
|
??2 |
Moral wins the road |
Idol |
??328 |
??10000 |
??310015 |
The Gongshu District |
|
|
|
|
|
|
|
|
|
|
|
|
|
" road name ": i.e. the title of road.As Zhongshan North Road, century main road, Chang Qingjie, Zu Miaoxiang, peach blossom do or the like.Note any punctuation mark to occur in the title.
" parity flag ": the delivery number of indicating this section road only is odd number, only still is consecutive number for even number.
" starting symbol ": this section road is delivered the starting symbol of number.
" stop number ": this section road is delivered the termination number of number.If termination the unknown of this section road then is defined as " 9999 " (when odd number or all number) or " 9998 " (during even number).
" postcode ": the postcode of this section road.
" affiliated district ": the district at this section road place.Wherein, the district is meant districts under city administration, county-level city, county or the like.
" section road numbering ": section road numbering under this section road.
The representation of organization information is as follows:
Every unit information stores by row, comprises " unit name ", " postcode ", " affiliated district ", " actual address ", " being called for short 1 ", " being called for short 2 ", " being called for short 3 ", " section road numbering ", " remarks " 9 attributes, and every information stores by row.The delivery information of unit address as shown in the table:
|
??A |
??B |
??C |
??D |
??E |
??F |
??G |
??H |
??I |
??1 |
The unit name |
Postcode |
Affiliated district |
Actual address |
Be called for short 1 |
Be called for short 2 |
Be called for short 3 |
Section road numbering |
Remarks |
??2 |
Hangzhou Xihu District people's court |
??310012 |
Xihu District |
No. 9, literary composition two West Roads |
Xihu District people's court |
Xihu District law court |
|
|
|
" unit name ": i.e. the title of unit.For example Zhejiang University, perseverance encourage mansion, Hangzhou Xihu District people's court, in melt City Garden or the like.Title must be write full name, and for example " Zhejiang Province higher people's court " cannot be written as " provincial high people's courts ", but " provincial high people's courts " can write " being called for short a 1 " hurdle.Note any punctuation mark to occur in the title.
" postcode ": the postcode of this unit location.
" affiliated district ": the district at this unit place.Wherein, the district is meant districts under city administration, county-level city, county or the like.
" actual address ": the actual address at the unit place that this unit information is represented.As shown in figure 10.
" be called for short 1 ": the abbreviation of the unit that this unit information is represented.If it then is empty not having.As shown in figure 10.
" be called for short 2 ": the abbreviation of the unit that this unit information is represented.If it then is empty not having.
" be called for short 3 ": the abbreviation of the unit that this unit information is represented.If it then is empty not having.
" section road numbering ": section road numbering under this unit actual address.
" remarks ": remark information.
Special circumstances wherein have:
[A] in the XX, XX mill, XX garden, XX village, XX garden, XX sub-district and similar this residential quarters in most cases all leave in " unit " worksheet.But belong to different delivery offices if a certain residential quarters occur, then it is stored in " road " worksheet.For example belong to first delivery office garden, Baima 1-20 number, belong to second delivery office 21-40 number.Then garden, Baima is deposited in " road " worksheet rather than " unit " worksheet.It is as follows to deposit form:
" road " worksheet among the .xls of first delivery office
|
??A |
??B |
??C |
??D |
??E |
??F |
??G |
??H |
??1 |
The road name |
Odd even is complete |
Starting symbol |
Stop number |
Postcode |
Affiliated district |
Section road numbering |
Remarks |
??2 |
Garden, Baima |
Entirely |
??1 |
??20 |
??111111 |
Baiyun District |
|
|
" road " worksheet among the .xls of second delivery office
|
??A |
??B |
??C |
??D |
??E |
??F |
??G |
??H |
??1 |
The road name |
Odd even is complete |
Starting symbol |
Stop number |
Postcode |
Affiliated district |
Section road numbering |
Remarks |
??2 |
Garden, Baima |
Entirely |
??21 |
??40 |
??111111 |
Baiyun District |
|
|
[B] owing to punctuation mark can not occur in " unit name ", then need be expressed as " East Lake fragrant pavilion water bank " for the title of " East Lake fragrant pavilion water bank " this form.And for the title of " Shahu Lake village (residential district, former Shahu Lake) " this form, bracket can be removed, " residential district, former Shahu Lake " this annotation information is put into corresponding " remarks " hurdle.
[C] should deposit in " road " worksheet for postal private letter box, and it is as follows to deposit form.And other mailbox can not be deposited in " road " worksheet.
Postal private letter box is deposited form shfft
|
??A |
??B |
??C |
??D |
??E |
??F |
??G |
??H |
??1 |
The road name |
Odd even is complete |
Starting symbol |
Stop number |
Postcode |
Affiliated district |
Section road numbering |
Remarks |
??2 |
The postal private letter box in XX city |
Very |
??1521 |
??1521 |
??111111 |
Baiyun District |
|
|
[D] for the army that the Arabic numerals designation is arranged, for example " 73022 army " should deposit in " road " worksheet, deposits the following table of form.
The army that the Arabic numerals designation is arranged
|
??A |
??B |
??C |
??D |
??E |
??F |
??G |
??H |
??1 |
The road name |
Odd even is complete |
Starting symbol |
Stop number |
Postcode |
Affiliated district |
Section road numbering |
Remarks |
??2 |
Army |
Idol |
??73022 |
??73022 |
??111111 |
Baiyun District |
|
|
The army of [E] and other then deposits in " unit " worksheet as " the 8th squadron of People's Armed Police ", and is as shown in the table.
The army that does not contain the Arabic numerals designation
|
??A |
??B |
??C |
??D |
??E |
??F |
??G |
??H |
??I |
??1 |
The unit name |
Postcode |
Affiliated district |
Actual address |
Be called for short 1 |
Be called for short 2 |
Be called for short 3 |
Section road numbering |
Remarks |
??2 |
The 8th army of the Armed Police squadron |
??111111 |
Baiyun District |
No. 222, Baiyun Road |
|
|
|
|
|
For a width of cloth letter image,, need methods such as utilization graphical analysis, Chinese Character Recognition, data base querying that image is handled in order to obtain final letter sorting information.Fig. 2 is the basic step that image information is handled, and at first needs the letter image is analyzed, and obtains the zone of address of the addressee; Again the Chinese character of address area is cut apart by row, obtained the multirow literal; Then adopt the first and second, two kinds of Chinese character partitioning algorithms that every style of writing word is divided into individual character; For the individual character that first algorithm obtains, adopted L and W Chinese Character Recognition algorithm that individual character is discerned respectively, and, used H Chinese Character Recognition algorithm to carry out individual character identification for the individual character that second algorithm obtains; The FA final address storehouse drives algorithm and obtains final sorter in conjunction with the recognition result of L, W, three kinds of algorithms of H and address base information and need sort information.Here the first and second Chinese character partitioning algorithms can be a kind of in the Chinese character partitioning algorithm, and L, W, three kinds of algorithms of H can be a kind of in the Chinese Character Recognition algorithm.
The technical scheme core that the present invention proposes is that address base drives, its basic thought is at first to set up an address base that comprises the normal address data entries, and every width of cloth letter image is obtained a recognition result that comprises address information by mode identification technology, for every in address base inquire address data entries, in recognition result, mate, obtain the highest inquire address data entries of matching degree, analyze information such as matching degree, then export these clauses and subclauses if meet the demands reserved portion is chosen information, otherwise do not have information.Drive its basic procedure such as Fig. 3 for address base.
Address base drives is input as three kinds of Chinese Character Recognition results (being respectively H, L, W algorithm), can see that in Fig. 2 L, W algorithm use identical word partitioning algorithm, and the H algorithm has used another kind of word partitioning algorithm, therefore the recognition result character string of L, W algorithm has identical length, and the length of the recognition result character string of H algorithm is different with preceding two kinds of possibilities, therefore at first these three kinds of recognition results are alignd, produce a character trail D, all there is 1 to 3 candidate's word (being respectively the recognition result of H, L, W) its each position.For character trail D, if need carry out the transit letter sorting, then use transit table address clauses and subclauses and D to mate, judgement obtains letter sorting information; If need carry out this mouthful letter sorting, then use this port address table to mate, judgement obtains this mouthful letter sorting information; Mix letter sorting if carry out this transit, then carry out the transit Address Recognition earlier, when being this message letter, carry out the identification of this port address again as if the result.
Below narrated respectively for the problem that relates in the result identification.
1. the foundation of recognition result alignment and recognition result character trail
The for convenience coupling that drives of address base and the recognition result that makes full use of three kinds of algorithms (H, L, W algorithm), at first need three kinds of identifications are comprehensively obtained the character trail D of an optimization, each word of this set D all has 1 to 3 candidate's word, be respectively (recognition result of H, L, W algorithm), and candidate's word sorts according to priority.If the recognition result character string of H, L, W algorithm is respectively Hr, Lr, Wr, the length of character string is respectively Hl, Ll, Wl, and Ll equates with Wl so, and Hl then not necessarily equates.In order not guarantee not reject useful information, the string length Dl after the alignment is the maximum length of Hl, Ll, Wl, promptly
Dl=max(Hl,Ll)
Adopted the Needleman-Wunsch algorithm that recognition result is carried out registration process here, because Lr in the recognition result, Wr aligns, therefore only needing Hr to align with Lr or Wr gets final product, just in when coupling as long as any one of two characters of same position is identical among the character among the Hr and Lr or the Wr, then think character among the Hr and the character match of Wr and Hr.Improving for this has carried out some to the Neeldeman-Wunsch algorithm, is the introduction of improved Needleman-Wunsch algorithm below:
Primary condition: M (i, 0)=M (0, and j)=0 (0≤i≤Ll, 0≤j≤Hl)
Tx(i,0)=Tx(0,j)=0?????(0≤i≤Ll,0≤j≤Hl)
Ty(i,0)=Ty(0,j)=0?????(0≤i≤Ll,0≤j≤Hl)
The recurrence condition:
M wherein, Tx, Ty are the matrix of (Ll+1) * (Hl+1), and M is the matching score matrix, Tx, Ty is for recalling matrix, and each unit of expression M is which adjacent unit obtains, the position of Tx record x direction, the position of Ty record direction.σ is a scoring function, and as Hr (j) and Lr (i), Wr (i) is when any is equal, and matching score is Mat; When unequal, mispairing must be divided into Mis.And add punishing of space be divided into W.The value of each unit of M relies on the value of its left side, upper left, last three directions simultaneously.Here we to design Mat be 2, Mis is-1, W is-2.From matrix (Ll Hl) dates back to (0,0) forward, according to the sensing of recalling matrix, the character string Hd after obtaining aliging, Wd, Ld, they have formed character trail D, and length be Dl=max (Hl, Ll).
2. transit Address Recognition
The transit Address Recognition has comprised two parts, the judgement of the coupling of transit address base and transit matching result.
2.1. the coupling of transit address base
Can see that in Fig. 1 the inquire address in the transit address table has three types: provincial address, prefecture-level address, counties and districts level address.And, can resolve to two parts for each inquire address, be called place name and level another name here.Such as " Beijing ", Beijing is place name, and the city is the level another name, and place name has comprised most information with regard to an address, and the level another name all is identical to a lot of addresses, mainly is " city " " province " " autonomous region " " county " " district " etc. at transit table middle rank another name.For an identification character collection D, in general to all mate each inquire address, calculate its matching degree.Having adopted the Smith-Waterman algorithm to carry out matching score here calculates, Smith-Waterman algorithm input inquiry sequence is a certain address of transit table, and because the storehouse sequence of Smith-Waterman algorithm input is the character trail D that three candidate's words are arranged, therefore the Smith-Waterman algorithm is improved.
At first improved Smith-Waterman algorithm is introduced.If certain the bar address in the transit table is character string Q, its length is Ql, is the formula of improved Smith-Waterman algorithm below:
Primary condition:
M(i,0)=E(i,0)=F(i,0)=0????????(0≤i≤Ql)
M(0,j)=E(0,j)=F(0,j)=0????????(0≤j≤Dl)
The recurrence condition:
E(i,j)=max{E(i,j-1)-r,M(i,j-1)-q-r,0}?????(5)
F(i,j)=max(F(i-1,j)-r,M(i-1,j)-q-r,0}?????(6)
M(i,j)=max{0,M(i-1,j-1)+σ(Q(i),D(j)),E(i,j),F(i,j)}???????(7)
Wherein, M, E, F are the matrix of (Ql+1) * (Dl+1), and σ is a scoring function, and q is room exploitation punishment, and r is that punishment, Mat matching score, Mis bit mismatch score are extended in the room.
For every inquire address in the transit address table, calculate by the Smith-Waterman algorithm, all from character trail D, obtain one section character string R, the matching degree of this section character string R and this inquire address maximum, and the position of R in D.Because there are subordinate relation in province, districts and cities, counties and districts between the third-level address, in order to reduce the matching times of address table, following Fig. 4 of coupling flow process of transit address table.
Through through the coupling of transit address table, obtain matching degree and form set DA greater than the address of setting thresholding, comprised among the set DA all each province that satisfy thresholding, districts and cities, counties and districts etc. the inquire address clauses and subclauses of different stage.Subordinate relation according to address in the transit address table, if the inquire address in the DA set has subordinate relation, it is combined into inquire address clauses and subclauses, such as having comprised three information in " Zhejiang Province " " Hangzhou " " Taizhou city " in the DA set, then will be combined into " Hangzhou, Zhejiang province city " and " Taizhou, Zhejiang Province city " two information.Be called set DB according to the set DA after the subordinate relation combination.
2.2. the judgement of transit matching result
Each clauses and subclauses is referred to as the address string among the set DB, and the address string can be made up of 1~3 inquire address, such as " Beijing ", " Pudong New Area, Shanghai ", " Hangzhou, Zhejiang province city Yuhang District " is respectively the address string that 1,2,3 inquire addresses are formed.DB is the set that comprises one or more address string, in order therefrom to choose a correct address string, need set up the evaluation model of matching result and adjudicate.For each inquire address, all have following several information: matching degree, matched position postcode.If comprise postcode in the recognition result here, then can extract the information of postcode identification.
At first this model need be set up the point system of matching degree, and concrete steps are as follows:
[A] is divided into place name+level another name two parts with inquire address, and length is a1 and a2
[B] inquiry place name and rank name be character match number b1 and the b2 in matched character string R (length Rl) respectively
[C] set place name and level another name fully the weights of coupling be c1, c2, c1=4 wherein, c2=1
[D] calculates matching score
S1=(c1*b1/a1+c2*b2/a2)/(c1+c2)?????????(9)
[E] mated fully to place name to be rewarded
By formula as can be known S2 be that 1.0 o'clock inquire addresses mate fully.
[F] sets that inquire address mates fully and incomplete weights during coupling, is respectively m1, m2, m1=100 wherein, m2=20.
Distinguishing coupling and the not exclusively weights of coupling fully, is because inquire address when mating fully, and we think that this identifying information can not cause any ambiguity.S3 has represented the score of each inquire address, and best result is 100, by formula as can be known, mates S3 〉=16 as if place name in character trail D in the inquire address.Because place name reacted address information generally speaking, we choose thresholding MT1=16, think the basic trusted of inquire address.Mate b2/a2=1 fully and work as the level another name, when the place name matching degree is b1/a1=0.5, such as content among the character trail D is " Hang Chuanshi of Zhejiang Province ", the matching degree score S3=12 of inquire address " Hangzhou " so, this moment, we thought that this inquire address comprises the part address information, may be by the factors such as exclusiveness of other information such as postcode, its superior and the subordinate's address association, place name, determine that " Hangzhou " is correct information, therefore choose thresholding MT2=12, think that inquire address has available address information.
[G] forms set DA with S3 at thresholding MT2 and above inquire address, and by the subordinate relation of DA according to inquire address, obtains gathering DB.Next model need carry out the score evaluation to each the address string among the DB.If the address is gone here and there to such an extent that be divided into S4, the score of maximum three grades of inquire addresses (provincial, prefecture-level, counties and districts' level) that it comprises is respectively ss1, ss2, and ss3 (must be divided into 0 when not existing), by following judgment criterion:
When (1) arbitrary inquire address score is equal to or greater than MT1 in go here and there the address so,
S4=ss1+ss2+ss3??(12)
(2) when having all inquire address scores less than MT1 in the address string (when existing must more than or equal to MT2), according to the matched position of inquire address in D, the ways of writing that whether meets Chinese address by matched position, promptly whether write, get different values by orders provincial, prefecture-level, counties and districts:
S4=ss1+ss2+ss3 is if matched position meets sequential write (13)
(ss1, ss2 ss3) do not meet sequential write (14) if write to S4=max
According to above accurate, we have obtained the score of each address string, and above-mentioned " Hang Chuanshi of Zhejiang Province " score should be S4 and should be 112.
[H] if there is not the postcode identifying information, then S4 promptly is the final score of address string; If the postcode identifying information exists, then the postcode identifying information is added the scoring system of address string.When the postcode identifying information exists, use the postcode of every grade of inquire address in the string of identification postcode and address to compare, obtain can the match is successful minimum one-level inquire address, can match the prefecture-level of " Hangzhou, Zhejiang province city " as postcode " 310001 ", and that " 320001 " can only match is provincial.For address string,, then its score is had the award of an additivity if the match is successful for its certain grade inquire address postcode and postcode identifying information.The basic award value of certain grade of coupling is MW, mates the different of rank and inquire address matching degree score S3 according to postcode simultaneously, and MW has been set the different weights of Pyatyi.Because prefecture-level, counties and districts' level postcode coupling is 4 postcode couplings, and provincial postcode coupling is 2 postcode couplings, so prefecture-level, counties and districts' level matching ratio is provincial higher weights, and for the inquire address of S3 〉=MT1 if obtain the checking of postcode, also should have higher weights.Concrete rule is as follows:
When matching when provincial
When matching when prefecture-level
When matching counties and districts' level
(20)
The value of MW is to set according to the degree of accuracy of identification postcode, here we to set MW be 40, during match query during even postcode and DB gather, be believable relatively.
More than set up the whole process of matching result evaluation model, each the address string among the set DB all can be estimated score accordingly through estimating.So next, need which address string accurate presentation letter address of the addressee among the judgement set DB.Here chosen the simplest judgement mode, promptly inquire address score, the matched position of address string in character trail D etc. sort in score, each address string by estimating to the address string, choosing 1~2 the highest address string of sorting position analyzes, obtain final result, idiographic flow such as Fig. 4.
Illustrate: MT3 is the thresholding that five equilibrium is estimated in conclusive judgement, and the value of MT3 has two kinds of situations, when the postcode recognition result is not comprehensively gone into evaluation model, gets MT3=MT1+1 here; MT3=MW+MT1+1 when the postcode recognition result is comprehensively gone into the evaluation model type.
The judging process and the result of above evaluation decision pattern are described with several examples below:
Example 1 " Shanghai City Fu Zhoulu ", " Shanghai City " score 100>" Foochow " score 17 is so the result is " Shanghai City ".
Example 2 " 442000 Xiamen Utilities Electric Co. " " Xiamen " score 16 is because postcode exists and do not match, so according to knowledge.
Example 3 " Hang Chuanshi of Zhejiang Province ", " Hangzhou, Zhejiang province city " must be divided into 112, so the result is " Hangzhou, Zhejiang province city ".
Example 4 " full mountain area, Shanghai City " " mountain area, Shanghai City " must be divided into 112, and " Baoshan District, Shanghai City " must be divided into 112, so the result is " Shanghai City ".
3. this port address identification
The identification of this port address is to utilize this port address table that recognition result character trail D is mated, and obtains the delivery suboffice of match address correspondence in this oral thermometer or delivers the road segment information.The storage mode of this port address table, it comprises road address and two tables of organization, at this moment because road address and organization are two kinds of expression-forms of address of the addressee.The identification of this port address has also comprised coupling and has adjudicated two parts.
4. at the basic procedure of this port address identification as shown in Figure 5, it has comprised the coupling of road address table and the coupling of organization table.Simultaneously the coupling of each table is divided into fuzzy matching and accurate coupling two parts again, and a plurality of matching results of the likelihood of two tables are comprehensively judged by the uniformity and the postcode identifying information of delivery information, obtains letter sorting information.Each step below makes introductions all round.
4.1. fuzzy matching
The coupling here adopts two step couplings, fuzzy matching and accurate coupling, main cause is that the time loss of accurately coupling is very big, and the address entry capacity of road address table and organization table is very big, for raising speed, designed fuzzy matching algorithm fast, used this algorithm to carry out fuzzy matching and improve a relative very little Candidate Set for accurately mating.Before fuzzy matching, at first need road address table and organization table are carried out the extraction of docuterm, docuterm is that the length that extracts from link name or unit name is 3 character string, and extracting principle is the docuterm similitude minimum each other of all extractions in the table.Fuzzy matching utilizes docuterm to remove to mate Chinese recognition result, adopts the quick comparison algorithm of direct search, chooses the Candidate Set that matching degree is accurately mated greater than the clauses and subclauses conduct of a certain thresholding.Respectively road address table and organization table being carried out fuzzy matching obtains two Candidate Sets and becomes the fuzzy set of matches of road fuzzy matching collection and unit.
4.2. accurately mate
When fuzzy matching, be that 3 docuterm has replaced actual link name or unit name to mate owing to adopted length, it has just selected two fuzzy Candidate Sets, but does not represent the matching degree of real road name or unit name.Accurately coupling is exactly that each clauses and subclauses and character trail D in the fuzzy Candidate Set are once mated again, and Matching Algorithm has adopted the improved Smith-Waterman algorithm of above introducing.Following formula is adopted in the calculating of this port address matching degree (Sl):
Sl=Match/max(Lin,Rl)???????(21)
Wherein Match represents to mate character number, and Lin represents the string length of link name or unit name, and Rl is the length of the matched character string R of Smith-Waterman algorithm output.Because link name and the diversity of unit name and factor affecting such as similitude each other, accurately only choose after the coupling mate (Sl=1.0) fully clauses and subclauses as a result of.After the coupling through two tables, meeting 0 is to the result and 0 result to a plurality of units name of a plurality of link name so.The reason that produces a plurality of link name results is itself to comprise many road names among the character trail D, has comprised " people road " and " Zhongshan Road " such as " crossing, Zhongshan Road, people road ", has comprised " middle Shan Xilu " and " Shan Xilu " such as " middle Shan Xilu " again; And also have above situation for a plurality of units name.Cause that ambiguity also comprises the unit name that exists a plurality of names identical in the organization table in the matching result simultaneously, they belong to two different places in same city, perhaps have many roads of the same name.
Road result for coupling, because the different doorplates on same road belong to different delivery suboffices or deliver the road section,, think that at this moment number is the string number immediately following the road name so need to extract its number, can obtain the result of road+number after extracting number, otherwise have only road.Result to road+number inquires about in the road address table, may obtain well-determined delivery information, or many different delivery informations when road (many of the same name); For having only link name in the road address table, to inquire about, may obtain unique delivery information (road only belongs to and delivers suboffice or road section), many delivery informations (a plurality of road of the same name), uncertain delivery information (road belongs to a plurality of delivery suboffices or road section).More than to same road inquiry summed up three kinds of results, be called here and determine to repeat the road matching result, uncertain road matching result by the road matching result.And the unit inquiry has only two kinds of situations to determine unit matching result and uncertain unit matching result.
4.3. the accurately judgement of matching result
Result for accurate coupling produces owing to the multiple situation of above analyzing, need comprehensively adjudicate by information such as districts under postcode, the matched position, finally obtains correct letter sorting information.
Fig. 6 accurately has after the coupling a plurality of as a result the time when link table or unit table, relatively wait information to carry out verification mutually by postcode coupling, affiliated district coupling, matching result delivery suboffice or road section, pick out inaccurate information or redundancy, obtain unique delivery suboffice or road section.Behind the information checking of Fig. 6, obtained the unique or a plurality of delivery suboffice or the road section that obtain by road address table and organization table respectively, if have only one in the coupling of link name and unit name the suboffice of delivery or road section result are arranged, if it is unique to deliver suboffice or road section, then export this letter sorting information, otherwise do not have information.Deliver suboffice or road section result if link name and unit name coupling all exist, then need to obtain last letter sorting information by mutual verification.As shown in Figure 7, if both compare by delivering suboffice or road section, if have unique identical delivery suboffice or road section, export this delivery suboffice or road section as letter sorting information, the coupling delivery information result of road address itself is unique else if, adopt this information as letter sorting information, think under other situations that information is uncertain and can't determine to deliver suboffice or road section.
More than introduced automatic identification of letter and method for sorting that address base drives, it is applied in the identification module of letter sorting machine.Show that through practice under relatively accurate complete, the situation that discrimination guarantees substantially of address base, this method can effectively be carried out analysis correction to recognition result, obtains result accurately.The key that this method can successfully be used is the accuracy of address base, the especially selective typing of road address table information integrity and organization table in this port address storehouse.Simultaneously under comprising the situation of postcode and full address, the letter image also can obtain better result.