Summary of the invention
Single for the address model that overcomes existing Chinese geographic position coding method, matching rate is not high enough, slow-footed deficiency, the invention provides the Chinese geocoding that a kind of address model is reasonable, matching rate is higher, rapidity is good and determine method based on fuzzy matching.
The technical solution adopted for the present invention to solve the technical problems is:
A kind of Chinese geocoding based on fuzzy matching is determined method, may further comprise the steps:
A1, reading in descriptive Chinese address information, is breakpoint with the administrative area rank, adopts the forward maximum searching method, and original address is carried out cutting, obtains the original address element array;
A2, the original address element is carried out standardization by the address dictionary;
A3, read normal address tree, adopt branch-bound algorithm, the original address element array is mated: the address database of setting up the number of addresses storage format, stratification according to the china administration district is divided, set up tree-shaped address storage tree, highest-ranking administrative area unit is as the root node of number of addresses, and preserve as child node in its subordinate administrative area; Foundation is to address key element and number after the cutting of descriptive Chinese address information, in matching process, at first read normal address tree R, judge by other key word of highest line political affairs level in the candidate site key element after the cutting, the address node of setting the corresponding administrative grade of R with the normal address mates, give up uncorrelated branch tree after the match is successful, keep the correlated branch tree and carry out next administrative grade coupling;
Simultaneously, using fuzzy rule controls matching operation: behind the key word after obtaining the original address cutting, also comprise:
Adopt the fuzzy matching rule that matching operation is optimized, the fuzzy matching rule definition is as follows: the supposition matching field is character string address, and length is h; Criteria field is character string std_address, and length is H; The std_address set that address ∩ std_address ≠ Φ is satisfied in definition is the set of Satisfying Matching Conditions, wherein, address ∩ std_address ≠ Φ represents that character string address and criteria field character string std_address occur simultaneously not for empty, keep the high set element of degree of membership at last; Be defined as follows matched rule:
1. standard characters std_address is identical with i character among the matched character string address, and then degree of membership is i/H;
2. standard characters std_address comprises matched character string address, and then degree of membership is 1;
Obtain after the degree of membership, set μ and be the coupling degree of membership, be converted into the quantification score value according to mapping ruler f:sc → μ, mapping function: f (μ)=10 * μ, with the evaluation score of sc as this candidate record;
The most close matching result of conduct that evaluation score is the highest promptly obtains more accurate match address.
As preferred a kind of scheme: described Chinese geocoding determines that method also comprises:
If the number that the A4 match address comprises is carried out space orientation: set the urban road number with following regular distribution: according to the both sides of odd or even number regular distribution in road, be odd numbers just to the left, the right side is an even numbers; Be odd numbers just to the right, the left side is an even numbers; Record road flex point number with and geographic coordinate information, after obtaining the number information in the original address, judge to be between any two flex points, suppose that the match address number is between flex point A, B, with A, B is reference point, carry out the least square method linear interpolation, obtain the particular geographic coordinates that this number is positioned at road, navigate to map at last.
Further, in the described steps A 3, by normalizing operation, the candidate site array define of obtaining after the original address standardization is address[i], 0<i<N; The normal address node is made as sc with the coupling score value of corresponding level candidate element
i, i represents the affiliated level of this node, N represents the degree of depth of initial address tree; It is as follows that coupling is passed judgment on rule:
Rule 1: number of addresses node and candidate's element accurately mate, Y → accurately mate N → fuzzy matching;
Rule 2: accurately search feasible solution after the coupling, Y → matching algorithm moves down, N → return the upper level node to search approximate solution;
Rule 3: judge whether to exist default, Y → preservation upper level branch tree, the current level of N → preservation branch tree;
Rule 4: judge whether to exist default, sc
i=0, i is default the place number of plies;
Rule 5: the candidate record final score is its each layer node matching score sum:
sc=∑sc
i。
Further again, in the described steps A 3, auxiliary geographical name data bank is set, use comparatively frequent geographic position to build the storehouse separately simultaneously for having the important of the second feature identity.
In steps A 1, the original address that obtains, first character with original address is a starting point, address database search is searched corresponding normal address title, exist and then read the address information reservation, simultaneously this character is excised in the original address character string, otherwise read next character and last character composition character string, corresponding normal address title is searched in continuation in address database, read successively, determines the address key element of all administrative grades.
In steps A 2, if there be default in the candidate site array after the cutting,, obtain its higher level address at address database according to other address element of next stage, write in the candidate site key element array.
In steps A 2, be called for short the design address, the another name information database, preserves the specialized information database of current all normal address information and its another name, abbreviation.
In steps A 2, the wrongly written or mispronounced characters error correction of the address element after the cutting, suppose in the address information of typing and have wrongly written or mispronounced characters, it is address element after the cutting can't find complete correspondence in the dictionary of address normal address title, get the normal address title the most close and return, and replace the address information of typing with the address information of typing.
Technical conceive of the present invention is: at first obtain original typing address information, adopt then and divide word algorithm that the original address of words input is carried out cutting, obtain the description key word with the corresponding locus of original address; The normal address data in city are pitched tree-like formula with K stores, wherein the K value is by the concrete quantity decision of each rank administrative unit, the key word that obtains is mated in the tree of normal address, adopt branch-bound algorithm that matching algorithm is optimized in the matching process, use simultaneously that fuzzy rule is accurately controlled matching operation and to the matching result screening of marking, obtain at least one and conform to fully with original address or be similar to the address information that conforms to.Application has reduced the scale of number of addresses based on the branch-and-bound matching algorithm of tree-shaped address information memory module, has optimized the algorithm complex of matching addresses process, has improved the efficient and the accuracy rate of address.
Beneficial effect of the present invention mainly shows: the present invention has optimized the algorithm complex of geocoding process, has improved the efficient and the accuracy rate of geocoding.
Embodiment
Below in conjunction with accompanying drawing the present invention is further described.
With reference to Fig. 1~Fig. 8,
A kind of Chinese geocoding method based on fuzzy matching as shown in Figure 1, wherein comprises following steps:
A1, reading in descriptive Chinese address information, is breakpoint with the administrative area rank, adopts the forward maximum searching method, and original address is carried out cutting, obtains the original address element array.A2, the original address element is carried out standardization by the address dictionary, obtain through being called for short or another name is corrected, misspelling is revised, address element array behind default normalizing operation such as filling.A3, read normal address tree, adopt branch-bound algorithm, the original address element array is mated, use fuzzy rule simultaneously matching operation is controlled, obtain more accurate match address.A4, the number that comprises for match address adopt flex point to carry out space orientation with reference to interpolation algorithm.
Described method, wherein, in steps A 1, at Chinese address information, with reference to china administration area dividing standard, established standards typing pattern:
Administrative address pattern: province (municipality directly under the Central Government) → city → district (county, county-level city); Regional address pattern: street (town) → village (road) term position → number.As normal address information: Hangzhou, Zhejiang province city Xihu District stays the town and stays and No. 288, North Road.
Described method, wherein, in steps A 1, the original address that obtains is a starting point with first character of original address, and address database search is searched corresponding normal address title, exist and then read the address information reservation, simultaneously this character is excised in the original address character string, otherwise read next character and last character composition character string, continue the corresponding normal address of search title in address database.Read successively, determine the address key element of all administrative grades.
Described method wherein, in steps A 2, if there be default in the candidate site array after the cutting, according to other address element of next stage, is obtained its higher level address at address database, writes in the candidate site key element array.
Described method, wherein, in steps A 2, be called for short the design address, the another name information database, preserves the specialized information database of current all normal address information and its another name, abbreviation.If there is another name in the candidate site after the cutting or is called for short, distinguish and it be standardized as standard name that as " Shandong " is standardized as " Shandong ", " Shanghai " is standardized as " Shanghai ".
Described method, wherein, in steps A 2, the wrongly written or mispronounced characters error correction of the address element after the cutting, suppose in the address information of typing and have wrongly written or mispronounced characters, be address element after the cutting can't find complete correspondence in the dictionary of address normal address title, get the normal address title the most close and return, and replace the address information of typing with the address information of typing.As typing " Liu Helu ", do not exist in the dictionary of address " Liu Helu ", only there be " Liu Helu ", get " Liu Helu " replacement " Liu Helu ".
Described method, wherein, in steps A 3, comprise following steps, read address database, and address database is stored with the number of addresses form, highest-ranking administrative area unit is as the root node of number of addresses, and preserve as child node in its subordinate administrative area, as shown in Figure 2.
Described method, wherein, in steps A 3, also comprise following steps, under address information tree-like storage prerequisite, adopt branch-bound algorithm that matching process is optimized, the address information of corresponding level during promptly at first other key word of highest line political affairs level in the matching candidate address element is set with corresponding address, matched nodes and branch thereof that the match is successful then keeps in the corresponding address tree set, and give up other uncorrelated address information nodes at the same level and branch thereof tree.By normalizing operation, the candidate site array define of obtaining after the original address standardization is address[i], 0<i<N.The normal address node is made as sc with the coupling score value of corresponding level candidate element
i, i represents the affiliated level of this node, N represents the degree of depth of initial address tree.It is as follows that coupling is passed judgment on rule:
Rule 1: number of addresses node and candidate's element accurately mate, Y → accurately mate N → fuzzy matching;
Rule 2: accurately search feasible solution after the coupling, Y → matching algorithm moves down, N → return the upper level node to search approximate solution;
Rule 3: judge whether to exist default, Y → preservation upper level branch tree, the current level of N → preservation branch tree;
Rule 4: judge whether to exist default, sc
i=0, i is default the place number of plies;
Rule 5: the candidate record final score is its each layer node matching score sum:
sc=∑sc
i
Described method wherein, in steps A 3, also comprises following steps, uses fuzzy rule control matching operation, if can't mate achievement fully for address information node at the same level in the number of addresses, then enables fuzzy rule, obtains the approximate match result.As typing key word at county level is " East Lake ", and only there be " West Lake " in node at county level in the number of addresses, then obtains node " West Lake " and branch thereof tree and keeps as matching result, gives up other nodes at the same level and branch thereof tree.
Described method wherein, in steps A 3, also comprises following steps, and matching result is quantized scoring.Coupling is given different score values with approximate match fully, and the most close matching result of conduct that score value is high returns, and the comparatively close matching result of the conduct that score value is low returns.Quantizing rule is as follows:
Suppose that matching field is character string address, length is h; Criteria field is character string std_address, and length is H.The std_address set that address ∩ std_address ≠ Φ is satisfied in definition is the set of Satisfying Matching Conditions, wherein, address ∩ std_address ≠ Φ represents that character string address and criteria field character string std_address occur simultaneously not for empty, keep the high set element of degree of membership at last.Be defined as follows matched rule Fig. 3):
1. standard characters std_address is identical with i character among the matched character string address, and then degree of membership is i/H;
2. standard characters std_address comprises matched character string address, and then degree of membership is 1.
Obtain after the degree of membership, set μ and be the coupling degree of membership, be converted into the quantification score value according to mapping ruler f:sc → μ, mapping function: f (μ)=1O * μ, with the evaluation score of sc as this candidate record.
Described method, wherein, in steps A 3, also comprise following steps, auxiliary geographical name data bank is set, having the important of the second feature identity for some uses comparatively frequent geographic position to build the storehouse separately simultaneously, the second feature identity as " Hangzhou, Zhejiang province city Xihu District stays the town and stays and No. 288, road " is " Zhejiang Polytechnical University Ping Feng school district ", if typing original address information is " Zhejiang Polytechnical University Ping Feng school district ", then directly navigate to the geographic position of " Hangzhou, Zhejiang province city Xihu District stays the town and stays and No. 288, road ".
Described method wherein, in steps A 4, comprises following steps, obtain final matching results after, carry out space interpolation location according to number information.If there is no number information then navigates to the region geometry center of the minimum administrative unit of original address information, is accurate to the street as original address information, then with the geometric space center of location positioning to this street.If there is number information, sets road and set the urban road number with following regular distribution: according to the both sides of odd or even number regular distribution in road: be odd numbers just to the left, the right side is an even numbers; Be odd numbers just to the right, the left side is even numbers (Fig. 4).Record road flex point number with and geographic coordinate information, after obtaining the number information in the original address, judge to be between any two flex points, suppose that the match address number is between flex point A, B, with A, B is reference point, carry out the least square method linear interpolation, obtain the particular geographic coordinates that this number is positioned at road, last space and geographical coordinate setting is to map.
Branch-and-bound matching algorithm average time complexity based on tree-shaped address information memory module among the present invention is log
K N, wherein N represents the leafy node number of K fork number of addresses.
In the present embodiment, set original typing address information and after cutting, obtain candidate site array address[for " Hangzhou, Zhejiang province city Donghu District stays to press down to stay and closes the road No. 288 " original address] (table 1).
Table 1 candidate site array
Level |
Economize |
The city |
The district |
The town |
The road |
Number |
Codomain |
Zhejiang |
Hangzhou |
East Lake |
Stay |
Stay and close |
??288 |
Consider better expression algorithm thought, add some in the match address tree and upset data that matching process is as follows behind the introducing branch and bound algorithms:
Step1: load the initial address tree, judge address[1]=" Zhejiang ", after accurately the match is successful, extraction is the branch tree of root node with " Zhejiang ", deletion invalid branch tree, wherein sc represents the PTS after each node and candidate site speech section are mated, as shown in Figure 5.
Step2: judge address[2]=" Hangzhou ", after accurately the match is successful, extracting with " Hangzhou " was the branch tree of root node.Judge address[3]=" East Lake ", after accurately the match is successful, extracting with " East Lake " was the branch tree of root node, as shown in Figure 6.
Step3: judge address[4]=" staying ", current branch tree does not have feasible solution, returns the father node in current root node " East Lake ", enable the fuzzy matching pattern, be met the branch tree of part matching condition, mate keyword again and " stay ", as shown in Figure 7.
Step4: judge address[5]=" stay and close ", the child node of current branch tree root node can't accurately mate, and starts the fuzzy matching pattern, obtain part coupling branch tree, judge address[6]=" 288 ", all part coupling branch trees mate, as shown in Figure 8.
After all speech section couplings were finished in the candidate site array, the last evaluation score that each address is write down sorted, and the address record that obtains marking the highest returns as final matching results, shown in Fig. 9 solid line part.
Step5: obtain number information, read the geographical information in final match address information Middle St road, comprise flex point number data, as shown in Figure 9.Judge that initial number " No. 288 " is positioned between flex point A " No. 268 " and the flex point B " No. 296 ".With flex point A, B is that reference point carries out the least square method interpolation, obtains the locus of original number in the street, sees " * " position among Figure 10.
What more than set forth is the good optimization effect that a embodiment that the present invention provides shows, obviously the present invention not only is fit to the foregoing description, can do many variations to it under the prerequisite of the related content of flesh and blood of the present invention and is implemented not departing from essence spirit of the present invention and do not exceed.