CN102955832A - Correspondence address identifying and standardizing system - Google Patents

Correspondence address identifying and standardizing system Download PDF

Info

Publication number
CN102955832A
CN102955832A CN201110255616XA CN201110255616A CN102955832A CN 102955832 A CN102955832 A CN 102955832A CN 201110255616X A CN201110255616X A CN 201110255616XA CN 201110255616 A CN201110255616 A CN 201110255616A CN 102955832 A CN102955832 A CN 102955832A
Authority
CN
China
Prior art keywords
address
module
metadata
standardized
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201110255616XA
Other languages
Chinese (zh)
Other versions
CN102955832B (en
Inventor
王国印
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaao Data Technology Co Ltd filed Critical Shenzhen Huaao Data Technology Co Ltd
Priority to CN201110255616.XA priority Critical patent/CN102955832B/en
Publication of CN102955832A publication Critical patent/CN102955832A/en
Application granted granted Critical
Publication of CN102955832B publication Critical patent/CN102955832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a correspondence address identifying and standardizing system which is used for solving the problem of correspondence address identification and standardization. The correspondence address identifying and standardizing system comprises a correspondence address input module, an address metadata dictionary module, an address identification and standardization module and a correspondence address output module, wherein the address identification and standardization module is connected with the correspondence address input module, the address metadata dictionary module and the correspondence address output module and is used for receiving a correspondence address input by a user and transmitted by the correspondence address input module, identifying and standardizing the correspondence address input by the user, and generating a standardized correspondence address; and the address metadata dictionary module is used for storing address metadata, and receiving and responding to a control command of the address identification and standardization module. By adopting the correspondence address identifying and standardizing system, the precision of correspondence address processing is improved, and the throughput rate and recall rate of correspondence address processing are relatively high.

Description

A kind of address identification, standardized system
One, technical field
The present invention relates to a kind for the treatment of technology of address, refer more particularly to a kind of address identification, standardized system.
Two, background technology
The application of address technology is very extensive, also closely bound up with daily life, as the contact of the mail in postal field need to address identify and processing, geocoding system at first need to address needing in standardization, the banking system carry out to address date store, identify, the processing such as renewal.The address technology also can be used for network, ecommerce, electronic chart etc.
Address lack of standardization or nonstandard will bring many inconvenience to people.For example: because station address is lack of standardization, mail system need to drop into a large amount of man power and materials and go to solve how to identify address correct, standard, otherwise the situation that will make the mistake delivery, repeats to deliver, and along with the growth of postal service data, this input will correspondingly enlarge, and this situation will be difficult to bear concerning mail system.Also exist the Problem of station address in the banking system, (in fact a lot of banking systems do not carry out standardization to station address really if banking system does not carry out standardized processing to station address, the typing of station address and more new capital be artificial operating), data volume increase along with operation system, the incompatibility of disparate databases, banking system will be faced with that processing speed is slow, inefficiency, the phenomenon of operation system data corruption is easy to cause client's loss and loss.
Above problems and phenomenon all can be processed and solve by the address technology.Since the establishment of the nation, along with determining of China administrative region, the communication in Chinese address has formed certain rule and characteristics; From the another one angle, because China's history culture is long, vast territory and abundant resources, the communication in Chinese address exists a large amount of addresses phenomenon of the same name.The characteristics of communication in Chinese address and rule can be summarized as follows: 1), address information has in other words administrative region property of level, rank character, and such as province, city, district, road, " Shennan Road, Nanshan District, Shenzhen City, Guangdong Province "; Province, county, small towns, village, " Leng Jiaba village, Shuan Hechang township, Yuezhi County, Sichuan Province " etc.2), the address date (place name) of different stage can exist phenomenon of the same name, easily causes ambiguity and misunderstanding.As the address that provides is " Wuhu ", and it is " City of Wuhu in Anhui " that people are difficult to distinguish, still " Wuhu County, Anhui Province ".3, same address metadata (place name) has multiple expression way, has the phenomenon of address another name, such as " Guangxi Zhuang Autonomous Region ", is write as " osmanthus ", " Guangxi ", or " Guangxi autonomous region ".
For above-mentioned phenomenon and problem, prior art also provides some to solve thinking and method.Be 200910156650.4 such as number of patent application, patent name provides so a kind of solution thinking for the patent documentation of " a kind of Chinese geocoding based on fuzzy matching is determined method ": read in descriptive Chinese address information (the Chinese address information reference china administration zone criteria for classifying, established standards typing pattern), take the administrative region as breakpoint, adopt the forward maximum searching method, cutting is carried out in the address of reading in, obtain the original address element array; Then the original address element array is carried out standardization by the address dictionary, can obtain the address of standard.Existing this technical scheme more depends on the accuracy of Input Address, exists phenomenon of the same name to be difficult to process for the address date of different stage, and processing procedure and method are comparatively simple and crude, and degree of accuracy is not very high comparatively speaking.
The present invention just is being based on some phenomenons and the problem of above existence, and the deficiencies in the prior art part, and a kind of new address identification, standardized solution thinking and method are provided.
Three, summary of the invention
In order to solve the deficiencies in the prior art part, and the address date of some phenomenons of address existence such as different stage exists phenomenon of the same name to cause the address to be difficult to process, improve the degree of accuracy that address is processed, and throughput, the recall rate of transaction processing system under the big data quantity environment.The invention provides a kind of address identification, standardized system.
In order to realize the object of the invention, the invention provides a kind of address identification and standardized system, described a kind of address identification and standardized system comprise address load module, address metadata dictionary module, Address Recognition and standardized module, address output module; Described address load module is used for receiving the address of user's input, and the address of described user's input is sent to described Address Recognition and standardized module; Described Address Recognition is connected with described address output module with described address load module, described address metadata dictionary module with standardized module, address for the described user's input that receives described address load module transmission, address to described user's input is identified and standardization, and produces standardized address; Described address metadata dictionary module is used for the memory address metadata, receives and respond the control command of described Address Recognition and standardized module; Described address output module is used for receiving the control command of described Address Recognition and standardized module, and the address of outputting standard.
The present invention has improved the degree of accuracy that address is processed, and under the processing environment of big data quantity, the present invention can be suitable for equally, and throughput, recall rate that address is processed are higher.Processing procedure of the present invention, the committed memory space is less, and the address metadata dictionary is easy to upgrade and safeguard.
Should be appreciated that above general description and following detailed description all are illustrative and exemplary, aim to provide desired of the present invention further specifying.
Four, description of drawings
The accompanying drawing that comprises is used for providing a further understanding of the present invention, and it is merged in instructions and consists of its part, description of drawings embodiments of the invention, and be used from instructions one and understand principle of the present invention.
Fig. 1 is a kind of address identification of the present invention and standardized system construction drawing.
Fig. 2 is the address metadata dictionary modular structure figure of preferred embodiment of the present invention.
Fig. 3 is the Trie tree theory structure schematic diagram of preferred embodiment of the present invention.
Fig. 4 is Address Recognition and the standardized module structural drawing of preferred embodiment of the present invention.
Fig. 5 is Address Recognition and the standardized module structural drawing of another preferred embodiment of the present invention.
Fig. 6 is the address metadata dictionary modular structure figure of another preferred embodiment of the present invention.
Five, embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only is used for explaining the present invention, be not intended to limit the present invention.
As shown in Figure 1, be a kind of address identification of the present invention and standardized system construction drawing.Described a kind of address identification and standardized system comprise address load module 100, Address Recognition and standardized module 300, address metadata dictionary module 200 and address output module 400.Described address load module 100 is used for receiving the address of user's input, and the address of described user's input is sent to described Address Recognition and standardized module 300; Described Address Recognition is connected with described address output module 400 with described address load module 100, described address metadata dictionary module 200 with standardized module 300, address for the described user's input that receives described address load module 100 transmission, address to described user's input is identified and standardization, and produces standardized address.Described address metadata dictionary module 200 is used for memory address metadata data, receives and respond the control command of described Address Recognition and standardized module 300; Described address output module 400 is used for receiving the control command of described Address Recognition and standardized module 300, and the address of outputting standard.
Described address metadata dictionary module 200 is shown in the address metadata dictionary modular structure figure of Fig. 2 preferred embodiment of the present invention.Described address metadata dictionary module 200 comprises hierarchical address metadata dictionary 210 and address another name metadata dictionary 220.Described hierarchical address metadata dictionary 210 is used for storage hierarchical address metadata, and the hierarchical address metadata dictionary can be four-stage grading address metadata dictionary or six grades of classification address metadata dictionary.
Described four-stage grading address metadata dictionary is divided by the administrative region.In example of the present invention, provide a kind of four-stage grading address model in order to consist of four-stage grading address metadata dictionary, shown in " table 1, four-stage grading address model (a) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of four-stage grading address model; Secondary provincial city, prefecture-level city, the municipality directly under the Central Government area under one's jurisdiction, the county, autonomous county, county-level city, flag, autonomous prefecture, the second level of four-stage grading address model will be divided in the area; Area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction will be the third level of four-stage grading address model; The small towns, road, the natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.The address that this kind hierarchy model more often is applied to I.D. and generally writes.
Table 1, four-stage grading address model (a)
Figure BSA00000566046700041
Also can adopt in embodiments of the present invention another kind of four-stage grading address model in order to consist of four-stage grading address metadata dictionary, shown in " table 2, four-stage grading address model (b) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of four-stage grading address model; Secondary provincial city, prefecture-level city, autonomous prefecture, the second level of four-stage grading address model will be divided in the area; Be directly under the jurisdiction of the urban district, area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction, the county, county-level city, autonomous county, flag will be the third level of four-stage grading address model; The small towns, road, the natural village, correlation number, the name of buildings will become the fourth stage of four-stage grading address model.This hierarchy model is comparatively rigorous, in strict accordance with the grade classification in state administration zone, is usually used in address sort and rating information in the internet.
Table 2, four-stage grading address model (b)
Described six grades of classification address metadata dictionary are divided by the administrative region equally.Can adopt in embodiments of the present invention one or six grades of classification address models in order to consist of six grades of classification address metadata dictionary, shown in " table 3, six grades of classification address models (a) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of six grades of classification address models; Secondary provincial city, prefecture-level city, the municipality directly under the Central Government area under one's jurisdiction, the county, autonomous county, county-level city, flag, autonomous prefecture, the second level of six grades of classification address models will be divided in the area; Area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction will be the third level of six grades of classification address models; The small towns will become the fourth stage of six grades of classification address models; The street, the road, the road, the natural village will be as the level V of six grades of classification address models; Numbering will be as the 6th grade of six grades of classification address models.The address that this kind hierarchy model also more often is applied to I.D. and generally writes.
Table 3, six grades of classification address models (a)
Figure BSA00000566046700061
Also can adopt in embodiments of the present invention another kind of six grades of classification address models in order to consist of six grades of classification address metadata dictionary, shown in " table 4, six grades of classification address models (b) ".Economize, autonomous region, municipality directly under the Central Government will be as the first order of six grades of classification address models; Secondary provincial city, prefecture-level city, autonomous prefecture, the second level of six grades of classification address models will be divided in the area; Be directly under the jurisdiction of the urban district, area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction, the county, county-level city, autonomous county, flag will be the third level of six grades of classification address models; The small towns will become the fourth stage of six grades of classification address models; The street, the road, the road, the natural village will be as the level V of six grades of classification address models; Numbering will be as the 6th grade of six grades of classification address models.This hierarchy model is comparatively rigorous, is the grade classification in strict accordance with the state administration zone equally, is usually used in address sort and rating information in the internet.
Table 4, six grades of classification address models (b)
Figure BSA00000566046700071
Can find out from the table of above-mentioned hierarchical address model, six grades of classification address metadata dictionary are compared with four-stage grading address metadata dictionary, and its address date information is more detailed, clear and definite, and the dictionary scale is relatively large.
Described hierarchical address metadata dictionary 210 can adopt Trie tree storage organization.Described Trie tree storage organization can adopt the method for even numbers group to realize.The Trie data structure that adopts the even numbers group to realize, all entries will be compiled into dictionary tree, and this dictionary tree is a definite finte-state machine (Deterministic Finite Automaton, DFA).
Described address another name metadata dictionary 220 is used for memory address another name metadata, and has mapping relations with metadata in the hierarchical address metadata dictionary 210.Hierarchical address metadata dictionary 210 and address another name metadata dictionary 220 are relations of one-to-many, the address metadata " Anhui Province " in the hierarchical address metadata dictionary 210 for example, with address another name metadata " Anhui ", " Anhui " in the another name metadata dictionary 220 of address, it is exactly the relation of one-to-many.Address another name metadata can only corresponding address metadata in same rank place name collection of metadata, therefore can be in same rank place name collection of metadata the address be called between metadata and the address metadata and set up mapping relations, be about to address another name metadata and be mapped to the address metadata, realize the unified processing of address metadata.
Described Address Recognition and standardized module 300 are shown in the Address Recognition and standardized module structural drawing of Fig. 4 preferred embodiment of the present invention.Described Address Recognition and standardized module 300 comprise address cutting module 310, address labeling module 320, weights module 330 and Address Standardization module 340.
Described address cutting module 310 is for the address of the described user's input that receives described address load module 100 transmission, and the address that described user is inputted carries out cutting, and generates the address set of metadata of cutting.Described cutting is the hierarchical address metadata dictionary 210 of available described address metadata dictionary 200, and adopting to the right, maximum matching process mates and cutting the address that described user inputs.
Described hierarchical address metadata dictionary 210 adopts Trie tree storage organization, and described Trie tree storage organization adopts the method for even numbers group to realize.The Trie data structure that adopts the even numbers group to realize comprises two arrays, is respectively array Base[], described array Base[] for depositing the relatively initial offset (avoiding a conflict) of next input variable of current state; Array Check[], described array Check[] be used for depositing the position of forerunner's state of current state.All entries of storage will be compiled into dictionary tree, and this dictionary tree is a definite finte-state machine (Deterministic Finite Automaton, DFA).In preferred embodiment of the present invention, can adopt maximum matching algorithm realization to the right that the address that described user inputs is mated and cutting, the Unicode code of each character is for definite finte-state machine (Deterministic Finite Automaton in the character string of input, DFA) input variable is that example is (with UTF-8, each byte of GBK or GB 18030 codings is as the same as the input variable of DFA), the maximum matching algorithm implementation is as follows to the right:
From original state, according to the value of input variable, obtain next state, obtain by following formula:
Base[s]+c=t formula (1)
Check[Base[s]+c]=s formula (2)
Wherein s is current state, and c is the value of input variable, and t is the position of next state, and formula (2) is used for verification, and the forerunner of expression NextState is current state.If current state be done state then in the Base array value of correspondence position be negative, otherwise be positive number (the expression current state is non-done state or intermediateness).
Be the value of input variable according to the above account form Unicode value that each character of reading character string is corresponding successively, calculate NextState from current state, then NextState is carried out down as the current state recurrence, until finish.Situation about finishing has following several:
1), the character string of input reaches end position
2), being used for the condition of formula (2) of verification does not satisfy
When reaching end position, the input variable that the done state of last process is corresponding is maximum matched position, and the distance of returning its relative starting point is maximum matching length.
Be not difficult to find out that according to above implementation the complexity of this algorithm is O (n), n is the length of character string of being retrieved.The scale of this algorithm and dictionary (being the entry quantity that dictionary comprises) is irrelevant.
In conjunction with the rie tree theory structure schematic diagram of Fig. 3 preferred embodiment of the present invention, can well understand implementation procedure and the principle of above-mentioned maximum matching algorithm to the right.Address such as user input be " ShenZhen,GuangDong Bao'an Xixiang ", and by the address metadata dictionary, adopting to the right, maximum matching method matches and cutting address afterwards are " ShenZhen,GuangDong Bao'an Xixiang ".
Described address labeling module 320 is used for utilizing the address set of metadata of 200 pairs of described cuttings of described address metadata dictionary module to mark according to predefined mark attribute, and generates the address set of metadata of mark.Because the randomness of user input is probably inputed address information by mistake, the perhaps random text of admixture between correct addresses at different levels, this just requires system to have powerful robustness and processing power.In order to realize the foregoing invention purpose, a kind of address information mark disposal route is provided in the preferred embodiment of the present invention, address information mark disposal route is different because the model of hierarchical address metadata dictionary is different, as example, its disposal route is as follows take " table 1, four-stage grading address model (a) " in preferred embodiment:
Mark the pre-defined as follows of attribute:
1), economize, autonomous region, municipality directly under the Central Government are the first order, additional label symbol " (1) ", i.e. single level address in the metadata back, first-level address that cuts out;
2), secondary provincial city, prefecture-level city, the municipality directly under the Central Government area under one's jurisdiction, the county, autonomous county, county-level city, flag, autonomous prefecture, the area is the second level, additional label symbol " (2) ", i.e. two-level address in the metadata back, first-level address that cuts out;
3), area under one's jurisdiction, secondary provincial city, the prefecture-level city area under one's jurisdiction is the 3rd utmost point, additional label symbol " (3) ", i.e. third-level address in the metadata back, first-level address that cuts out;
4), the small towns, road, the natural village, the name of buildings etc. is the fourth stage, additional label symbol " (4) ", i.e. level Four address in the metadata back, first-level address that cuts out;
5), for the data of non-address information, the address information that system can not identify, additional label symbol " (0) ", i.e. real-time address.
According to above definition, be " ShenZhen,GuangDong Bao'an Xixiang " such as the address of user input, by the address metadata dictionary, adopting to the right, maximum matching algorithm coupling and cutting address afterwards are " ShenZhen,GuangDong Bao'an Xixiang "; Can be " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) " through the address after the mark.Because there is situation of the same name in the address metadata of different stage, when therefore the address of described input being marked, multiple mark situation can appear.The situation that four kinds of marks are just arranged such as the address of above-mentioned input: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".
Accordingly, according to above-mentioned definition, general mark sequentially has following types, shown in " table 5, general mark are sequentially given an example ":
Table 5, general mark are sequentially given an example
Figure BSA00000566046700101
Address information mark disposal route as an example of " table 2, four-stage grading address model (b) ", " table 3, six grades of classification address models (a) " and " table 4, six grades of classification address models (b) " example can also similar above-mentioned processing mode mark.Different is that the mark type of six grades of classification address metadata dictionary is more detailed with respect to four-stage grading address metadata dictionary.
The address set of metadata that described weights module 330 is used for described mark is calculated its corresponding weights and is exported the address set of metadata of weights maximum.The address of input will obtain the address set of metadata of one or more marks after processing through described address cutting module 310, address labeling module 320.Such as Input Address " ShenZhen,GuangDong Bao'an Xixiang ", can obtain following four kinds of mark states: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".The reason that produces this problem is because there is phenomenon of the same name in the address metadata of different stage.Can calculate its corresponding weights and export the address set of metadata of weights maximum according to dynamic programming algorithm in embodiments of the present invention.
Described dynamic programming algorithm, can adopt in embodiments of the present invention the classic algorithm in the dynamic programming algorithm: Viterbi (Viterbi) algorithm calculates optimum address grade mark sequence, and the observed value in the algorithm and state are the address grade.This algorithm comprises following content:
An original state value:
Pi N * 1=(π 1, π 2, π 3..., π n) TFormula (3)
π wherein iThat the address rank is the initial probability of i.Value in the Pi is rule of thumb set, and each the value size in it is followed following principle: the initial probability of the higher correspondence of address administrative grade is higher, such as provincial initial probability greater than city-level.A probability transfer matrix A N * n:
A = a 11 a 12 a 13 · · · a 1 n a 21 a 22 a 23 · · · a 2 n a 31 a 32 a 33 · · · a 3 n · · · · · · · · · · · · · · · a n 1 a n 2 a n 3 · · · a nn
Wherein
a ij=P(q t=j|q t-1=i) 1≤i,j≤n
Represent that current address rank is i, the next address rank is the probability of j.Each value in the matrix A is rule of thumb set.Each a in the matrix A for example IjThe value size should follow following principle:
1), generally speaking (i, j) consist of a backward (during i>=j), a IjValue should be in principle less than all non-backwards (value of i<j), the direction that this condition increases progressively according to the address rank for the net result that guarantees to mark sequence.
2), in the hierarchy model of address, have the situation that a certain level address can be continuous, corresponding a can appear in expression fourth stage address continuously such as " (4)+" 44Value slightly larger, otherwise just slightly smaller.
Being constrained on it:
a ij ≥ 0 , ∀ i , j
Σ j = 1 n a ij = 1 , ∀ i
The execution flow process of Viterbi (Viterbi) algorithm is as follows:
1), initialization
δ 1(i)=π i, 1≤i≤n formula (4)
Figure BSA00000566046700121
Formula (5)
2), circulation is carried out
δ t ( j ) = max 1 ≤ i ≤ n [ δ t - 1 ( i ) a ij ] Formula (6)
Figure BSA00000566046700123
Formula (7)
2≤t≤T wherein, 1≤j≤n
3). finish
P * = max 1 ≤ i ≤ n [ δ T ( i ) ] Formula (8)
q T * = arg max 1 ≤ i ≤ n [ δ T ( i ) ] Formula (9)
Obtain best mark sequence by back-track algorithm, following formula:
Figure BSA00000566046700126
T=T-1, T-2 ..., 1 formula (10)
The for instance realization of bright above-mentioned algorithm.Different because of the model of hierarchical address metadata dictionary, the Pi of Viterbi (Viterbi) algorithm and the occurrence of A are with different.In embodiments of the present invention, take " table 1, four-stage grading address model (a) " as example, the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address such as input is: " ShenZhen,GuangDong Bao'an Xixiang " can obtain following four kinds of mark states through described address cutting module 310, address labeling module 320 after processing: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".According to Viterbi (Viterbi) algorithm, we can learn the weights of four kinds of mark states:
1), Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4); P=0.030375
2), Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2); P=0.0030375
3), Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4); P=0.001125
4), Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2); P=1.125E-4
The mark sequence of maximum probability is the first mark situation.Therefore the result of dynamic programming algorithm output also is the first mark state " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ".
Can also similar above-mentioned processing mode process with the weights disposal route that " table 2, four-stage grading address model (b) ", " table 3, six grades of classification address models (a) " and " table 4, six grades of classification address models (b) " are as the criterion.
Described address another name metadata dictionary 220 is used for memory address another name metadata, and has mapping relations with hierarchical address metadata dictionary 210.By described address another name metadata dictionary 220, described Address Standardization module 340 can be carried out standardization to the address another name of input, and the name of Address Standardization is called the full name of official of address.As: " Shanghai " is standardized as " Shanghai City ", " Guangdong " and is standardized as that " Guangdong Province ", " Guangxi " are standardized as " Guangxi Zhuang Autonomous Region ", " Beijing " is standardized as " Beijing " etc.
After the address set of metadata of described weights maximum produces, described Address Standardization module 340 will be carried out standardization to it with described address another name metadata dictionary 220, generate standardized address, and control command occurs to described address output module 400.Be " ShenZhen,GuangDong Bao'an Xixiang " such as above-mentioned Input Address, processing through address cutting module 310, address labeling module 320 and weights module 330, pass through again the standardization of Address Standardization module 340, will obtain standardized address " West Township Town, Baoan District, Shenzhen City, Guangdong Province ".The address information of 340 pairs of real-time addresses of described Address Standardization module (namely being labeled as " 0 ") will be handled as follows:
1), real-time address appears at before the level Four address occurs for the first time.Since one-level, secondary, and it is relatively complete that collect the third-level address, the front third-level address of not included generally can not occur.If real-time address appears at before the level Four address occurs for the first time, general all is because the address of user's input error causes, what Address Standardization module 340 will be given tacit consent in embodiments of the present invention carries out standardization to other address of other non-zero order, real-time address front and back.Certainly, also can do in other embodiments deletion is labeled as the address text of " (0) " level or the address text that is labeled as " (0) " level is not done standardization.
2), real-time address appears at after the level Four address for the first time.This kind situation may be because complete not (as: the address metadata dictionary take " table 1, four-stage grading address model (a) " as the hierarchical address model that causes of address date, " road " data None-identified afterwards will appear, such as typical buildings, cell name etc.), perhaps input error causes, and last before this kind situation will occur for the first time to real-time address in preferred embodiment of the present invention is labeled as afterwards all mark set of metadata of " (4) " and all do not do standardization.For example, for following address: " Room 203,7 Unit 3, new garden, No. 1037 Anhui, Wuchang District Lopa Nationality fine jade road, Wuhan City, Hubei Province ".Its mark sequence is as follows: " new Room (4) (4) 203, (0) 7 Unit (4) 3, garden, Hubei Province (1) Wuhan City (2) Wuchang District (3) (4) No. 1037 (4) Anhui of Luo Yulu (1) ".Data in the database of address are full-time not, processing for above-mentioned situation is as follows: the data that are labeled as real-time address for the first time are " new garden ", last data that are labeled as the level Four address of its front are " No. 1037 ", then " No. 1037 " all mark sequences are not afterwards all done standardization, therefore the net result sequence of " Room 203,7 Unit 3, new garden, No. 1037 Anhui " should be " Room 203,7 Unit 3, new garden, No. 1037 Anhui ", rather than " Room 203,7 Unit 3, new garden, No. 1037 Anhui Province ".
In another preferred embodiment of the present invention, described Address Recognition and standardized module 300 comprise outside address cutting module 310, address labeling module 320, weights module 330 and the Address Standardization module 340, also comprise a correcting module 350, shown in the Address Recognition and standardized module structural drawing of another preferred embodiment of Fig. 5 the present invention.Described correcting module utilizes the address metadata dictionary, determines whether consistent with the constraint condition of described address metadata dictionary to the mark sequence of the address set of metadata of weights maximum; If inconsistent, the mark sequence of the address set of metadata of described weights maximum is revised; Generate revised address set of metadata and export to described Address Standardization module.
In another preferred embodiment of the present invention, described address metadata dictionary module 200 also comprises an address metadata correction dictionary 230, shown in the address metadata dictionary modular structure figure of another preferred embodiment of Fig. 6 the present invention.What deposit in the described address metadata correction dictionary 230 is following 2 types constraint condition data:
1), in district and county or the county-level city situation of the same name, deposit its constraint condition.For example: in level Four address model (a, b) and six grades of address models (a, b), there is identical situation in the another name of the another name in district and county or county-level city.Such as " Taihe County " (being subordinate to Anhui Province's Fuyang City) and " Da He district " (being subordinate to Jinzhou City, Liaoning Province), their another name all is " Taihe county ", when calculating optimum mark sequence through dynamic programming algorithm, " Taihe county " all is labeled in district's one-level, therefore also needs the candidate result of optimum is proofreaied and correct processing with a constraint condition.In embodiments of the present invention, with the constraint condition of depositing be: " Taihe county → Fuyang City ", this constraint condition represent that " Taihe county " is a county under " Fuyang City ".
2), in township, town and county or the county-level city situation of the same name, deposit its constraint condition.For example in level Four address model (a) and six grades of address models (a), there is township, town and county or county-level city situation of the same name.Such as address " Fujian Longyan Changting and level road ", this moment " Changting " will be labeled on the rank of township level address, in fact it is a county under " Longyan ", i.e. " Changting County ", rather than " Changting town " under " Hailin City, Mudanjiang City, Heilongjiang Province ", therefore also will judge according to constraint condition, constraint condition is: " Changting → Longyan ", this constraint condition represent that " Changting " is a county under " Longyan ".
Described correcting module 350 can judge whether the address set of metadata of weights maximum is consistent, namely whether meets constraint condition by address metadata correction dictionary 230.If inconsistent, namely do not meet constraint condition, selected optimal result existing problems then are described, described correcting module 350 will be revised the mark of the address set of metadata of described weights maximum; Generate revised address set of metadata and export to described Address Standardization module.
The for instance realization of bright another preferred embodiment.In another preferred embodiment of the present invention, take " table 1, four-stage grading address model (a) " as example, the desirable following original state value of Pi and A:
Pi={0.05,0.45,0.25,0.15,0.1};
A={{0.05,0.45,0.25,0.15,0.10},
{0.05,0.23,0.45,0.17,0.10},
{0.05,0.18,0.25,0.30,0.22},
{0.05,0.35,0.05,0.05,0.50},
{0.05,0.30,0.15,0.05,0.45}};
Address such as user's input is " new township, Heihe In The Heilongjiang River Wudalianchi ", and through the processing of address cutting module 310, address labeling module 320 and weights module 330, the mark of Input Address and weights situation are as follows as can be known:
1), " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) " P=0.0200475;
2), " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (2) (4) " P=0.0111375;
3), " new township, Wudalianchi, Heihe, Heilungkiang (1) (4) (4) (4) " P=0.0091125;
4), " new township, Wudalianchi, Heihe, Heilungkiang (1) (4) (2) (4) " P=0.001485.
At this time export the set of metadata of weights maximum: " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) ".Described correcting module utilizes the address metadata correction dictionary 230 in the address metadata dictionary 200, the set of metadata of described weights maximum is judged as can be known, and " Wudalianchi, Heihe (2) (4) " is inconsistent with constraint condition " Wudalianchi → Heihe City ".At this moment, correcting module will be revised the mark of the address set of metadata " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) " of described weights maximum; Generate revised address set of metadata " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (2) (4) ", and export to described Address Standardization module.

Claims (12)

1. an address is identified and standardized system, it is characterized in that: described a kind of address identification and standardized system comprise address load module, address metadata dictionary module, Address Recognition and standardized module, address output module;
Described address load module is used for receiving the address of user's input, and the address of described user's input is sent to described Address Recognition and standardized module;
Described Address Recognition is connected with described address output module with described address load module, described address metadata dictionary module with standardized module, address for the described user's input that receives described address load module transmission, address to described user's input is identified and standardization, and produces standardized address;
Described address metadata dictionary module is used for the memory address metadata, receives and respond the control command of described Address Recognition and standardized module;
Described address output module is used for receiving the control command of described Address Recognition and standardized module, and the address of outputting standard.
2. a kind of address as claimed in claim 1 is identified and standardized system, and it is characterized in that: described Address Recognition and standardized module comprise address cutting module, address labeling module, weights module and Address Standardization module;
Described address cutting module is utilized address metadata dictionary module for the address of the described user's input that receives described address load module transmission, and the address that described user is inputted carries out cutting, and generates the address set of metadata of cutting;
Described address labeling module is used for utilizing described address metadata dictionary module that the address set of metadata of described cutting is marked, and generates the address set of metadata of mark;
Described weights module is used for the address set of metadata to described mark, calculates its corresponding weights and exports the address set of metadata of weights maximum;
Described Address Standardization module is used for utilizing described address metadata dictionary module that the address set of metadata of described weights maximum is carried out standardization, generates standardized address, and control command occurs to described address output module.
3. a kind of address identification as claimed in claim 2 and standardized system is characterized in that: described cutting can adopt to the right that maximum matching algorithm mates and cutting the address that described user inputs.
4. a kind of address as claimed in claim 1 or 2 is identified and standardized system, it is characterized in that: described address metadata dictionary module comprises hierarchical address metadata dictionary and address another name metadata dictionary.
5. a kind of address as claimed in claim 4 is identified and standardized system, it is characterized in that: described hierarchical address metadata dictionary is used for storage hierarchical address metadata, can be four-stage grading address metadata dictionary or six grades of classification address metadata dictionary.
6. a kind of address as claimed in claim 5 is identified and standardized system, it is characterized in that: described hierarchical address metadata dictionary can adopt Trie tree storage organization.
7. a kind of address as claimed in claim 6 is identified and standardized system, it is characterized in that: described Trie tree storage organization can adopt the method for even numbers group to realize.
8. a kind of address as claimed in claim 4 is identified and standardized system, it is characterized in that: described address another name metadata dictionary is used for the another name metadata of memory address, and the metadata in described address another name metadata dictionary and the described hierarchical address metadata dictionary has mapping relations.
9. a kind of address as claimed in claim 2 is identified and standardized system, it is characterized in that: described weights module can adopt dynamic programming algorithm to calculate and export the address set of metadata of weights maximum.
10. a kind of address as claimed in claim 9 is identified and standardized system, and it is characterized in that: described dynamic programming algorithm can adopt Viterbi (Viterbi) algorithm to realize.
11. a kind of address identification as claimed in claim 2 and standardized system, it is characterized in that: described Address Recognition and standardized module also comprise a correcting module; Described correcting module utilizes described address metadata dictionary module, and the address set of metadata of weights maximum is determined whether unanimously; If inconsistent, the mark of the address set of metadata of described weights maximum is revised; Generate revised address set of metadata and export to described Address Standardization module.
12. a kind of address identification as claimed in claim 11 and standardized system, it is characterized in that: described address metadata dictionary module also comprises an address metadata correction dictionary, is used for storing described constraint condition data.
CN201110255616.XA 2011-08-31 2011-08-31 A kind of address identification, standardized system Active CN102955832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110255616.XA CN102955832B (en) 2011-08-31 2011-08-31 A kind of address identification, standardized system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110255616.XA CN102955832B (en) 2011-08-31 2011-08-31 A kind of address identification, standardized system

Publications (2)

Publication Number Publication Date
CN102955832A true CN102955832A (en) 2013-03-06
CN102955832B CN102955832B (en) 2015-11-25

Family

ID=47764642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110255616.XA Active CN102955832B (en) 2011-08-31 2011-08-31 A kind of address identification, standardized system

Country Status (1)

Country Link
CN (1) CN102955832B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440311A (en) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 Method and system for identifying geographical name entities
CN104252507A (en) * 2013-06-28 2014-12-31 北京华傲达数据技术有限公司 Enterprise data matching method and device
CN104951508A (en) * 2015-05-21 2015-09-30 腾讯科技(深圳)有限公司 Time information identification method and device
WO2016127904A1 (en) * 2015-02-13 2016-08-18 阿里巴巴集团控股有限公司 Text address processing method and apparatus
WO2016165538A1 (en) * 2015-04-13 2016-10-20 阿里巴巴集团控股有限公司 Address data management method and device
CN106407475A (en) * 2016-11-18 2017-02-15 广州爱九游信息技术有限公司 Content screening method, device and server
CN107145577A (en) * 2017-05-08 2017-09-08 上海东方网络金融服务有限公司 Address standardization method, device, storage medium and computer
CN110210020A (en) * 2019-05-22 2019-09-06 武汉虹信通信技术有限责任公司 The standardized system and method for address
CN110569322A (en) * 2019-07-26 2019-12-13 苏宁云计算有限公司 Address information analysis method, device and system and data acquisition method
CN110851559A (en) * 2019-10-14 2020-02-28 中科曙光南京研究院有限公司 Automatic data element identification method and identification system
CN113434708A (en) * 2021-05-25 2021-09-24 北京百度网讯科技有限公司 Address information detection method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350012A (en) * 2007-07-18 2009-01-21 北京灵图软件技术有限公司 Method and system for matching address
US20090150393A1 (en) * 2007-12-11 2009-06-11 Pitney Bowes Inc. Method for assignment of point level address geocodes to street networks
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350012A (en) * 2007-07-18 2009-01-21 北京灵图软件技术有限公司 Method and system for matching address
US20090150393A1 (en) * 2007-12-11 2009-06-11 Pitney Bowes Inc. Method for assignment of point level address geocodes to street networks
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜锋: "基于条件随机场的中文分词研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 2, 15 February 2007 (2007-02-15), pages 138 - 698 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252507B (en) * 2013-06-28 2017-06-27 北京华傲达数据技术有限公司 A kind of business data matching process and device
CN104252507A (en) * 2013-06-28 2014-12-31 北京华傲达数据技术有限公司 Enterprise data matching method and device
WO2015027836A1 (en) * 2013-08-27 2015-03-05 深圳市华傲数据技术有限公司 Method and system for place name entity recognition
CN103440311A (en) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 Method and system for identifying geographical name entities
WO2016127904A1 (en) * 2015-02-13 2016-08-18 阿里巴巴集团控股有限公司 Text address processing method and apparatus
WO2016165538A1 (en) * 2015-04-13 2016-10-20 阿里巴巴集团控股有限公司 Address data management method and device
CN104951508A (en) * 2015-05-21 2015-09-30 腾讯科技(深圳)有限公司 Time information identification method and device
CN106407475A (en) * 2016-11-18 2017-02-15 广州爱九游信息技术有限公司 Content screening method, device and server
CN107145577A (en) * 2017-05-08 2017-09-08 上海东方网络金融服务有限公司 Address standardization method, device, storage medium and computer
CN110210020A (en) * 2019-05-22 2019-09-06 武汉虹信通信技术有限责任公司 The standardized system and method for address
CN110210020B (en) * 2019-05-22 2023-06-20 武汉虹旭信息技术有限责任公司 Communication address standardization system and method thereof
CN110569322A (en) * 2019-07-26 2019-12-13 苏宁云计算有限公司 Address information analysis method, device and system and data acquisition method
CN110851559A (en) * 2019-10-14 2020-02-28 中科曙光南京研究院有限公司 Automatic data element identification method and identification system
CN113434708A (en) * 2021-05-25 2021-09-24 北京百度网讯科技有限公司 Address information detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN102955832B (en) 2015-11-25

Similar Documents

Publication Publication Date Title
CN102955833B (en) A kind of address identification, standardized method
CN102955832B (en) A kind of address identification, standardized system
CN103440312B (en) A kind of system and terminal of mailing address inquiry postcode
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
CN109033086A (en) A kind of address resolution, matched method and device
CN103440311A (en) Method and system for identifying geographical name entities
CN101079024B (en) Special word list dynamic generation system and method
CN104408153A (en) Short text hash learning method based on multi-granularity topic models
CN108038090B (en) A kind for the treatment of method and apparatus of Text Address
CN103473289A (en) Device and method for completing communication addresses
CN112307153B (en) Automatic construction method and device of industrial knowledge base and storage medium
CN102419778A (en) Information searching method for discovering and clustering sub-topics of query statement
CN103473217B (en) The method and apparatus of extracting keywords from text
CN104008106A (en) Method and apparatus for obtaining hot topic
CN101984422A (en) Fault-tolerant text query method and equipment
CN103577989A (en) Method and system for information classification based on product identification
CN102207946A (en) Knowledge network semi-automatic generation method
CN109033225A (en) Chinese address identifying system
CN107025232A (en) The processing method and processing device of address information in logistics system
CN109165331A (en) A kind of index establishing method and its querying method and device of English place name
CN115269834A (en) High-precision text classification method and device based on BERT
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network
CN103136212A (en) Mining method of class new words and device
Boughamoura et al. A fuzzy approach for pertinent information extraction from web resources
CN115795060A (en) Entity alignment method based on knowledge enhancement

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: 518057, building 713, room 7, building 9, high tech, central high tech Zone, Shenzhen, Guangdong

Applicant after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 607, room 29, overseas student Pioneer Building, 518057 South Ring Road, Nanshan District hi tech Zone, Guangdong, Shenzhen

Applicant before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: Room 713, 7/F, Software Building, No. 9, High-tech Middle Road, Central District, Shenzhen, Guangdong 518057

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

CP02 Change in the address of a patent holder