CN102024024A - Method and device for constructing address database - Google Patents

Method and device for constructing address database Download PDF

Info

Publication number
CN102024024A
CN102024024A CN 201010540110 CN201010540110A CN102024024A CN 102024024 A CN102024024 A CN 102024024A CN 201010540110 CN201010540110 CN 201010540110 CN 201010540110 A CN201010540110 A CN 201010540110A CN 102024024 A CN102024024 A CN 102024024A
Authority
CN
China
Prior art keywords
address
normal form
original
information
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010540110
Other languages
Chinese (zh)
Other versions
CN102024024B (en
Inventor
时金
万鑫
张传明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN 201010540110 priority Critical patent/CN102024024B/en
Publication of CN102024024A publication Critical patent/CN102024024A/en
Application granted granted Critical
Publication of CN102024024B publication Critical patent/CN102024024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for constructing an address database. The method comprises the following steps of: acquiring original address data; classifying the original address data and generating a paradigm address by a segmentation model; and classifying the paradigm address into a paradigm address database. The invention also discloses a device for constructing the address database. The invention has the advantages that an address to be classified is segmented and classified by using the segmentation model through address attributes and is stored to the paradigm address database, so that the construction efficiency of the address database is high, and the accuracy is also high.

Description

The constructing method of address database and device
[technical field]
The present invention relates to a kind of constructing method and device of address database, refer in particular to a kind of address database constructing method and device based on learning model.
[background technology]
Since more than ten years in past, along with the development of Internet technology, that people depend on more and more that the internet provides is abundant, fast, information timely.But how in vast as the open sea information, to find information to be searched, become a problem that presses for solution, correspondingly, arise at the historic moment in numerous internet search engines and corresponding website, the outstanding person in the middle of this comprises Baidu's search (www.baidu.com) of company of Baidu and Google's search (www.google.cn) of Google.
In numerous information that need to search, the important information of a class is Search Address information, and the demand of this class obtains paying attention to when searching online electronic map information especially.So-called online electronic chart is with respect to the traditional paper map or the electronic chart of unit, it has and upgrades in time, is convenient to inquire about, use the succinct and plurality of advantages such as abundant information that provide directly perceived, at present the Baidu's map of relatively extensively being recommended in the online electronic chart supplier of China (map.baidu.com) that comprises company of Baidu and the google map (ditu.google.cn) of Google wherein especially satisfy Chinese user's use habit more and have obtained widespread use with Baidu's map of company of Baidu.
Wherein, when the user of online electronic chart inquired about the address searching frame of certain address to be checked Input Online electronic chart, this address to be checked can be inquired about in the address database of construction.
Yet there are some defectives in existing structure address database technology.Existing address database just utilizes dictionary when construction, vocabulary, suffix Keyword List and the artificial mode of summing up are classified into address database after with the address date participle that receives, it is often by manually adapting to the address date that receives, illustrate: when being " No. 100, south, street, Zhong Guan-cun " as if the address that receives, it at first passes through dictionary, vocabulary, the suffix Keyword List, carry out participle, as, the suffix Keyword List may be: " street ", " road ", " road ", " number " etc., running into as " street " so, " road ", " road ", " number " etc. during key word, promptly behind key word, carry out participle, illustrate: if the address that receives is " No. 100, south, street, Zhong Guan-cun ",, be " street, Zhong Guan-cun " then with this address participle by the suffix Keyword List, " south ", " No. 100 "; After participle finishes, again by the artificial mode that adapts to, for the address information behind the participle adds attribute, its attribute labeling is in proper order: link name-orientation name-doorplate name is that link name, " south " add the orientation name, " No. 100 " middle adding attribute is the doorplate name as add attribute in " street, Zhong Guan-cun ".Yet, if the address that receives is " No. 100 south, street, Zhong Guan-cun ", after it being divided into " street, Zhong Guan-cun ", " No. 100 ", " south " by above-mentioned participle, also will add new attribute labeling for the address information behind this participle is in proper order: link name-doorplate name-orientation name, and, be that the adding attribute is that doorplate name, " south " add the orientation name in link name, " No. 100 " as in " street, Zhong Guan-cun ", adding attribute to the adding of the address behind this participle attribute.
Above-mentioned address date constructing method because of will constantly adding new attribute labeling order, thereby causes processing procedure comparatively complicated, efficient is lower, in addition, just carry out participle, can cause the participle accuracy rate lower by the mode of dictionary, vocabulary, suffix keyword.
Therefore, need provide a kind of improved address database constructing method and device.
[summary of the invention]
The object of the present invention is to provide a kind of constructing method of improved address database, described method is set up the normal form address database based on a large amount of original address data of input.
Another object of the present invention is to provide a kind of construction device of improved address database, described device is set up the normal form address database based on a large amount of original address data of input.
Correspondingly, the constructing method of the address database of one embodiment of the present invention comprises:
A kind of constructing method of normal form address database comprises:
S1, obtain the original address data;
S2, participle model is to described original address data qualification and produce the normal form address;
S3, described normal form address is sorted out into the normal form address database.
As a further improvement on the present invention, described S2 may further comprise the steps:
Described participle model carries out participle to described original address;
Produce described normal form address by described participle.
As a further improvement on the present invention, described S1 comprises:
Judge described original address data whether with the format match of normal form address;
If coupling is then directly exported described original address data as the normal form address.
As a further improvement on the present invention, described S1 comprises:
Judge described original address data whether with the format match of normal form address;
If do not match, then enter S2.
As a further improvement on the present invention, also comprise address statistical study step behind described S1: described address statistical study step is carried out statistical study to the original address data, produces the normal form address.
As a further improvement on the present invention, described S1 comprises:
Judge described original address data whether with the format match of normal form address;
If do not match, then enter address statistical study step.
As a further improvement on the present invention, described address statistical study step comprises:
First address information before the identification unknown address information;
Second address information after the identification unknown address information;
Address style information in the address date resources bank in the middle of described first address information of statistics and second address information, and calculate the probability that described address style information occurs;
Address style information that probability is the highest and preset threshold relatively if be higher than described threshold value, then produce the normal form address with described address style information in conjunction with first address information and second address information.
As a further improvement on the present invention, described address statistical study step comprises:
If be lower than described threshold value, then enter the S2 step.
As a further improvement on the present invention, before described S2, further comprising the steps of:
Address date obtains: obtain the original address data;
Generate language material: some described original address data are become language material according to the normal form standard participle of formulating;
Study language material:, make up described participle model by the machine learning mode based on described language material.
As a further improvement on the present invention, described machine learning mode is the condition random field type.
As a further improvement on the present invention, described machine learning mode is the support vector machine mode.
As a further improvement on the present invention, described machine learning mode is a hidden Markov model.
As a further improvement on the present invention, described S3 specifically may further comprise the steps:
Address base is set up step: the normal form address base of setting up a tree structure;
Address input step: receive described normal form address;
Address sort step: analyze described normal form address, and described normal form address is sorted out to described normal form address base according to described tree structure.
As a further improvement on the present invention, described normal form address base has some branches, and the end of each branch has at least one leaf node.
As a further improvement on the present invention, described address sort step also comprises described normal form address sort in the described standard normal form address base at least one leaf node.
As a further improvement on the present invention, the tree structure of described normal form address base comprises administrative region layer and the subaddressing layer based on the address logic level.
As a further improvement on the present invention, described administrative region layer comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
As a further improvement on the present invention, described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least.
As a further improvement on the present invention, described road class address is used to define the specific address with headed by the road.
As a further improvement on the present invention, described regional class address is used to define the specific address with headed by the sub-district.
As a further improvement on the present invention, described terrestrial reference class address is used to define a concrete location point.
The constructing method of the address database of another embodiment of the invention comprises:
S1, obtain the original address data;
S2, participle model is to described original address data qualification and produce candidate's normal form address;
S3, described candidate's normal form address is sorted out into the normal form address database.
As a further improvement on the present invention, described S2 may further comprise the steps:
Described participle model carries out participle to described original address;
Produce candidate's normal form address by described participle.
As a further improvement on the present invention, described S3 may further comprise the steps:
With described candidate's normal form address process is the normal form address;
Described normal form address is sorted out into the normal form address database.
As a further improvement on the present invention, described S 1 comprises:
Judge described original address data whether with the format match of candidate's normal form address;
If coupling is then directly exported described original address data as candidate's normal form address.
As a further improvement on the present invention, described S1 comprises:
Judge described original address data whether with the format match of candidate's normal form address;
If do not match, then enter S2.
As a further improvement on the present invention, also comprise address statistical study step behind described S1: described address statistical study step is carried out statistical study to the original address data, produces the normal form address.
As a further improvement on the present invention, described S1 comprises:
Judge described original address data whether with the format match of candidate's normal form address;
If do not match, then enter described address statistical study step.
As a further improvement on the present invention, described address statistical study step comprises:
First address information before the identification unknown address information;
Second address information after the identification unknown address information;
Address style information in the address date resources bank in the middle of described first address information of statistics and second address information, and calculate the probability that described address style information occurs;
Address style information that probability is the highest and preset threshold relatively if be higher than described threshold value, then produce candidate's normal form address with described address style information in conjunction with first address information and second address information.
As a further improvement on the present invention, described address statistical study step comprises:
If be lower than described threshold value, then enter the S2 step.
As a further improvement on the present invention, before described S2, further comprising the steps of:
Address date obtains: obtain the original address data;
Generate language material: some described original address data are become language material according to the normal form standard participle of formulating;
Study language material:, make up described participle model by the machine learning mode based on described language material.
As a further improvement on the present invention, described machine learning mode is the condition random field type.
As a further improvement on the present invention, described machine learning mode is the support vector machine mode.
As a further improvement on the present invention, described machine learning mode is a hidden Markov model.
As a further improvement on the present invention, further comprising the steps of before the described S3:
Address base is set up step: the normal form address base of setting up a tree structure;
Address input step: receive described normal form address;
Address sort step: analyze described normal form address, and described normal form address is sorted out to described normal form address base according to described tree structure.
As a further improvement on the present invention, described normal form address base has some branches, and the end of each branch has at least one leaf node.
As a further improvement on the present invention, described address sort step also comprises described normal form address sort in the described standard normal form address base at least one leaf node.
As a further improvement on the present invention, the tree structure of described normal form address base comprises administrative region layer and the subaddressing layer based on the address logic level.
As a further improvement on the present invention, described administrative region layer comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
As a further improvement on the present invention, described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least.
As a further improvement on the present invention, described road class address is used to define the specific address with headed by the road.
As a further improvement on the present invention, described regional class address is used to define the specific address with headed by the sub-district.
As a further improvement on the present invention, described terrestrial reference class address is used to define a concrete location point.
Correspondingly, the address database construction device of one embodiment of the present invention comprises:
The raw data acquisition module is used to obtain the original address data;
The participle model module is used for described original address data qualification and produces the normal form address;
Normal form address generation module is used for described normal form address is sorted out into the normal form address database.
As a further improvement on the present invention, the original address information in the described raw data acquisition module comprises text message and coordinate information.
As a further improvement on the present invention, described address database construction device also comprises the address statistical analysis module, is used for the original address data are carried out statistical study, produces the normal form address.
As a further improvement on the present invention, described address database construction device also comprises:
Generate the language material module: be used for some described original address data are become language material according to the normal form standard participle of formulating;
Study language material module: be used for making up described participle model by the machine learning mode based on described language material.
As a further improvement on the present invention, described machine learning mode is the condition random field type.
As a further improvement on the present invention, described machine learning mode is the support vector machine mode.
As a further improvement on the present invention, described machine learning mode is a hidden Markov model.
As a further improvement on the present invention, described normal form address generation module also comprises:
Address base is set up the unit, is used to set up the normal form address base of a tree structure;
Address input unit is used to receive described candidate's normal form address;
The address sort unit is used to analyze described candidate's normal form address, and described candidate's normal form address is sorted out to described normal form address base according to described tree structure.
As a further improvement on the present invention, described normal form address base has some branches, and the end of each branch has at least one leaf node.
As a further improvement on the present invention, the tree structure of described normal form address base comprises administrative region layer and the subaddressing layer based on the address logic level.
As a further improvement on the present invention, described administrative region layer comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
As a further improvement on the present invention, described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least.
The address database construction device of another embodiment of the invention comprises:
The raw data acquisition module is used to obtain the original address data;
The participle model module, participle model is to described original address data qualification and produce candidate's normal form address:
Normal form address generation module is used for described candidate's normal form address is sorted out into the normal form address database.
As a further improvement on the present invention, the original address information in the described raw data acquisition module comprises text message and coordinate information.
As a further improvement on the present invention, described address database construction device also comprises the address statistical analysis module, is used for the original address data are carried out statistical study, produces candidate's normal form address.
As a further improvement on the present invention, described address database construction device also comprises:
Generate the language material module: be used for some described original address data are become language material according to the normal form standard participle of formulating;
Study language material module: be used for making up described participle model by the machine learning mode based on described language material.
As a further improvement on the present invention, described machine learning mode is the condition random field type.
As a further improvement on the present invention, described machine learning mode is the support vector machine mode.
As a further improvement on the present invention, described machine learning mode is a hidden Markov model.
As a further improvement on the present invention, described normal form address generation module also comprises:
Address base is set up the unit, is used to set up the normal form address base of a tree structure;
Address input unit is used to receive described normal form address;
The address sort unit is used to analyze described normal form address, and described normal form address is sorted out to described normal form address base according to described tree structure.
As a further improvement on the present invention, described normal form address base has some branches, and the end of each branch has at least one leaf node.
As a further improvement on the present invention, the tree structure of described normal form address base comprises administrative region layer and the subaddressing layer based on the address logic level.
As a further improvement on the present invention, described administrative region layer comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
As a further improvement on the present invention, described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least.
The invention has the beneficial effects as follows: the utilization participle model is treated Categories Address by address properties and is cut the speech classification, and is stored to the normal form address database, make that address database construction efficient of the present invention is higher, and accuracy rate is also higher.
[description of drawings]
Fig. 1 is the process flow diagram of the address database constructing method of one embodiment of the present invention.
Fig. 2 is the process flow diagram of the address database constructing method of another embodiment of the present invention.
Fig. 3 is the structural representation of the address database construction device of one embodiment of the present invention.
Fig. 4 is the process flow diagram of the address database constructing method of one embodiment of the present invention.
Fig. 5 is the process flow diagram of the address database constructing method of another embodiment of the present invention.
Fig. 6 is the structural representation of the address database construction device of another embodiment of the present invention.
Fig. 7 is the structural representation of normal form of the present invention address generation module.
Fig. 8 is the process flow diagram of normal form address generating method of the present invention.
Fig. 9 is the normal form address base configuration diagram of address base setup unit of the present invention.
Figure 10 is the process flow diagram of construction participle model of the present invention.
Figure 11 is the modular structure synoptic diagram of construction participle model of the present invention.
[embodiment]
Understand for technical characterictic, goal of the invention and technique effect to invention have more clearly, now contrast description of drawings the specific embodiment of the present invention, identical label is represented the part that step is identical in each figure.In this article, " schematically " expression " is served as example, example or explanation ", any diagram, the embodiment that is described to " schematically " in this article should be interpreted as a kind of preferred or have more the technical scheme of advantage.
At first with reference to figure 1, the address database constructing method of one embodiment of the present invention may further comprise the steps:
S1, obtain the original address data.Wherein, these original address data comprise the text message and the coordinate information of address, described text message refers to any specific address of one of them at least that can represent road class address, regional class address, terrestrial reference class address, and described coordinate information refers to the concrete coordinate points of these original address data.For example: the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x; y) ", and wherein, " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City " are the text message of these original address data, (x y) is the coordinate information of these original address data.
S2, participle model carry out participle and produce the normal form address described original address data.Wherein, how this participle model is set up, and it is to learn what kind of word segmentation regulation, will disclose at follow-up instructions.
S3, described normal form address is sorted out into the normal form address database.What deserves to be mentioned is: same original address data, when depositing the normal form address database in, may be a plurality of memory addresss, for example, the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x, y) ", and it obtains " Haidian District Beijing ", " upward No. 10, ten streets ", " Baidu's mansion " through behind participle, then when depositing this database in, this memory address then may be two: the one, " No. 10, ten streets, ShangDi, Haidian District, BeiJing City "; The 2nd, " Haidian District, Beijing City Baidu mansion ", it is to carry out classification and storage according to the rule of administrative region+road class address, administrative region+terrestrial reference class address.In above-mentioned example, the administrative region is that Haidian District, Beijing City, road class address are that No. 10, ten streets, last ground, terrestrial reference class address are Baidu's mansion.Described storage mode will disclose in follow-up instructions in detail.
At first with reference to figure 2, the address database constructing method of another embodiment of the invention may further comprise the steps:
S1 ', obtain the original address data.Wherein, these original address data comprise the text message and the coordinate information of address, described text message refers to any specific address of one of them at least that can represent road class address, regional class address, terrestrial reference class address, and described coordinate information refers to the concrete coordinate points of these original address data.For example: the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x; y) ", and wherein, " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City " are the text message of these original address data, (x y) is the coordinate information of these original address data.
S2 ', participle model carry out participle and produce candidate's normal form address described original address data.This candidate's normal form address will be handled it in thereafter S3 ' step, and classification deposits in to the normal form address database.Wherein, how this participle model is set up, and it is to learn what kind of word segmentation regulation, will disclose at follow-up instructions.
S3 ', with described candidate's normal form address process and sort out the normal form address database.Described processing refers to the tree derivation with the corresponding normal form address database in this candidate's normal form address, its form is adjusted into branch or the leaf node that meets fully in this normal form address tree derivation.What deserves to be mentioned is: same original address data, when depositing the normal form address database in, may be a plurality of memory addresss, for example, the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x, y) ", and it obtains " Haidian District Beijing ", " upward No. 10, ten streets ", " Baidu's mansion " through behind participle, then when depositing this database in, this memory address then may be two: the one, " No. 10, ten streets, ShangDi, Haidian District, BeiJing City "; The 2nd, " Haidian District, Beijing City Baidu mansion ", it is to carry out classification and storage according to the rule of administrative region+road class address, administrative region+terrestrial reference class address.In above-mentioned example, the administrative region is that Haidian District, Beijing City, road class address are that No. 10, ten streets, last ground, terrestrial reference class address are Baidu's mansion.Described storage mode will disclose in follow-up instructions in detail.
Correspondingly, please refer to Fig. 3, be the address database construction device of one embodiment of the present invention, it comprises raw data acquisition module 1, participle model module 2, and normal form address generation module 4.
Wherein, raw data acquisition module 1 is used to obtain the original address data that comprise a large amount of address informations.Wherein, these original address data comprise the text message and the coordinate information of address, described text message refers to any specific address of one of them at least that can represent road class address, regional class address, terrestrial reference class address, and described coordinate information refers to the concrete coordinate points of these original address data.For example: the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x; y) ", and wherein, " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City " are the text message of these original address data, (x y) is the coordinate information of these original address data.
Participle model module 2 is used for described original address data are carried out participle and produced the normal form address or candidate's normal form address.Wherein, how this participle model is set up, and it is to learn what kind of word segmentation regulation, will disclose at follow-up instructions.
Normal form address generation module 4 is used for described normal form address is sorted out into the normal form address database.It is pointed out that in another embodiment of the present invention what this step received is candidate's normal form address, this step needs this candidate's normal form address is handled, and restores to the normal form address database.Described processing refers to the tree derivation with the corresponding normal form address database in this candidate's normal form address, its form is adjusted into branch or the leaf node that meets fully in this normal form address tree derivation.The address information that meets the normal form database format that described " normal form address " refer to obtains by raw data acquisition module 1, participle model module 2, normal form address generation module 4.These address informations will be classified in the address style below the corresponding subaddressing layer according to the described call format of Fig. 9 of the present invention and go, and this part will have detailed introduction when back segment text description Fig. 9.What deserves to be mentioned is: same original address data, when depositing address database in, may be a plurality of memory addresss, for example, the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x, y) ", and it obtains " Haidian District Beijing ", " upward No. 10, ten streets ", " Baidu's mansion " through behind participle, then when depositing this database in, this memory address then may be two: the one, " No. 10, ten streets, ShangDi, Haidian District, BeiJing City "; The 2nd, " Haidian District, Beijing City Baidu mansion ", it is to carry out classification and storage according to the rule of administrative region+road class address, administrative region+terrestrial reference class address.In above-mentioned example, the administrative region is that Haidian District, Beijing City, road class address are that No. 10, ten streets, last ground, terrestrial reference class address are Baidu's mansion.
With reference to figure 4, as one embodiment of the present invention, the constructing method of this address database also can be expanded by above-mentioned steps again, is deformed into following detailed operation flow process:
Step S10: obtain the original address data.These original address data comprise the text message and the coordinate information of address, described text message refers to any specific address of one of them at least that can represent road class address, regional class address, terrestrial reference class address, and described coordinate information refers to the concrete coordinate points of these original address data.For example: the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x; y) ", and wherein, " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City " are the text message of these original address data, (x y) is the coordinate information of these original address data.
Step S11: at certain concrete address information, judge whether described address information meets the requirement of normal form address,, then directly enter step S16, if the undesirable step S12 that then enters if meet the requirements.
Step S12: i.e. address statistical study step is used for described a large amount of address information and carries out statistical study based on existing address date resources bank, and based on the frequency that certain address information occurs, produce the normal form address in all address informations.Need the reason of this step to be, described original address information might not all be the complete normal form address that can be directly applied for step S16.Very common may be, also imperfect by the original address information that all multipaths (for example internet data collection approach) get access to, described imperfect address information does not also meet the call format of step S16 normal form address, need further handle based on the method for statistical study, described statistical analysis technique is: first address information before the identification unknown address information; Second address information after the identification unknown address information; Address style information in the address date resources bank in the middle of described first address information of statistics and second address information, and calculate the probability that described address style information occurs; Address style information that probability is the highest and preset threshold are relatively.Illustrate: if original address information is " No. 13, Xisi, famous beauty in the late Spring and Autumn Period alleyway, Zongguancun Street, Haidian District, Beijing City ", then this address is discerned in the past backward, when " Haidian District, Beijing City ", " street, Zhong Guan-cun " all can identify it by the address date resources bank is address, administrative region and road class address, and " Xisi, famous beauty in the late Spring and Autumn Period alleyway " is in the time of can not discerning, then carry out reversal of identification, promptly from after forward identification, when " No. 13 " are identified is " during the doorplate address ", then in described address date resources bank, add up, which kind of address style information middle address of inserting be to doorplate class address in statistics road class address, if after statistics, the probability of finding class address, alleyway is the highest, and relatively this probability and pre-set threshold, enter the S13 step.
Step S13: if described probability is higher than preset threshold, then described address information is used as the normal form address, and directly enters step S16; If described probability is lower than preset threshold, then this address information not can be used as the use of normal form address, and enters step S14.
Step S14: participle model participle step is used for the described address information that still can't handle through step S13 is analyzed, and based on predefined participle model, produces the normal form address.In an embodiment of the invention, be based on condition random field (conditional random field, CRF) method of study expectation produces described " participle model ", carry out participle and produce the normal form address by this participle model, can export the participle and the attribute labeling information of described normal form address simultaneously.
Step S16: the normal form address generates step, is used for processings of classifying of described normal form address, and is referred in the normal form address database of correspondence.The address information that meets the normal form database format that described " normal form address " refer to obtains by step S11, step S13, step S14.These address informations will be classified in the address style below the corresponding subaddressing layer according to the described call format of Fig. 9 of the present invention and go, and this part will have detailed introduction when back segment text description Fig. 9.
What deserves to be mentioned is: in another embodiment of the present invention, if undesirable in the S11 step, also can directly enter the S14 step, its specifically judge or processing procedure in the above-mentioned steps unanimity, do not repeat them here.
With reference to figure 5, as another embodiment of the invention, the constructing method of this address database also can be expanded by above-mentioned steps again, is deformed into following detailed operation flow process:
Step S10 ': obtain the original address data.These original address data comprise the text message and the coordinate information of address, described text message refers to any specific address of one of them at least that can represent road class address, regional class address, terrestrial reference class address, and described coordinate information refers to the concrete coordinate points of these original address data.For example: the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x; y) ", and wherein, " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City " are the text message of these original address data, (x y) is the coordinate information of these original address data.
Step S11 ': at certain concrete address information, judge whether described address information meets the requirement of candidate's normal form address,, then directly enter step S15 ', if the undesirable step S12 ' that then enters if meet the requirements.
Step S12 ': i.e. address statistical study step is used for described a large amount of address information and carries out statistical study based on existing address date resources bank, and based on the frequency that certain address information occurs, produce candidate's normal form address in all address informations.Need the reason of this step to be, described original address information might not all be complete candidate's normal form address that can be directly applied for step S15 '.Very common may be, also imperfect by the original address information that all multipaths (for example internet data collection approach) get access to, described imperfect address information does not also meet the call format of candidate's normal form address of step S15 ', need further handle based on the method for statistical study, described statistical analysis technique is: first address information before the identification unknown address information; Second address information after the identification unknown address information; Address style information in the address date resources bank in the middle of described first address information of statistics and second address information, and calculate the probability that described address style information occurs; Address style information that probability is the highest and preset threshold are relatively.Illustrate: if original address information is " No. 13, Xisi, famous beauty in the late Spring and Autumn Period alleyway, Zongguancun Street, Haidian District, Beijing City ", then this address is discerned in the past backward, when " Haidian District, Beijing City ", " street, Zhong Guan-cun " all can identify it by the address date resources bank is address, administrative region and road class address, and " Xisi, famous beauty in the late Spring and Autumn Period alleyway " is in the time of can not discerning, then carry out reversal of identification, promptly from after forward identification, when " No. 13 " are identified is " during the doorplate address ", then in described address date resources bank, add up, which kind of address style information middle address of inserting be to doorplate class address in statistics road class address, if after statistics, the probability of finding class address, alleyway is the highest, and relatively this probability and pre-set threshold, enter S13 ' step.
Step S13 ': if described probability is higher than preset threshold, then described address information is used as candidate's normal form address, and directly enters step S15 '; If described probability is lower than preset threshold, then this address information not can be used as the use of candidate's normal form address, and enters step S14 '.
Step S14 ': participle model participle step is used for the described address information that still can't handle through step S13 ' is analyzed, and based on predefined participle model, produces candidate's normal form address.In an embodiment of the invention, be based on condition random field (conditional random field, CRF) method of study expectation produces described " participle model ", carry out participle and produce candidate's normal form address by this participle model, can export the participle and the attribute labeling information of described candidate's normal form address simultaneously.
Step S15 ': collect the candidate's normal form address information that produces by step S11 ', step S13 ', step S14 '.What deserves to be mentioned is: same original address data, the candidate's normal form address that produces may be a plurality of, the form of described candidate's normal form address comprises text message and coordinate information, for example: candidate's normal form address of complete original address data " No. 3 HaiLong Building Building B, Zongguancun Street, Haidian District, Beijing City 213-406 (x; y) " output after treatment may comprise two: one, road class candidate normal form address, comprise text message " No. 3, Zongguancun Street, Haidian District, Beijing City " and coordinate information (x, y); Its two, terrestrial reference class candidate normal form address comprises text message " Haidian District, Beijing City dragon mansion " and coordinate information (x, y), wherein (x is constant y), is representing above-mentioned road class candidate normal form address and terrestrial reference class candidate normal form address to come down to same specific address.
Step S16 ': the normal form address generates step, is used for processings of classifying of described candidate's normal form address, and is referred in the normal form address database of correspondence.The address information that meets the normal form database format that described " candidate's normal form address " refer to obtains by step S11 ', step S13 ', step S14 '.These address informations will be classified in the address style below the corresponding subaddressing layer according to the described call format of Fig. 9 of the present invention and go, and this part will have detailed introduction when back segment text description Fig. 9.
What deserves to be mentioned is: in another embodiment of the present invention, if undesirable in S11 ' step, also can directly enter S14 ' step, its specifically judge or processing procedure in the above-mentioned steps unanimity, do not repeat them here.
Correspondingly, with reference to figure 6, the construction device of address database of the present invention can comprise in the ground expansion: raw data acquisition module 10, address statistical analysis module 11, participle model module 12, and normal form address generation module 13.
Raw data acquisition module 10 is used to obtain the original address data that comprise a large amount of address informations.Wherein, these original address data comprise the text message and the coordinate information of address, described text message refers to any specific address of one of them at least that can represent road class address, regional class address, terrestrial reference class address, and described coordinate information refers to the concrete coordinate points of these original address data.For example: the original address data are " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City+(x; y) ", and wherein, " No. 10 Baidu's mansions in ten streets, ShangDi, Haidian District, BeiJing City " are the text message of these original address data, (x y) is the coordinate information of these original address data.
It comprises statistical analysis unit and address date data bank unit (not shown) address statistical analysis module 11.And be used for described a large amount of address informations are carried out statistical study based on existing address date resources bank, and, produce normal form address or candidate's normal form address in all address informations based on the frequency that certain address information occurs.Need the reason of this module to be, described original address information might not all be complete candidate's normal form address that can directly be suitable for or normal form address.Very common may be, also imperfect by the original address information that all multipaths (for example internet data collection approach) get access to, described imperfect address information does not also meet candidate's normal form address or the call format of normal form address, need further handle based on statistical analysis module: first address information before the identification unknown address information; Second address information after the identification unknown address information; Address style information in the address date resources bank in the middle of described first address information of statistics and second address information, and calculate the probability that described address style information occurs; Address style information that probability is the highest and preset threshold are relatively.Illustrate: if original address information is " No. 13, Xisi, famous beauty in the late Spring and Autumn Period alleyway, Zongguancun Street, Haidian District, Beijing City ", then this address is discerned in the past backward, when " Haidian District, Beijing City ", " street, Zhong Guan-cun " all can identify it by the address date resources bank is address, administrative region and road class address, and " Xisi, famous beauty in the late Spring and Autumn Period alleyway " is in the time of can not discerning, then carry out reversal of identification, promptly from after forward identification, when " No. 13 " are identified is " during the doorplate address ", then in described address date resources bank, add up, which kind of address style information middle address of inserting be to doorplate class address in statistics road class address, if after statistics, the probability of finding class address, alleyway is the highest, and relatively this probability and pre-set threshold, judge whether this address information as candidate's normal form address or normal form address.
Participle model module 12 is used for the described address information that still can't handle through address statistical analysis module 11 is analyzed, and based on predefined participle model, produces candidate normal form address or normal form address.Alleged herein " address information that can't handle " refer to the address information of handling through described address statistical analysis module 11 further handle based on the method for statistical study after this probability be lower than the address information of setting threshold.In an embodiment of the invention, be based on condition random field (conditional random field, CRF) method of study expectation produces described " predefined participle model ", carry out participle by this participle model, can export the participle and the attribute labeling information of described normal form address or candidate's normal form address simultaneously.Please refer to introduction in Baidu's encyclopaedia (http://baike.baidu.com/view/2510459.htm) about the principle of work of CRF, do not repeat them here.It should be noted that, in other embodiments of the present invention, described address learning model also can pass through support vector machine (Support Vector Machine, SVM) or hidden Markov model (HiddenMarkov Model, HMM) method is set up, the principle of these methods all in the industry cycle is applied, and does not repeat them here.
Normal form address generation module 13 is used for described word segmentation result is formed candidate normal form address or normal form address and deposited address database in.The address information that meets the normal form database format that described " candidate's normal form address " or " normal form address " refer to obtain by raw data acquisition module 10, address statistical analysis module 11, participle model module 12, normal form address generation module 13.These address informations will originally be classified in the address style below the corresponding subaddressing layer according to the described call format of Fig. 9 of the present invention and go, and this part will have detailed introduction when back segment text description Fig. 9.
With reference to figure 7, normal form of the present invention address generation module comprises that address base sets up unit 100, address receiving element 101, and address sort unit 102.
Address base sets up unit 100 to be used to set up the standard normal form address base of a tree structure, and this tree-shaped standard normal form address base has some branches, and the end of each branch has at least one leaf node.Concrete structure about described standard normal form address base can be done detailed description again in conjunction with Fig. 9 in subsequent paragraph.
Address receiving element 101 is used to receive normal form address or candidate's normal form address.After address base is set up the criteria for classification of having set up standard normal form address in the unit 100, any one candidate's normal form address or normal form address that receives and be input to through address receiving element 101 in the described standard normal form address base can find corresponding position to deposit in theory, judges that described deposit position finishes by described address sort unit 102.
Address sort unit 102 is used to analyze described normal form address or candidate's normal form address, and is classified into certain branch of described standard normal form address base.
Correspondingly, with reference to figure 8, the normal form address generating method of normal form address generation module correspondence can be decomposed into: address base is set up step S100, address input step S101, and address sort step S102.
Address base sets up step S100 to set up the standard normal form address base of a tree structure, and this tree-shaped standard normal form address base has some branches, and the end of each branch has at least one leaf node.Concrete structure about described standard normal form address base can be done detailed description again in conjunction with Fig. 9 in subsequent paragraph.Do not repeat them here.
Address input step S101 receives normal form address or candidate's normal form address.After address base is set up the criteria for classification of having set up standard normal form address in the unit 100, any one candidate's normal form address or normal form address that receives and be input to through address receiving element 101 in the described standard normal form address base can find corresponding position to deposit in theory, judges that described deposit position finishes by described address sort unit 102.
Address sort step S102 analyzes described normal form address or candidate's normal form address, and is classified into certain branch of described standard normal form address base.
With reference to figure 9, for more clearly being described, address base sets up the concrete structure in the storehouse, normal form normal address in the unit 100, and below to appoint the storehouse, normal form normal address of the electronic chart in the republic administrative region be that example is done detailed description to set up China.In general, Zhong Guo administrative division comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.These four levels are relatively-stationary, and its quantity and title are can add up easily corresponding to the region name of various places to obtain.Therefore, in standard normal form address base, these four levels are merged the ground floor of the tree structure that is generically and collectively referred to as standard normal form address base, promptly territory, administrative area layer 90 correspondingly is designated first level and is province/autonomous region/municipality directly under the Central Government 91 in Fig. 9; Second level is city/autonomous prefecture 92; The 3rd level is district/county 93; The 4th level is township/town/street 94.Specific address title below the 4th level is numerous and complicated, vast as the open sea then, yet, this many specific address title can be summed up as three kinds of address styles: i.e. road class address 81, regional class address 82 and terrestrial reference class address 83, this three classes address is generically and collectively referred to as the second layer of the tree structure of standard normal form address base, and promptly the subaddressing layer 80.Certainly, this subaddressing layer 80 also can include only one of them of above-mentioned three kinds of addresses or wherein two.The specific address that described road class address 81 is used to define with headed by the road, for example: a road b number, a road b lane etc.The specific address that described regional class address 82 is used to define with headed by the sub-district, for example: b of a sub-district, a sub-district b phase.Described terrestrial reference class address 83 is used to define a concrete location point, for example: a mansion, b park etc.What deserves to be mentioned is: above-mentioned level is divided just based on an embodiment of the invention, promptly the level of address in the People's Republic of China (PRC) administrative region is divided, certainly, level for other countries or area is divided, can be different with above-mentioned division, it is so long as get final product based on the division of address logic level, and described address logic level can be regarded as, and is contracted to less address realm step by step from a bigger address realm.
With reference to shown in Figure 10, for participle model of the present invention obtains by the following method:
S1000, obtain the original address data;
S1001, some original address data are become language material according to the normal form standard participle of formulating, wherein, so-called " normal form standard " is described in above-mentioned Fig. 9.
S1002, based on described language material, the mode by machine learning makes up participle model.Wherein, the mode of machine learning can be condition random field (conditional random field, CRF) method of study language material produces described " predefined participle model ", carry out participle by this participle model, can export the participle and the attribute labeling information of described normal form address or candidate's normal form address simultaneously.Please refer to introduction in Baidu's encyclopaedia (http://baike.baidu.com/view/2510459.htm) about the principle of work of CRF, do not repeat them here.It should be noted that, in other embodiments of the present invention, described address learning model also can pass through support vector machine (Support Vector Machine, SVM) or hidden Markov model (Hidden Markov Model, HMM) method is set up, the principle of these methods all in the industry cycle is applied, and does not repeat them here.
Correspondingly, with reference to shown in Figure 11, for construction participle model of the present invention comprises with lower module:
Address date acquisition module 1000: be used to obtain the original address data;
Generate language material module 1001: be used for some original address data are become language material according to the normal form standard participle of formulating, wherein, so-called " normal form standard " is described in above-mentioned Fig. 9.
Study language material module 1002: based on described language material, the mode by machine learning makes up this participle model.Wherein, the mode of machine learning can be condition random field (conditional random field, CRF) method of study language material produces described " predefined participle model ", carry out participle by this participle model, can export the participle and the attribute labeling information of described normal form address or candidate's normal form address simultaneously.Please refer to introduction in Baidu's encyclopaedia (http://baike.baidu.com/view/2510459.htm) about the principle of work of CRF, do not repeat them here.It should be noted that, in other embodiments of the present invention, described address learning model also can pass through support vector machine (Support Vector Machine, SVM) or hidden Markov model (Hidden Markov Model, HMM) method is set up, the principle of these methods all in the industry cycle is applied, and does not repeat them here.
By above description, can draw, the utilization participle model is treated Categories Address by address properties and is cut speech, and is stored to standard normal form address database, make that address database construction efficient of the present invention is higher, and accuracy rate is also higher.
Be to be understood that, though this instructions is described according to embodiment, but be not that each embodiment only comprises an independently technical scheme, this narrating mode of instructions only is for clarity sake, those skilled in the art should make instructions as a whole, technical scheme among each embodiment also can form other embodiments that it will be appreciated by those skilled in the art that through appropriate combination.
Above listed a series of detailed description only is specifying at feasibility embodiment of the present invention; they are not in order to restriction protection scope of the present invention, allly do not break away from equivalent embodiment or the change that skill spirit of the present invention done and all should be included within protection scope of the present invention.

Claims (67)

1. the constructing method of a normal form address database is characterized in that, this method comprises:
S1, obtain the original address data;
S2, participle model is to described original address data qualification and produce the normal form address;
S3, described normal form address is sorted out into the normal form address database.
2. the method for claim 1 is characterized in that, described S2 may further comprise the steps:
Described participle model carries out participle to described original address;
Produce described normal form address by described participle.
3. the method for claim 1 is characterized in that, described S1 comprises:
Judge described original address data whether with the format match of normal form address;
If coupling is then directly exported described original address data as the normal form address.
4. the method for claim 1 is characterized in that, described S1 comprises:
Judge described original address data whether with the format match of normal form address;
If do not match, then enter S2.
5. the method for claim 1 is characterized in that, also comprises statistical study step in address behind described S1: described address statistical study step is carried out statistical study to the original address data, produces the normal form address.
6. method as claimed in claim 5 is characterized in that, described S1 comprises:
Judge described original address data whether with the format match of normal form address;
If do not match, then enter address statistical study step.
7. method as claimed in claim 5 is characterized in that, described address statistical study step comprises:
First address information before the identification unknown address information;
Second address information after the identification unknown address information;
Address style information in the address date resources bank in the middle of described first address information of statistics and second address information, and calculate the probability that described address style information occurs;
Address style information that probability is the highest and preset threshold relatively if be higher than described threshold value, then produce the normal form address with described address style information in conjunction with first address information and second address information.
8. method as claimed in claim 7 is characterized in that, described address statistical study step comprises:
If be lower than described threshold value, then enter the S2 step.
9. the method for claim 1 is characterized in that, and is before described S2, further comprising the steps of:
Address date obtains: obtain the original address data;
Generate language material: some described original address data are become language material according to the normal form standard participle of formulating;
Study language material:, make up described participle model by the machine learning mode based on described language material.
10. method as claimed in claim 9 is characterized in that, described machine learning mode is the condition random field type.
11. method as claimed in claim 9 is characterized in that, described machine learning mode is the support vector machine mode.
12. method as claimed in claim 9 is characterized in that, described machine learning mode is a hidden Markov model.
13. the method for claim 1 is characterized in that, described S3 specifically may further comprise the steps:
Address base is set up step: the normal form address base of setting up a tree structure;
Address input step: receive described normal form address;
Address sort step: analyze described normal form address, and described normal form address is sorted out to described normal form address base according to described tree structure.
14. method as claimed in claim 13 is characterized in that, described normal form address base has some branches, and the end of each branch has at least one leaf node.
15. method as claimed in claim 14 is characterized in that, described address sort step also comprises described normal form address sort in the described standard normal form address base at least one leaf node.
16. method as claimed in claim 13 is characterized in that, the tree structure of described normal form address base comprises based on the administrative region layer of address logic level and subaddressing layer.
17. method as claimed in claim 16 is characterized in that, described administrative region layer comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
18. method as claimed in claim 16 is characterized in that, described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least.
19. method as claimed in claim 18 is characterized in that, described road class address is used to define the specific address with headed by the road.
20. method as claimed in claim 18 is characterized in that, described regional class address is used to define the specific address with headed by the sub-district.
21. method as claimed in claim 18 is characterized in that, described terrestrial reference class address is used to define a concrete location point.
22. the constructing method of a normal form address database is characterized in that, this method comprises:
S1, obtain the original address data;
S2, participle model is to described original address data qualification and produce candidate's normal form address;
S3, described candidate's normal form address is sorted out into the normal form address database.
23. method as claimed in claim 22 is characterized in that, described S2 may further comprise the steps:
Described participle model carries out participle to described original address;
Produce candidate's normal form address by described participle.
24., it is characterized in that described S3 may further comprise the steps as claim 22 or 23 described methods:
With described candidate's normal form address process is the normal form address;
Described normal form address is sorted out into the normal form address database.
25. method as claimed in claim 22 is characterized in that, described S1 comprises:
Judge described original address data whether with the format match of candidate's normal form address;
If coupling is then directly exported described original address data as candidate's normal form address.
26. method as claimed in claim 22 is characterized in that, described S1 comprises:
Judge described original address data whether with the format match of candidate's normal form address;
If do not match, then enter S2.
27. method as claimed in claim 22 is characterized in that, also comprises statistical study step in address behind described S1: described address statistical study step is carried out statistical study to the original address data, produces the normal form address.
28. method as claimed in claim 27 is characterized in that, described S1 comprises:
Judge described original address data whether with the format match of candidate's normal form address;
If do not match, then enter described address statistical study step.
29. method as claimed in claim 27 is characterized in that, described address statistical study step comprises:
First address information before the identification unknown address information;
Second address information after the identification unknown address information;
Address style information in the address date resources bank in the middle of described first address information of statistics and second address information, and calculate the probability that described address style information occurs;
Address style information that probability is the highest and preset threshold relatively if be higher than described threshold value, then produce candidate's normal form address with described address style information in conjunction with first address information and second address information.
30. method as claimed in claim 29 is characterized in that, described address statistical study step comprises:
If be lower than described threshold value, then enter the S2 step.
31. method as claimed in claim 22 is characterized in that, and is before described S2, further comprising the steps of:
Address date obtains: obtain the original address data;
Generate language material: some described original address data are become language material according to the normal form standard participle of formulating;
Study language material:, make up described participle model by the machine learning mode based on described language material.
32. method as claimed in claim 31 is characterized in that, described machine learning mode is the condition random field type.
33. method as claimed in claim 31 is characterized in that, described machine learning mode is the support vector machine mode.
34. method as claimed in claim 31 is characterized in that, described machine learning mode is a hidden Markov model.
35. method as claimed in claim 22 is characterized in that, and is further comprising the steps of before the described S3:
Address base is set up step: the normal form address base of setting up a tree structure;
Address input step: receive described normal form address;
Address sort step: analyze described normal form address, and described normal form address is sorted out to described normal form address base according to described tree structure.
36. method as claimed in claim 35 is characterized in that, described normal form address base has some branches, and the end of each branch has at least one leaf node.
37. method as claimed in claim 36 is characterized in that, described address sort step also comprises described normal form address sort in the described standard normal form address base at least one leaf node.
38. method as claimed in claim 35 is characterized in that, the tree structure of described normal form address base comprises based on the administrative region layer of address logic level and subaddressing layer.
39. method as claimed in claim 38 is characterized in that, described administrative region layer comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
40. method as claimed in claim 38 is characterized in that, described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least.
41. method as claimed in claim 40 is characterized in that, described road class address is used to define the specific address with headed by the road.
42. method as claimed in claim 40 is characterized in that, described regional class address is used to define the specific address with headed by the sub-district.
43. method as claimed in claim 40 is characterized in that, described terrestrial reference class address is used to define a concrete location point.
44. an address database construction device is characterized in that this device comprises:
The raw data acquisition module is used to obtain the original address data;
The participle model module is used for described original address data qualification and produces the normal form address;
Normal form address generation module is used for described normal form address is sorted out into the normal form address database.
45. device as claimed in claim 44 is characterized in that, the original address information in the described raw data acquisition module comprises text message and coordinate information.
46. device as claimed in claim 44 is characterized in that, described address database construction device also comprises the address statistical analysis module, is used for the original address data are carried out statistical study, produces the normal form address.
47. device as claimed in claim 44 is characterized in that, described address database construction device also comprises:
Generate the language material module: be used for some described original address data are become language material according to the normal form standard participle of formulating;
Study language material module: be used for making up described participle model by the machine learning mode based on described language material.
48. device as claimed in claim 47 is characterized in that, described machine learning mode is the condition random field type.
49. device as claimed in claim 47 is characterized in that, described machine learning mode is the support vector machine mode.
50. device as claimed in claim 47 is characterized in that, described machine learning mode is a hidden Markov model.
51. device as claimed in claim 44 is characterized in that, described normal form address generation module comprises:
Address base is set up the unit, is used to set up the normal form address base of a tree structure;
Address input unit is used to receive described normal form address;
The address sort unit is used to analyze described normal form address, and described normal form address is sorted out to described normal form address base according to described tree structure.
52. device as claimed in claim 51 is characterized in that, described normal form address base has some branches, and the end of each branch has at least one leaf node.
53. device as claimed in claim 51 is characterized in that, the tree structure of described normal form address base comprises based on the administrative region layer of address logic level and subaddressing layer.
54. device as claimed in claim 53 is characterized in that, described administrative region layer comprises four levels: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
55. device as claimed in claim 53 is characterized in that, described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least.
56. an address database construction device is characterized in that this device comprises:
The raw data acquisition module is used to obtain the original address data;
The participle model module, participle model is to described original address data qualification and produce candidate's normal form address;
Normal form address generation module is used for described candidate's normal form address is sorted out into the normal form address database.
57. device as claimed in claim 56 is characterized in that, the original address information in the described raw data acquisition module comprises text message and coordinate information.
58. device as claimed in claim 56 is characterized in that, described address database construction device also comprises the address statistical analysis module, is used for the original address data are carried out statistical study, produces candidate's normal form address.
59. device as claimed in claim 56 is characterized in that, described address database construction device also comprises:
Generate the language material module: be used for some described original address data are become language material according to the normal form standard participle of formulating;
Study language material module: be used for making up described participle model by the machine learning mode based on described language material.
60. device as claimed in claim 59 is characterized in that, described machine learning mode is the condition random field type.
61. device as claimed in claim 59 is characterized in that, described machine learning mode is the support vector machine mode.
62. device as claimed in claim 59 is characterized in that, described machine learning mode is a hidden Markov model.
63. device as claimed in claim 56 is characterized in that, described normal form address generation module comprises:
Address base is set up the unit, is used to set up the normal form address base of a tree structure;
Address input unit is used to receive described candidate's normal form address;
The address sort unit is used to analyze described candidate's normal form address, and described candidate's normal form address is sorted out to described normal form address base according to described tree structure.
64., it is characterized in that described normal form address base has some branches as the described device of claim 63, the end of each branch has at least one leaf node.
65., it is characterized in that the tree structure of described normal form address base comprises based on the administrative region layer of address logic level and subaddressing layer as the described device of claim 63.
66., it is characterized in that described administrative region layer comprises four levels as the described device of claim 65: first level is province/autonomous region/municipality directly under the Central Government; Second level is city/autonomous prefecture; The 3rd level is district/county; The 4th level is township/town/street.
67., it is characterized in that described subaddressing layer comprises one of them of road class address, regional class address and terrestrial reference class address at least as the described device of claim 65.
CN 201010540110 2010-11-10 2010-11-10 Method and device for constructing address database Active CN102024024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010540110 CN102024024B (en) 2010-11-10 2010-11-10 Method and device for constructing address database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010540110 CN102024024B (en) 2010-11-10 2010-11-10 Method and device for constructing address database

Publications (2)

Publication Number Publication Date
CN102024024A true CN102024024A (en) 2011-04-20
CN102024024B CN102024024B (en) 2013-07-10

Family

ID=43865322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010540110 Active CN102024024B (en) 2010-11-10 2010-11-10 Method and device for constructing address database

Country Status (1)

Country Link
CN (1) CN102024024B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN103514234A (en) * 2012-06-30 2014-01-15 北京百度网讯科技有限公司 Method and device for extracting page information
CN103678708A (en) * 2013-12-30 2014-03-26 小米科技有限责任公司 Method and device for recognizing preset addresses
CN105630933A (en) * 2015-12-22 2016-06-01 安徽瑞信软件有限公司 Address data management method
WO2016127904A1 (en) * 2015-02-13 2016-08-18 阿里巴巴集团控股有限公司 Text address processing method and apparatus
CN106296344A (en) * 2016-07-29 2017-01-04 北京小米移动软件有限公司 Maliciously address recognition methods and device
CN106708898A (en) * 2015-11-17 2017-05-24 方正国际软件(北京)有限公司 Method and device for showing building structures
CN106875264A (en) * 2017-03-31 2017-06-20 北京京东尚科信息技术有限公司 Sequence information management method, device and order sorting system
CN107423295A (en) * 2016-05-24 2017-12-01 张向利 A kind of magnanimity address date intelligence fast matching method
CN107527312A (en) * 2016-06-22 2017-12-29 顺丰科技有限公司 Express mail address process system and method
CN107577744A (en) * 2017-08-28 2018-01-12 苏州科技大学 Nonstandard Address automatic matching model, matching process and method for establishing model
CN108204816A (en) * 2016-12-20 2018-06-26 北京四维图新科技股份有限公司 Address process of refinement method and device, logistics navigation system and the terminal of location navigation
CN109255565A (en) * 2017-07-14 2019-01-22 菜鸟智能物流控股有限公司 Address attribution identification and logistics task distribution method and device
CN109960795A (en) * 2019-02-18 2019-07-02 平安科技(深圳)有限公司 A kind of address information standardized method, device, computer equipment and storage medium
CN110832476A (en) * 2017-07-24 2020-02-21 北京嘀嘀无限科技发展有限公司 System and method for providing information for on-demand services
CN110889769A (en) * 2018-08-21 2020-03-17 湖南共睹互联网科技有限责任公司 Transaction guarantee association method, computer device and computer readable storage medium
CN111274802A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Validity judgment method and device for address data
CN111353309A (en) * 2019-12-25 2020-06-30 北京合力亿捷科技股份有限公司 Method and system for processing communication quality complaint address based on text analysis
CN111353011A (en) * 2020-02-27 2020-06-30 北京市商汤科技开发有限公司 Location data set, building method and device thereof, and data processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393544A (en) * 2008-10-07 2009-03-25 南京师范大学 Chinese address semantic parsing method facing address encode
CN101458702A (en) * 2007-12-13 2009-06-17 韩国电子通信研究院 Apparatus for building address database and method thereof
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458702A (en) * 2007-12-13 2009-06-17 韩国电子通信研究院 Apparatus for building address database and method thereof
CN101393544A (en) * 2008-10-07 2009-03-25 南京师范大学 Chinese address semantic parsing method facing address encode
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514234A (en) * 2012-06-30 2014-01-15 北京百度网讯科技有限公司 Method and device for extracting page information
CN103514234B (en) * 2012-06-30 2018-10-16 北京百度网讯科技有限公司 A kind of page info extracting method and device
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN102841920B (en) * 2012-06-30 2017-05-10 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN103678708B (en) * 2013-12-30 2017-01-18 小米科技有限责任公司 Method and device for recognizing preset addresses
CN103678708A (en) * 2013-12-30 2014-03-26 小米科技有限责任公司 Method and device for recognizing preset addresses
WO2016127904A1 (en) * 2015-02-13 2016-08-18 阿里巴巴集团控股有限公司 Text address processing method and apparatus
CN105988988A (en) * 2015-02-13 2016-10-05 阿里巴巴集团控股有限公司 Method and device for processing text address
US10795964B2 (en) 2015-02-13 2020-10-06 Alibaba Group Holding Limited Text address processing method and apparatus
EP3258397A4 (en) * 2015-02-13 2017-12-20 Alibaba Group Holding Limited Text address processing method and apparatus
CN106708898A (en) * 2015-11-17 2017-05-24 方正国际软件(北京)有限公司 Method and device for showing building structures
CN106708898B (en) * 2015-11-17 2021-03-19 方正国际软件(北京)有限公司 Method and device for showing building structure
CN105630933A (en) * 2015-12-22 2016-06-01 安徽瑞信软件有限公司 Address data management method
CN107423295A (en) * 2016-05-24 2017-12-01 张向利 A kind of magnanimity address date intelligence fast matching method
CN107527312A (en) * 2016-06-22 2017-12-29 顺丰科技有限公司 Express mail address process system and method
CN106296344A (en) * 2016-07-29 2017-01-04 北京小米移动软件有限公司 Maliciously address recognition methods and device
CN106296344B (en) * 2016-07-29 2020-01-07 北京小米移动软件有限公司 Malicious address identification method and device
CN108204816A (en) * 2016-12-20 2018-06-26 北京四维图新科技股份有限公司 Address process of refinement method and device, logistics navigation system and the terminal of location navigation
CN108204816B (en) * 2016-12-20 2020-06-02 北京四维图新科技股份有限公司 Address refinement processing method and device for positioning navigation, logistics navigation system and terminal
CN106875264A (en) * 2017-03-31 2017-06-20 北京京东尚科信息技术有限公司 Sequence information management method, device and order sorting system
CN109255565A (en) * 2017-07-14 2019-01-22 菜鸟智能物流控股有限公司 Address attribution identification and logistics task distribution method and device
CN110832476A (en) * 2017-07-24 2020-02-21 北京嘀嘀无限科技发展有限公司 System and method for providing information for on-demand services
CN107577744A (en) * 2017-08-28 2018-01-12 苏州科技大学 Nonstandard Address automatic matching model, matching process and method for establishing model
CN110889769A (en) * 2018-08-21 2020-03-17 湖南共睹互联网科技有限责任公司 Transaction guarantee association method, computer device and computer readable storage medium
CN111274802A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Validity judgment method and device for address data
CN111274802B (en) * 2018-11-19 2023-04-18 阿里巴巴集团控股有限公司 Validity judgment method and device for address data
CN109960795A (en) * 2019-02-18 2019-07-02 平安科技(深圳)有限公司 A kind of address information standardized method, device, computer equipment and storage medium
CN109960795B (en) * 2019-02-18 2024-05-07 平安科技(深圳)有限公司 Address information standardization method and device, computer equipment and storage medium
CN111353309A (en) * 2019-12-25 2020-06-30 北京合力亿捷科技股份有限公司 Method and system for processing communication quality complaint address based on text analysis
CN111353011A (en) * 2020-02-27 2020-06-30 北京市商汤科技开发有限公司 Location data set, building method and device thereof, and data processing method and device
CN111353011B (en) * 2020-02-27 2024-05-17 北京市商汤科技开发有限公司 Site data set, establishing method and device thereof, and data processing method and device

Also Published As

Publication number Publication date
CN102024024B (en) 2013-07-10

Similar Documents

Publication Publication Date Title
CN102024024B (en) Method and device for constructing address database
CN101996247B (en) Method and device for constructing address database
CN105718579B (en) A kind of information-pushing method excavated based on internet log and User Activity identifies
CN107291783B (en) Semantic matching method and intelligent equipment
CN109492077A (en) The petrochemical field answering method and system of knowledge based map
CN102419778B (en) Information searching method for discovering and clustering sub-topics of query statement
CN103020293B (en) A kind of construction method and system of the ontology library of mobile application
CN102508859A (en) Advertisement classification method and device based on webpage characteristic
CN105095187A (en) Search intention identification method and device
CN104199857A (en) Tax document hierarchical classification method based on multi-tag classification
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
CN109308321A (en) A kind of knowledge question answering method, knowledge Q-A system and computer readable storage medium
CN102495892A (en) Webpage information extraction method
CN103823893A (en) User comment-based product search method and system
CN103268313A (en) Method and device for semantic analysis of natural language
CN104112026A (en) Short message text classifying method and system
CN101980208A (en) Address query method and system
CN103942220A (en) Method used for intelligently linking work orders with knowledge of knowledge base and suitable for IT operation and maintenance system
CN103440287A (en) Web question-answering retrieval system based on product information structuring
CN101984432A (en) Method and device for constructing address database
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN103853746A (en) Word bank generation method and system, input method and input system
CN113177101B (en) User track identification method, device, equipment and storage medium
CN108021715A (en) Isomery tag fusion system based on semantic structure signature analysis
US20150012543A1 (en) Region labeling method and device of data documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant