CN103440311A - Method and system for identifying geographical name entities - Google Patents
Method and system for identifying geographical name entities Download PDFInfo
- Publication number
- CN103440311A CN103440311A CN2013103777027A CN201310377702A CN103440311A CN 103440311 A CN103440311 A CN 103440311A CN 2013103777027 A CN2013103777027 A CN 2013103777027A CN 201310377702 A CN201310377702 A CN 201310377702A CN 103440311 A CN103440311 A CN 103440311A
- Authority
- CN
- China
- Prior art keywords
- address
- cutting
- place name
- grade
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 238000005520 cutting process Methods 0.000 claims description 53
- 238000002347 injection Methods 0.000 claims description 12
- 239000007924 injection Substances 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 abstract 7
- 230000011218 segmentation Effects 0.000 abstract 4
- 238000007781 pre-processing Methods 0.000 abstract 1
- 230000007704 transition Effects 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 239000002243 precursor Substances 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001915 proofreading effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000011430 maximum method Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 239000010127 yangjing Substances 0.000 description 1
- 238000013316 zoning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
The invention provides a method for identifying geographical name entities. The method comprises the steps that address texts are input and pre-processing is conducted on the address texts; address segmentation is conducted on the address texts according to dictionary metadata; address labeling is conducted on segmentation results, and an optimal address level labeling sequence is obtained; according to the context, the labeling sequence is revised, and an optimal labeling result is output. According to the dictionary metadata, the address segmentation is conducted on the address texts, then corresponding geographical name categories are labeled on the segmentation result according to a geographical name category definition table, the labeling sequence is optimized through a Viterbi algorithm, the labeling sequence is revised according to the context, and the final labeling result is obtained. Therefore, the geographical name entity identification result is accurate, and practicability is high. In addition, the invention provides a system for identifying the geographical name entities.
Description
Technical field
The present invention relates to area of geographic information, relate in particular to a kind of method and system of place name Entity recognition.
Background technology
Along with the development of Geographic Information System GIS, remote sensing system RS, Global Positioning System (GPS) GPS, the widespread use of especially mobile interconnected LBS, the application based on geography information more and more becomes the part of people's life.Application based on geography information particularly in address a more important part be exactly the place name Entity recognition, in current a lot of natural language processing platform, relate in the part of named entity recognition all not high enough to the discrimination of place name entity, be mainly reflected in: the first, place name entity class attribute is single, only with place name, identify all place names, the grade of place name (provincial, region, at county level, township level, community/villagers' committee, road, village, buildings etc.) is not done to thinner division; The second, lower for township level and the place name identification rate below township level; Three, not processing the situation that different place names have identical abbreviation, as Jilin, can be Jilin, can be also Jilin Province; Four, the difference of same name being described to (place name another name) identifies poor.
So, need to have the place name entity recognition method that a kind of discrimination is higher to solve above-mentioned these problems.
Summary of the invention
For this reason, the present invention is in order one of to address the aforementioned drawbacks.
Thereby, the invention provides a kind of method and system of place name Entity recognition, according to the dictionary metadata, the address text is carried out to the address cutting, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context proofread and correct the mark sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.
So one embodiment of the invention provides a kind of method of place name Entity recognition, the method comprises the following steps: the address text is inputted and is carried out pre-service; According to the dictionary metadata, the address text is carried out to the address cutting; The cutting result is carried out to the address mark, and obtain optimum address grade mark sequence; Based on context proofread and correct the mark sequence, and export optimum annotation results.
Preferably, described method adopts dictionary mode to carry out, and wherein, described dictionary adopts the Trie tree construction of even numbers group to store.
Preferably, described method is set up the address metadatabase based on the Trie tree in advance; Described method is divided into 12 grades by address metadata place name; Set up corresponding place name class declaration form according to described address metadatabase.
In one embodiment of the invention, the cutting of described address adopts the address text that the reverse maximum matching algorithm of Trie tree scans input from right to left to carry out the address cutting.
In one embodiment of the invention, described address mark comprises: according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification; And the address grade of the cutting result that will can't find in the metadata of address is labeled as 0 grade.
Preferably, described address mark is by the address grade mark sequence of the acquisition optimum of Viterbi algorithm.
Another embodiment of the present invention provides a kind of system of place name Entity recognition, and this system comprises: the address text input system, in order to obtain the input message of address text, and carry out pre-service; The address cutting system, in order to carry out cutting according to the address metadata to the address text obtained in the input system of address; The address label injection system, in order to the address grade mark sequence of the acquisition optimum by the Viterbi algorithm; The ADDRESS HYGIENE system, in order to based on context to proofread and correct the mark sequence, and obtain optimum annotation results.
In another embodiment, described address TEXT system is carried out pre-service to the address text and is at least comprised: the space of Delete superfluous and numeral or alphabetical double byte character are converted to the half-angle character.
In another embodiment, described address cutting system carries out the cutting of address text according to the address metadatabase of setting up in advance based on the Trie tree; Described address cutting system adopts the reverse maximum matching algorithm based on the Trie tree to carry out the address cutting.
In another embodiment, described address label injection system marks respectively corresponding place name classification according to the attribute of place name in the metadata of address by the cutting result; The address grade of the cutting result that in addition, described address label injection system will can't find in the metadata of address is labeled as 0 grade; Described address label injection system obtains optimum address grade mark sequence by the Viterbi algorithm again.The present invention carries out the address cutting according to the dictionary metadata to the address text, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context correction marks sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.
The accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the method for a kind of place name Entity recognition of realizing of the embodiment of the present invention.
Fig. 2 is the principle of work schematic diagram of the Trie tree of embodiment of the present invention employing.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is described in further detail.Should be appreciated that specific embodiment described herein, only for explaining the present invention, is not intended to limit the present invention.
The method and system of a kind of place name Entity recognition provided by the invention, according to the dictionary metadata, the address text is carried out to the address cutting, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context correction marks sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.
As Fig. 1 is the schematic flow sheet of the method for a kind of place name Entity recognition of realizing of the embodiment of the present invention, the method specifically comprises the following steps:
Step S110: the address text is inputted and is carried out pre-service.
Characteristics based on Chinese Place Names, in the embodiment of the present invention, address cutting and address place name Entity recognition all are based on the pattern of dictionary.Participle based on dictionary has forward (from left to right) coupling and reverse (from right to left) coupling usually.Generally reverse matching ratio forward coupling cutting error rate is at half, and for solving the intersection ambiguity, has superiority, and the ambiguous definition that intersects is: tri-continous characters of ABC, and AB and BC all can become word; Generally in Chinese, the probability of BC composition word is larger.In the embodiment of the present invention, the address cutting is based on address metadata dictionary and adopts reverse maximum matching algorithm to scan from right to left the address text of user input, realizes the cutting of address.In order to improve the speed of cutting, dictionary adopts the Trie data tree structure based on even numbers group (Double Array).
In this step, need to set up in advance the address metadatabase based on the Trie tree, address metadata dictionary mainly comprises various geographical name datas, as provincial administrative area, local administrative area, administrative areas at the county level, the township level administrative area, community/villagers' committee, road, buildings, community, village, mechanism etc., about the geographical name data of administrative division can be directly from obtaining in the relevant address date of china administration zoning from wikipedia and national statistics board web, other data can be by artificially collecting, and extract from complete mailing address by address cutting and recognition technology.
The data that the place name meta-data pack contains mainly contain: provincial administrative area name (comprising provinces, autonomous regions and municipalities and special administrative region), local administrative area name (prefecture-level city, autonomous prefecture, area, alliance), administrative areas at the county level's name (comprising districts under city administration, county-level city, county, autonomous county, flag, automonous banner, special zone and forest zone), township level administrative area name (comprising township, town, street, bush, district office), other address dates (comprising link name, village name, cell name, building name and square name, organization names) etc.
The address use pattern of current use mainly contains two kinds:
Pattern one: the address of locating centered by road, common address architecture rule is as follows: provincial administrative area+local administrative area+administrative areas at the county level+road+number+building name+room number.As: No. 29 foreign student's Chuangye Building Room 2208 of high-new southern loop, Nanshan District, Shenzhen City, Guangdong Province.This kind of address descriptor pattern is common in electronic chart, as Baidu's map, Google Maps etc.
Pattern two: the address of status centered by administrative division, common address architecture rule is as follows: provincial administrative area+local administrative area+administrative areas at the county level+township/town/street+residence (village) committee meeting+community/natural village.As: the precious people of Xixiang street, Baoan District, Shenzhen City, Guangdong Province Liu Tang neighbourhood committee garden.This kind of address descriptor pattern is common in government department, as Department of Civil Affairs.
For the above-mentioned two kinds of address descriptor patterns of compatibility, will the place name in address be divided into to 12 grades according to the characteristics of place name in the embodiment of the present invention, as shown in table 1 below.
level | administrative region | administrative region for example |
the first order | provinces, autonomous regions and municipalities | guangdong Province, Inner Mongolia Autonomous Region, Guangxi Zhuang Autonomous Region, Beijing, Chongqing City etc. |
the second level | prefecture-level city, autonomous prefecture, area, alliance | shenzhen, Guangzhou, Wuhan City, Wenzhou City, Korean Autonomous Prefecture of Yanbian, Shigatse District, Hotan Prefecture, Turfan Prefecture, Xilinguole League etc. |
the third level | prefecture-level city's jurisdiction district, municipality directly under the Central Government's jurisdiction district, province are directly under the jurisdiction of county, autonomous county, county-level city, flag | futian District, Nanshan District, Pudong New District, Chongming County, Haidian District, Wenchang City, Dingan County, Manchu Autonomous County of Qingyuan, Conghua City, Horqin Right Wing Front Banner etc. |
the fourth stage | small towns, street, bush, national countryside, district office | close ,Wu town, market town, ,Yue Hai street, San Guanmiao township, Bayan Hu Shu Sumu, permitted the family Uygur nationality of bridge Hui ethnic group township etc. |
level V | community, villagers' committee | the ,Shi village, Liu Tang community villagers' committee etc. |
the 6th grade | road | shen Nan, South Road, Science Court, Xizhimenwai Dajie etc. |
the 7th grade | number | immediately following the numbering after road, as: No. 208, good fortune Road, Lane 223, Yangjing road etc. |
the 8th grade | community, village (natural village), manufacturing district | bOHO TOWN ,Liu Tang village etc. |
the 9th grade | road, village, lane, path, numbering of residential building etc. in community | this field is deposited Shi Cun Road or lane, path, path etc.Feature mainly contains: unit, lane, lane, neighbour, ,Dong, building, mill etc. |
the tenth grade | building name | software mansion, foreign student's Chuangye Building etc. |
the tenth one-level | room number | as 22nd floors (floor, F) Room 08 of 2208 Shi, etc. |
the tenth secondary | other titles apart from the above | mechanism's name, Business Name, non-place name etc. |
Table 1: ten two-level address grade classification model definitions.
Address metadata dictionary not only comprises place name vocabulary, but also to comprise the attribute that place name vocabulary is corresponding, it is the place name classification, its Dictionary format is defined as: address metadata dictionary consists of multirow, every a line becomes an entry (Term), each Term comprises place name and address classes collection (Categories) corresponding to place name, wherein is called key, the property set that the address grade is key or Categories.Address each Term of metadata dictionary comprises 2, i.e. address classes collection (Categories) corresponding to place name and place name, between them with the branch of half-angle "; " separate, some place names comprise a plurality of place name classifications (such as the another name of some standard edition addresses is also the another name of other standards version address), between different region grades with the comma of half-angle, " separate.
Characteristics and use habit according to place name, standard name with official's a certain place name entirely by name, other name is its another name, as, the another name in " ”Wei“ Guangdong Province, ”He“ Guangdong, Guangdong ", another name is divided into two kinds from literal feature, a kind of is the continuous substring of standard name, and the name that is referred to as to abridge claims again to be called for short, as " Guangdong " in " Guangdong Province ", another another name does not see that on literal feature any association or non-substring are arranged fully, as " Guangdong " in " Guangdong Province ".In order to take into account the difference of this type, for the class declaration of the situation of the first, be that standard name classification back adds " _ ABBR ", in like manner the second situation is that standard name classification back adds " _ ALIAS ".Therefore the class declaration of place name is as shown in table 2 below.
Table 2: place name class declaration form.
Step S120 carries out the address cutting according to the dictionary metadata to the address text.
In the embodiment of the present invention, dictionary adopts the Trie data tree structure based on even numbers group (Double Array), and for the ease of understanding, the principle that the Forward Maximum Method of might as well take is example Trie tree as shown in Figure 2.
The known Trie tree of schematic diagram as shown in Figure 2 is a definite finite-state automata (DFA), each node represents a state of automat, difference according to variable, carry out state transitions, and do the verification of a state transition path when shifting, when arriving done state or can't shift, complete inquiry.The query script of Trie tree mainly is divided into two steps: under current state, according to the character of current input, make state transitions, obtain the position of its immediate successor state; Whether the forerunner of verification current state, determine which state transitions is current state be by, be to point to its direct precursor.
This shows in structure Trie tree, must store the direct precursor information of current state.On current network, the version of realizing of popular Trie tree generally all is based on the even numbers group, and the name of two arrays is respectively base[] and check[], each the element subscript in array
ibe equivalent to node numbering or the memory location in the even numbers group of Trie tree, claim again status number.
base[i]: what deposit is current state
ito the minimum conflict free side-play amount of all follow-up states;
check[i]: what deposit is current state
idirect precursor information, which state transitions stores current state is by;
basewith
checkpaired,
base[i]with
check[i]represent the attribute of same state.
If current state is s, the character of input is c, and NextState is the non-leaf node of t(), the constraint condition of query script is:
check[base[s]+c]=s(formula 1);
base[s]+c=t(formula 2);
The base[s of
each state] be worth unique.
If current state s can transfer in leafy node t, its constraint condition is:
base[s]=t(formula 3);
t=check[t](formula 4);
base[t]<0and
base[t]the value initial node that is DFA
0the entry formed to the character of current leafy node process is at the opposite numbers of all entry concentrated positions by the lexicographic order sequence.
The search efficiency of realizing based on the Trie tree construction is just higher, and it doesn't matter to do the time of one query and the scale of dictionary with the Trie tree, and only relevant with the length of query text character string, therefore inquiry is the soonest once
o (1), the text-string first character the retrieval of the ground floor of Trie less than; Inquiring about once the poorest time complexity is
o (n), the degree of depth that wherein n only sets with Trie is relevant with the length of query text, and wherein the degree of depth of tree depends on entry length the longest in dictionary.
For the convenience realized, the embodiment of the present invention will
checkwith
basebe put in an array,
basearray is placed on even bit,
checkarray is placed on odd bits,
base[i]->array[2*i],
check[i] –>array[2*i+1]; If current state is s, the character of input is c, and NextState is the non-leaf node of t() condition under, the constraint condition of the query script of its this method is revised as:
array [2* (array[2*s]+c)+1]=array [2*s](formula 5);
array [2*s]+c=t(formula 6);
In the array array, the value of effective even bit is unequal mutually, is worth unique.
If current state s can transfer in leafy node t, its constraint condition is:
array [2*s]=t(formula 7);
t=array[2*t+1](formula 8);
array [2*t]<0and the entry that initial node that the value of array [2*t] is DFA 0 forms to the character of current leafy node process is at the opposite number of all entry concentrated positions by the lexicographic order sequence.
Step S130: the cutting result is carried out to the address mark, and obtain optimum address grade mark sequence.
After in step S120, the address text passes through the reverse maximum matching algorithm cutting based on the Trie tree, next the address metadata cut out is marked to upper corresponding place name classification, the place name classification can be from the metadata dictionary of address obtains in the attribute of each place name, if does not deposit in dictionary the address out be split, illustrate that this address is unrecognized address, its address grade is labeled as 0 grade, then the address grade mark sequence by the acquisition optimum of Viterbi algorithm to the address above mentioned mark.
The realization of bright above-mentioned steps for instance.Build the probability model of Viterbi algorithm according to priori,
piwith
adesirable following initial value:
Pi={0.05,0.45,0.25,0.15,0.1};
A?=?{{0.05,?0.45,?0.25,?0.15,?0.10},
{0.05,?0.23,?0.45,?0.17,?0.10},
{0.05,?0.18,?0.25,?0.30,?0.22},
{0.05,?0.35,?0.05,?0.05,?0.50},
{0.05,?0.30,?0.15,?0.05,?0.45}};
As the address of inputting is: " ShenZhen,GuangDong Bao'an Xixiang " can obtain following four kinds of annotation results sequences after processing through Qie, address, described address mark: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".According to Viterbi (Viterbi) algorithm, we can learn the weights of four kinds of mark states:
Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4);
p=0.030375;
Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2);
p=0.0030375;
Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4);
p=0.001125;
Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2);
p=1.125E-4;
The mark sequence of maximum probability is the first mark situation, so the result of dynamic programming algorithm output is also the first mark state " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ".
Based on context step S140 proofreaies and correct the mark sequence, and exports optimum annotation results.
Can't resolve the another name situation identical with the another name of county or county-level city in a prefecture-level city area under one's jurisdiction in step S130, for example " Taihe County " (being subordinate to Anhui Province's Fuyang City) and " Taihe District " (being subordinate to Jinzhou City, Liaoning Province), their another name is all " Taihe county ", but they belong to different address ratings.When " Taihe county, ”He“ Jinzhou, Taihe county, Fuyang (city) (city) " occurring, be labeled in maximum probability on the 3rd location, polar region rank according to algorithm and probability model " Taihe county " now, solve problems will be according to it address name above judge that its address rank is " 2 " or " 3 ", the correction that is marked sequence as special circumstances like that.Be exemplified below:
The address of input is: " ancient month of the Pingshan Mountain, Shijiazhuang, Hebei ".
The address sequence of mark is: " Hebei (1; 2; 4) Shijiazhuang (2; the 4) Pingshan Mountain (2; 3,4) Gu Yue (4) ", and in this mark sequence, the mark grade of each address is interpreted as: " Hebei " can be the another name in " Hebei province ", can be also “ Hebei District, Tianjin " another name, can be also the another name in " Hebei township "; " Shijiazhuang " can be the another name in " Shijiazhuang City " and " Shijiazhuang town "; " Pingshan Mountain " can be the another name in " Pingshan County " or " ”Huo“ Pingshan Mountain town, Pingshan District ".
Optimum mark sequence is: " Pingshan Mountain, Shijiazhuang, Hebei (1) (2) (3) Gu Yue (4) ".
Based on context the mark sequence after proofreading and correct is: " Pingshan Mountain, Shijiazhuang, Hebei (1) (2) (2) Gu Yue (4) ", because " Pingshan Mountain " now is " Pingshan County ".
This shows and call identical the time when the another name in a prefecture-level city area under one's jurisdiction and county or county-level city, whether the affiliated prefecture-level city that is noted as third-level address its direct precursor address, if not being proofreaied and correct.In order to facilitate contextual rule to adopt the mode of above-mentioned contrary rule to store, the record another name is context for the another name of prefecture-level city under county or county-level city, for example (Taihe county → Fuyang).Therefore when meeting this context, revise the grade of mark, do not make any modification while not meeting.
Meanwhile also there is two-level address and level Four address situation of the same name, mainly appear at the another name in county-level city or county and the another name situation of the same name in small towns, because the level Four address can occur repeatedly continuously in a sufficient address, therefore sometimes can be labeled in two-level address on level Four.Now also to based on context be differentiated, be revised the sequence of mark.
As the address of inputting is: " He Min village, new township, Heihe In The Heilongjiang River Wudalianchi ";
Optimum mark sequence is: " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) He Min village (0) ",
" Wudalianchi " now is labeled on the rank of fourth stage address, in fact it is a county-level city, and the mark sequence after based on context proofreading and correct is: " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (2) (4) He Min village (0) ".
The solution that has identical another name with district is similar, for small towns and county situation of the same name, the rule that system retains is that the another name that another name is prefecture-level city under county or county-level city is context, for example (Wudalianchi → Heihe), therefore when meeting this context, revise the grade of mark, do not make any modification while not meeting.
Therefore for some special circumstances, provide a mechanism based on context to be proofreaied and correct the best mark sequence, the method for processing is to eliminate the ambiguity (the corresponding a plurality of addresses of alias grade) because another name brings according to the address context simultaneously.The result drawn like this is more accurate.
Another embodiment of the present invention provides a kind of system of place name Entity recognition, and this system comprises: the address text input system, in order to obtain the input message of address text, and carry out pre-service; The address cutting system, in order to carry out cutting according to the address metadata to the address text obtained in the input system of address; The address label injection system, in order to the address grade mark sequence of the acquisition optimum by the Viterbi algorithm; The ADDRESS HYGIENE system, in order to based on context to proofread and correct the mark sequence, and obtain optimum annotation results.
In another embodiment, described address TEXT system is carried out pre-service to the address text and is at least comprised: the space of Delete superfluous and numeral or alphabetical double byte character are converted to the half-angle character.
In another embodiment, described address cutting system carries out the cutting of address text according to the address metadatabase of setting up in advance based on the Trie tree; Described address cutting system adopts the reverse maximum matching algorithm based on the Trie tree to carry out the address cutting.
In another embodiment, described address label injection system marks respectively corresponding place name classification according to the attribute of place name in the metadata of address by the cutting result; The address grade of the cutting result that in addition, described address label injection system will can't find in the metadata of address is labeled as 0 grade; Described address label injection system obtains optimum address grade mark sequence by the Viterbi algorithm again.The present invention carries out the address cutting according to the dictionary metadata to the address text, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context correction marks sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.
Claims (10)
1. the method for a place name Entity recognition, is characterized in that, said method comprising the steps of:
The address text is inputted and is carried out pre-service;
According to the dictionary metadata, the address text is carried out to the address cutting;
The cutting result is carried out to the address mark, and obtain optimum address grade mark sequence;
Based on context proofread and correct the mark sequence, and export optimum annotation results.
2. method according to claim 1, is characterized in that, described method adopts dictionary mode to carry out; Wherein, described dictionary adopts the Trie tree construction of even numbers group to store.
3. method according to claim 1 and 2, is characterized in that, described method comprises:
Described method is set up the address metadatabase based on the Trie tree in advance;
Described method is divided into 12 grades by address metadata place name;
Set up corresponding place name class declaration form according to described address metadatabase.
4. method according to claim 1, is characterized in that, the cutting of described address comprises: the cutting of described address adopts address text that the reverse maximum matching algorithm of Trie tree scans input from right to left to carry out address to divide.
5. method according to claim 1, is characterized in that, described address mark comprises:
According to place name class declaration form, the cutting result is marked respectively to corresponding place name classification;
And the address grade of the cutting result that will can't find in the metadata of address is labeled as 0 grade.
6. method according to claim 1 or 5, is characterized in that, described address mark comprises: described address mark is by the address grade mark sequence of the acquisition optimum of Viterbi algorithm.
7. the system of a place name Entity recognition, is characterized in that, described system comprises:
The address text input system, in order to obtain the input message of address text, and carry out pre-service;
The address cutting system, in order to carry out cutting according to the address metadata to the address text obtained in the input system of address;
The address label injection system, in order to the address grade mark sequence of the acquisition optimum by the Viterbi algorithm;
The ADDRESS HYGIENE system, in order to based on context to proofread and correct the mark sequence, and obtain optimum annotation results.
8. system according to claim 7, is characterized in that, described system comprises: described address TEXT system is carried out pre-service to the address text and is at least comprised: the space of Delete superfluous and by the numeral or alphabetical double byte character be converted to the half-angle character.
9. system according to claim 7, is characterized in that, described system comprises: described address cutting system carries out the cutting of address text according to the address metadatabase of setting up in advance based on the Trie tree; Described address cutting system adopts the reverse maximum matching algorithm based on the Trie tree to carry out the address cutting.
10. system according to claim 7, is characterized in that, described system comprises:
Described address label injection system marks respectively corresponding place name classification according to the attribute of place name in the metadata of address by the cutting result;
The address grade of the cutting result that in addition, described address label injection system will can't find in the metadata of address is labeled as 0 grade;
Described address label injection system obtains optimum address grade mark sequence by the Viterbi algorithm again.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013103777027A CN103440311A (en) | 2013-08-27 | 2013-08-27 | Method and system for identifying geographical name entities |
PCT/CN2014/084609 WO2015027836A1 (en) | 2013-08-27 | 2014-08-18 | Method and system for place name entity recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013103777027A CN103440311A (en) | 2013-08-27 | 2013-08-27 | Method and system for identifying geographical name entities |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103440311A true CN103440311A (en) | 2013-12-11 |
Family
ID=49694004
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013103777027A Pending CN103440311A (en) | 2013-08-27 | 2013-08-27 | Method and system for identifying geographical name entities |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN103440311A (en) |
WO (1) | WO2015027836A1 (en) |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015027836A1 (en) * | 2013-08-27 | 2015-03-05 | 深圳市华傲数据技术有限公司 | Method and system for place name entity recognition |
CN104933024A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
CN104933023A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
CN105045888A (en) * | 2015-07-28 | 2015-11-11 | 浪潮集团有限公司 | Participle training corpus tagging method for HMM (Hidden Markov Model) |
CN105704258A (en) * | 2014-11-28 | 2016-06-22 | 北京山海经纬信息技术有限公司 | Address recognition method and equipment |
WO2016138773A1 (en) * | 2015-03-05 | 2016-09-09 | 深圳市华傲数据技术有限公司 | Address knowledge processing method and device based on graphs |
CN106156145A (en) * | 2015-04-13 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The management method of a kind of address date and device |
CN106155998A (en) * | 2015-04-09 | 2016-11-23 | 腾讯科技(深圳)有限公司 | A kind of data processing method and device |
CN106326206A (en) * | 2015-06-24 | 2017-01-11 | 北京京东尚科信息技术有限公司 | Entity extraction method based on grammar templates |
CN106557574A (en) * | 2016-11-23 | 2017-04-05 | 广东电网有限责任公司佛山供电局 | Destination address matching process and system based on tree construction |
CN106970918A (en) * | 2016-01-13 | 2017-07-21 | 阿里巴巴集团控股有限公司 | Generate the method and device of international address unique identifier |
CN107133215A (en) * | 2017-05-20 | 2017-09-05 | 复旦大学 | A kind of Chinese canonical address recognition methods of offline handwriting |
CN107247792A (en) * | 2017-06-16 | 2017-10-13 | 中国电子技术标准化研究院 | Match method, device and the computer equipment of functional department |
CN107305540A (en) * | 2016-04-20 | 2017-10-31 | 顺丰科技有限公司 | Address cutting recognition methods |
CN108509505A (en) * | 2018-03-05 | 2018-09-07 | 昆明理工大学 | A kind of character string retrieving method and device based on subregion even numbers group Trie |
CN108920457A (en) * | 2018-06-15 | 2018-11-30 | 腾讯大地通途(北京)科技有限公司 | Address Recognition method and apparatus and storage medium |
CN109033225A (en) * | 2018-06-29 | 2018-12-18 | 福州大学 | Chinese address identifying system |
CN109145095A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Information of place names matching process, information matching method, device and computer equipment |
CN109255564A (en) * | 2017-07-13 | 2019-01-22 | 菜鸟智能物流控股有限公司 | Pick-up point address recommendation method and device |
CN109299469A (en) * | 2018-10-29 | 2019-02-01 | 复旦大学 | A method of identifying complicated address in long text |
CN110210020A (en) * | 2019-05-22 | 2019-09-06 | 武汉虹信通信技术有限责任公司 | The standardized system and method for address |
WO2019205308A1 (en) * | 2018-04-27 | 2019-10-31 | 平安科技(深圳)有限公司 | Information input method and apparatus, and terminal device and medium |
WO2020057432A1 (en) * | 2018-09-17 | 2020-03-26 | 阿里巴巴集团控股有限公司 | Address standardization method and device, storage medium and computer terminal |
CN111324679A (en) * | 2018-12-14 | 2020-06-23 | 阿里巴巴集团控股有限公司 | Method, device and system for processing address information |
CN111931478A (en) * | 2020-07-16 | 2020-11-13 | 丰图科技(深圳)有限公司 | Address interest plane model training method, address prediction method and device |
CN112052673A (en) * | 2020-08-28 | 2020-12-08 | 丰图科技(深圳)有限公司 | Logistics network point identification method and device, computer equipment and storage medium |
CN112633003A (en) * | 2020-12-30 | 2021-04-09 | 平安科技(深圳)有限公司 | Address recognition method and device, computer equipment and storage medium |
CN112966511A (en) * | 2021-02-08 | 2021-06-15 | 广州探迹科技有限公司 | Entity word recognition method and device |
CN113220836A (en) * | 2021-05-08 | 2021-08-06 | 北京百度网讯科技有限公司 | Training method and device of sequence labeling model, electronic equipment and storage medium |
WO2024000656A1 (en) * | 2022-06-29 | 2024-01-04 | 青岛海尔科技有限公司 | Place name recognition method, system and apparatus, and storage medium |
CN112633003B (en) * | 2020-12-30 | 2024-05-31 | 平安科技(深圳)有限公司 | Address recognition method and device, computer equipment and storage medium |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114417022B (en) * | 2022-03-30 | 2022-06-28 | 阿里巴巴(中国)有限公司 | Model training method, data processing method and device |
CN117131867B (en) * | 2022-05-17 | 2024-05-14 | 贝壳找房(北京)科技有限公司 | Method, apparatus, computer program product and storage medium for splitting house address |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007213211A (en) * | 2006-02-08 | 2007-08-23 | Fujifilm Corp | Retrieval database, address retrieval device and address retrieval method |
CN101719128A (en) * | 2009-12-31 | 2010-06-02 | 浙江工业大学 | Fuzzy matching-based Chinese geo-code determination method |
CN102298585A (en) * | 2010-06-24 | 2011-12-28 | 高德软件有限公司 | Address splitting and level marking method and device |
CN102955833A (en) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | Correspondence address identifying and standardizing method |
CN102955832A (en) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | Correspondence address identifying and standardizing system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103440311A (en) * | 2013-08-27 | 2013-12-11 | 深圳市华傲数据技术有限公司 | Method and system for identifying geographical name entities |
-
2013
- 2013-08-27 CN CN2013103777027A patent/CN103440311A/en active Pending
-
2014
- 2014-08-18 WO PCT/CN2014/084609 patent/WO2015027836A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007213211A (en) * | 2006-02-08 | 2007-08-23 | Fujifilm Corp | Retrieval database, address retrieval device and address retrieval method |
CN101719128A (en) * | 2009-12-31 | 2010-06-02 | 浙江工业大学 | Fuzzy matching-based Chinese geo-code determination method |
CN102298585A (en) * | 2010-06-24 | 2011-12-28 | 高德软件有限公司 | Address splitting and level marking method and device |
CN102955833A (en) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | Correspondence address identifying and standardizing method |
CN102955832A (en) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | Correspondence address identifying and standardizing system |
Cited By (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015027836A1 (en) * | 2013-08-27 | 2015-03-05 | 深圳市华傲数据技术有限公司 | Method and system for place name entity recognition |
CN105704258B (en) * | 2014-11-28 | 2019-11-29 | 方正国际软件(北京)有限公司 | A kind of method and apparatus of Address Recognition |
CN105704258A (en) * | 2014-11-28 | 2016-06-22 | 北京山海经纬信息技术有限公司 | Address recognition method and equipment |
WO2016138773A1 (en) * | 2015-03-05 | 2016-09-09 | 深圳市华傲数据技术有限公司 | Address knowledge processing method and device based on graphs |
CN106155998A (en) * | 2015-04-09 | 2016-11-23 | 腾讯科技(深圳)有限公司 | A kind of data processing method and device |
CN106155998B (en) * | 2015-04-09 | 2019-03-26 | 腾讯科技(深圳)有限公司 | A kind of data processing method and device |
CN106156145A (en) * | 2015-04-13 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The management method of a kind of address date and device |
CN104933024B (en) * | 2015-05-12 | 2017-09-01 | 深圳市华傲数据技术有限公司 | Chinese address participle mask method |
CN104933023A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
CN104933023B (en) * | 2015-05-12 | 2017-09-01 | 深圳市华傲数据技术有限公司 | Chinese address participle mask method |
CN104933024A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
CN106326206A (en) * | 2015-06-24 | 2017-01-11 | 北京京东尚科信息技术有限公司 | Entity extraction method based on grammar templates |
CN106326206B (en) * | 2015-06-24 | 2021-01-26 | 北京京东尚科信息技术有限公司 | Entity extraction method based on grammar template |
CN105045888A (en) * | 2015-07-28 | 2015-11-11 | 浪潮集团有限公司 | Participle training corpus tagging method for HMM (Hidden Markov Model) |
CN106970918A (en) * | 2016-01-13 | 2017-07-21 | 阿里巴巴集团控股有限公司 | Generate the method and device of international address unique identifier |
CN106970918B (en) * | 2016-01-13 | 2020-10-27 | 菜鸟智能物流控股有限公司 | Method and device for generating unique identifier of international address |
CN107305540A (en) * | 2016-04-20 | 2017-10-31 | 顺丰科技有限公司 | Address cutting recognition methods |
CN106557574A (en) * | 2016-11-23 | 2017-04-05 | 广东电网有限责任公司佛山供电局 | Destination address matching process and system based on tree construction |
CN106557574B (en) * | 2016-11-23 | 2020-02-04 | 广东电网有限责任公司佛山供电局 | Target address matching method and system based on tree structure |
CN107133215A (en) * | 2017-05-20 | 2017-09-05 | 复旦大学 | A kind of Chinese canonical address recognition methods of offline handwriting |
CN109145095A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Information of place names matching process, information matching method, device and computer equipment |
CN109145095B (en) * | 2017-06-16 | 2024-03-29 | 贵州小爱机器人科技有限公司 | Place name information matching method, information matching device and computer equipment |
CN107247792A (en) * | 2017-06-16 | 2017-10-13 | 中国电子技术标准化研究院 | Match method, device and the computer equipment of functional department |
CN107247792B (en) * | 2017-06-16 | 2021-01-15 | 中国电子技术标准化研究院 | Method and device for matching functional departments and computer equipment |
CN109255564A (en) * | 2017-07-13 | 2019-01-22 | 菜鸟智能物流控股有限公司 | Pick-up point address recommendation method and device |
CN108509505B (en) * | 2018-03-05 | 2022-04-12 | 昆明理工大学 | Character string retrieval method and device based on partition double-array Trie |
CN108509505A (en) * | 2018-03-05 | 2018-09-07 | 昆明理工大学 | A kind of character string retrieving method and device based on subregion even numbers group Trie |
WO2019205308A1 (en) * | 2018-04-27 | 2019-10-31 | 平安科技(深圳)有限公司 | Information input method and apparatus, and terminal device and medium |
CN108920457A (en) * | 2018-06-15 | 2018-11-30 | 腾讯大地通途(北京)科技有限公司 | Address Recognition method and apparatus and storage medium |
CN109033225A (en) * | 2018-06-29 | 2018-12-18 | 福州大学 | Chinese address identifying system |
WO2020057432A1 (en) * | 2018-09-17 | 2020-03-26 | 阿里巴巴集团控股有限公司 | Address standardization method and device, storage medium and computer terminal |
CN109299469A (en) * | 2018-10-29 | 2019-02-01 | 复旦大学 | A method of identifying complicated address in long text |
CN111324679B (en) * | 2018-12-14 | 2023-04-11 | 阿里巴巴集团控股有限公司 | Method, device and system for processing address information |
CN111324679A (en) * | 2018-12-14 | 2020-06-23 | 阿里巴巴集团控股有限公司 | Method, device and system for processing address information |
CN110210020A (en) * | 2019-05-22 | 2019-09-06 | 武汉虹信通信技术有限责任公司 | The standardized system and method for address |
CN110210020B (en) * | 2019-05-22 | 2023-06-20 | 武汉虹旭信息技术有限责任公司 | Communication address standardization system and method thereof |
CN111931478B (en) * | 2020-07-16 | 2023-11-10 | 丰图科技(深圳)有限公司 | Training method of address interest surface model, and prediction method and device of address |
CN111931478A (en) * | 2020-07-16 | 2020-11-13 | 丰图科技(深圳)有限公司 | Address interest plane model training method, address prediction method and device |
CN112052673A (en) * | 2020-08-28 | 2020-12-08 | 丰图科技(深圳)有限公司 | Logistics network point identification method and device, computer equipment and storage medium |
CN112633003A (en) * | 2020-12-30 | 2021-04-09 | 平安科技(深圳)有限公司 | Address recognition method and device, computer equipment and storage medium |
CN112633003B (en) * | 2020-12-30 | 2024-05-31 | 平安科技(深圳)有限公司 | Address recognition method and device, computer equipment and storage medium |
CN112966511A (en) * | 2021-02-08 | 2021-06-15 | 广州探迹科技有限公司 | Entity word recognition method and device |
CN112966511B (en) * | 2021-02-08 | 2024-03-15 | 广州探迹科技有限公司 | Entity word recognition method and device |
CN113220836A (en) * | 2021-05-08 | 2021-08-06 | 北京百度网讯科技有限公司 | Training method and device of sequence labeling model, electronic equipment and storage medium |
CN113220836B (en) * | 2021-05-08 | 2024-04-09 | 北京百度网讯科技有限公司 | Training method and device for sequence annotation model, electronic equipment and storage medium |
WO2024000656A1 (en) * | 2022-06-29 | 2024-01-04 | 青岛海尔科技有限公司 | Place name recognition method, system and apparatus, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2015027836A1 (en) | 2015-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103440311A (en) | Method and system for identifying geographical name entities | |
CN109145169B (en) | Address matching method based on statistical word segmentation | |
CN102955833B (en) | A kind of address identification, standardized method | |
CN103440312B (en) | A kind of system and terminal of mailing address inquiry postcode | |
CN101313300B (en) | Local search | |
CN107145577A (en) | Address standardization method, device, storage medium and computer | |
CN106528526B (en) | A kind of Chinese address semanteme marking method based on Bayes's segmentation methods | |
CN109033086A (en) | A kind of address resolution, matched method and device | |
CN102955832B (en) | A kind of address identification, standardized system | |
CN108369582B (en) | Address error correction method and terminal | |
CN109344263B (en) | Address matching method | |
CN109657074B (en) | News knowledge graph construction method based on address tree | |
CN102289467A (en) | Method and device for determining target site | |
CN104866593A (en) | Database searching method based on knowledge graph | |
CN109344213B (en) | Chinese geocoding method based on dictionary tree | |
CN108763215B (en) | Address storage method and device based on address word segmentation and computer equipment | |
CN106909611B (en) | Hotel automatic matching method based on text information extraction | |
CN105224622A (en) | The place name address extraction of Internet and standardized method | |
CN110472066A (en) | A kind of construction method of urban geography semantic knowledge map | |
CN109933797A (en) | Geocoding and system based on Jieba participle and address dictionary | |
CN101777082A (en) | Correlation method of text information and geological information and system | |
CN101206121B (en) | Placename retrieval device | |
CN110442603A (en) | Address matching method, apparatus, computer equipment and storage medium | |
CN101093478A (en) | Method and system for identifying Chinese full name based on Chinese shortened form of entity | |
CN106874287A (en) | A kind of processing method and processing device of point of interest POI geocodings |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20131211 |