CN103440311A - Method and system for identifying geographical name entities - Google Patents

Method and system for identifying geographical name entities Download PDF

Info

Publication number
CN103440311A
CN103440311A CN2013103777027A CN201310377702A CN103440311A CN 103440311 A CN103440311 A CN 103440311A CN 2013103777027 A CN2013103777027 A CN 2013103777027A CN 201310377702 A CN201310377702 A CN 201310377702A CN 103440311 A CN103440311 A CN 103440311A
Authority
CN
China
Prior art keywords
address
cutting
place name
grade
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103777027A
Other languages
Chinese (zh)
Inventor
王国印
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaao Data Technology Co Ltd filed Critical Shenzhen Huaao Data Technology Co Ltd
Priority to CN2013103777027A priority Critical patent/CN103440311A/en
Publication of CN103440311A publication Critical patent/CN103440311A/en
Priority to PCT/CN2014/084609 priority patent/WO2015027836A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a method for identifying geographical name entities. The method comprises the steps that address texts are input and pre-processing is conducted on the address texts; address segmentation is conducted on the address texts according to dictionary metadata; address labeling is conducted on segmentation results, and an optimal address level labeling sequence is obtained; according to the context, the labeling sequence is revised, and an optimal labeling result is output. According to the dictionary metadata, the address segmentation is conducted on the address texts, then corresponding geographical name categories are labeled on the segmentation result according to a geographical name category definition table, the labeling sequence is optimized through a Viterbi algorithm, the labeling sequence is revised according to the context, and the final labeling result is obtained. Therefore, the geographical name entity identification result is accurate, and practicability is high. In addition, the invention provides a system for identifying the geographical name entities.

Description

A kind of method and system of place name Entity recognition
Technical field
The present invention relates to area of geographic information, relate in particular to a kind of method and system of place name Entity recognition.
Background technology
Along with the development of Geographic Information System GIS, remote sensing system RS, Global Positioning System (GPS) GPS, the widespread use of especially mobile interconnected LBS, the application based on geography information more and more becomes the part of people's life.Application based on geography information particularly in address a more important part be exactly the place name Entity recognition, in current a lot of natural language processing platform, relate in the part of named entity recognition all not high enough to the discrimination of place name entity, be mainly reflected in: the first, place name entity class attribute is single, only with place name, identify all place names, the grade of place name (provincial, region, at county level, township level, community/villagers' committee, road, village, buildings etc.) is not done to thinner division; The second, lower for township level and the place name identification rate below township level; Three, not processing the situation that different place names have identical abbreviation, as Jilin, can be Jilin, can be also Jilin Province; Four, the difference of same name being described to (place name another name) identifies poor.
So, need to have the place name entity recognition method that a kind of discrimination is higher to solve above-mentioned these problems.
Summary of the invention
For this reason, the present invention is in order one of to address the aforementioned drawbacks.
Thereby, the invention provides a kind of method and system of place name Entity recognition, according to the dictionary metadata, the address text is carried out to the address cutting, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context proofread and correct the mark sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.
So one embodiment of the invention provides a kind of method of place name Entity recognition, the method comprises the following steps: the address text is inputted and is carried out pre-service; According to the dictionary metadata, the address text is carried out to the address cutting; The cutting result is carried out to the address mark, and obtain optimum address grade mark sequence; Based on context proofread and correct the mark sequence, and export optimum annotation results.
Preferably, described method adopts dictionary mode to carry out, and wherein, described dictionary adopts the Trie tree construction of even numbers group to store.
Preferably, described method is set up the address metadatabase based on the Trie tree in advance; Described method is divided into 12 grades by address metadata place name; Set up corresponding place name class declaration form according to described address metadatabase.
In one embodiment of the invention, the cutting of described address adopts the address text that the reverse maximum matching algorithm of Trie tree scans input from right to left to carry out the address cutting.
In one embodiment of the invention, described address mark comprises: according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification; And the address grade of the cutting result that will can't find in the metadata of address is labeled as 0 grade.
Preferably, described address mark is by the address grade mark sequence of the acquisition optimum of Viterbi algorithm.
Another embodiment of the present invention provides a kind of system of place name Entity recognition, and this system comprises: the address text input system, in order to obtain the input message of address text, and carry out pre-service; The address cutting system, in order to carry out cutting according to the address metadata to the address text obtained in the input system of address; The address label injection system, in order to the address grade mark sequence of the acquisition optimum by the Viterbi algorithm; The ADDRESS HYGIENE system, in order to based on context to proofread and correct the mark sequence, and obtain optimum annotation results.
In another embodiment, described address TEXT system is carried out pre-service to the address text and is at least comprised: the space of Delete superfluous and numeral or alphabetical double byte character are converted to the half-angle character.
In another embodiment, described address cutting system carries out the cutting of address text according to the address metadatabase of setting up in advance based on the Trie tree; Described address cutting system adopts the reverse maximum matching algorithm based on the Trie tree to carry out the address cutting.
In another embodiment, described address label injection system marks respectively corresponding place name classification according to the attribute of place name in the metadata of address by the cutting result; The address grade of the cutting result that in addition, described address label injection system will can't find in the metadata of address is labeled as 0 grade; Described address label injection system obtains optimum address grade mark sequence by the Viterbi algorithm again.The present invention carries out the address cutting according to the dictionary metadata to the address text, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context correction marks sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.
The accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the method for a kind of place name Entity recognition of realizing of the embodiment of the present invention.
Fig. 2 is the principle of work schematic diagram of the Trie tree of embodiment of the present invention employing.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is described in further detail.Should be appreciated that specific embodiment described herein, only for explaining the present invention, is not intended to limit the present invention.
The method and system of a kind of place name Entity recognition provided by the invention, according to the dictionary metadata, the address text is carried out to the address cutting, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context correction marks sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.
As Fig. 1 is the schematic flow sheet of the method for a kind of place name Entity recognition of realizing of the embodiment of the present invention, the method specifically comprises the following steps:
Step S110: the address text is inputted and is carried out pre-service.
Characteristics based on Chinese Place Names, in the embodiment of the present invention, address cutting and address place name Entity recognition all are based on the pattern of dictionary.Participle based on dictionary has forward (from left to right) coupling and reverse (from right to left) coupling usually.Generally reverse matching ratio forward coupling cutting error rate is at half, and for solving the intersection ambiguity, has superiority, and the ambiguous definition that intersects is: tri-continous characters of ABC, and AB and BC all can become word; Generally in Chinese, the probability of BC composition word is larger.In the embodiment of the present invention, the address cutting is based on address metadata dictionary and adopts reverse maximum matching algorithm to scan from right to left the address text of user input, realizes the cutting of address.In order to improve the speed of cutting, dictionary adopts the Trie data tree structure based on even numbers group (Double Array).
In this step, need to set up in advance the address metadatabase based on the Trie tree, address metadata dictionary mainly comprises various geographical name datas, as provincial administrative area, local administrative area, administrative areas at the county level, the township level administrative area, community/villagers' committee, road, buildings, community, village, mechanism etc., about the geographical name data of administrative division can be directly from obtaining in the relevant address date of china administration zoning from wikipedia and national statistics board web, other data can be by artificially collecting, and extract from complete mailing address by address cutting and recognition technology.
The data that the place name meta-data pack contains mainly contain: provincial administrative area name (comprising provinces, autonomous regions and municipalities and special administrative region), local administrative area name (prefecture-level city, autonomous prefecture, area, alliance), administrative areas at the county level's name (comprising districts under city administration, county-level city, county, autonomous county, flag, automonous banner, special zone and forest zone), township level administrative area name (comprising township, town, street, bush, district office), other address dates (comprising link name, village name, cell name, building name and square name, organization names) etc.
The address use pattern of current use mainly contains two kinds:
Pattern one: the address of locating centered by road, common address architecture rule is as follows: provincial administrative area+local administrative area+administrative areas at the county level+road+number+building name+room number.As: No. 29 foreign student's Chuangye Building Room 2208 of high-new southern loop, Nanshan District, Shenzhen City, Guangdong Province.This kind of address descriptor pattern is common in electronic chart, as Baidu's map, Google Maps etc.
Pattern two: the address of status centered by administrative division, common address architecture rule is as follows: provincial administrative area+local administrative area+administrative areas at the county level+township/town/street+residence (village) committee meeting+community/natural village.As: the precious people of Xixiang street, Baoan District, Shenzhen City, Guangdong Province Liu Tang neighbourhood committee garden.This kind of address descriptor pattern is common in government department, as Department of Civil Affairs.
For the above-mentioned two kinds of address descriptor patterns of compatibility, will the place name in address be divided into to 12 grades according to the characteristics of place name in the embodiment of the present invention, as shown in table 1 below.
level administrative region administrative region for example
the first order provinces, autonomous regions and municipalities guangdong Province, Inner Mongolia Autonomous Region, Guangxi Zhuang Autonomous Region, Beijing, Chongqing City etc.
the second level prefecture-level city, autonomous prefecture, area, alliance shenzhen, Guangzhou, Wuhan City, Wenzhou City, Korean Autonomous Prefecture of Yanbian, Shigatse District, Hotan Prefecture, Turfan Prefecture, Xilinguole League etc.
the third level prefecture-level city's jurisdiction district, municipality directly under the Central Government's jurisdiction district, province are directly under the jurisdiction of county, autonomous county, county-level city, flag futian District, Nanshan District, Pudong New District, Chongming County, Haidian District, Wenchang City, Dingan County, Manchu Autonomous County of Qingyuan, Conghua City, Horqin Right Wing Front Banner etc.
the fourth stage small towns, street, bush, national countryside, district office close ,Wu town, market town, ,Yue Hai street, San Guanmiao township, Bayan Hu Shu Sumu, permitted the family Uygur nationality of bridge Hui ethnic group township etc.
level V community, villagers' committee the ,Shi village, Liu Tang community villagers' committee etc.
the 6th grade road shen Nan, South Road, Science Court, Xizhimenwai Dajie etc.
the 7th grade number immediately following the numbering after road, as: No. 208, good fortune Road, Lane 223, Yangjing road etc.
the 8th grade community, village (natural village), manufacturing district bOHO TOWN ,Liu Tang village etc.
the 9th grade road, village, lane, path, numbering of residential building etc. in community this field is deposited Shi Cun Road or lane, path, path etc.Feature mainly contains: unit, lane, lane, neighbour, ,Dong, building, mill etc.
the tenth grade building name software mansion, foreign student's Chuangye Building etc.
the tenth one-level room number as 22nd floors (floor, F) Room 08 of 2208 Shi, etc.
the tenth secondary other titles apart from the above mechanism's name, Business Name, non-place name etc.
Table 1: ten two-level address grade classification model definitions.
Address metadata dictionary not only comprises place name vocabulary, but also to comprise the attribute that place name vocabulary is corresponding, it is the place name classification, its Dictionary format is defined as: address metadata dictionary consists of multirow, every a line becomes an entry (Term), each Term comprises place name and address classes collection (Categories) corresponding to place name, wherein is called key, the property set that the address grade is key or Categories.Address each Term of metadata dictionary comprises 2, i.e. address classes collection (Categories) corresponding to place name and place name, between them with the branch of half-angle "; " separate, some place names comprise a plurality of place name classifications (such as the another name of some standard edition addresses is also the another name of other standards version address), between different region grades with the comma of half-angle, " separate.
Characteristics and use habit according to place name, standard name with official's a certain place name entirely by name, other name is its another name, as, the another name in " ”Wei“ Guangdong Province, ”He“ Guangdong, Guangdong ", another name is divided into two kinds from literal feature, a kind of is the continuous substring of standard name, and the name that is referred to as to abridge claims again to be called for short, as " Guangdong " in " Guangdong Province ", another another name does not see that on literal feature any association or non-substring are arranged fully, as " Guangdong " in " Guangdong Province ".In order to take into account the difference of this type, for the class declaration of the situation of the first, be that standard name classification back adds " _ ABBR ", in like manner the second situation is that standard name classification back adds " _ ALIAS ".Therefore the class declaration of place name is as shown in table 2 below.
Figure 2013103777027100002DEST_PATH_IMAGE002
Table 2: place name class declaration form.
Step S120 carries out the address cutting according to the dictionary metadata to the address text.
In the embodiment of the present invention, dictionary adopts the Trie data tree structure based on even numbers group (Double Array), and for the ease of understanding, the principle that the Forward Maximum Method of might as well take is example Trie tree as shown in Figure 2.
The known Trie tree of schematic diagram as shown in Figure 2 is a definite finite-state automata (DFA), each node represents a state of automat, difference according to variable, carry out state transitions, and do the verification of a state transition path when shifting, when arriving done state or can't shift, complete inquiry.The query script of Trie tree mainly is divided into two steps: under current state, according to the character of current input, make state transitions, obtain the position of its immediate successor state; Whether the forerunner of verification current state, determine which state transitions is current state be by, be to point to its direct precursor.
This shows in structure Trie tree, must store the direct precursor information of current state.On current network, the version of realizing of popular Trie tree generally all is based on the even numbers group, and the name of two arrays is respectively base[] and check[], each the element subscript in array ibe equivalent to node numbering or the memory location in the even numbers group of Trie tree, claim again status number.
base[i]: what deposit is current state ito the minimum conflict free side-play amount of all follow-up states;
check[i]: what deposit is current state idirect precursor information, which state transitions stores current state is by;
basewith checkpaired, base[i]with check[i]represent the attribute of same state.
If current state is s, the character of input is c, and NextState is the non-leaf node of t(), the constraint condition of query script is:
check[base[s]+c]=s(formula 1);
base[s]+c=t(formula 2);
The base[s of each state] be worth unique.
If current state s can transfer in leafy node t, its constraint condition is:
base[s]=t(formula 3);
t=check[t](formula 4);
base[t]<0and base[t]the value initial node that is DFA 0the entry formed to the character of current leafy node process is at the opposite numbers of all entry concentrated positions by the lexicographic order sequence.
The search efficiency of realizing based on the Trie tree construction is just higher, and it doesn't matter to do the time of one query and the scale of dictionary with the Trie tree, and only relevant with the length of query text character string, therefore inquiry is the soonest once o (1), the text-string first character the retrieval of the ground floor of Trie less than; Inquiring about once the poorest time complexity is o (n), the degree of depth that wherein n only sets with Trie is relevant with the length of query text, and wherein the degree of depth of tree depends on entry length the longest in dictionary.
For the convenience realized, the embodiment of the present invention will checkwith basebe put in an array, basearray is placed on even bit, checkarray is placed on odd bits, base[i]->array[2*i], check[i] –>array[2*i+1]; If current state is s, the character of input is c, and NextState is the non-leaf node of t() condition under, the constraint condition of the query script of its this method is revised as:
array [2* (array[2*s]+c)+1]=array [2*s](formula 5);
array [2*s]+c=t(formula 6);
In the array array, the value of effective even bit is unequal mutually, is worth unique.
If current state s can transfer in leafy node t, its constraint condition is:
array [2*s]=t(formula 7);
t=array[2*t+1](formula 8);
array [2*t]<0and the entry that initial node that the value of array [2*t] is DFA 0 forms to the character of current leafy node process is at the opposite number of all entry concentrated positions by the lexicographic order sequence.
Step S130: the cutting result is carried out to the address mark, and obtain optimum address grade mark sequence.
After in step S120, the address text passes through the reverse maximum matching algorithm cutting based on the Trie tree, next the address metadata cut out is marked to upper corresponding place name classification, the place name classification can be from the metadata dictionary of address obtains in the attribute of each place name, if does not deposit in dictionary the address out be split, illustrate that this address is unrecognized address, its address grade is labeled as 0 grade, then the address grade mark sequence by the acquisition optimum of Viterbi algorithm to the address above mentioned mark.
The realization of bright above-mentioned steps for instance.Build the probability model of Viterbi algorithm according to priori, piwith adesirable following initial value:
Pi={0.05,0.45,0.25,0.15,0.1};
A?=?{{0.05,?0.45,?0.25,?0.15,?0.10},
{0.05,?0.23,?0.45,?0.17,?0.10},
{0.05,?0.18,?0.25,?0.30,?0.22},
{0.05,?0.35,?0.05,?0.05,?0.50},
{0.05,?0.30,?0.15,?0.05,?0.45}};
As the address of inputting is: " ShenZhen,GuangDong Bao'an Xixiang " can obtain following four kinds of annotation results sequences after processing through Qie, address, described address mark: " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4) ", " Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2) ".According to Viterbi (Viterbi) algorithm, we can learn the weights of four kinds of mark states:
Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4); p=0.030375;
Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (2); p=0.0030375;
Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (4); p=0.001125;
Bao'an, Shenzhen, Guangdong (1) (4) (3) Xixiang (2); p=1.125E-4;
The mark sequence of maximum probability is the first mark situation, so the result of dynamic programming algorithm output is also the first mark state " Bao'an, Shenzhen, Guangdong (1) (2) (3) Xixiang (4) ".
Based on context step S140 proofreaies and correct the mark sequence, and exports optimum annotation results.
Can't resolve the another name situation identical with the another name of county or county-level city in a prefecture-level city area under one's jurisdiction in step S130, for example " Taihe County " (being subordinate to Anhui Province's Fuyang City) and " Taihe District " (being subordinate to Jinzhou City, Liaoning Province), their another name is all " Taihe county ", but they belong to different address ratings.When " Taihe county, ”He“ Jinzhou, Taihe county, Fuyang (city) (city) " occurring, be labeled in maximum probability on the 3rd location, polar region rank according to algorithm and probability model " Taihe county " now, solve problems will be according to it address name above judge that its address rank is " 2 " or " 3 ", the correction that is marked sequence as special circumstances like that.Be exemplified below:
The address of input is: " ancient month of the Pingshan Mountain, Shijiazhuang, Hebei ".
The address sequence of mark is: " Hebei (1; 2; 4) Shijiazhuang (2; the 4) Pingshan Mountain (2; 3,4) Gu Yue (4) ", and in this mark sequence, the mark grade of each address is interpreted as: " Hebei " can be the another name in " Hebei province ", can be also “ Hebei District, Tianjin " another name, can be also the another name in " Hebei township "; " Shijiazhuang " can be the another name in " Shijiazhuang City " and " Shijiazhuang town "; " Pingshan Mountain " can be the another name in " Pingshan County " or " ”Huo“ Pingshan Mountain town, Pingshan District ".
Optimum mark sequence is: " Pingshan Mountain, Shijiazhuang, Hebei (1) (2) (3) Gu Yue (4) ".
Based on context the mark sequence after proofreading and correct is: " Pingshan Mountain, Shijiazhuang, Hebei (1) (2) (2) Gu Yue (4) ", because " Pingshan Mountain " now is " Pingshan County ".
This shows and call identical the time when the another name in a prefecture-level city area under one's jurisdiction and county or county-level city, whether the affiliated prefecture-level city that is noted as third-level address its direct precursor address, if not being proofreaied and correct.In order to facilitate contextual rule to adopt the mode of above-mentioned contrary rule to store, the record another name is context for the another name of prefecture-level city under county or county-level city, for example (Taihe county → Fuyang).Therefore when meeting this context, revise the grade of mark, do not make any modification while not meeting.
Meanwhile also there is two-level address and level Four address situation of the same name, mainly appear at the another name in county-level city or county and the another name situation of the same name in small towns, because the level Four address can occur repeatedly continuously in a sufficient address, therefore sometimes can be labeled in two-level address on level Four.Now also to based on context be differentiated, be revised the sequence of mark.
As the address of inputting is: " He Min village, new township, Heihe In The Heilongjiang River Wudalianchi ";
Optimum mark sequence is: " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (4) (4) He Min village (0) ",
" Wudalianchi " now is labeled on the rank of fourth stage address, in fact it is a county-level city, and the mark sequence after based on context proofreading and correct is: " new township, Wudalianchi, Heihe, Heilungkiang (1) (2) (2) (4) He Min village (0) ".
The solution that has identical another name with district is similar, for small towns and county situation of the same name, the rule that system retains is that the another name that another name is prefecture-level city under county or county-level city is context, for example (Wudalianchi → Heihe), therefore when meeting this context, revise the grade of mark, do not make any modification while not meeting.
Therefore for some special circumstances, provide a mechanism based on context to be proofreaied and correct the best mark sequence, the method for processing is to eliminate the ambiguity (the corresponding a plurality of addresses of alias grade) because another name brings according to the address context simultaneously.The result drawn like this is more accurate.
Another embodiment of the present invention provides a kind of system of place name Entity recognition, and this system comprises: the address text input system, in order to obtain the input message of address text, and carry out pre-service; The address cutting system, in order to carry out cutting according to the address metadata to the address text obtained in the input system of address; The address label injection system, in order to the address grade mark sequence of the acquisition optimum by the Viterbi algorithm; The ADDRESS HYGIENE system, in order to based on context to proofread and correct the mark sequence, and obtain optimum annotation results.
In another embodiment, described address TEXT system is carried out pre-service to the address text and is at least comprised: the space of Delete superfluous and numeral or alphabetical double byte character are converted to the half-angle character.
In another embodiment, described address cutting system carries out the cutting of address text according to the address metadatabase of setting up in advance based on the Trie tree; Described address cutting system adopts the reverse maximum matching algorithm based on the Trie tree to carry out the address cutting.
In another embodiment, described address label injection system marks respectively corresponding place name classification according to the attribute of place name in the metadata of address by the cutting result; The address grade of the cutting result that in addition, described address label injection system will can't find in the metadata of address is labeled as 0 grade; Described address label injection system obtains optimum address grade mark sequence by the Viterbi algorithm again.The present invention carries out the address cutting according to the dictionary metadata to the address text, then according to place name class declaration form, the cutting result is marked respectively to corresponding place name classification, by the Viterbi algorithm, the mark sequence is carried out to optimization, and based on context correction marks sequence, obtain final annotation results, make place name Entity recognition result accurate, practical.

Claims (10)

1. the method for a place name Entity recognition, is characterized in that, said method comprising the steps of:
The address text is inputted and is carried out pre-service;
According to the dictionary metadata, the address text is carried out to the address cutting;
The cutting result is carried out to the address mark, and obtain optimum address grade mark sequence;
Based on context proofread and correct the mark sequence, and export optimum annotation results.
2. method according to claim 1, is characterized in that, described method adopts dictionary mode to carry out; Wherein, described dictionary adopts the Trie tree construction of even numbers group to store.
3. method according to claim 1 and 2, is characterized in that, described method comprises:
Described method is set up the address metadatabase based on the Trie tree in advance;
Described method is divided into 12 grades by address metadata place name;
Set up corresponding place name class declaration form according to described address metadatabase.
4. method according to claim 1, is characterized in that, the cutting of described address comprises: the cutting of described address adopts address text that the reverse maximum matching algorithm of Trie tree scans input from right to left to carry out address to divide.
5. method according to claim 1, is characterized in that, described address mark comprises:
According to place name class declaration form, the cutting result is marked respectively to corresponding place name classification;
And the address grade of the cutting result that will can't find in the metadata of address is labeled as 0 grade.
6. method according to claim 1 or 5, is characterized in that, described address mark comprises: described address mark is by the address grade mark sequence of the acquisition optimum of Viterbi algorithm.
7. the system of a place name Entity recognition, is characterized in that, described system comprises:
The address text input system, in order to obtain the input message of address text, and carry out pre-service;
The address cutting system, in order to carry out cutting according to the address metadata to the address text obtained in the input system of address;
The address label injection system, in order to the address grade mark sequence of the acquisition optimum by the Viterbi algorithm;
The ADDRESS HYGIENE system, in order to based on context to proofread and correct the mark sequence, and obtain optimum annotation results.
8. system according to claim 7, is characterized in that, described system comprises: described address TEXT system is carried out pre-service to the address text and is at least comprised: the space of Delete superfluous and by the numeral or alphabetical double byte character be converted to the half-angle character.
9. system according to claim 7, is characterized in that, described system comprises: described address cutting system carries out the cutting of address text according to the address metadatabase of setting up in advance based on the Trie tree; Described address cutting system adopts the reverse maximum matching algorithm based on the Trie tree to carry out the address cutting.
10. system according to claim 7, is characterized in that, described system comprises:
Described address label injection system marks respectively corresponding place name classification according to the attribute of place name in the metadata of address by the cutting result;
The address grade of the cutting result that in addition, described address label injection system will can't find in the metadata of address is labeled as 0 grade;
Described address label injection system obtains optimum address grade mark sequence by the Viterbi algorithm again.
CN2013103777027A 2013-08-27 2013-08-27 Method and system for identifying geographical name entities Pending CN103440311A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2013103777027A CN103440311A (en) 2013-08-27 2013-08-27 Method and system for identifying geographical name entities
PCT/CN2014/084609 WO2015027836A1 (en) 2013-08-27 2014-08-18 Method and system for place name entity recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103777027A CN103440311A (en) 2013-08-27 2013-08-27 Method and system for identifying geographical name entities

Publications (1)

Publication Number Publication Date
CN103440311A true CN103440311A (en) 2013-12-11

Family

ID=49694004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103777027A Pending CN103440311A (en) 2013-08-27 2013-08-27 Method and system for identifying geographical name entities

Country Status (2)

Country Link
CN (1) CN103440311A (en)
WO (1) WO2015027836A1 (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015027836A1 (en) * 2013-08-27 2015-03-05 深圳市华傲数据技术有限公司 Method and system for place name entity recognition
CN104933024A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN104933023A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN105045888A (en) * 2015-07-28 2015-11-11 浪潮集团有限公司 Participle training corpus tagging method for HMM (Hidden Markov Model)
CN105704258A (en) * 2014-11-28 2016-06-22 北京山海经纬信息技术有限公司 Address recognition method and equipment
WO2016138773A1 (en) * 2015-03-05 2016-09-09 深圳市华傲数据技术有限公司 Address knowledge processing method and device based on graphs
CN106156145A (en) * 2015-04-13 2016-11-23 阿里巴巴集团控股有限公司 The management method of a kind of address date and device
CN106155998A (en) * 2015-04-09 2016-11-23 腾讯科技(深圳)有限公司 A kind of data processing method and device
CN106326206A (en) * 2015-06-24 2017-01-11 北京京东尚科信息技术有限公司 Entity extraction method based on grammar templates
CN106557574A (en) * 2016-11-23 2017-04-05 广东电网有限责任公司佛山供电局 Destination address matching process and system based on tree construction
CN106970918A (en) * 2016-01-13 2017-07-21 阿里巴巴集团控股有限公司 Generate the method and device of international address unique identifier
CN107133215A (en) * 2017-05-20 2017-09-05 复旦大学 A kind of Chinese canonical address recognition methods of offline handwriting
CN107247792A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Match method, device and the computer equipment of functional department
CN107305540A (en) * 2016-04-20 2017-10-31 顺丰科技有限公司 Address cutting recognition methods
CN108509505A (en) * 2018-03-05 2018-09-07 昆明理工大学 A kind of character string retrieving method and device based on subregion even numbers group Trie
CN108920457A (en) * 2018-06-15 2018-11-30 腾讯大地通途(北京)科技有限公司 Address Recognition method and apparatus and storage medium
CN109033225A (en) * 2018-06-29 2018-12-18 福州大学 Chinese address identifying system
CN109145095A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Information of place names matching process, information matching method, device and computer equipment
CN109255564A (en) * 2017-07-13 2019-01-22 菜鸟智能物流控股有限公司 Pick-up point address recommendation method and device
CN109299469A (en) * 2018-10-29 2019-02-01 复旦大学 A method of identifying complicated address in long text
CN110210020A (en) * 2019-05-22 2019-09-06 武汉虹信通信技术有限责任公司 The standardized system and method for address
WO2019205308A1 (en) * 2018-04-27 2019-10-31 平安科技(深圳)有限公司 Information input method and apparatus, and terminal device and medium
WO2020057432A1 (en) * 2018-09-17 2020-03-26 阿里巴巴集团控股有限公司 Address standardization method and device, storage medium and computer terminal
CN111324679A (en) * 2018-12-14 2020-06-23 阿里巴巴集团控股有限公司 Method, device and system for processing address information
CN111931478A (en) * 2020-07-16 2020-11-13 丰图科技(深圳)有限公司 Address interest plane model training method, address prediction method and device
CN112052673A (en) * 2020-08-28 2020-12-08 丰图科技(深圳)有限公司 Logistics network point identification method and device, computer equipment and storage medium
CN112633003A (en) * 2020-12-30 2021-04-09 平安科技(深圳)有限公司 Address recognition method and device, computer equipment and storage medium
CN112966511A (en) * 2021-02-08 2021-06-15 广州探迹科技有限公司 Entity word recognition method and device
CN113220836A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Training method and device of sequence labeling model, electronic equipment and storage medium
WO2024000656A1 (en) * 2022-06-29 2024-01-04 青岛海尔科技有限公司 Place name recognition method, system and apparatus, and storage medium
CN112633003B (en) * 2020-12-30 2024-05-31 平安科技(深圳)有限公司 Address recognition method and device, computer equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417022B (en) * 2022-03-30 2022-06-28 阿里巴巴(中国)有限公司 Model training method, data processing method and device
CN117131867B (en) * 2022-05-17 2024-05-14 贝壳找房(北京)科技有限公司 Method, apparatus, computer program product and storage medium for splitting house address

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007213211A (en) * 2006-02-08 2007-08-23 Fujifilm Corp Retrieval database, address retrieval device and address retrieval method
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN102298585A (en) * 2010-06-24 2011-12-28 高德软件有限公司 Address splitting and level marking method and device
CN102955833A (en) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 Correspondence address identifying and standardizing method
CN102955832A (en) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 Correspondence address identifying and standardizing system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440311A (en) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 Method and system for identifying geographical name entities

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007213211A (en) * 2006-02-08 2007-08-23 Fujifilm Corp Retrieval database, address retrieval device and address retrieval method
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN102298585A (en) * 2010-06-24 2011-12-28 高德软件有限公司 Address splitting and level marking method and device
CN102955833A (en) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 Correspondence address identifying and standardizing method
CN102955832A (en) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 Correspondence address identifying and standardizing system

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015027836A1 (en) * 2013-08-27 2015-03-05 深圳市华傲数据技术有限公司 Method and system for place name entity recognition
CN105704258B (en) * 2014-11-28 2019-11-29 方正国际软件(北京)有限公司 A kind of method and apparatus of Address Recognition
CN105704258A (en) * 2014-11-28 2016-06-22 北京山海经纬信息技术有限公司 Address recognition method and equipment
WO2016138773A1 (en) * 2015-03-05 2016-09-09 深圳市华傲数据技术有限公司 Address knowledge processing method and device based on graphs
CN106155998A (en) * 2015-04-09 2016-11-23 腾讯科技(深圳)有限公司 A kind of data processing method and device
CN106155998B (en) * 2015-04-09 2019-03-26 腾讯科技(深圳)有限公司 A kind of data processing method and device
CN106156145A (en) * 2015-04-13 2016-11-23 阿里巴巴集团控股有限公司 The management method of a kind of address date and device
CN104933024B (en) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 Chinese address participle mask method
CN104933023A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN104933023B (en) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 Chinese address participle mask method
CN104933024A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN106326206A (en) * 2015-06-24 2017-01-11 北京京东尚科信息技术有限公司 Entity extraction method based on grammar templates
CN106326206B (en) * 2015-06-24 2021-01-26 北京京东尚科信息技术有限公司 Entity extraction method based on grammar template
CN105045888A (en) * 2015-07-28 2015-11-11 浪潮集团有限公司 Participle training corpus tagging method for HMM (Hidden Markov Model)
CN106970918A (en) * 2016-01-13 2017-07-21 阿里巴巴集团控股有限公司 Generate the method and device of international address unique identifier
CN106970918B (en) * 2016-01-13 2020-10-27 菜鸟智能物流控股有限公司 Method and device for generating unique identifier of international address
CN107305540A (en) * 2016-04-20 2017-10-31 顺丰科技有限公司 Address cutting recognition methods
CN106557574A (en) * 2016-11-23 2017-04-05 广东电网有限责任公司佛山供电局 Destination address matching process and system based on tree construction
CN106557574B (en) * 2016-11-23 2020-02-04 广东电网有限责任公司佛山供电局 Target address matching method and system based on tree structure
CN107133215A (en) * 2017-05-20 2017-09-05 复旦大学 A kind of Chinese canonical address recognition methods of offline handwriting
CN109145095A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Information of place names matching process, information matching method, device and computer equipment
CN109145095B (en) * 2017-06-16 2024-03-29 贵州小爱机器人科技有限公司 Place name information matching method, information matching device and computer equipment
CN107247792A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Match method, device and the computer equipment of functional department
CN107247792B (en) * 2017-06-16 2021-01-15 中国电子技术标准化研究院 Method and device for matching functional departments and computer equipment
CN109255564A (en) * 2017-07-13 2019-01-22 菜鸟智能物流控股有限公司 Pick-up point address recommendation method and device
CN108509505B (en) * 2018-03-05 2022-04-12 昆明理工大学 Character string retrieval method and device based on partition double-array Trie
CN108509505A (en) * 2018-03-05 2018-09-07 昆明理工大学 A kind of character string retrieving method and device based on subregion even numbers group Trie
WO2019205308A1 (en) * 2018-04-27 2019-10-31 平安科技(深圳)有限公司 Information input method and apparatus, and terminal device and medium
CN108920457A (en) * 2018-06-15 2018-11-30 腾讯大地通途(北京)科技有限公司 Address Recognition method and apparatus and storage medium
CN109033225A (en) * 2018-06-29 2018-12-18 福州大学 Chinese address identifying system
WO2020057432A1 (en) * 2018-09-17 2020-03-26 阿里巴巴集团控股有限公司 Address standardization method and device, storage medium and computer terminal
CN109299469A (en) * 2018-10-29 2019-02-01 复旦大学 A method of identifying complicated address in long text
CN111324679B (en) * 2018-12-14 2023-04-11 阿里巴巴集团控股有限公司 Method, device and system for processing address information
CN111324679A (en) * 2018-12-14 2020-06-23 阿里巴巴集团控股有限公司 Method, device and system for processing address information
CN110210020A (en) * 2019-05-22 2019-09-06 武汉虹信通信技术有限责任公司 The standardized system and method for address
CN110210020B (en) * 2019-05-22 2023-06-20 武汉虹旭信息技术有限责任公司 Communication address standardization system and method thereof
CN111931478B (en) * 2020-07-16 2023-11-10 丰图科技(深圳)有限公司 Training method of address interest surface model, and prediction method and device of address
CN111931478A (en) * 2020-07-16 2020-11-13 丰图科技(深圳)有限公司 Address interest plane model training method, address prediction method and device
CN112052673A (en) * 2020-08-28 2020-12-08 丰图科技(深圳)有限公司 Logistics network point identification method and device, computer equipment and storage medium
CN112633003A (en) * 2020-12-30 2021-04-09 平安科技(深圳)有限公司 Address recognition method and device, computer equipment and storage medium
CN112633003B (en) * 2020-12-30 2024-05-31 平安科技(深圳)有限公司 Address recognition method and device, computer equipment and storage medium
CN112966511A (en) * 2021-02-08 2021-06-15 广州探迹科技有限公司 Entity word recognition method and device
CN112966511B (en) * 2021-02-08 2024-03-15 广州探迹科技有限公司 Entity word recognition method and device
CN113220836A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Training method and device of sequence labeling model, electronic equipment and storage medium
CN113220836B (en) * 2021-05-08 2024-04-09 北京百度网讯科技有限公司 Training method and device for sequence annotation model, electronic equipment and storage medium
WO2024000656A1 (en) * 2022-06-29 2024-01-04 青岛海尔科技有限公司 Place name recognition method, system and apparatus, and storage medium

Also Published As

Publication number Publication date
WO2015027836A1 (en) 2015-03-05

Similar Documents

Publication Publication Date Title
CN103440311A (en) Method and system for identifying geographical name entities
CN109145169B (en) Address matching method based on statistical word segmentation
CN102955833B (en) A kind of address identification, standardized method
CN103440312B (en) A kind of system and terminal of mailing address inquiry postcode
CN101313300B (en) Local search
CN107145577A (en) Address standardization method, device, storage medium and computer
CN106528526B (en) A kind of Chinese address semanteme marking method based on Bayes&#39;s segmentation methods
CN109033086A (en) A kind of address resolution, matched method and device
CN102955832B (en) A kind of address identification, standardized system
CN108369582B (en) Address error correction method and terminal
CN109344263B (en) Address matching method
CN109657074B (en) News knowledge graph construction method based on address tree
CN102289467A (en) Method and device for determining target site
CN104866593A (en) Database searching method based on knowledge graph
CN109344213B (en) Chinese geocoding method based on dictionary tree
CN108763215B (en) Address storage method and device based on address word segmentation and computer equipment
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN105224622A (en) The place name address extraction of Internet and standardized method
CN110472066A (en) A kind of construction method of urban geography semantic knowledge map
CN109933797A (en) Geocoding and system based on Jieba participle and address dictionary
CN101777082A (en) Correlation method of text information and geological information and system
CN101206121B (en) Placename retrieval device
CN110442603A (en) Address matching method, apparatus, computer equipment and storage medium
CN101093478A (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN106874287A (en) A kind of processing method and processing device of point of interest POI geocodings

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20131211