CN101477570A - Self-learning Chinese address judging method - Google Patents

Self-learning Chinese address judging method Download PDF

Info

Publication number
CN101477570A
CN101477570A CNA2009100953779A CN200910095377A CN101477570A CN 101477570 A CN101477570 A CN 101477570A CN A2009100953779 A CNA2009100953779 A CN A2009100953779A CN 200910095377 A CN200910095377 A CN 200910095377A CN 101477570 A CN101477570 A CN 101477570A
Authority
CN
China
Prior art keywords
address
information
standard
standard degree
redundant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2009100953779A
Other languages
Chinese (zh)
Inventor
胡天磊
陈珂
陈刚
周佳庆
寿黎但
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CNA2009100953779A priority Critical patent/CN101477570A/en
Publication of CN101477570A publication Critical patent/CN101477570A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a self-learning method of re-judging Chinese addresses. When re-judging Chinese addresses, the self-learning method is adopted. All the address data to re-judged is processed at first, the standard degree of various addresses is calculated by adopting the calculation formula of the standard degree, and redundant information of the addresses in accordance with the standard conditions is extracted, the confidence level of redundant information is calculated, and the credible redundant information is used for replacing and re-judging the subsequent address data. The self-learning method for re-judging Chinese addresses does not rely on the domain knowledge and can obviously reduce the proportion of misjudgment and emissive judgment in the address re-judgment under the precondition of ensuring the analysis precision.

Description

A kind of Chinese address judging method of self study
Technical field
The present invention relates to mass data cleaned and declare the relevant technology of heavily handling, particularly relate to a kind of judging method that Chinese address date is not relied on domain knowledge.
Background technology
Along with the develop rapidly of Chinese search engine and mass data digging technology, efficient Chinese address is declared the extensive concern that weight technology conduct gordian technique wherein has been subjected to industry member and academia, has become the focus of research.It is flexible that Chinese address has literary style, and therefore semantic characteristics such as changeable are compared English address to go heavily, and Chinese address is declared and heavily faced new requirement and challenge.
Existing various data removing repeat method mainly concentrates on the judgement of handling the text data similarity, and dependence is judged between data, the judgement of data abbreviation, and when handling mass data on the problem such as method complexity reduction.These methods and various branch develop can effectively handle regular English data, but to the declaring heavily of the processing, particularly Chinese address of Chinese data, can only mechanically judge repetition according to the literal similarity of text, thereby have bigger limitation.Such as: " No. 38, Hangzhou Zhejiang University road " and " Hangzhou YuQuan school area, ZheJiang University " two address informations, pointed to same address in fact, but because the difference of literary style, existing program all can't be judged to be identical automatically, just can make accurate judgment and only " YuQuan school area, ZheJiang University " is converted to " No. 38, Zhejiang University road " by the predefined domain knowledge in outside.But these domain knowledges are in large scale, outside pre-defined in real work feasibility not high.And some more small-sized declaring are heavily used, the domain knowledge that the use scale is very huge is obvious also very improper.
Summary of the invention
The objective of the invention is at the deficiencies in the prior art, a kind of Chinese address judging method of self study is provided.
The objective of the invention is to be achieved through the following technical solutions: a kind of Chinese address judging method of self study may further comprise the steps:
(1) all address dates is carried out the operation that redundant information is extracted in pre-service; Concrete steps are as follows:
(A) address cutting operation: sufficient address data are cut into subaddress information at different levels.
(B) standard degree in address calculates: each address date is calculated its standard degree information, and concrete computing method are to calculate the standard degree of every grade of subaddress information respectively, and the weight addition obtains the standard degree value of whole piece address date.Standard degree calculation procedure to the subaddressing is as follows:
The first step, the number of times of this value appearance of search from the corresponding subdomain of the address date of listing standard in; Analyze this subaddressing structure simultaneously,, then do the secondary classification, calculate the standard degree of each subdomain respectively if it is made up of thinner subdomain.Rule is that occurrence number is many more, and this value is standard more.
Second step, subaddress information is carried out participle, as standard, average this subaddressing speech number of the more little meaning of number of words is few more with the average number of words of each speech behind the participle, and the possibility of standard is just more little;
The 3rd step, analyze the literal essential information in subaddressing, calculate the legal possible information in this territory.
Comprehensive above three step gained standard degree information, if occurrence number is greater than a threshold value in the first step, then only use first step result to be used as this subaddressing standard degree, if less than a threshold value, then used for one, two or one, two, three steps obtained the standard degree of this subaddressing in weight phase Calais as a result according to actual conditions.
(C) to the address date extraction redundant information of standard degree above certain threshold value, the data that save as { authority data, redundant data, occurrence number } form are right, make things convenient for the later stage retrieval.
(D) it is right to screen all redundant format, and occurrence number is surpassed the redundant format of certain threshold value, and it is right to be designated as credible redundancy.
(2) traversal institute remains to declare heavy address, if the redundant data information of taking out in the step (1), and this redundant information is that credible redundancy is right, then this redundant information replaced with the authority data of correspondence; Retry is declared in address after all replacements.
(3) waiting of increasing of subsequent dynamic declared the location, important place, follow each address computation standard degree, it is right to extract redundant information and upgrade redundant information, and replacement redundancy information is declared heavy this sequential operation.
The present invention compared with prior art, the useful effect that has is:
(1) the present invention a kind ofly can accurately not rely on the judging method of domain knowledge, has used the various information of self study, and the address is declared in the heavily accuracy and improved a lot than traditional judging method that does not carry out self study.
(2) the present invention does not need the support of specific address knowledge base, and maintenance cost is little, and is simple to operate, is different from the method that traditional some depend on the address knowledge base, applied range, as can be applicable to all kinds of vertical search engines, data warehouse, mail system or the like.
So the present invention is a kind of being applicable under the internet environment, is used for accurately, efficiently the magnanimity Chinese address is declared heavy method.
Description of drawings
Fig. 1 self study Chinese address judging method process flow diagram.
Embodiment
At vertical search engine, data integrated system etc. need carry out the address to be declared in the heavy application, uses this method to carry out the actual heavy industry of declaring and does, and can obtain to declare heavy effect accurately than traditional judging method that does not utilize domain knowledge is better.The concrete implementation step of this method is as follows:
1. a pre-service is carried out in all pending addresses.Mainly carry out following several work:
1) to address cutting subdomain:
Methods such as keyword coupling can be used,, the effect of cutting should be guaranteed as far as possible owing to the semantic diversity of Chinese.Such as " No. 38, Zheda Road, Xihu District, Hangzhou City, Zhejiang Province " this address being cut into " Zhejiang Province, Hangzhou, Xihu District, Zhejiang University road, No. 38 " these several territories;
2) the address standard degree of address after the calculating cutting:
The standard degree of address by each cutting after subdomain the standard degree weight and calculate.And the standard degree of each subdomain can use following three thinkings:
A) number of times of this value appearance of search from the corresponding subdomain of the address set of listing standard in, this value is designated as frei (frei ∈ N), and occurrence number is many more, and it is credible that standard is got in this territory;
B) content structure of assay value:
Such as if similar sub-road names such as " Wen Sanlu ask the intelligence lane " appears in name territory, road, then do the secondary classification, be divided into " Wen Sanlu " and " asking the intelligence lane ", calculate the standard situation with the first step respectively.Add Chinese words segmentation simultaneously and make judgement, as standard, the average more for a short time speech number that means of number of words is few more with the average number of words of each speech behind the participle, and the possibility of standard is just more little.This value is designated as segi (0<segi<1), and concrete computing method can be decided according to application;
C) analyze the information such as number of words in this territory, calculate the legal possibility in this territory, this value is designated as wci:
A fairly simple disposal route is this territory number of words-1 of wci=, and wci is big more, and then nonstandard possibility is big more.
Successively by above-mentioned three step operational computations ri, if frei more than or equal to 3 times, directly is changed to 1 with ri.If 0<frei<3, then in conjunction with the value of frei and segi:
r i = 1 3 fre i &times; 70 % + seg i &times; 30 % ( 0 < fre i < 3 )
Here frei gets 3 and mainly depends on experiment experience for boundary.Because information such as a lot of Chinese roads name are more deserted, can't become speech,, can pass through the value of comprehensive segi of weight and wci therefore as if frei=0:
r i = seg i &times; 70 % + 2 wc i &times;30% ( fre i = 0 )
3) travel through the address that all calculate, those standard degree surpassed the address computation redundant information wherein of certain threshold value, and preserve:
Such as the redundant information after " No. 38, Zheda Road, Hangzhou City, Zhejiang Province YuQuan school area, ZheJiang University " this address extraction to form to { No. 38, Zhejiang University road, YuQuan school area, ZheJiang University, N}, wherein N represents to change the occurrence number to information, simultaneously also can be used to judge this legitimacy to redundant information, N is big more, and is then legal more.
4) screening redundant information:
Method described in step 3) is a basis for estimation with N, and the address that all occurrence numbers do not meet certain threshold value is fallen in screening;
2. after pre-service, all addresses are done the replacement work of redundant information:
Each address is judged, if the redundant information that screens in the pre-service, then it is replaced with real address information.Such as have in the redundant information No. 38, Zhejiang University road, YuQuan school area, ZheJiang University, 10} this to data, and " Hangzhou YuQuan school area, ZheJiang University " this keyword has appearred in the actual address, then the address is replaced by " No. 38, Hangzhou Zhejiang University road " at last.
After checking out all addresses, to all addresses carry out one everywhere the location declare heavily.Can use cluster to declare the method that heavily waits;
3. for the new address of follow-up arrival, use as above step 1 and step 2 liang close method of step, specific as follows:
1) to address cutting subdomain;
2) according to the standard degree of the subdomain calculated address after the cutting;
3) extract redundant information in the address if any.If comprise known redundant information in the address, then use actual address replacement redundancy information;
4) re-computation is declared in address and all existing addresses.

Claims (1)

1, a kind of Chinese address judging method of self study is characterized in that, may further comprise the steps:
(1) all address dates is carried out the operation that redundant information is extracted in pre-service.
(2) traversal institute remains to declare heavy address, if the redundant data information of taking out in the step (1), and this redundant information is that credible redundancy is right, then this redundant information replaced with the authority data of correspondence; Retry is declared in address after all replacements.
(3) waiting of increasing of subsequent dynamic declared the location, important place, follow each address computation standard degree, it is right to extract redundant information and upgrade redundant information, and replacement redundancy information is declared heavy this sequential operation.2, the Chinese address judging method of self study according to claim 1 is characterized in that, described step (1) concrete steps are as follows:
(A) address cutting operation: sufficient address data are cut into subaddress information at different levels.
(B) standard degree in address calculates: each address date is calculated its standard degree information, and concrete computing method are to calculate the standard degree of every grade of subaddress information respectively, and the weight addition obtains the standard degree value of whole piece address date.Standard degree calculation procedure to the subaddressing is as follows:
The first step, the number of times of this value appearance of search from the corresponding subdomain of the address date of listing standard in; Analyze this subaddressing structure simultaneously,, then do the secondary classification, calculate the standard degree of each subdomain respectively if it is made up of thinner subdomain.Rule is that occurrence number is many more, and this value is standard more.
Second step, subaddress information is carried out participle, as standard, average this subaddressing speech number of the more little meaning of number of words is few more with the average number of words of each speech behind the participle, and the possibility of standard is just more little;
The 3rd step, analyze the literal essential information in subaddressing, calculate the legal possible information in this territory.
Comprehensive above three step gained standard degree information, if occurrence number is greater than a threshold value in the first step, then only use first step result to be used as this subaddressing standard degree, if less than a threshold value, then used for one, two or one, two, three steps obtained the standard degree of this subaddressing in weight phase Calais as a result according to actual conditions.
(C) to the address date extraction redundant information of standard degree above certain threshold value, the data that save as { authority data, redundant data, occurrence number } form are right, make things convenient for the later stage retrieval.
(D) it is right to screen all redundant format, and occurrence number is surpassed the redundant format of certain threshold value, and it is right to be designated as credible redundancy.
CNA2009100953779A 2009-01-12 2009-01-12 Self-learning Chinese address judging method Pending CN101477570A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2009100953779A CN101477570A (en) 2009-01-12 2009-01-12 Self-learning Chinese address judging method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2009100953779A CN101477570A (en) 2009-01-12 2009-01-12 Self-learning Chinese address judging method

Publications (1)

Publication Number Publication Date
CN101477570A true CN101477570A (en) 2009-07-08

Family

ID=40838286

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2009100953779A Pending CN101477570A (en) 2009-01-12 2009-01-12 Self-learning Chinese address judging method

Country Status (1)

Country Link
CN (1) CN101477570A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750852A (en) * 2015-04-14 2015-07-01 海量云图(北京)数据技术有限公司 Method for finding and classifying Chinese address data
CN108090221A (en) * 2018-01-02 2018-05-29 北京市燃气集团有限责任公司 A kind of correlating method of combustion gas card data and user management data
CN109582969A (en) * 2018-12-04 2019-04-05 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN113255398A (en) * 2020-02-10 2021-08-13 百度在线网络技术(北京)有限公司 Interest point duplicate determination method, device, equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750852A (en) * 2015-04-14 2015-07-01 海量云图(北京)数据技术有限公司 Method for finding and classifying Chinese address data
CN104750852B (en) * 2015-04-14 2018-03-09 海量云图(北京)数据技术有限公司 The discovery of Chinese address data and sorting technique
CN108090221A (en) * 2018-01-02 2018-05-29 北京市燃气集团有限责任公司 A kind of correlating method of combustion gas card data and user management data
CN108090221B (en) * 2018-01-02 2019-05-10 北京市燃气集团有限责任公司 A kind of correlating method of combustion gas card data and user management data
CN109582969A (en) * 2018-12-04 2019-04-05 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN113255398A (en) * 2020-02-10 2021-08-13 百度在线网络技术(北京)有限公司 Interest point duplicate determination method, device, equipment and storage medium
CN113255398B (en) * 2020-02-10 2023-08-18 百度在线网络技术(北京)有限公司 Point of interest weight judging method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Schulz et al. A multi-indicator approach for geolocalization of tweets
Ryoo et al. Inferring twitter user locations with 10 km accuracy
Truelove et al. Towards credibility of micro-blogs: characterising witness accounts
CN102289467A (en) Method and device for determining target site
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
Ilina et al. Social event detection on twitter
CN104572956A (en) System and method for confirming POI information effectiveness
CN102419778A (en) Information searching method for discovering and clustering sub-topics of query statement
CN104572955A (en) System and method for determining POI name based on clustering
CN103853738A (en) Identification method for webpage information related region
CN105677640A (en) Domain concept extraction method for open texts
CN105975455A (en) Information analysis system based on bidirectional recursive neural network
CN107203526A (en) A kind of query string semantic requirement analysis method and device
CN103714132B (en) A kind of method and apparatus for being used to carry out focus incident excavation based on region and industry
Utomo et al. Geolocation prediction in social media data using text analysis: A review
CN101477570A (en) Self-learning Chinese address judging method
CN104572957A (en) POI name determination system based on clustering and method thereof
Xu et al. Traffic event detection using twitter data based on association rules
CN110222139B (en) Road entity data duplication eliminating method, device, computing equipment and medium
CN105159885A (en) Point-of-interest name identification method and device
CN112069818A (en) Triple prediction model generation method, relation triple extraction method and device
Chang et al. Enhancing POI search on maps via online address extraction and associated information segmentation
CN105608067A (en) Automatic knowledge extraction method and apparatus for network teaching system
CN104298786B (en) A kind of image search method and device
CN105138520A (en) Event trigger word recognition method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090708