CN105068989A - Place name and address extraction method and apparatus - Google Patents

Place name and address extraction method and apparatus Download PDF

Info

Publication number
CN105068989A
CN105068989A CN201510437893.0A CN201510437893A CN105068989A CN 105068989 A CN105068989 A CN 105068989A CN 201510437893 A CN201510437893 A CN 201510437893A CN 105068989 A CN105068989 A CN 105068989A
Authority
CN
China
Prior art keywords
place name
feature words
name address
suffix
prefix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510437893.0A
Other languages
Chinese (zh)
Other versions
CN105068989B (en
Inventor
刘纪平
罗安
王勇
王克永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Academy of Surveying and Mapping
Original Assignee
Chinese Academy of Surveying and Mapping
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Academy of Surveying and Mapping filed Critical Chinese Academy of Surveying and Mapping
Priority to CN201510437893.0A priority Critical patent/CN105068989B/en
Publication of CN105068989A publication Critical patent/CN105068989A/en
Application granted granted Critical
Publication of CN105068989B publication Critical patent/CN105068989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of information, in particular to a place name and address extraction method and apparatus. The place and name address extraction method comprises: performing word segmentation on a target text to obtain a to-be-matched word set; performing text matching on prefixes and postfixes of the to-be-matched word set by utilizing prefix characteristic words and postfix characteristic words in a prefix and postfix identification window respectively, and according to a text matching result, obtaining candidate place names and addresses; and extracting screening characteristic words in the candidate place names and addresses, and according to the screening characteristic words, performing filtration and screening on the candidate place names and addresses according to the screening characteristic words. The place name and address extraction method and apparatus can improve the problem of relatively high difficulty in obtaining place names and addresses from massive webpage texts.

Description

Place name address extraction method and device
Technical field
The present invention relates to areas of information technology, in particular to place name address extraction method and device.
Background technology
Along with the development of Internet technology and computer technology, magnanimity internet information has related to the every aspect of user, user can obtain the information such as news, report, military affairs, life of generation from internet, and can find time that these information occurs, place from web page text.Along with the increase of internet information content, by Informational Expressions such as Internet news out, these internet news information updatings are timely, data volume large, abundant information, become the important way that place name address is expressed in increasing place name address.Simultaneously, geomatics industry requires more and more higher to the Up-to-date state of geographic information data, the content of geography information not only can be enriched in the place name address extracted in magnanimity web page text, can also provide support for the analysis of all kinds of event of attention from government, research and decision-making, progressively become the significant data source that a kind of geographic information data obtains.Therefore, from network text, how to obtain place name address date be accurately extracted in order to geographic information data obtains with the important of renewal and problem demanding prompt solution.At present, traditional place name address extraction method is mainly launched based on the method such as dictionary, statistics, rule and machine learning, and the requirement of these methods to traditional place name address base is high, for fuzzy place name address or the place name Address Recognition difficulty that do not log in large.
Summary of the invention
The object of the present invention is to provide place name address extraction method and device, from magnanimity web page text, obtain difficulty larger problem in place name address to improve.
First aspect, embodiments provides a kind of place name address extraction method, comprising: to target text participle, obtain phrase to be matched; Sew prefix characteristic word in identification window and suffix Feature Words before and after utilizing and respectively characters matching is carried out to the prefix of described phrase to be matched and suffix, and obtain alternative place name address according to the result of described characters matching; Extract the screening Feature Words in described alternative place name address, according to described screening Feature Words, filtering screening is carried out to described alternative place name address.
In conjunction with first aspect, embodiments provide the first possible embodiment of first aspect, wherein, describedly also to comprise before target text participle: adopt web crawlers technology, capture the web page text in target web, and using the described web page text of crawl as target text.
In conjunction with first aspect, embodiments provide the embodiment that the second of first aspect is possible, wherein, described method also comprises: the extraction place name address text data in the language material text library containing place name address being carried out respectively to prefix characteristic word and suffix Feature Words; Utilize the described prefix characteristic word of extraction and described suffix Feature Words to form described front and back and sew identification window.
In conjunction with first aspect, embodiments provide the third possible embodiment of first aspect, wherein, the described prefix characteristic word that described utilization is extracted and described suffix Feature Words form described front and back and sew identification window, comprise: frequency statistics is carried out to the described prefix characteristic word extracted from described language material text library and described suffix Feature Words, and according to the result of described frequency statistics prefix characteristic word in identification window is sewed to described front and back and suffix Feature Words gives weight; According to the weight size of described prefix characteristic word and described suffix Feature Words, determine that the matching order of prefix characteristic word and suffix Feature Words in identification window is sewed in described front and back.
In conjunction with first aspect, embodiments provide the 4th kind of possible embodiment of first aspect, wherein, described utilize before and after sew prefix characteristic word in identification window and suffix Feature Words and respectively characters matching carried out to the prefix of described phrase to be matched and suffix, comprising: the matching order sewing prefix characteristic word and the suffix Feature Words determined in identification window according to described front and back carries out characters matching to the prefix of described phrase to be matched and suffix.
In conjunction with first aspect, embodiments provide the 5th kind of possible embodiment of first aspect, wherein, described utilize before and after sew prefix characteristic word in identification window and suffix Feature Words and respectively characters matching carried out to the prefix of described phrase to be matched and suffix, comprising: before and after utilizing, the prefix of prefix characteristic word to described phrase to be matched of sewing in identification window is mated; After described prefix matching is consistent, the suffix of suffix Feature Words to described phrase to be matched utilizing described front and back to sew in identification window mates.
In conjunction with first aspect, embodiments provide the 6th kind of possible embodiment of first aspect, wherein, screening Feature Words in the described alternative place name address of described extraction, according to described screening Feature Words, filtering screening is carried out to described alternative place name address, comprising: when at least comprising in administrative division key element, proper noun noun, latitude and longitude information and enterprises and institutions' Feature Words in the described screening Feature Words extracted from described alternative place name address, determining that described alternative place name address is the place name address meeting place name address rule; When comprising surname and description of person in the described screening Feature Words extracted from described alternative place name address simultaneously, or, when comprising personage's pronoun and description of person, reject described alternative place name address simultaneously.
Second aspect, the embodiment of the present invention additionally provides a kind of place name address extraction device, comprising: word-dividing mode, for target text participle, obtains phrase to be matched; Before and after sew matching module, respectively characters matching is carried out to the prefix of described phrase to be matched and suffix for sewing prefix characteristic word in identification window and suffix Feature Words before and after utilizing, and obtains alternative place name address according to the result of described characters matching; Filtering screening module, for extracting the screening Feature Words in described alternative place name address, carries out filtering screening according to described screening Feature Words to described alternative place name address.
In conjunction with second aspect, embodiments provide the first possible embodiment of second aspect, wherein, said apparatus also comprises: text handling module, for to before target text participle, adopt web crawlers technology, capture the web page text in target web, and using the described web page text of crawl as target text.
In conjunction with second aspect, embodiments provide the embodiment that the second of second aspect is possible, wherein, said apparatus also comprises: Feature Words extraction module, for carrying out the extraction of prefix characteristic word and suffix Feature Words respectively to the place name address text data in the language material text library containing place name address; Identification window comprising modules, described prefix characteristic word and described suffix Feature Words for utilizing extraction form described front and back and sew identification window.
In the place name address extraction method of the embodiment of the present invention and device, participle is carried out to the target text in webpage, target text is divided into independently word or word, sew identification window before and after utilizing afterwards to mate with this text data after cutting, obtain alternative place name address, finally carrying out filtering screening according to the Feature Words in alternative place name address to alternative place name address obtains final place name address, utilize the method can extract place name address comparatively easily from magnanimity web page text, thus improve in prior art extract from magnanimity web page text place name address comparatively difficulty problem.
For making above-mentioned purpose of the present invention, feature and advantage become apparent, preferred embodiment cited below particularly, and coordinate appended accompanying drawing, be described in detail below.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, be briefly described to the accompanying drawing used required in embodiment below, be to be understood that, the following drawings illustrate only some embodiment of the present invention, therefore the restriction to scope should be counted as, for those of ordinary skill in the art, under the prerequisite not paying creative work, other relevant accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 shows a kind of process flow diagram of place name address extraction method in the embodiment of the present invention;
Fig. 2 shows a kind of structural representation sewing identification window before and after in the embodiment of the present invention;
Fig. 3 shows the another kind of process flow diagram of place name address extraction method in the embodiment of the present invention;
Fig. 4 shows the another kind of structural representation sewing identification window before and after in the embodiment of the present invention;
Fig. 5 shows a kind of structural representation of place name address extraction device in the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.The assembly of the embodiment of the present invention describing and illustrate in usual accompanying drawing herein can be arranged with various different configuration and design.Therefore, below to the detailed description of the embodiments of the invention provided in the accompanying drawings and the claimed scope of the present invention of not intended to be limiting, but selected embodiment of the present invention is only represented.Based on embodiments of the invention, the every other embodiment that those skilled in the art obtain under the prerequisite not making creative work, all belongs to the scope of protection of the invention.
In order to realize the extraction to place name address in webpage, embodiments provide a kind of place name address extraction method, as shown in Figure 1, main processing steps comprises:
Step S11: to target text participle, obtains phrase to be matched.
In the present invention, to target text participle mainly to Chinese text participle, Chinese word segmentation utilizes segmentation methods to be cut into independently word or Chinese character to Chinese web page text data, relatively more complicated than English string segmentation method.This step, when to target text participle, adopts comparative maturity and popular Chinese Word Automatic Segmentation, and carries out analysis verification to the result of participle, reduces the mistake occurred in Chinese word segmentation.
Step S12: sew prefix characteristic word in identification window and suffix Feature Words before and after utilizing and respectively characters matching is carried out to the prefix of phrase to be matched and suffix, and obtain alternative place name address according to the result of characters matching.
In web page text, place name address is generally sew word along with corresponding front and back jointly to occur, as being positioned at a library in somewhere, wherein " being positioned at " can as the prefix word in " somewhere ", " " can as the suffix word in " somewhere ", place name address is between prefix time and suffix word, in the embodiment of the present invention when identifying the place name address in web page text, make use of this kind of syntactic structure of place name address just, before and after adopting, sew identification window to extract the place name address in web page text.
A kind of structural drawing of identification window is sewed as shown in Figure 2, as can be seen from the figure, front and back are sewed identification window and are comprised prefix characteristic word (being such as positioned at) and suffix Feature Words (such as), wherein the quantity of prefix characteristic word and suffix Feature Words is indefinite, when sewing identification window before and after utilizing and mating phrase to be matched, prefix characteristic word can be first utilized to mate successively, after when prefix Feature Words, the match is successful, recycling suffix Feature Words mates successively, current suffix is when all the match is successful, extract the word alternatively place name address between front and back are sewed.
Step S13: extract the screening Feature Words in alternative place name address, carries out filtering screening according to screening Feature Words to alternative place name address.
In the present invention, set up place name address extraction rule base, in this rule base, define the extracting rule of place name address.Screening Feature Words is extracted from alternative place name address, and judge whether the screening Feature Words extracted meets the place name address extraction rule preset, if do not meet reject this alternative place name address, if meet, then using this place name address as the result extracted.
Above-mentioned place name address extraction method can carry out automatic screening extraction to the place name address in magnanimity web page text.
Additionally provide a kind of embodiment of place name address extraction method in the embodiment of the present invention, as shown in Figure 3, main processing steps comprises:
Step S31: web crawlers, captures target text.
Adopt web crawlers technology, capture the web page text in target web, and using the web page text of crawl as target text.
Web crawlers, also referred to as Web Spider or network robot, is a kind of according to certain rule, automatically the information above internet is carried out the program that captures or script.Web crawlers to capture the content of text on webpage according to the form of the composition of website and stored in database.
Because the html document on webpage uses hyperlink to couple together, web crawlers can be thrown the net along this and be creeped, and every webpage just utilizes capture program to be captured by web page text, and by hyperlink extraction wherein out, as the clue of creeping further.
Particularly, the website links that web crawlers will be able to be accessed from a group, access these links, and all hyperlink recognized in these pages, afterwards these hyperlink are added in a list of websites, and can repeatedly access the address in these lists according to certain strategy and then capture web page text from corresponding webpage.
Step S32: Chinese word segmentation is carried out to target text.
Chinese word segmentation is that the target text (referring generally to Chinese text in the present invention) utilizing segmentation methods to be obtained by web crawlers is cut into independently word or Chinese character, more complicated than English string segmentation method a lot, segmentation methods of increasing income both at home and abroad at present has reached very high accuracy rate, the present invention adopts comparative maturity and popular Chinese Word Automatic Segmentation, and analysis verification is carried out to the result of participle, reject the redundancy word of non-place name address, reduce the mistake occurred in Chinese word segmentation.
To obtaining phrase queue to be matched after target text participle.
Step S33: sew identification window before and after being formed.
In this step, statistical study is carried out to the place name address text data in the language material text library containing place name address, and extract prefix characteristic word and the suffix Feature Words of place name address in language material text database; And utilize the prefix characteristic word of extraction and suffix Feature Words composition front and back to sew identification window.
Wherein when utilizing the prefix characteristic word of extraction and suffix Feature Words composition front and back sew identification window, frequency statistics is carried out to the prefix characteristic word extracted from language material text library and suffix Feature Words, and according to the result of frequency statistics prefix characteristic word in identification window is sewed to front and back and suffix Feature Words gives weight; According to the weight size of prefix characteristic word and suffix Feature Words, before and after determining, sew the matching order of prefix characteristic word and suffix Feature Words in identification window.
Additionally provide in the embodiment of the present invention and a kind ofly Feature Words is sewed to front and back carry out composing before and after weight composition and sew the example of identification window, specific as follows:
As in table 1, according to the part of speech of prefix word, give the list of types of prefix word, as verb " be positioned at ", " comprising ", preposition " along with " " in " etc., will not enumerate herein.
The list of table 1 prefix word
As in table 2, according to the part of speech of suffix word, give the list of types of suffix word, as verb " hold ", " carrying out ", auxiliary word " " " etc. " etc., will not enumerate herein.
The list of table 2 suffix word
According to the prefix characteristic word shown in table and table 2 and suffix Feature Words, in the language material text library of statistics place name address, the frequency that the prefix characteristic word of place name address and suffix Feature Words occur, and the weight of Feature Words is sewed in front and back to utilize formula weight=frequency (pos)+frequency (word) to determine, wherein:
When frequency (pos) represents that prefix word part of speech is determined, the frequency of suffix word part of speech;
Frequency (word): when suffix word part of speech is determined, the frequency that current suffix word occurs under this part of speech.
According to calculate each before and after sew the weight size of word, the matching order of prefix characteristic word and suffix Feature Words in identification window is sewed before and after determining, as shown in Figure 4, the front and back of composition are sewed identification window and are comprised prefix word and suffix word, wherein prefix word comprises prefix word 1, prefix word 2 ... prefix word n, according to the large minispread of the weight calculated; Suffix word comprises suffix word 1, suffix word 2 ... suffix word n, according to the weight size calculated and determine to put in order with the corresponding relation of prefix word, when specifically mating, the sequencing of arrangement is basically identical with the sequencing carrying out mating.
Step S34: alternative place name address extraction
Utilize the front and back formed in above-mentioned steps to sew identification window and respectively characters matching is carried out to the prefix of phrase to be matched and suffix, extract alternative place name address.Wherein, when mating phrase to be matched, the frequecy characteristic of Feature Words is sewed according to front and back, successively queue coupling is carried out to the prefix characteristic word of phrase to be matched and suffix Feature Words, namely the Feature Words that weight is larger is first by the coupling of place name prefix characteristic word, then according to carrying out mating of place name address suffix word with the weight of suffix word corresponding to this prefix characteristic word, after only having current suffix word the match is successful completely, then the text message of centre is extracted, alternatively place name address, the alternative place name address wherein extracted can join in alternative place name address base.
Step S35: rule verification, place name address extraction
According to the composition rule of place name address, build place name address rule base, place name address in the alternative place name address base formed is mated one by one, reject the noise information not comprising place name Address factor and do not meet place name address word-building rule, extract the place name address comprising address element Feature Words, guarantee correctness and the efficiency of place name Address Recognition and extraction, mainly comprise Feature Words extraction and filter with Feature Words.
Feature Words extracts:
(1) comprise administrative division key element in alternative place name address then as place name address information, concrete formula is: adminLib: administrative division storehouse (being accurate at village level), as: " Beijing, Jinan, Haidian ... ", i a: element in set; Loc (y): be defined as place name address set.
(2) extraction comprises the alternative place name address of proper noun noun as place name address: wherein Loclist: proper noun name set of words, as: " river, lake, road ... "
(3) in alternative place name address containing latitude and longitude information as place name address: wherein Lonlat [i] is longitude and latitude word, as: " east longitude, north latitude, west longitude, south latitude ".
(4) the alternative place name address containing enterprises and institutions' Feature Words is as place name address: unit [i] is enterprise and institution word, and " as: company, school, passenger station, exhibition center, bank ... "
Feature Words filters:
(5) non-place name address is judged as containing surname and containing the alternative place name address of description of person's word: wherein Familyname: name surname set, as: " Zhao, money, grandson ... " Figurelist: description of person's suffix word, as: " Ms, sir, uncle, auntie ... " Loc (n): represent non-place name address set.
(6) in alternative place name address both containing personage's pronoun also containing description of person's suffix word be judged as non-place name address: pronlist: personage's pronoun set, as: " you, we, he ... "
Corresponding above-mentioned place name address extraction method, the embodiment of the present invention additionally provides a kind of place name address extraction device, comprises as shown in Figure 5: word-dividing mode 41, front and back sew matching module 42 and filtering screening module 43; Wherein, word-dividing mode 41, for target text participle, obtains phrase to be matched; Before and after sew matching module 42, respectively characters matching is carried out to the prefix of phrase to be matched and suffix for sewing prefix characteristic word in identification window and suffix Feature Words before and after utilizing, and obtains alternative place name address according to the result of characters matching; Filtering screening module 43, for extracting the screening Feature Words in alternative place name address, carries out filtering screening according to screening Feature Words to alternative place name address.
Before to target text participle, first target text is obtained, the method obtained is capture web page text by web crawlers technology, also comprise to realize webpage said apparatus: text handling module, for to before target text participle, adopt web crawlers technology, capture the web page text in target web, and using the web page text of crawl as target text.
Said apparatus, also comprises: Feature Words extraction module, for carrying out the extraction of prefix characteristic word and suffix Feature Words respectively to the place name address text data in the language material text library containing place name address; Identification window comprising modules, sews identification window for utilizing before and after the prefix characteristic word of extraction and suffix Feature Words composition.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the system of foregoing description, device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
In several embodiments that the application provides, should be understood that disclosed system, apparatus and method can realize by another way.Device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, again such as, multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some communication interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.
If described function using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disc or CD etc. various can be program code stored medium.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.

Claims (10)

1. a place name address extraction method, is characterized in that, comprising:
To target text participle, obtain phrase to be matched;
Sew prefix characteristic word in identification window and suffix Feature Words before and after utilizing and respectively characters matching is carried out to the prefix of described phrase to be matched and suffix, and obtain alternative place name address according to the result of described characters matching;
Extract the screening Feature Words in described alternative place name address, according to described screening Feature Words, filtering screening is carried out to described alternative place name address.
2. method according to claim 1, is characterized in that, describedly also comprises before target text participle: adopt web crawlers technology, captures the web page text in target web, and using the described web page text that captures as target text.
3. method according to claim 1, is characterized in that, described method also comprises: the extraction place name address text data in the language material text library containing place name address being carried out respectively to prefix characteristic word and suffix Feature Words;
Utilize the described prefix characteristic word of extraction and described suffix Feature Words to form described front and back and sew identification window.
4. method according to claim 3, is characterized in that, the described prefix characteristic word that described utilization is extracted and described suffix Feature Words form described front and back and sew identification window, comprising:
Frequency statistics is carried out to the described prefix characteristic word extracted from described language material text library and described suffix Feature Words, and according to the result of described frequency statistics prefix characteristic word in identification window is sewed to described front and back and suffix Feature Words gives weight;
According to the weight size of described prefix characteristic word and described suffix Feature Words, determine that the matching order of prefix characteristic word and suffix Feature Words in identification window is sewed in described front and back.
5. method according to claim 4, is characterized in that, described utilize before and after sew prefix characteristic word in identification window and suffix Feature Words and respectively characters matching carried out to the prefix of described phrase to be matched and suffix, comprising:
The matching order sewing prefix characteristic word and the suffix Feature Words determined in identification window according to described front and back carries out characters matching to the prefix of described phrase to be matched and suffix.
6. method according to claim 5, is characterized in that, described utilize before and after sew prefix characteristic word in identification window and suffix Feature Words and respectively characters matching carried out to the prefix of described phrase to be matched and suffix, comprising:
Before and after utilizing, the prefix of prefix characteristic word to described phrase to be matched of sewing in identification window is mated;
After described prefix matching is consistent, the suffix of suffix Feature Words to described phrase to be matched utilizing described front and back to sew in identification window mates.
7. method according to claim 1, is characterized in that, the screening Feature Words in the described alternative place name address of described extraction, carries out filtering screening, comprising according to described screening Feature Words to described alternative place name address:
When at least comprising in administrative division key element, proper noun noun, latitude and longitude information and enterprises and institutions' Feature Words in the described screening Feature Words extracted from described alternative place name address, determine that described alternative place name address is the place name address meeting place name address rule;
When comprising surname and description of person in the described screening Feature Words extracted from described alternative place name address simultaneously, or, when comprising personage's pronoun and description of person, reject described alternative place name address simultaneously.
8. a place name address extraction device, is characterized in that, comprising:
Word-dividing mode, for target text participle, obtains phrase to be matched;
Before and after sew matching module, respectively characters matching is carried out to the prefix of described phrase to be matched and suffix for sewing prefix characteristic word in identification window and suffix Feature Words before and after utilizing, and obtains alternative place name address according to the result of described characters matching;
Filtering screening module, for extracting the screening Feature Words in described alternative place name address, carries out filtering screening according to described screening Feature Words to described alternative place name address.
9. device according to claim 8, is characterized in that, also comprises: text handling module, for before target text participle, adopts web crawlers technology, captures the web page text in target web, and using the described web page text of crawl as target text.
10. device according to claim 8, is characterized in that, also comprises:
Feature Words extraction module, for carrying out the extraction of prefix characteristic word and suffix Feature Words respectively to the place name address text data in the language material text library containing place name address;
Identification window comprising modules, described prefix characteristic word and described suffix Feature Words for utilizing extraction form described front and back and sew identification window.
CN201510437893.0A 2015-07-23 2015-07-23 Place name address extraction method and device Active CN105068989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510437893.0A CN105068989B (en) 2015-07-23 2015-07-23 Place name address extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510437893.0A CN105068989B (en) 2015-07-23 2015-07-23 Place name address extraction method and device

Publications (2)

Publication Number Publication Date
CN105068989A true CN105068989A (en) 2015-11-18
CN105068989B CN105068989B (en) 2018-05-04

Family

ID=54498363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510437893.0A Active CN105068989B (en) 2015-07-23 2015-07-23 Place name address extraction method and device

Country Status (1)

Country Link
CN (1) CN105068989B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550372A (en) * 2016-01-28 2016-05-04 浪潮软件集团有限公司 Sentence training device and method and information extraction system
CN107368471A (en) * 2017-06-29 2017-11-21 中国测绘科学研究院 The extracting method of place name address in a kind of web page text
CN107784478A (en) * 2016-08-31 2018-03-09 北京国双科技有限公司 The treating method and apparatus of administrative organization's information
CN108763212A (en) * 2018-05-23 2018-11-06 北京神州泰岳软件股份有限公司 A kind of address information extraction method and device
CN109918480A (en) * 2019-03-01 2019-06-21 陈包容 A method of address is extracted from text
CN110134935A (en) * 2018-02-08 2019-08-16 株式会社理光 A kind of method, device and equipment for extracting font style characteristic
CN110134664A (en) * 2019-04-12 2019-08-16 中国平安财产保险股份有限公司 Acquisition methods, device and the computer equipment in Data Migration path
CN111476037A (en) * 2020-04-14 2020-07-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN111625732A (en) * 2020-05-25 2020-09-04 鼎富智能科技有限公司 Address matching method and device
CN112015888A (en) * 2019-05-31 2020-12-01 百度在线网络技术(北京)有限公司 Abstract information extraction method and abstract information extraction system
CN112417812A (en) * 2020-11-26 2021-02-26 新智认知数据服务有限公司 Address standardization method and system and electronic equipment
CN114091454A (en) * 2021-11-29 2022-02-25 重庆市地理信息和遥感应用中心 Method for extracting place name information and positioning space in internet text
CN116258141A (en) * 2023-05-16 2023-06-13 青岛海信网络科技股份有限公司 Text data processing method, server and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052683A (en) * 1998-02-24 2000-04-18 Nortel Networks Corporation Address lookup in packet data communication networks
CN1929447A (en) * 2006-06-01 2007-03-14 华为技术有限公司 Method and device for searching address prefixion and message transfer method and system
CN101299217A (en) * 2008-06-06 2008-11-05 北京搜狗科技发展有限公司 Method, apparatus and system for processing map information
CN102169498A (en) * 2011-04-14 2011-08-31 中国测绘科学研究院 Address model constructing method and address matching method and system
CN102479230A (en) * 2010-11-29 2012-05-30 北京四维图新科技股份有限公司 Method and device for extracting geographical feature words
CN103853738A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Identification method for webpage information related region
CN103854064A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Event occurrence risk prediction and early warning method targeted to specific zone
CN103853700A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Event forewarning method based on regions and object information discovery

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052683A (en) * 1998-02-24 2000-04-18 Nortel Networks Corporation Address lookup in packet data communication networks
CN1929447A (en) * 2006-06-01 2007-03-14 华为技术有限公司 Method and device for searching address prefixion and message transfer method and system
CN101299217A (en) * 2008-06-06 2008-11-05 北京搜狗科技发展有限公司 Method, apparatus and system for processing map information
CN102479230A (en) * 2010-11-29 2012-05-30 北京四维图新科技股份有限公司 Method and device for extracting geographical feature words
CN102169498A (en) * 2011-04-14 2011-08-31 中国测绘科学研究院 Address model constructing method and address matching method and system
CN103853738A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Identification method for webpage information related region
CN103854064A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Event occurrence risk prediction and early warning method targeted to specific zone
CN103853700A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Event forewarning method based on regions and object information discovery

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550372A (en) * 2016-01-28 2016-05-04 浪潮软件集团有限公司 Sentence training device and method and information extraction system
CN107784478A (en) * 2016-08-31 2018-03-09 北京国双科技有限公司 The treating method and apparatus of administrative organization's information
CN107784478B (en) * 2016-08-31 2020-09-15 北京国双科技有限公司 Method and device for processing administrative institution information
CN107368471A (en) * 2017-06-29 2017-11-21 中国测绘科学研究院 The extracting method of place name address in a kind of web page text
CN107368471B (en) * 2017-06-29 2020-11-27 中国测绘科学研究院 Method for extracting place name address from webpage text
CN110134935A (en) * 2018-02-08 2019-08-16 株式会社理光 A kind of method, device and equipment for extracting font style characteristic
CN110134935B (en) * 2018-02-08 2023-08-11 株式会社理光 Method, device and equipment for extracting character form characteristics
CN108763212A (en) * 2018-05-23 2018-11-06 北京神州泰岳软件股份有限公司 A kind of address information extraction method and device
CN109918480A (en) * 2019-03-01 2019-06-21 陈包容 A method of address is extracted from text
CN110134664A (en) * 2019-04-12 2019-08-16 中国平安财产保险股份有限公司 Acquisition methods, device and the computer equipment in Data Migration path
CN112015888A (en) * 2019-05-31 2020-12-01 百度在线网络技术(北京)有限公司 Abstract information extraction method and abstract information extraction system
CN112015888B (en) * 2019-05-31 2023-08-18 百度在线网络技术(北京)有限公司 Abstract information extraction method and abstract information extraction system
CN111476037A (en) * 2020-04-14 2020-07-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN111476037B (en) * 2020-04-14 2023-03-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN111625732B (en) * 2020-05-25 2023-06-23 鼎富智能科技有限公司 Address matching method and device
CN111625732A (en) * 2020-05-25 2020-09-04 鼎富智能科技有限公司 Address matching method and device
CN112417812A (en) * 2020-11-26 2021-02-26 新智认知数据服务有限公司 Address standardization method and system and electronic equipment
CN112417812B (en) * 2020-11-26 2024-05-17 新智认知数据服务有限公司 Address standardization method and system and electronic equipment
CN114091454A (en) * 2021-11-29 2022-02-25 重庆市地理信息和遥感应用中心 Method for extracting place name information and positioning space in internet text
CN116258141A (en) * 2023-05-16 2023-06-13 青岛海信网络科技股份有限公司 Text data processing method, server and device
CN116258141B (en) * 2023-05-16 2023-09-26 青岛海信网络科技股份有限公司 Text data processing method, server and device

Also Published As

Publication number Publication date
CN105068989B (en) 2018-05-04

Similar Documents

Publication Publication Date Title
CN105068989A (en) Place name and address extraction method and apparatus
CN104408093B (en) A kind of media event key element abstracting method and device
CN103399885B (en) Mining method and device of POI (point of interest) representing images and server
CN102779140B (en) A kind of keyword acquisition methods and device
CN102270206A (en) Method and device for capturing valid web page contents
CN1936893A (en) Method and system for generating input-method word frequency base based on internet information
CN101369215B (en) Contact person positioning method, system and mobile communication terminal
CN104598577A (en) Extraction method for webpage text
CN110399606B (en) Unsupervised electric power document theme generation method and system
CN104750754A (en) Website industry classification method and server
CN103577989A (en) Method and system for information classification based on product identification
CN110298039B (en) Event place identification method, system, equipment and computer readable storage medium
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN103136302A (en) Method and device of test question repeat output
CN103838798A (en) Page classification system and method
CN109408806A (en) A kind of Event Distillation method based on English grammar rule
CN105320734A (en) Web page core content extraction method
CN102646124A (en) Method for automatically identifying address information
CN102117289A (en) Method and device for extracting comment content from webpage
CN103699370A (en) SurvML (Survey Marked Language) design and development method based on XML (Extensive Markup Language)
CN103136212A (en) Mining method of class new words and device
CN102163189A (en) Method and device for extracting evaluative information from critical texts
CN110175288B (en) Method and system for filtering character and image data for teenager group
CN102737017B (en) Method and apparatus for extracting page theme
CN107807917A (en) Method for extracting content of text, device, system and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant