CN104850538A - Chinese address compound word segmentation technology based on rules and statistic model - Google Patents

Chinese address compound word segmentation technology based on rules and statistic model Download PDF

Info

Publication number
CN104850538A
CN104850538A CN201510230116.9A CN201510230116A CN104850538A CN 104850538 A CN104850538 A CN 104850538A CN 201510230116 A CN201510230116 A CN 201510230116A CN 104850538 A CN104850538 A CN 104850538A
Authority
CN
China
Prior art keywords
address
word segmentation
rule
chinese
random field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510230116.9A
Other languages
Chinese (zh)
Inventor
沈启明
密铁宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pei Keming Management Consulting (shanghai) Co Ltd
Original Assignee
Pei Keming Management Consulting (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pei Keming Management Consulting (shanghai) Co Ltd filed Critical Pei Keming Management Consulting (shanghai) Co Ltd
Priority to CN201510230116.9A priority Critical patent/CN104850538A/en
Publication of CN104850538A publication Critical patent/CN104850538A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a Chinese address compound word segmentation technology based on rules and a statistic model. The word segmentation processing is carried out on addresses by comprehensively utilizing a condition random field model and a maximum matching algorithm optimized by the rules; and the condition random field model extracts related characteristics of address inner information, and a training data set established by a pre-processing phase to train the model so that the Chinese address compound word segmentation technology has the capabilities of automatically segmenting address information and identifying address factors. The condition random field model has a strong model identification capability and can be used for successfully identifying cells with database omission, and also has a good ambiguous address identification capability so that the address factors can be successfully distinguished. An MMSEG algorithm has the characteristics of rapid speed, high precision and the like under the good condition of dictionary data support. According to the Chinese address compound word segmentation technology, the two algorithms are combined and can be mutually supplemented and identified, so that the address matching accuracy is effectively improved, and the word segmentation accuracy is improved under the condition that the efficiency is guaranteed.

Description

The Chinese address compound participle technique of rule-based and statistical model
Technical field
The invention belongs to technical field of geographic information, relate to a kind of Chinese address compound participle technique of rule-based and statistical model specifically.
Background technology
Matching addresses the description address of character property and the geographical position coordinates in its space is set up the process of corresponding relation.Matching addresses service is address search match objects according to specific step, first will by Address Standardization; Then server search matching addresses reference data, searches potential position; Be that each position candidate specifies score value according to the degree of closeness with address, finally with score value the highest mate this address.Multiple address participle pattern is there is in matching addresses method in currently available technology, the participle pattern of some of them technology is more single, there is no compound use different technologies, or the joint efficiency of different technologies is not high, thus cause participle technique based on single rule effectively cannot identify neologisms, and comparatively slow based on the speed of single statistical model participle, if both can be combined use, under the prerequisite of guaranteed efficiency, improve participle accuracy become the urgent demand of people.
Summary of the invention
The deficiency that the present invention exists to overcome prior art, provides a kind of Chinese address compound participle technique that can improve the rule-based of participle accuracy and statistical model under the prerequisite of guaranteed efficiency.
The present invention is achieved by the following technical solutions: a kind of Chinese address compound participle technique of rule-based and statistical model, and its integrated use conditional random field models and the maximum matching algorithm through rule optimization carry out word segmentation processing to address; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.Conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, and has good ambiguity address resolving ability simultaneously, contributes to successfully distinguishing Address factor.
The described maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm, based on maximum forward matching algorithm, is aided with the rule of disambiguation, coordinates dictionary to carry out the cutting of address information and the identification of Address factor.MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.
MMSEG algorithm be in Chinese word segmentation one common, based on the segmentation methods of dictionary, simple, effect is relatively better, due to its Simple visual, it is not very complicated for implementing, travelling speed is also than comparatively fast.It is an older segmentation methods, and it is the word custom of going to refine people with the angle of a non-language scholar, thus concludes the disambiguation algorithm of oneself, is a comparatively simple and practical segmentation methods.
The invention has the beneficial effects as follows: the matching addresses method integrated use conditional random field models in the present invention and word segmentation processing is carried out to address through the maximum matching algorithm of rule optimization, conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, there is good ambiguity address resolving ability simultaneously, contribute to successfully distinguishing Address factor.Maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.The present invention efficiently solve based on single rule participle cannot effectively identify neologisms shortcoming and based on the slow shortcoming of single statistical model participle, both combine use by Chinese address compound participle technique of the present invention, improve participle accuracy under the prerequisite of guaranteed efficiency.
Embodiment
Below in conjunction with embodiment, the present invention is described in detail.
A Chinese address compound participle technique for rule-based and statistical model, its integrated use conditional random field models and the maximum matching algorithm through rule optimization carry out word segmentation processing to address; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.Conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, and has good ambiguity address resolving ability simultaneously, contributes to successfully distinguishing Address factor.The described maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm, based on maximum forward matching algorithm, is aided with the rule of disambiguation, coordinates dictionary to carry out the cutting of address information and the identification of Address factor.MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.
Finally should be noted that; above content is only in order to illustrate technical scheme of the present invention; but not limiting the scope of the invention; the simple modification that those of ordinary skill in the art carries out technical scheme of the present invention or equivalently to replace, does not all depart from essence and the scope of technical solution of the present invention.

Claims (2)

1. a Chinese address compound participle technique for rule-based and statistical model, is characterized in that: described Chinese address compound participle technique integrated use conditional random field models and carry out word segmentation processing to address through the maximum matching algorithm of rule optimization; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.
2. the Chinese address compound participle technique of rule-based and statistical model according to claim 1, it is characterized in that: the described maximum matching algorithm through rule optimization refers to MMSEG algorithm, MMSEG algorithm is based on maximum forward matching algorithm, be aided with the rule of disambiguation, coordinate dictionary to carry out the cutting of address information and the identification of Address factor.
CN201510230116.9A 2015-05-08 2015-05-08 Chinese address compound word segmentation technology based on rules and statistic model Pending CN104850538A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510230116.9A CN104850538A (en) 2015-05-08 2015-05-08 Chinese address compound word segmentation technology based on rules and statistic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510230116.9A CN104850538A (en) 2015-05-08 2015-05-08 Chinese address compound word segmentation technology based on rules and statistic model

Publications (1)

Publication Number Publication Date
CN104850538A true CN104850538A (en) 2015-08-19

Family

ID=53850188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510230116.9A Pending CN104850538A (en) 2015-05-08 2015-05-08 Chinese address compound word segmentation technology based on rules and statistic model

Country Status (1)

Country Link
CN (1) CN104850538A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528526A (en) * 2016-10-09 2017-03-22 武汉工程大学 A Chinese address semantic tagging method based on the Bayes word segmentation algorithm
CN110826318A (en) * 2019-10-14 2020-02-21 浙江数链科技有限公司 Method, device, computer device and storage medium for logistics information identification

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database
CN104598573A (en) * 2015-01-13 2015-05-06 北京京东尚科信息技术有限公司 Method for extracting life circle of user and system thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996247A (en) * 2010-11-10 2011-03-30 百度在线网络技术(北京)有限公司 Method and device for constructing address database
CN104598573A (en) * 2015-01-13 2015-05-06 北京京东尚科信息技术有限公司 Method for extracting life circle of user and system thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
梧桐话: "《http://www.360doc.com/content/13/0217/15/11619026_266141425.shtml》", 17 February 2013 *
程昌秀等: "一种基于规则的模糊中文地址分词匹配方法", 《地理与地理信息科学》 *
蒋建洪等: "词典与统计方法结合的中文分词模型研究及应用", 《计算机工程与设计》 *
谭侃侃: "基于规则的中文地址分词与匹配的方法", 《中国优秀硕士学位论文全文数据库基础科学辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528526A (en) * 2016-10-09 2017-03-22 武汉工程大学 A Chinese address semantic tagging method based on the Bayes word segmentation algorithm
CN106528526B (en) * 2016-10-09 2019-05-28 武汉工程大学 A kind of Chinese address semanteme marking method based on Bayes's segmentation methods
CN110826318A (en) * 2019-10-14 2020-02-21 浙江数链科技有限公司 Method, device, computer device and storage medium for logistics information identification

Similar Documents

Publication Publication Date Title
CN108133045B (en) Keyword extraction method and system, and keyword extraction model generation method and system
US10783171B2 (en) Address search method and device
Jiang et al. R 2 cnn: Rotational region cnn for arbitrarily-oriented scene text detection
CN111625635A (en) Question-answer processing method, language model training method, device, equipment and storage medium
CN105005577A (en) Address matching method
WO2018177316A1 (en) Information identification method, computing device, and storage medium
CN104050256A (en) Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method
CN107256230B (en) Fusion method based on diversified geographic information points
TW201907325A (en) Risk address identification method, device and electronic device
CN105243389A (en) Industry classification tag determining method and apparatus for company name
CN105045847B (en) A kind of method that Chinese institutional units title is extracted from text message
CN113657274B (en) Table generation method and device, electronic equipment and storage medium
CN113223013B (en) Method, device, equipment and storage medium for pulmonary vessel segmentation positioning
CN111488468A (en) Geographic information knowledge point extraction method and device, storage medium and computer equipment
CN111309910A (en) Text information mining method and device
Rousell et al. Extraction of landmarks from OpenStreetMap for use in navigational instructions
CN111198946A (en) Network news hotspot mining method and device
Zhou et al. Icdar 2015 text reading in the wild competition
CN104850538A (en) Chinese address compound word segmentation technology based on rules and statistic model
CN107463624A (en) A kind of method and system that city interest domain identification is carried out based on social media data
CN103176953B (en) A kind of text handling method and system
CN108153860A (en) A kind of geolocation analysis method based on multilingual news
CN105354264B (en) A kind of quick adding method of theme label based on local sensitivity Hash
CN112381162A (en) Information point identification method and device and electronic equipment
Zhao et al. One‐shot video‐based person re‐identification with variance subsampling algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150819

RJ01 Rejection of invention patent application after publication