CN104850538A - Chinese address compound word segmentation technology based on rules and statistic model - Google Patents
Chinese address compound word segmentation technology based on rules and statistic model Download PDFInfo
- Publication number
- CN104850538A CN104850538A CN201510230116.9A CN201510230116A CN104850538A CN 104850538 A CN104850538 A CN 104850538A CN 201510230116 A CN201510230116 A CN 201510230116A CN 104850538 A CN104850538 A CN 104850538A
- Authority
- CN
- China
- Prior art keywords
- address
- word segmentation
- rule
- chinese
- random field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a Chinese address compound word segmentation technology based on rules and a statistic model. The word segmentation processing is carried out on addresses by comprehensively utilizing a condition random field model and a maximum matching algorithm optimized by the rules; and the condition random field model extracts related characteristics of address inner information, and a training data set established by a pre-processing phase to train the model so that the Chinese address compound word segmentation technology has the capabilities of automatically segmenting address information and identifying address factors. The condition random field model has a strong model identification capability and can be used for successfully identifying cells with database omission, and also has a good ambiguous address identification capability so that the address factors can be successfully distinguished. An MMSEG algorithm has the characteristics of rapid speed, high precision and the like under the good condition of dictionary data support. According to the Chinese address compound word segmentation technology, the two algorithms are combined and can be mutually supplemented and identified, so that the address matching accuracy is effectively improved, and the word segmentation accuracy is improved under the condition that the efficiency is guaranteed.
Description
Technical field
The invention belongs to technical field of geographic information, relate to a kind of Chinese address compound participle technique of rule-based and statistical model specifically.
Background technology
Matching addresses the description address of character property and the geographical position coordinates in its space is set up the process of corresponding relation.Matching addresses service is address search match objects according to specific step, first will by Address Standardization; Then server search matching addresses reference data, searches potential position; Be that each position candidate specifies score value according to the degree of closeness with address, finally with score value the highest mate this address.Multiple address participle pattern is there is in matching addresses method in currently available technology, the participle pattern of some of them technology is more single, there is no compound use different technologies, or the joint efficiency of different technologies is not high, thus cause participle technique based on single rule effectively cannot identify neologisms, and comparatively slow based on the speed of single statistical model participle, if both can be combined use, under the prerequisite of guaranteed efficiency, improve participle accuracy become the urgent demand of people.
Summary of the invention
The deficiency that the present invention exists to overcome prior art, provides a kind of Chinese address compound participle technique that can improve the rule-based of participle accuracy and statistical model under the prerequisite of guaranteed efficiency.
The present invention is achieved by the following technical solutions: a kind of Chinese address compound participle technique of rule-based and statistical model, and its integrated use conditional random field models and the maximum matching algorithm through rule optimization carry out word segmentation processing to address; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.Conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, and has good ambiguity address resolving ability simultaneously, contributes to successfully distinguishing Address factor.
The described maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm, based on maximum forward matching algorithm, is aided with the rule of disambiguation, coordinates dictionary to carry out the cutting of address information and the identification of Address factor.MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.
MMSEG algorithm be in Chinese word segmentation one common, based on the segmentation methods of dictionary, simple, effect is relatively better, due to its Simple visual, it is not very complicated for implementing, travelling speed is also than comparatively fast.It is an older segmentation methods, and it is the word custom of going to refine people with the angle of a non-language scholar, thus concludes the disambiguation algorithm of oneself, is a comparatively simple and practical segmentation methods.
The invention has the beneficial effects as follows: the matching addresses method integrated use conditional random field models in the present invention and word segmentation processing is carried out to address through the maximum matching algorithm of rule optimization, conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, there is good ambiguity address resolving ability simultaneously, contribute to successfully distinguishing Address factor.Maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.The present invention efficiently solve based on single rule participle cannot effectively identify neologisms shortcoming and based on the slow shortcoming of single statistical model participle, both combine use by Chinese address compound participle technique of the present invention, improve participle accuracy under the prerequisite of guaranteed efficiency.
Embodiment
Below in conjunction with embodiment, the present invention is described in detail.
A Chinese address compound participle technique for rule-based and statistical model, its integrated use conditional random field models and the maximum matching algorithm through rule optimization carry out word segmentation processing to address; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.Conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, and has good ambiguity address resolving ability simultaneously, contributes to successfully distinguishing Address factor.The described maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm, based on maximum forward matching algorithm, is aided with the rule of disambiguation, coordinates dictionary to carry out the cutting of address information and the identification of Address factor.MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.
Finally should be noted that; above content is only in order to illustrate technical scheme of the present invention; but not limiting the scope of the invention; the simple modification that those of ordinary skill in the art carries out technical scheme of the present invention or equivalently to replace, does not all depart from essence and the scope of technical solution of the present invention.
Claims (2)
1. a Chinese address compound participle technique for rule-based and statistical model, is characterized in that: described Chinese address compound participle technique integrated use conditional random field models and carry out word segmentation processing to address through the maximum matching algorithm of rule optimization; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.
2. the Chinese address compound participle technique of rule-based and statistical model according to claim 1, it is characterized in that: the described maximum matching algorithm through rule optimization refers to MMSEG algorithm, MMSEG algorithm is based on maximum forward matching algorithm, be aided with the rule of disambiguation, coordinate dictionary to carry out the cutting of address information and the identification of Address factor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510230116.9A CN104850538A (en) | 2015-05-08 | 2015-05-08 | Chinese address compound word segmentation technology based on rules and statistic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510230116.9A CN104850538A (en) | 2015-05-08 | 2015-05-08 | Chinese address compound word segmentation technology based on rules and statistic model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104850538A true CN104850538A (en) | 2015-08-19 |
Family
ID=53850188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510230116.9A Pending CN104850538A (en) | 2015-05-08 | 2015-05-08 | Chinese address compound word segmentation technology based on rules and statistic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104850538A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528526A (en) * | 2016-10-09 | 2017-03-22 | 武汉工程大学 | A Chinese address semantic tagging method based on the Bayes word segmentation algorithm |
CN110826318A (en) * | 2019-10-14 | 2020-02-21 | 浙江数链科技有限公司 | Method, device, computer device and storage medium for logistics information identification |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996247A (en) * | 2010-11-10 | 2011-03-30 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
CN104598573A (en) * | 2015-01-13 | 2015-05-06 | 北京京东尚科信息技术有限公司 | Method for extracting life circle of user and system thereof |
-
2015
- 2015-05-08 CN CN201510230116.9A patent/CN104850538A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996247A (en) * | 2010-11-10 | 2011-03-30 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
CN104598573A (en) * | 2015-01-13 | 2015-05-06 | 北京京东尚科信息技术有限公司 | Method for extracting life circle of user and system thereof |
Non-Patent Citations (4)
Title |
---|
梧桐话: "《http://www.360doc.com/content/13/0217/15/11619026_266141425.shtml》", 17 February 2013 * |
程昌秀等: "一种基于规则的模糊中文地址分词匹配方法", 《地理与地理信息科学》 * |
蒋建洪等: "词典与统计方法结合的中文分词模型研究及应用", 《计算机工程与设计》 * |
谭侃侃: "基于规则的中文地址分词与匹配的方法", 《中国优秀硕士学位论文全文数据库基础科学辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528526A (en) * | 2016-10-09 | 2017-03-22 | 武汉工程大学 | A Chinese address semantic tagging method based on the Bayes word segmentation algorithm |
CN106528526B (en) * | 2016-10-09 | 2019-05-28 | 武汉工程大学 | A kind of Chinese address semanteme marking method based on Bayes's segmentation methods |
CN110826318A (en) * | 2019-10-14 | 2020-02-21 | 浙江数链科技有限公司 | Method, device, computer device and storage medium for logistics information identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108133045B (en) | Keyword extraction method and system, and keyword extraction model generation method and system | |
US10783171B2 (en) | Address search method and device | |
Jiang et al. | R 2 cnn: Rotational region cnn for arbitrarily-oriented scene text detection | |
CN111625635A (en) | Question-answer processing method, language model training method, device, equipment and storage medium | |
CN105005577A (en) | Address matching method | |
WO2018177316A1 (en) | Information identification method, computing device, and storage medium | |
CN104050256A (en) | Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method | |
CN107256230B (en) | Fusion method based on diversified geographic information points | |
TW201907325A (en) | Risk address identification method, device and electronic device | |
CN105243389A (en) | Industry classification tag determining method and apparatus for company name | |
CN105045847B (en) | A kind of method that Chinese institutional units title is extracted from text message | |
CN113657274B (en) | Table generation method and device, electronic equipment and storage medium | |
CN113223013B (en) | Method, device, equipment and storage medium for pulmonary vessel segmentation positioning | |
CN111488468A (en) | Geographic information knowledge point extraction method and device, storage medium and computer equipment | |
CN111309910A (en) | Text information mining method and device | |
Rousell et al. | Extraction of landmarks from OpenStreetMap for use in navigational instructions | |
CN111198946A (en) | Network news hotspot mining method and device | |
Zhou et al. | Icdar 2015 text reading in the wild competition | |
CN104850538A (en) | Chinese address compound word segmentation technology based on rules and statistic model | |
CN107463624A (en) | A kind of method and system that city interest domain identification is carried out based on social media data | |
CN103176953B (en) | A kind of text handling method and system | |
CN108153860A (en) | A kind of geolocation analysis method based on multilingual news | |
CN105354264B (en) | A kind of quick adding method of theme label based on local sensitivity Hash | |
CN112381162A (en) | Information point identification method and device and electronic equipment | |
Zhao et al. | One‐shot video‐based person re‐identification with variance subsampling algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150819 |
|
RJ01 | Rejection of invention patent application after publication |