CN104850538A

CN104850538A - Chinese address compound word segmentation technology based on rules and statistic model

Info

Publication number: CN104850538A
Application number: CN201510230116.9A
Authority: CN
Inventors: 沈启明; 密铁宾
Original assignee: Pei Keming Management Consulting (shanghai) Co Ltd
Current assignee: Pei Keming Management Consulting (shanghai) Co Ltd
Priority date: 2015-05-08
Filing date: 2015-05-08
Publication date: 2015-08-19

Abstract

The invention discloses a Chinese address compound word segmentation technology based on rules and a statistic model. The word segmentation processing is carried out on addresses by comprehensively utilizing a condition random field model and a maximum matching algorithm optimized by the rules; and the condition random field model extracts related characteristics of address inner information, and a training data set established by a pre-processing phase to train the model so that the Chinese address compound word segmentation technology has the capabilities of automatically segmenting address information and identifying address factors. The condition random field model has a strong model identification capability and can be used for successfully identifying cells with database omission, and also has a good ambiguous address identification capability so that the address factors can be successfully distinguished. An MMSEG algorithm has the characteristics of rapid speed, high precision and the like under the good condition of dictionary data support. According to the Chinese address compound word segmentation technology, the two algorithms are combined and can be mutually supplemented and identified, so that the address matching accuracy is effectively improved, and the word segmentation accuracy is improved under the condition that the efficiency is guaranteed.

Description

The Chinese address compound participle technique of rule-based and statistical model

Technical field

The invention belongs to technical field of geographic information, relate to a kind of Chinese address compound participle technique of rule-based and statistical model specifically.

Background technology

Matching addresses the description address of character property and the geographical position coordinates in its space is set up the process of corresponding relation.Matching addresses service is address search match objects according to specific step, first will by Address Standardization; Then server search matching addresses reference data, searches potential position; Be that each position candidate specifies score value according to the degree of closeness with address, finally with score value the highest mate this address.Multiple address participle pattern is there is in matching addresses method in currently available technology, the participle pattern of some of them technology is more single, there is no compound use different technologies, or the joint efficiency of different technologies is not high, thus cause participle technique based on single rule effectively cannot identify neologisms, and comparatively slow based on the speed of single statistical model participle, if both can be combined use, under the prerequisite of guaranteed efficiency, improve participle accuracy become the urgent demand of people.

Summary of the invention

The deficiency that the present invention exists to overcome prior art, provides a kind of Chinese address compound participle technique that can improve the rule-based of participle accuracy and statistical model under the prerequisite of guaranteed efficiency.

The present invention is achieved by the following technical solutions: a kind of Chinese address compound participle technique of rule-based and statistical model, and its integrated use conditional random field models and the maximum matching algorithm through rule optimization carry out word segmentation processing to address; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.Conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, and has good ambiguity address resolving ability simultaneously, contributes to successfully distinguishing Address factor.

The described maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm, based on maximum forward matching algorithm, is aided with the rule of disambiguation, coordinates dictionary to carry out the cutting of address information and the identification of Address factor.MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.

MMSEG algorithm be in Chinese word segmentation one common, based on the segmentation methods of dictionary, simple, effect is relatively better, due to its Simple visual, it is not very complicated for implementing, travelling speed is also than comparatively fast.It is an older segmentation methods, and it is the word custom of going to refine people with the angle of a non-language scholar, thus concludes the disambiguation algorithm of oneself, is a comparatively simple and practical segmentation methods.

The invention has the beneficial effects as follows: the matching addresses method integrated use conditional random field models in the present invention and word segmentation processing is carried out to address through the maximum matching algorithm of rule optimization, conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, there is good ambiguity address resolving ability simultaneously, contribute to successfully distinguishing Address factor.Maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.The present invention efficiently solve based on single rule participle cannot effectively identify neologisms shortcoming and based on the slow shortcoming of single statistical model participle, both combine use by Chinese address compound participle technique of the present invention, improve participle accuracy under the prerequisite of guaranteed efficiency.

Embodiment

Below in conjunction with embodiment, the present invention is described in detail.

A Chinese address compound participle technique for rule-based and statistical model, its integrated use conditional random field models and the maximum matching algorithm through rule optimization carry out word segmentation processing to address; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.Conditional random field models has powerful mode identificating ability, can the successful identification database community of omitting, and has good ambiguity address resolving ability simultaneously, contributes to successfully distinguishing Address factor.The described maximum matching algorithm through rule optimization refers to MMSEG algorithm, and MMSEG algorithm, based on maximum forward matching algorithm, is aided with the rule of disambiguation, coordinates dictionary to carry out the cutting of address information and the identification of Address factor.MMSEG algorithm has when there being good dictionary data support that speed is fast, precision high.Two kinds of algorithms are combined and it can be made to complement each other by the present invention, verify mutually, effectively improve matching addresses accuracy rate.

Finally should be noted that; above content is only in order to illustrate technical scheme of the present invention; but not limiting the scope of the invention; the simple modification that those of ordinary skill in the art carries out technical scheme of the present invention or equivalently to replace, does not all depart from essence and the scope of technical solution of the present invention.

Claims

1. a Chinese address compound participle technique for rule-based and statistical model, is characterized in that: described Chinese address compound participle technique integrated use conditional random field models and carry out word segmentation processing to address through the maximum matching algorithm of rule optimization; Described utilization conditional random field models needs the linked character extracting address information inside, and the training data set pair model training created with pretreatment stage, makes it possess the ability of automatic segmentation address information and identification address key element.

2. the Chinese address compound participle technique of rule-based and statistical model according to claim 1, it is characterized in that: the described maximum matching algorithm through rule optimization refers to MMSEG algorithm, MMSEG algorithm is based on maximum forward matching algorithm, be aided with the rule of disambiguation, coordinate dictionary to carry out the cutting of address information and the identification of Address factor.