CN106933799A

CN106933799A - A kind of Chinese word cutting method and device of point of interest POI titles

Info

Publication number: CN106933799A
Application number: CN201511029875.5A
Authority: CN
Inventors: 史川
Original assignee: Navinfo Co Ltd
Current assignee: Navinfo Co Ltd
Priority date: 2015-12-31
Filing date: 2015-12-31
Publication date: 2017-07-07

Abstract

The present invention provides a kind of Chinese word cutting method and device of point of interest POI titles, and methods described includes：The dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles is obtained, dictionary for word segmentation includes the word frequency of the keyword and each keyword that are extracted from the POI titles of the predetermined total sample of POI titles in the predetermined total sample of POI titles；The POI titles for treating participle carry out full cutting, obtain first participle result, wherein, if the same individual character in a POI titles has various keywords under different slit modes, the then word frequency according to the keyword obtained under different slit modes in the predetermined total sample of POI titles, using word frequency highest keyword as individual character word segmentation result.The cutting ambiguity that a certain individual character occurs when solving the problems, such as POI title participles by the Chinese word cutting method and device of the POI titles, makes cutting result more reasonable, it is ensured that the accuracy of participle.

Description

A kind of Chinese word cutting method and device of point of interest POI titles

Technical field

The present invention relates to participle technique field, a kind of Chinese word segmentation side of point of interest POI titles is particularly related to Method and device.

Background technology

With developing rapidly for internet, the information that people can contact also drastically is expanding, the information of magnanimity For people provide resource acquisition easily simultaneously as all kinds of mixes the inconvenience for also bringing information sifting, From there through introducing participle technique, you can people is obtained by the more accurate and reasonable of information sifting arrangement Resource, the work for giving people and life bring bigger facility, while making efficiency be greatly improved. Separated due to no between Chinese word, based on existing Chinese words segmentation in point of interest (POI) title There is a problem of phrase segmentation ambiguity in participle application, this causes that word segmentation result has deviation with physical meaning, Information processing, retrieval to after bring and directly affect.

The content of the invention

The technical problem to be solved in the present invention be to provide a kind of point of interest POI titles Chinese word cutting method and Device, to there is cutting ambiguity in the Chinese word segmentation for solving the problems, such as POI titles.

On the one hand, embodiments of the invention provide a kind of Chinese word cutting method of point of interest POI titles, bag Include：

Obtain to the dictionary for word segmentation that is obtained after the predetermined total sample process of POI titles, dictionary for word segmentation include from The keyword extracted in the POI titles of the predetermined total sample of POI titles and each keyword are in predetermined POI Word frequency in the total sample of title；

Treating a POI titles of participle carries out full cutting, obtains first participle result, wherein, if the Same individual character in one POI titles has various keywords under different slit modes, then cut according to difference Word frequency of the keyword obtained under point mode in the predetermined total sample of POI titles, word frequency highest is crucial Word as individual character word segmentation result.

Wherein, when including non-Chinese character in a POI titles, the above method also includes：

Half-angle treatment is carried out to a POI titles, all of non-Chinese character in a POI titles is extracted Group simultaneously marks the position of non-Chinese character group, and non-Chinese character group is added into first participle result.

Wherein, after the acquisition first participle result, the above method also includes：

In judging the keyword in first participle result, if having the unregistered word being not present in dictionary for word segmentation；

If so, word frequency of the unregistered word in the predetermined total sample of POI titles is then counted, when unregistered word When frequency is higher than predetermined threshold value, unregistered word is added to dictionary for word segmentation.

Wherein, the above-mentioned POI titles for treating participle carry out full cutting, obtain first participle result Step includes：

The first POI titles are matched with dictionary for word segmentation according to maximum matching method, is obtained the first matching knot Really；

The first matching result is modified according to the minimum principle of participle individual character, obtains first participle result.

Wherein, above-mentioned dictionary for word segmentation also includes：National link name storehouse and neighborhood name allocation list.

On the other hand, to realize the above method, the embodiment of the present invention also provides a kind of point of interest POI titles Chinese word segmentation device, including：

Acquisition module, for obtaining the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, point Word dictionary includes the keyword and each pass extracted from the POI titles of the predetermined total sample of POI titles Word frequency of the keyword in the predetermined total sample of POI titles；

First participle module, the POI titles for treating participle carry out full cutting, obtain first point Word result, wherein, if the same individual character in a POI titles has various passes under different slit modes Keyword, then the word frequency according to the keyword obtained under different slit modes in the predetermined total sample of POI titles, Using word frequency highest keyword as individual character word segmentation result.

Wherein, said apparatus also include：

Second word-dividing mode, for carrying out half-angle treatment to a POI titles, extracts a POI titles In all of non-Chinese character group and mark the position of non-Chinese character group, and non-Chinese character group is added to First participle result.

Wherein, said apparatus also include：

Judge module, for judge word-dividing mode obtain first participle result in keyword in, if having It is not present in the unregistered word in dictionary for word segmentation；

Statistics and add module, if being yes for the judged result of judge module, statistics unregistered word is pre- Determine the word frequency in the total sample of POI titles, when the frequency of unregistered word is higher than predetermined threshold value, will be not logged in Word is added to dictionary for word segmentation.

Wherein, above-mentioned first participle module includes：

Matching unit, for the first POI titles to be matched with dictionary for word segmentation according to maximum matching method, Obtain the first matching result；

Amending unit, for being modified to the first matching result according to the minimum principle of participle individual character, obtains One word segmentation result.

Above-mentioned technical proposal of the invention at least includes following beneficial effect：

Above-mentioned technical proposal of the invention is by according to a certain available difference of individual character cutting in POI titles Word frequency of the keyword in dictionary for word segmentation, word frequency highest keyword as the word segmentation result of individual character is solved The problem of the cutting ambiguity that a certain individual character occurs, makes the cutting result more reasonable during POI title participles, protects The accuracy of participle is demonstrate,proved.

Brief description of the drawings

Technical scheme in order to illustrate more clearly the embodiments of the present invention, will describe to the embodiment of the present invention below Needed for the accompanying drawing to be used be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, are not paying creative labor Under the premise of, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 represents the schematic flow sheet of the Chinese word cutting method of the inventive method embodiment POI titles；

Fig. 2 represents a kind of structural representation of the Chinese word segmentation device of apparatus of the present invention embodiment POI titles；

Fig. 3 represents another structural representation of the Chinese word segmentation device of apparatus of the present invention embodiment POI titles Figure；

Fig. 4 represents another structural representation of the Chinese word segmentation device of apparatus of the present invention embodiment POI titles Figure；

Fig. 5 represents the flow example of the Chinese word cutting method of specific embodiment POI titles of the invention.

Specific embodiment

To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with attached Figure and specific embodiment are described in detail.

Embodiment of the method

Fig. 1 is referred to, the stream of its Chinese word cutting method for being illustrated that the inventive method embodiment POI titles Journey schematic diagram, the Chinese word cutting method of the point of interest POI titles that the inventive method embodiment is provided, can be with Including：

Step S101, obtains the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, participle word Allusion quotation includes the keyword and each keyword extracted from the predetermined POI titles of the total sample of POI titles Word frequency in the predetermined total sample of POI titles.

In above-described embodiment, arrangement treatment is carried out by the predetermined total sample of POI titles, obtained for POI The dictionary for word segmentation of title participle, here, the arrangement treatment for the predetermined total sample of POI titles can be people Work Collator Mode, is not construed as limiting to specific processing mode in this embodiment；The dictionary for word segmentation is included from predetermined It is total that the keyword extracted in the POI titles of the total sample of POI titles and each keyword are based on predetermined POI titles The word frequency of sample, here, the keyword of extraction can be according to based on the pre-defined of predetermined POI titles Attribute extract and according in pre-defined attribute storage to dictionary for word segmentation.

In addition, the predetermined total sample of POI titles is that this is pre- by gathering the POI name sets included in advance The POI title radixes for determining the total sample of POI titles are sufficiently large, and scope is wide enough, and here, the present invention is real Example is applied not limit the collection of the predetermined total sample of POI titles and recording method.

Step S102, treating a POI titles of participle carries out full cutting, obtains first participle result, Wherein, if the same individual character in a POI titles has various keywords under different slit modes, According to word frequency of the keyword obtained under different slit modes in the predetermined total sample of POI titles, by word frequency Highest keyword as individual character word segmentation result.

In above-described embodiment, during the POI titles for treating participle carry out full cutting, if A certain individual character can obtain various different keywords, the i.e. cutting of the individual character according to different slit modes With ambiguity, then according to various keywords recorded in dictionary for word segmentation in the predetermined total sample of POI titles In word frequency, be would know that in the predetermined total sample of POI titles by word frequency, various keywords using frequency Rate, using word frequency highest keyword as the word segmentation result of individual character, a POI titles can be obtained with this In individual character accuracy cutting result higher.For example：" multiple Beijing University pharmacy Eastern Han Dynasty Yang Lu shops " is carried out complete It is after cutting：" multiple ", " Beijing University ", " big pharmacy ", " pharmacy ", " Eastern Han Dynasty Yang Lu ", " shop ", here, individual character " big " can obtain different keyword " Beijing University " and " big pharmacies " according to different slit modes, according to The word frequency of keyword " Beijing University " and " big pharmacy " in the predetermined total sample of POI titles is entered to cutting result Row confirms, can obtain in dictionary for word segmentation two word frequency of keyword respectively " 213 " and " 43782 ", It is possible thereby to the cutting result for confirming individual character " big " is " big pharmacy ".

Wherein, in a kind of possible implementation of the inventive method embodiment, at the described first POI When including non-Chinese character in title, the above method also includes：

Here, when when non-Chinese character is included in a POI titles of participle, to a POI Title carries out half-angle treatment, extracts all of non-Chinese character group in a POI titles, and mark it is non-in The position of Chinese character group, first participle result, the position then conduct of the mark are added to by non-Chinese character group Natural delimiter in first POI titles during remaining Chinese character cutting.

Wherein, in a kind of possible implementation of the inventive method embodiment, the acquisition first participle After result, the above method also includes：

In above-described embodiment, the unregistered word in first participle result is judged, and to being not logged in base Word frequency statisticses are carried out in the predetermined total sample of POI titles, when the word frequency of the unregistered word is higher than predetermined threshold value, The unregistered word is added in dictionary for word segmentation, the keyword of dictionary for word segmentation is expanded with this.

Wherein, in above-mentioned steps S102, treating a POI titles of participle carries out full cutting, obtains the The step of one word segmentation result, can include：

First POI titles are matched according to maximum matching method with dictionary for word segmentation, the first matching knot is obtained Really；The first matching result is modified according to the minimum principle of participle individual character, obtains first participle result.

In above-described embodiment, according to maximum matching method by the keyword in a POI titles and dictionary for word segmentation Matched, the first matching result of the keyword of Corresponding matching dictionary for word segmentation is obtained with this；Then basis The minimum principle of participle individual character merges the adjacent individual character matched in the first matching result as cutting Divide result, so as to be modified to the first matching result, obtain first participle result.For example, to " new ocean Big pharmacy " carries out matching can obtain the first matching result：" new/ocean/big pharmacy ", then according to participle individual character most Few principle is modified to first matching result and can obtain first participle result：" new ocean/big pharmacy ".In addition, In this embodiment, maximum matching method can be from Forward Maximum Method method, reverse maximum matching method and double To one or more matching methods in maximum matching method.

To sum up, the Chinese word cutting method of the POI titles that the inventive method embodiment is provided is by according to POI Word frequency of a certain available different keywords of individual character cutting in dictionary for word segmentation in title, by word frequency highest Keyword solves the cutting discrimination that a certain individual character occurs during POI title participles as the word segmentation result of individual character The problem of justice, makes cutting result more reasonable, it is ensured that the accuracy of participle.

Below, then by one specific implementation example of the present invention, the present invention is described in more detail.

Fig. 5 is referred to, its Chinese word cutting method for being illustrated that specific embodiment POI titles of the invention Flow example.The step of Chinese word cutting method of specific embodiment POI titles, includes：

A, keyword is extracted the characteristics of POI titles according in the predetermined total sample of POI titles, such as " Sheng Dawu Golden electrical equipment various household supplies instrument firm ", it is necessary to extract keyword " hardware ", " electrical equipment ", " various household supplies ", " instrument ", " firm ", and by these keywords according to default attribute storage to dictionary for word segmentation, default attribute includes： Part of speech, whether be brand, whether be place etc., while by each keyword be based on predetermined POI titles gross sample This word frequency correspondence is added to dictionary for word segmentation, storage form such as following table：

IDCODE

NAME

LOCTION

ADJECTIVE

BRAND

NOUN

FREQUENCY

K00270

Cinema

Y

N

Y

2065345

K00271

KFC

N

Y

7844

K00272

Hardware

N

Y

48732

K00273

Company

Y

N

Y

884245

K00274

Fast

N

Y

N

1045623

In addition, adding national link name storehouse and neighborhood name allocation list in the dictionary for word segmentation.

B, the POI titles for treating participle are pre-processed.The POI titles of participle will be treated carries out half-angle Treatment, the various segmentation symbols in record POI titles, such as the mark of word segmentation such as dash, bracket is carried Take English word therein and numeral etc., and mark position.

C, the POI titles for treating participle carry out Chinese word segmentation treatment.According to Forward Maximum Method algorithm POI titles are carried out into full cutting, for example：" multiple Beijing University pharmacy Eastern Han Dynasty Yang Lu shops " is cut entirely based on dictionary for word segmentation After be：" multiple ", " Beijing University ", " big pharmacy ", " pharmacy ", " Eastern Han Dynasty Yang Lu ", " shop ", (here, according to National link name storehouse, does not split for link name " Eastern Han Dynasty Yang Lu ")；For the discrimination occurred in participle Justice is analyzed treatment by word frequency, is " 213 " according to the word frequency of " Beijing University " in the total samples of predetermined POI, And the word frequency in " big pharmacy " is " 43782 ", so result is：" big pharmacy ", the first matching result is " multiple / north/big pharmacy/Eastern Han Dynasty Yang Lu/shop "；First matching result is modified according to the minimum principle of participle individual character, Obtaining first participle result is：" multiple north/big pharmacy/Eastern Han Dynasty Yang Lu/shop ".

D, unregistered word treatment.Judge whether to be not present in first participle result in dictionary for word segmentation not Posting term, if so, carrying out word frequency statisticses in the predetermined total sample of POI titles for not logging in base, works as word Frequency reaches predetermined threshold, and just the unregistered word is added in dictionary for word segmentation.

Device embodiment

Fig. 2 is referred to, the one of its Chinese word segmentation device for being illustrated that apparatus of the present invention embodiment POI titles Plant structural representation.To realize above method embodiment, apparatus of the present invention embodiment provides a kind of point of interest The Chinese word segmentation device of POI titles, can include：

Acquisition module 210, for obtaining the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, Dictionary for word segmentation includes the keyword and each extracted from the POI titles of the predetermined total sample of POI titles Word frequency of the keyword in the predetermined total sample of POI titles；

First participle module 220, the POI titles for treating participle carry out full cutting, obtain the One word segmentation result, wherein, if the same individual character in a POI titles under different slit modes have it is many Keyword is planted, then according to the keyword obtained under different slit modes in the predetermined total sample of POI titles Word frequency, using word frequency highest keyword as individual character word segmentation result.

Wherein, on the basis of Fig. 2, referring to Fig. 3, it is illustrated that apparatus of the present invention embodiment POI Another structural representation of the Chinese word segmentation device of title, said apparatus can also include：

Second word-dividing mode 230, for carrying out half-angle treatment to a POI titles, extracts a POI All of non-Chinese character group and the position of non-Chinese character group is marked in title, and non-Chinese character group is added Add to first participle result.

Wherein, on the basis of Fig. 2, referring to Fig. 4, it is illustrated that apparatus of the present invention embodiment POI Another structural representation of the Chinese word segmentation device of title, said apparatus can also include：

Judge module 240, for judge word-dividing mode obtain first participle result in keyword in, be It is no to have the unregistered word being not present in dictionary for word segmentation；

Statistics and add module 250, if being yes for the judged result of judge module, count unregistered word Word frequency in the predetermined total sample of POI titles, when the frequency of unregistered word is higher than predetermined threshold value, will not Posting term is added to dictionary for word segmentation.

Wherein, above-mentioned first participle module 220 can include：

The Chinese word segmentation device of the POI titles that said apparatus embodiment of the invention is provided and above method reality Apply example and belong to same design, it implements process and refers to embodiment of the method, to avoid repeating, here no longer Repeat.

In sum, the Chinese word cutting method and device of the POI titles that the above embodiment of the present invention is provided By according to word frequency of the available different keywords of a certain individual character cutting in dictionary for word segmentation in POI titles, Using word frequency highest keyword as individual character word segmentation result, a certain individual character when solving POI title participles The problem of the cutting ambiguity of appearance, makes cutting result more reasonable, it is ensured that the accuracy of participle.

The above is the preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, These improvements and modifications also should be regarded as protection scope of the present invention.

It should be noted that for foregoing embodiment, in order to be briefly described, therefore it is all expressed as one it is The combination of actions of row, but those skilled in the art should know, and the present invention is not suitable by described action The limitation of sequence, because according to the present invention, some steps can sequentially or simultaneously be carried out using other.Secondly, Those skilled in the art should also know that embodiment described in this description belongs to preferred embodiment, institute The action being related to is not necessarily essential to the invention.

In addition, in inventive embodiments, such as first and second or the like relational terms are used merely to one Individual entity or operation make a distinction with another entity or operation, and not necessarily require or imply these realities There is any this actual relation or order between body or operation.

Claims

1. a kind of Chinese word cutting method of point of interest POI titles, it is characterised in that including：

The dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles is obtained, the dictionary for word segmentation includes There are the keyword and each keyword that are extracted from the POI titles of the predetermined total sample of POI titles to exist Word frequency in the predetermined total sample of POI titles；

Treating a POI titles of participle carries out full cutting, obtains first participle result, wherein, if institute State the same individual character in a POI titles has various keywords under different slit modes, then according to not With word frequency of the keyword obtained under slit mode in the predetermined total sample of POI titles, by word frequency most Keyword high as the individual character word segmentation result.

2. method according to claim 1, it is characterised in that wrapped in a POI titles When having included non-Chinese character, methods described also includes：

Half-angle treatment is carried out to a POI titles, all of non-Chinese in a POI titles is extracted Character group simultaneously marks the position of the non-Chinese character group, and by the non-Chinese character group added to described the One word segmentation result.

3. method according to claim 1, it is characterised in that the acquisition first participle result it Afterwards, methods described also includes：

In judging the keyword in the first participle result, if be not present in the dictionary for word segmentation Unregistered word；

If so, then count word frequency of the unregistered word in the predetermined total sample of POI titles, when it is described not When the frequency of posting term is higher than predetermined threshold value, the unregistered word is added to dictionary for word segmentation.

4. method according to claim 1, it is characterised in that treat participle the first POI The step of title carries out full cutting, acquisition first participle result includes：

The first POI titles are matched with the dictionary for word segmentation according to maximum matching method, is obtained One matching result；

First matching result is modified according to the minimum principle of participle individual character, obtains first participle result.

5. method according to claim 1, it is characterised in that the dictionary for word segmentation also includes：Entirely State's link name storehouse and neighborhood name allocation list.

6. a kind of Chinese word segmentation device of point of interest POI titles, it is characterised in that including：

Acquisition module, for obtaining the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, institute State dictionary for word segmentation include from the POI titles of the predetermined total sample of POI titles extract keyword with And word frequency of each keyword in the predetermined total sample of POI titles；

First participle module, the POI titles for treating participle carry out full cutting, obtain first point Word result, wherein, if the same individual character in a POI titles under different slit modes have it is many Keyword is planted, then according to the keyword obtained under different slit modes in the predetermined total sample of POI titles In word frequency, using word frequency highest keyword as the individual character word segmentation result.

7. device according to claim 6, it is characterised in that described device also includes：

Second word-dividing mode, for carrying out half-angle treatment to a POI titles, extracts a POI All of non-Chinese character group and the position of the non-Chinese character group is marked in title, and by the non-Chinese Character group is added to the first participle result.

8. device according to claim 6, it is characterised in that described device also includes：

Judge module, for judging the keyword in the first participle result that the word-dividing mode is obtained in, be It is no to have the unregistered word being not present in the dictionary for word segmentation；

Statistics and add module, if being yes for the judged result of the judge module, do not step on described in statistics Word frequency of the record word in the predetermined total sample of POI titles, when the frequency of the unregistered word is higher than predetermined threshold value When, the unregistered word is added to dictionary for word segmentation.

9. device according to claim 6, it is characterised in that the first participle module includes：

Matching unit, for being entered a POI titles with the dictionary for word segmentation according to maximum matching method Row matching, obtains the first matching result；

Amending unit, for being modified to first matching result according to the minimum principle of participle individual character, obtains To first participle result.

10. device according to claim 6, it is characterised in that the dictionary for word segmentation also includes：Entirely State's link name storehouse and neighborhood name allocation list.