CN107329950A

CN107329950A - It is a kind of based on the Chinese address segmenting method without dictionary

Info

Publication number: CN107329950A
Application number: CN201710441735.1A
Authority: CN
Inventors: 谢婷婷; 李晓林; 严柯; 张懿; 刘志杰
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2017-06-13
Filing date: 2017-06-13
Publication date: 2017-11-07
Anticipated expiration: 2037-06-13
Also published as: CN107329950B

Abstract

The invention discloses a kind of based on the Chinese address segmenting method without dictionary, comprise the following steps：1) random length in training corpus is obtained by statistics and is more than 1 word frequency for being less than or equal to 8 character string, mutual information, comentropy；2) address character string is pre-processed by regular expression；Full cutting is carried out to input address character string；3) the participle scheme of segmental arc least cost is obtained according to mutual information and comentropy algorithm；4) secondary calculating is carried out to the string assemble of the participle scheme according to confidence level method, whether be true entry, obtain optimal participle scheme if judging the character string.

Description

It is a kind of based on the Chinese address segmenting method without dictionary

Technical field

The present invention relates to Internet technology and data mining technology field, and in particular to a kind of mutual trust of utilization Chinese address Breath and comentropy and confidence level are to the Address factor progress cutting in Chinese address without dictionary Chinese address segmenting method.

Background technology

With the fast development of Internet technology, network turns into the Important Platform that Information Communication is exchanged.In cyberspace There are substantial amounts of data or information to produce daily, wherein most exists all in the form of natural language text, how therefrom dug Useful information is excavated as current study hotspot.Contain substantial amounts of spatial information in these texts, according to sampling statistics, entirely Contain positional information in the webpage of ball about 70%.But, compared with traditional geography information or data, the geography information in text It is non-structured, only after formalization processing, could be analyzed and be excavated.Spatial information formalization in text includes In terms of place name address participle, spatial relationship extraction, Event Distillation.Place name address participle formalizes most base as spatial information The work of plinth, the bottom, its accuracy will directly influence the validity of follow-up work.

Place name address participle is application of the Chinese word segmentation in place name address.It is to go here and there place name address to split into somely Manage the process of key element.Chinese Word Automatic Segmentation can substantially be divided into 3 classes：Segmenting method based on dictionary, the participle side based on statistics Method and the segmenting method based on understanding.Because China's address name is innumerable and disordered, the complete dictionary of neither one is comprising all Address information, therefore, gone here and there herein for place name address, propose a kind of Chinese address segmenting method without dictionary.

The content of the invention

For problem of the prior art, it is an object of the invention to provide a kind of based on the Chinese address participle side without dictionary Method, by counting the word frequency of address corpus, mutual information, comentropy carries out full cutting to character string and obtains all slit modes Set, calculates the slit mode of radian least cost, and confidence level processing is then done to slit mode and carries out secondary cutting, is obtained Optimal result.

The present invention is for the technical scheme that is used of solution above-mentioned technical problem：

The present invention provides a kind of based on the Chinese address segmenting method without dictionary, comprises the following steps；

Random length is more than 1 word frequency, mutual information and the information for being less than or equal to 8 character string in S1, statistics address corpus Entropy；

S2, to the address character string of input using just being pre-processed in expression formula, enters to the character string obtained after processing The full cutting processing of row, obtains cutting set；

S3, the mutual information and comentropy of obtained character string are counted according to step S1, and calculating obtains segmental arc least cost Participle scheme；

S4, carries out secondary calculating to the string assemble of the obtained participle schemes of step S3 according to confidence level method, sentences Whether the character string of breaking is true entry, obtains optimal participle scheme.

Preferably, the step S1 includes following sub-step：

Any character length is more than 1 frequency for being less than or equal to 8 word string in every address in S11, statistics address corpus In degree, deposit word frequency dictionary Word_dic；

S12, is counted in the mutual information between character string, deposit MI_map using formula (1)；

Wherein p (xy) is the probability that character x and character y occur simultaneously in language material；P (x) be character x individually occur it is general Rate；P (y) is the probability that character y individually occurs；

S13, counts left entropy, the right entropy of character string, and be stored in LR_map, left entropy, the right side using formula (2) and formula (3) Entropy refers to the comentropy of character string left margin and right margin respectively；

Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the collection of the right adjacent word of word string Close, b represents right adjacent word, and aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.

Preferably, the step S2 is specially：The address character string of input is pre-processed using regular expression, Character string W after processing is carried out to be not inserted into separator in the middle of full cutting processing, continuous numeral, cutting set W={ w are obtained_i}, 1≤i≤2^l-1, wherein l represents the length of character string.

Preferably, according to the mutual information in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string with And the comentropy of character string, utilize the cutting set W={ w obtained in formula (4) calculation procedure S2_iIn each w_iProbability, And result is preserved, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as；

M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, m, n table Show the word of the left character string rightmost side and the word of the right character string leftmost side.

Preferably, the step S4 is specially：Judged successively in segment_result using confidence level formula (5) The character string T cut out₁,T₂,...,T_nWhether it is true entry, and true entry is put into result set last_ In result and export；

Wherein, fre (w₁) and fre (w) represent character string w respectively₁The number of times occurred with w in corpus, conf (w₁| W) entry w is represented₁Relative to entry w confidence level.

Specifically：Step S4 includes following sub-step：

S41, setting takes big threshold alpha and takes small threshold value beta；

S42, for character string T₁=Q₁,Q₂,...,Q_n, wherein Q₁Single word is represented, n is character string T₁Length, if n= 2, then by T₁It is put into result set last_result, otherwise jumps to step S43；

S43, defines firstword=Q₁Q₂, secondword=Q₁Q₂Q₃, confidence level is calculated using formula (5), if conf ＜ α then retain secondword, otherwise jump to step S44；Judge whether secondword is equal to T simultaneously₁If then following Ring terminates, and exports last_result, otherwise carries out epexegesis comparison, makes firstword=secondword, secondword =Q₁Q₂Q₃Q₄, circulation execution step S43；

S44, if conf ＞ β, retain firstword, otherwise jumps to step S45；Judge that secondword is simultaneously It is no to be equal to T₁If then firstword is put into result set last_result, and make T₁After removal firstword Character string, otherwise carries out epexegesis comparison, keeps firstword constant, makes secondword=Q₁Q₂Q₃Q₄, and jump to step S43；

S45, if α ＜ conf ＜ β, compare the word frequency of character string, if fre (firstword) ＞ fre (secondword), then firstword is put into result set last_result, and makes T₁After removal firstword Character string, jumps to step S42；If whether fre (firstword) ＜ fre (secondword), judge secondword Equal to T₁If then secondword is put into result set last_result, circulation terminates, and export last_result, Otherwise epexegesis comparison is carried out, firstword=secondword, secondword=Q is made₁Q₂Q₃Q₄And jump to step S43.

The beneficial effects of the invention are as follows：

Present invention is mainly applied to the parsing of Chinese address in geographical location information service, this method can be realized to Chinese The participle of address, with stronger feasibility and validity.

Embodiment

With reference to embodiment, the invention will be further described.

The present invention provides a kind of based on the Chinese address segmenting method without dictionary, and Chinese address " Wuhan City's flood is chosen here Mountain area falls wild goose Lu Zhengjia gulfs 105 " specific implementation process of the invention is illustrated.

S1, data prepare：

(1) any character length is more than 1 frequency for being less than or equal to 8 word string in every address in statistics address corpus In degree, deposit word frequency dictionary Word_dic.

(2) in the mutual information between statistics character string, deposit MI_map.

(3) count in the left entropy of character string, right entropy, deposit LR_map, left entropy, right entropy refer to word string left margin and the right respectively The comentropy on boundary.

S2, the address character string to input are pre-processed using regular expression, and character string W after processing is cut entirely Office is managed, and is not inserted into separator in the middle of continuous numeral, is obtained cutting set W={ w_i},1≤i≤2^l-1, wherein l represents character The length of string.

S3, according to the mutual information and character in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string The comentropy of string, utilizes the cutting set W={ w obtained in formula (4) calculation procedure S2_iIn each w_iProbability, and preserve As a result, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as；

Pro (" Wuhan City Hongshan District | fall wild goose Lu Zhengjia gulfs | No. 105 "):=1.1029722727130447E5；

Then " Wuhan City Hongshan District | fall wild goose Lu Zhengjia gulfs | No. 105 " it is designated as segment_result.

S4, the character string T cut out judged successively using confidence level formula (5) in segment_result₁, T₂,...,T_nWhether be true entry, numeral and number combination do not handle.

T₁=" Wuhan City Hongshan District ", T₂=" falling wild goose Lu Zhengjia gulfs ", T₃=" No. 105 ".

Character string T1=" Wuhan City Hongshan District ", n=6.

A) firstword=" Wuhan ", secondword=" Wuhan City " is calculated, conf=using formula (5) 0.0019<0.3, then retain " Wuhan City "；Continue epexegesis and compare firstword=" Wuhan City ", secondword=" Wuhan City's flood ", firstword is calculated relative to secondword confidence levels, conf=0.818 using formula (5)>0.8, retain " military Chinese city "；Continue epexegesis and compare firstword=" Wuhan City ", secondword=" Wuhan City Hong Shan ", counted using formula (5) Calculate, conf=0.818>0.8, retain " Wuhan City ", continue epexegesis and compare firstword=" Wuhan City ", secondword =" Wuhan City Hongshan District ", is calculated, conf=0.818 using formula (5)>0.8, retain " Wuhan City ", insertion number of times be more than 3 and Secondword is equal to T₁, " Wuhan City " is put into result set last-result.

b)T₁=" Hongshan District ", n=3, firstword=" Hong Shan ", secondword=" Hongshan District ", utilizes formula (5) calculate, conf=0.028<0.3, then retain " Hongshan District ", secondword=T₁, " Hongshan District " is put into result set last_result。

Character string T₂=" falling wild goose Lu Zhengjia gulfs ", n=6.

A) firstword=" falling wild goose ", secondword=" Luo Yanlu " is calculated, conf=0.0 using formula (5)< 0.3, then retain " Luo Yanlu "；Continue epexegesis and compare firstword=" Luo Yanlu ", secondword=" Luo Yan roads Zheng ", profit Calculated with formula (5), conf=0.54,0.8 is less than more than 0.3, compares the word frequency of two words, fre (" Luo Yanlu ")=22> Fre (" Luo Yan roads Zheng ")=10, retains " Luo Yanlu " and " Luo Yanlu " is put into result set last_result.

b)T₁=" Zheng Jia gulfs ", n=3, firstword=" Zheng Jia ", secondword=" Zheng Jia gulfs ", utilizes formula (5) calculate, conf=0.0<0.3, then retain " Zheng Jia gulfs ", secondword=T₁, " Zheng Jia gulfs " is put into result set last_result。

Character string T₃=" No. 105 ", representation for numeral and number combination, do not handle and be directly placed into result set last_ result。

S5, output last_result=" Wuhan City | Hongshan District | Luo Yanlu | Zheng Jia gulfs | No. 105 ".

The part not illustrated in specification is prior art or common knowledge.The present embodiment is merely to illustrate the invention, Rather than limitation the scope of the present invention, those skilled in the art change for equivalent replacement of the invention made etc. to be considered Fall into invention claims institute protection domain.

Claims

1. it is a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that：Comprise the following steps：

Random length is more than 1 word frequency, mutual information and the comentropy for being less than or equal to 8 character string in S1, statistics address corpus；

S2, to the address character string of input using just being pre-processed in expression formula, is carried out complete to the character string obtained after processing Cutting is handled, and obtains cutting set；

S3, the mutual information and comentropy of obtained character string are counted according to step S1, the participle for obtaining segmental arc least cost is calculated Scheme；

S4, carries out secondary calculating, judging should according to confidence level method to the string assemble of the obtained participle schemes of step S3 Whether character string is true entry, obtains optimal participle scheme.

2. it is according to claim 1 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that：The step S1 includes following sub-step：

Any character length is more than 1 frequency for being less than or equal to 8 word string in every address in S11, statistics address corpus, deposits Enter in word frequency dictionary Word_dic；

Wherein p (xy) is the probability that character x and character y occur simultaneously in language material；P (x) is the probability that character x individually occurs；p (y) it is probability that character y individually occurs；

S13, left entropy, the right entropy of character string are counted using formula (2) and formula (3), and are stored in LR_map, left entropy, right entropy point Do not refer to the comentropy of character string left margin and right margin；

<mrow> <msub> <mi>E</mi> <mi>L</mi> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>a</mi> <mo>&Element;</mo> <mi>A</mi> </mrow> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>w</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <msub> <mi>log</mi> <mn>2</mn> </msub> <mi>P</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>w</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <msub> <mi>E</mi> <mi>R</mi> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>b</mi> <mo>&Element;</mo> <mi>B</mi> </mrow> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mi>b</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <msub> <mi>log</mi> <mn>2</mn> </msub> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>|</mo> <mi>w</mi> <mi>b</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the set of the right adjacent word of word string, b Right adjacent word is represented, aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.

3. it is according to claim 2 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that：The step S2 is specially：The address character string of input is pre-processed using regular expression, full cutting is carried out to character string W after processing Separator is not inserted into the middle of processing, continuous numeral, cutting set W={ w are obtained_i},1≤i≤2^l-1, wherein l represents character string Length.

4. it is according to claim 3 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that：The step S3 is specially：According to the mutual information and character string in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string Comentropy, utilize the cutting set W={ w obtained in formula (4) calculation procedure S2_iIn each w_iProbability, and preserve knot Really, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as；

M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, and m, n represent left The word of the character string rightmost side and the word of the right character string leftmost side.

5. it is according to claim 4 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that：The step S4 is specially：Judge the character string T cut out in segment_result successively using confidence level formula (5)₁,T₂,..., T_nWhether it is true entry, and true entry is put into result set last_result and exported；

Wherein, fre (w₁) and fre (w) represent character string w respectively₁The number of times occurred with w in corpus, conf (w₁| w) represent Entry w₁Relative to entry w confidence level.

6. it is according to claim 5 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that：The step S4 includes following sub-step：

S41, setting takes big threshold alpha and takes small threshold value beta；

S42, for character string T₁=Q₁,Q₂,...,Q_n, wherein Q₁Single word is represented, n is character string T₁Length, if n=2, By T₁It is put into result set last_result, otherwise jumps to step S43；

S43, defines firstword=Q₁Q₂, secondword=Q₁Q₂Q₃, confidence level is calculated using formula (5), if conf ＜ α Then retain secondword, otherwise jump to step S44；Judge whether secondword is equal to T simultaneously₁If then circulation is tied Beam, and last_result is exported, epexegesis comparison is otherwise carried out, firstword=secondword, secondword=is made Q₁Q₂Q₃Q₄, circulation execution step S43；

S44, if conf ＞ β, retain firstword, otherwise jumps to step S45；Simultaneously judge secondword whether etc. In T₁If then firstword is put into result set last_result, and make T₁Equal to the character removed after firstword String, otherwise carries out epexegesis comparison, keeps firstword constant, makes secondword=Q₁Q₂Q₃Q₄, and jump to step S43；

S45, if α ＜ conf ＜ β, compare the word frequency of character string, if fre (firstword) ＞ fre (secondword), Firstword is put into result set last_result, and makes T₁Equal to the character string removed after firstword, step is jumped to Rapid S42；If fre (firstword) ＜ fre (secondword), judge whether secondword is equal to T₁If then will Secondword is put into result set last_result, and circulation terminates, and exports last_result, otherwise carries out epexegesis ratio Compared with making firstword=secondword, secondword=Q₁Q₂Q₃Q₄And jump to step S43.

7. it is according to claim 6 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that：Step S43- In S45, if some word that epexegesis compares is compared 3 times, circulation is jumped out, and export last_result.