CN107329950A - It is a kind of based on the Chinese address segmenting method without dictionary - Google Patents
It is a kind of based on the Chinese address segmenting method without dictionary Download PDFInfo
- Publication number
- CN107329950A CN107329950A CN201710441735.1A CN201710441735A CN107329950A CN 107329950 A CN107329950 A CN 107329950A CN 201710441735 A CN201710441735 A CN 201710441735A CN 107329950 A CN107329950 A CN 107329950A
- Authority
- CN
- China
- Prior art keywords
- mrow
- character string
- msub
- word
- secondword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000012545 processing Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 6
- 241000272814 Anser sp. Species 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a kind of based on the Chinese address segmenting method without dictionary, comprise the following steps:1) random length in training corpus is obtained by statistics and is more than 1 word frequency for being less than or equal to 8 character string, mutual information, comentropy;2) address character string is pre-processed by regular expression;Full cutting is carried out to input address character string;3) the participle scheme of segmental arc least cost is obtained according to mutual information and comentropy algorithm;4) secondary calculating is carried out to the string assemble of the participle scheme according to confidence level method, whether be true entry, obtain optimal participle scheme if judging the character string.
Description
Technical field
The present invention relates to Internet technology and data mining technology field, and in particular to a kind of mutual trust of utilization Chinese address
Breath and comentropy and confidence level are to the Address factor progress cutting in Chinese address without dictionary Chinese address segmenting method.
Background technology
With the fast development of Internet technology, network turns into the Important Platform that Information Communication is exchanged.In cyberspace
There are substantial amounts of data or information to produce daily, wherein most exists all in the form of natural language text, how therefrom dug
Useful information is excavated as current study hotspot.Contain substantial amounts of spatial information in these texts, according to sampling statistics, entirely
Contain positional information in the webpage of ball about 70%.But, compared with traditional geography information or data, the geography information in text
It is non-structured, only after formalization processing, could be analyzed and be excavated.Spatial information formalization in text includes
In terms of place name address participle, spatial relationship extraction, Event Distillation.Place name address participle formalizes most base as spatial information
The work of plinth, the bottom, its accuracy will directly influence the validity of follow-up work.
Place name address participle is application of the Chinese word segmentation in place name address.It is to go here and there place name address to split into somely
Manage the process of key element.Chinese Word Automatic Segmentation can substantially be divided into 3 classes:Segmenting method based on dictionary, the participle side based on statistics
Method and the segmenting method based on understanding.Because China's address name is innumerable and disordered, the complete dictionary of neither one is comprising all
Address information, therefore, gone here and there herein for place name address, propose a kind of Chinese address segmenting method without dictionary.
The content of the invention
For problem of the prior art, it is an object of the invention to provide a kind of based on the Chinese address participle side without dictionary
Method, by counting the word frequency of address corpus, mutual information, comentropy carries out full cutting to character string and obtains all slit modes
Set, calculates the slit mode of radian least cost, and confidence level processing is then done to slit mode and carries out secondary cutting, is obtained
Optimal result.
The present invention is for the technical scheme that is used of solution above-mentioned technical problem:
The present invention provides a kind of based on the Chinese address segmenting method without dictionary, comprises the following steps;
Random length is more than 1 word frequency, mutual information and the information for being less than or equal to 8 character string in S1, statistics address corpus
Entropy;
S2, to the address character string of input using just being pre-processed in expression formula, enters to the character string obtained after processing
The full cutting processing of row, obtains cutting set;
S3, the mutual information and comentropy of obtained character string are counted according to step S1, and calculating obtains segmental arc least cost
Participle scheme;
S4, carries out secondary calculating to the string assemble of the obtained participle schemes of step S3 according to confidence level method, sentences
Whether the character string of breaking is true entry, obtains optimal participle scheme.
Preferably, the step S1 includes following sub-step:
Any character length is more than 1 frequency for being less than or equal to 8 word string in every address in S11, statistics address corpus
In degree, deposit word frequency dictionary Word_dic;
S12, is counted in the mutual information between character string, deposit MI_map using formula (1);
Wherein p (xy) is the probability that character x and character y occur simultaneously in language material;P (x) be character x individually occur it is general
Rate;P (y) is the probability that character y individually occurs;
S13, counts left entropy, the right entropy of character string, and be stored in LR_map, left entropy, the right side using formula (2) and formula (3)
Entropy refers to the comentropy of character string left margin and right margin respectively;
Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the collection of the right adjacent word of word string
Close, b represents right adjacent word, and aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.
Preferably, the step S2 is specially:The address character string of input is pre-processed using regular expression,
Character string W after processing is carried out to be not inserted into separator in the middle of full cutting processing, continuous numeral, cutting set W={ w are obtainedi},
1≤i≤2l-1, wherein l represents the length of character string.
Preferably, according to the mutual information in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string with
And the comentropy of character string, utilize the cutting set W={ w obtained in formula (4) calculation procedure S2iIn each wiProbability,
And result is preserved, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as;
M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, m, n table
Show the word of the left character string rightmost side and the word of the right character string leftmost side.
Preferably, the step S4 is specially:Judged successively in segment_result using confidence level formula (5)
The character string T cut out1,T2,...,TnWhether it is true entry, and true entry is put into result set last_
In result and export;
Wherein, fre (w1) and fre (w) represent character string w respectively1The number of times occurred with w in corpus, conf (w1|
W) entry w is represented1Relative to entry w confidence level.
Specifically:Step S4 includes following sub-step:
S41, setting takes big threshold alpha and takes small threshold value beta;
S42, for character string T1=Q1,Q2,...,Qn, wherein Q1Single word is represented, n is character string T1Length, if n=
2, then by T1It is put into result set last_result, otherwise jumps to step S43;
S43, defines firstword=Q1Q2, secondword=Q1Q2Q3, confidence level is calculated using formula (5), if conf
< α then retain secondword, otherwise jump to step S44;Judge whether secondword is equal to T simultaneously1If then following
Ring terminates, and exports last_result, otherwise carries out epexegesis comparison, makes firstword=secondword, secondword
=Q1Q2Q3Q4, circulation execution step S43;
S44, if conf > β, retain firstword, otherwise jumps to step S45;Judge that secondword is simultaneously
It is no to be equal to T1If then firstword is put into result set last_result, and make T1After removal firstword
Character string, otherwise carries out epexegesis comparison, keeps firstword constant, makes secondword=Q1Q2Q3Q4, and jump to step
S43;
S45, if α < conf < β, compare the word frequency of character string, if fre (firstword) > fre
(secondword), then firstword is put into result set last_result, and makes T1After removal firstword
Character string, jumps to step S42;If whether fre (firstword) < fre (secondword), judge secondword
Equal to T1If then secondword is put into result set last_result, circulation terminates, and export last_result,
Otherwise epexegesis comparison is carried out, firstword=secondword, secondword=Q is made1Q2Q3Q4And jump to step S43.
The beneficial effects of the invention are as follows:
Present invention is mainly applied to the parsing of Chinese address in geographical location information service, this method can be realized to Chinese
The participle of address, with stronger feasibility and validity.
Embodiment
With reference to embodiment, the invention will be further described.
The present invention provides a kind of based on the Chinese address segmenting method without dictionary, and Chinese address " Wuhan City's flood is chosen here
Mountain area falls wild goose Lu Zhengjia gulfs 105 " specific implementation process of the invention is illustrated.
S1, data prepare:
(1) any character length is more than 1 frequency for being less than or equal to 8 word string in every address in statistics address corpus
In degree, deposit word frequency dictionary Word_dic.
(2) in the mutual information between statistics character string, deposit MI_map.
Wherein p (xy) is the probability that character x and character y occur simultaneously in language material;P (x) be character x individually occur it is general
Rate;P (y) is the probability that character y individually occurs;
(3) count in the left entropy of character string, right entropy, deposit LR_map, left entropy, right entropy refer to word string left margin and the right respectively
The comentropy on boundary.
Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the collection of the right adjacent word of word string
Close, b represents right adjacent word, and aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.
S2, the address character string to input are pre-processed using regular expression, and character string W after processing is cut entirely
Office is managed, and is not inserted into separator in the middle of continuous numeral, is obtained cutting set W={ wi},1≤i≤2l-1, wherein l represents character
The length of string.
S3, according to the mutual information and character in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string
The comentropy of string, utilizes the cutting set W={ w obtained in formula (4) calculation procedure S2iIn each wiProbability, and preserve
As a result, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as;
Pro (" Wuhan City Hongshan District | fall wild goose Lu Zhengjia gulfs | No. 105 "):=1.1029722727130447E5;
Then " Wuhan City Hongshan District | fall wild goose Lu Zhengjia gulfs | No. 105 " it is designated as segment_result.
M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, m, n table
Show the word of the left character string rightmost side and the word of the right character string leftmost side.
S4, the character string T cut out judged successively using confidence level formula (5) in segment_result1,
T2,...,TnWhether be true entry, numeral and number combination do not handle.
T1=" Wuhan City Hongshan District ", T2=" falling wild goose Lu Zhengjia gulfs ", T3=" No. 105 ".
Wherein, fre (w1) and fre (w) represent character string w respectively1The number of times occurred with w in corpus, conf (w1|
W) entry w is represented1Relative to entry w confidence level.
Character string T1=" Wuhan City Hongshan District ", n=6.
A) firstword=" Wuhan ", secondword=" Wuhan City " is calculated, conf=using formula (5)
0.0019<0.3, then retain " Wuhan City ";Continue epexegesis and compare firstword=" Wuhan City ", secondword=" Wuhan
City's flood ", firstword is calculated relative to secondword confidence levels, conf=0.818 using formula (5)>0.8, retain " military
Chinese city ";Continue epexegesis and compare firstword=" Wuhan City ", secondword=" Wuhan City Hong Shan ", counted using formula (5)
Calculate, conf=0.818>0.8, retain " Wuhan City ", continue epexegesis and compare firstword=" Wuhan City ", secondword
=" Wuhan City Hongshan District ", is calculated, conf=0.818 using formula (5)>0.8, retain " Wuhan City ", insertion number of times be more than 3 and
Secondword is equal to T1, " Wuhan City " is put into result set last-result.
b)T1=" Hongshan District ", n=3, firstword=" Hong Shan ", secondword=" Hongshan District ", utilizes formula
(5) calculate, conf=0.028<0.3, then retain " Hongshan District ", secondword=T1, " Hongshan District " is put into result set
last_result。
Character string T2=" falling wild goose Lu Zhengjia gulfs ", n=6.
A) firstword=" falling wild goose ", secondword=" Luo Yanlu " is calculated, conf=0.0 using formula (5)<
0.3, then retain " Luo Yanlu ";Continue epexegesis and compare firstword=" Luo Yanlu ", secondword=" Luo Yan roads Zheng ", profit
Calculated with formula (5), conf=0.54,0.8 is less than more than 0.3, compares the word frequency of two words, fre (" Luo Yanlu ")=22>
Fre (" Luo Yan roads Zheng ")=10, retains " Luo Yanlu " and " Luo Yanlu " is put into result set last_result.
b)T1=" Zheng Jia gulfs ", n=3, firstword=" Zheng Jia ", secondword=" Zheng Jia gulfs ", utilizes formula
(5) calculate, conf=0.0<0.3, then retain " Zheng Jia gulfs ", secondword=T1, " Zheng Jia gulfs " is put into result set
last_result。
Character string T3=" No. 105 ", representation for numeral and number combination, do not handle and be directly placed into result set last_
result。
S5, output last_result=" Wuhan City | Hongshan District | Luo Yanlu | Zheng Jia gulfs | No. 105 ".
The part not illustrated in specification is prior art or common knowledge.The present embodiment is merely to illustrate the invention,
Rather than limitation the scope of the present invention, those skilled in the art change for equivalent replacement of the invention made etc. to be considered
Fall into invention claims institute protection domain.
Claims (7)
1. it is a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:Comprise the following steps:
Random length is more than 1 word frequency, mutual information and the comentropy for being less than or equal to 8 character string in S1, statistics address corpus;
S2, to the address character string of input using just being pre-processed in expression formula, is carried out complete to the character string obtained after processing
Cutting is handled, and obtains cutting set;
S3, the mutual information and comentropy of obtained character string are counted according to step S1, the participle for obtaining segmental arc least cost is calculated
Scheme;
S4, carries out secondary calculating, judging should according to confidence level method to the string assemble of the obtained participle schemes of step S3
Whether character string is true entry, obtains optimal participle scheme.
2. it is according to claim 1 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step
S1 includes following sub-step:
Any character length is more than 1 frequency for being less than or equal to 8 word string in every address in S11, statistics address corpus, deposits
Enter in word frequency dictionary Word_dic;
S12, is counted in the mutual information between character string, deposit MI_map using formula (1);
<mrow>
<mi>M</mi>
<mi>I</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>,</mo>
<mi>y</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mfrac>
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mi>y</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>)</mo>
</mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>y</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein p (xy) is the probability that character x and character y occur simultaneously in language material;P (x) is the probability that character x individually occurs;p
(y) it is probability that character y individually occurs;
S13, left entropy, the right entropy of character string are counted using formula (2) and formula (3), and are stored in LR_map, left entropy, right entropy point
Do not refer to the comentropy of character string left margin and right margin;
<mrow>
<msub>
<mi>E</mi>
<mi>L</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>a</mi>
<mo>&Element;</mo>
<mi>A</mi>
</mrow>
</munder>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>a</mi>
<mi>w</mi>
<mo>|</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>a</mi>
<mi>w</mi>
<mo>|</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<msub>
<mi>E</mi>
<mi>R</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>b</mi>
<mo>&Element;</mo>
<mi>B</mi>
</mrow>
</munder>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mi>b</mi>
<mo>|</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>|</mo>
<mi>w</mi>
<mi>b</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the set of the right adjacent word of word string, b
Right adjacent word is represented, aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.
3. it is according to claim 2 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step
S2 is specially:The address character string of input is pre-processed using regular expression, full cutting is carried out to character string W after processing
Separator is not inserted into the middle of processing, continuous numeral, cutting set W={ w are obtainedi},1≤i≤2l-1, wherein l represents character string
Length.
4. it is according to claim 3 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step
S3 is specially:According to the mutual information and character string in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string
Comentropy, utilize the cutting set W={ w obtained in formula (4) calculation procedure S2iIn each wiProbability, and preserve knot
Really, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as;
<mrow>
<mi>c</mi>
<mrow>
<mo>(</mo>
<mi>M</mi>
<mo>,</mo>
<mi>N</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>M</mi>
<mi>I</mi>
<mrow>
<mo>(</mo>
<mi>m</mi>
<mo>,</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<msub>
<mi>E</mi>
<mi>R</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>M</mi>
<mo>)</mo>
</mrow>
<msub>
<mi>E</mi>
<mi>L</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>N</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, and m, n represent left
The word of the character string rightmost side and the word of the right character string leftmost side.
5. it is according to claim 4 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step
S4 is specially:Judge the character string T cut out in segment_result successively using confidence level formula (5)1,T2,...,
TnWhether it is true entry, and true entry is put into result set last_result and exported;
<mrow>
<mi>c</mi>
<mi>o</mi>
<mi>n</mi>
<mi>f</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mn>1</mn>
</msub>
<mo>|</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>f</mi>
<mi>r</mi>
<mi>e</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mn>1</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>f</mi>
<mi>r</mi>
<mi>e</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>f</mi>
<mi>r</mi>
<mi>e</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mn>1</mn>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>5</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, fre (w1) and fre (w) represent character string w respectively1The number of times occurred with w in corpus, conf (w1| w) represent
Entry w1Relative to entry w confidence level.
6. it is according to claim 5 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step
S4 includes following sub-step:
S41, setting takes big threshold alpha and takes small threshold value beta;
S42, for character string T1=Q1,Q2,...,Qn, wherein Q1Single word is represented, n is character string T1Length, if n=2,
By T1It is put into result set last_result, otherwise jumps to step S43;
S43, defines firstword=Q1Q2, secondword=Q1Q2Q3, confidence level is calculated using formula (5), if conf < α
Then retain secondword, otherwise jump to step S44;Judge whether secondword is equal to T simultaneously1If then circulation is tied
Beam, and last_result is exported, epexegesis comparison is otherwise carried out, firstword=secondword, secondword=is made
Q1Q2Q3Q4, circulation execution step S43;
S44, if conf > β, retain firstword, otherwise jumps to step S45;Simultaneously judge secondword whether etc.
In T1If then firstword is put into result set last_result, and make T1Equal to the character removed after firstword
String, otherwise carries out epexegesis comparison, keeps firstword constant, makes secondword=Q1Q2Q3Q4, and jump to step S43;
S45, if α < conf < β, compare the word frequency of character string, if fre (firstword) > fre (secondword),
Firstword is put into result set last_result, and makes T1Equal to the character string removed after firstword, step is jumped to
Rapid S42;If fre (firstword) < fre (secondword), judge whether secondword is equal to T1If then will
Secondword is put into result set last_result, and circulation terminates, and exports last_result, otherwise carries out epexegesis ratio
Compared with making firstword=secondword, secondword=Q1Q2Q3Q4And jump to step S43.
7. it is according to claim 6 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:Step S43-
In S45, if some word that epexegesis compares is compared 3 times, circulation is jumped out, and export last_result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710441735.1A CN107329950B (en) | 2017-06-13 | 2017-06-13 | Chinese address word segmentation method based on no dictionary |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710441735.1A CN107329950B (en) | 2017-06-13 | 2017-06-13 | Chinese address word segmentation method based on no dictionary |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107329950A true CN107329950A (en) | 2017-11-07 |
CN107329950B CN107329950B (en) | 2021-01-05 |
Family
ID=60195558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710441735.1A Expired - Fee Related CN107329950B (en) | 2017-06-13 | 2017-06-13 | Chinese address word segmentation method based on no dictionary |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107329950B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108563631A (en) * | 2018-03-23 | 2018-09-21 | 江苏速度信息科技股份有限公司 | A kind of automatic identifying method of natural language address descriptor |
CN108647263A (en) * | 2018-04-28 | 2018-10-12 | 淮阴工学院 | A kind of network address method for evaluating confidence crawled based on segmenting web page |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN109190997A (en) * | 2018-09-18 | 2019-01-11 | 广东电网有限责任公司 | Chinese address hierarchical analysis and standard processing method and system |
CN109614396A (en) * | 2018-12-17 | 2019-04-12 | 广东电网有限责任公司 | A kind of method for cleaning of address data structure and standardization |
CN109902290A (en) * | 2019-01-23 | 2019-06-18 | 广州杰赛科技股份有限公司 | A kind of term extraction method, system and equipment based on text information |
CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of processing method of text data, device and equipment |
CN113591004A (en) * | 2021-08-04 | 2021-11-02 | 北京小米移动软件有限公司 | Game tag generation method and device, storage medium and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007005884A2 (en) * | 2005-07-01 | 2007-01-11 | Microsoft Corporation | Generating chinese language couplets |
CN104572622A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Term filtering method |
CN106528526A (en) * | 2016-10-09 | 2017-03-22 | 武汉工程大学 | A Chinese address semantic tagging method based on the Bayes word segmentation algorithm |
-
2017
- 2017-06-13 CN CN201710441735.1A patent/CN107329950B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007005884A2 (en) * | 2005-07-01 | 2007-01-11 | Microsoft Corporation | Generating chinese language couplets |
CN104572622A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Term filtering method |
CN106528526A (en) * | 2016-10-09 | 2017-03-22 | 武汉工程大学 | A Chinese address semantic tagging method based on the Bayes word segmentation algorithm |
Non-Patent Citations (2)
Title |
---|
LI XIAOLIN ET AL.: "Interpretation of Chinese Address Information Based on Multi-factor Inference", 《2016 15TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING》 * |
赵卫锋 等: "非结构化中文自然语言地址描述的自动识别", 《计算机工程与应用》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108563631A (en) * | 2018-03-23 | 2018-09-21 | 江苏速度信息科技股份有限公司 | A kind of automatic identifying method of natural language address descriptor |
CN108647263A (en) * | 2018-04-28 | 2018-10-12 | 淮阴工学院 | A kind of network address method for evaluating confidence crawled based on segmenting web page |
CN108647263B (en) * | 2018-04-28 | 2022-04-12 | 淮阴工学院 | Network address confidence evaluation method based on webpage segmentation crawling |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN109190997A (en) * | 2018-09-18 | 2019-01-11 | 广东电网有限责任公司 | Chinese address hierarchical analysis and standard processing method and system |
CN109190997B (en) * | 2018-09-18 | 2021-03-12 | 广东电网有限责任公司 | Chinese address hierarchical analysis and standard processing method and system |
CN109614396A (en) * | 2018-12-17 | 2019-04-12 | 广东电网有限责任公司 | A kind of method for cleaning of address data structure and standardization |
CN109902290A (en) * | 2019-01-23 | 2019-06-18 | 广州杰赛科技股份有限公司 | A kind of term extraction method, system and equipment based on text information |
CN109902290B (en) * | 2019-01-23 | 2023-06-30 | 广州杰赛科技股份有限公司 | Text information-based term extraction method, system and equipment |
CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of processing method of text data, device and equipment |
CN110032730B (en) * | 2019-02-18 | 2023-09-05 | 创新先进技术有限公司 | Text data processing method, device and equipment |
CN113591004A (en) * | 2021-08-04 | 2021-11-02 | 北京小米移动软件有限公司 | Game tag generation method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107329950B (en) | 2021-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107329950A (en) | It is a kind of based on the Chinese address segmenting method without dictionary | |
CN107797991B (en) | Dependency syntax tree-based knowledge graph expansion method and system | |
CN104572622B (en) | A kind of screening technique of term | |
CN107885999A (en) | A kind of leak detection method and system based on deep learning | |
US20150207704A1 (en) | Public opinion information display system and method | |
CN107992481A (en) | A kind of matching regular expressions method, apparatus and system based on multiway tree | |
US20120278339A1 (en) | Query parsing for map search | |
CN105630941A (en) | Statistics and webpage structure based Wen body text content extraction method | |
CN102289467A (en) | Method and device for determining target site | |
CN105573979B (en) | A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character | |
CN107046586A (en) | A kind of algorithm generation domain name detection method based on natural language feature | |
CN107220300A (en) | Information mining method, electronic installation and readable storage medium storing program for executing | |
CN106803035A (en) | A kind of password conjecture set creation method and password cracking method based on username information | |
US20190147038A1 (en) | Preserving and processing ambiguity in natural language | |
US11323403B2 (en) | System and method for detecting geo-locations in social media | |
CN103955450A (en) | Automatic extraction method of new words | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
CN107291730B (en) | Method and device for providing correction suggestion for query word and probability dictionary construction method | |
CN111400998B (en) | Text display method and device, electronic equipment and readable storage medium | |
CN106446139A (en) | Webpage content extracting method and device | |
CN104516859B (en) | A kind of word modification method and system | |
CN103455572B (en) | Obtain the method and device of video display main body in webpage | |
CN112069305B (en) | Data screening method and device and electronic equipment | |
CN106940711A (en) | A kind of URL detection methods and detection means | |
CN104572787A (en) | Method and device for recognizing pseudo original website |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210105 |