CN107329950A - It is a kind of based on the Chinese address segmenting method without dictionary - Google Patents

It is a kind of based on the Chinese address segmenting method without dictionary Download PDF

Info

Publication number
CN107329950A
CN107329950A CN201710441735.1A CN201710441735A CN107329950A CN 107329950 A CN107329950 A CN 107329950A CN 201710441735 A CN201710441735 A CN 201710441735A CN 107329950 A CN107329950 A CN 107329950A
Authority
CN
China
Prior art keywords
mrow
character string
msub
word
secondword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710441735.1A
Other languages
Chinese (zh)
Other versions
CN107329950B (en
Inventor
谢婷婷
李晓林
严柯
张懿
刘志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Institute of Technology
Original Assignee
Wuhan Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Institute of Technology filed Critical Wuhan Institute of Technology
Priority to CN201710441735.1A priority Critical patent/CN107329950B/en
Publication of CN107329950A publication Critical patent/CN107329950A/en
Application granted granted Critical
Publication of CN107329950B publication Critical patent/CN107329950B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a kind of based on the Chinese address segmenting method without dictionary, comprise the following steps:1) random length in training corpus is obtained by statistics and is more than 1 word frequency for being less than or equal to 8 character string, mutual information, comentropy;2) address character string is pre-processed by regular expression;Full cutting is carried out to input address character string;3) the participle scheme of segmental arc least cost is obtained according to mutual information and comentropy algorithm;4) secondary calculating is carried out to the string assemble of the participle scheme according to confidence level method, whether be true entry, obtain optimal participle scheme if judging the character string.

Description

It is a kind of based on the Chinese address segmenting method without dictionary
Technical field
The present invention relates to Internet technology and data mining technology field, and in particular to a kind of mutual trust of utilization Chinese address Breath and comentropy and confidence level are to the Address factor progress cutting in Chinese address without dictionary Chinese address segmenting method.
Background technology
With the fast development of Internet technology, network turns into the Important Platform that Information Communication is exchanged.In cyberspace There are substantial amounts of data or information to produce daily, wherein most exists all in the form of natural language text, how therefrom dug Useful information is excavated as current study hotspot.Contain substantial amounts of spatial information in these texts, according to sampling statistics, entirely Contain positional information in the webpage of ball about 70%.But, compared with traditional geography information or data, the geography information in text It is non-structured, only after formalization processing, could be analyzed and be excavated.Spatial information formalization in text includes In terms of place name address participle, spatial relationship extraction, Event Distillation.Place name address participle formalizes most base as spatial information The work of plinth, the bottom, its accuracy will directly influence the validity of follow-up work.
Place name address participle is application of the Chinese word segmentation in place name address.It is to go here and there place name address to split into somely Manage the process of key element.Chinese Word Automatic Segmentation can substantially be divided into 3 classes:Segmenting method based on dictionary, the participle side based on statistics Method and the segmenting method based on understanding.Because China's address name is innumerable and disordered, the complete dictionary of neither one is comprising all Address information, therefore, gone here and there herein for place name address, propose a kind of Chinese address segmenting method without dictionary.
The content of the invention
For problem of the prior art, it is an object of the invention to provide a kind of based on the Chinese address participle side without dictionary Method, by counting the word frequency of address corpus, mutual information, comentropy carries out full cutting to character string and obtains all slit modes Set, calculates the slit mode of radian least cost, and confidence level processing is then done to slit mode and carries out secondary cutting, is obtained Optimal result.
The present invention is for the technical scheme that is used of solution above-mentioned technical problem:
The present invention provides a kind of based on the Chinese address segmenting method without dictionary, comprises the following steps;
Random length is more than 1 word frequency, mutual information and the information for being less than or equal to 8 character string in S1, statistics address corpus Entropy;
S2, to the address character string of input using just being pre-processed in expression formula, enters to the character string obtained after processing The full cutting processing of row, obtains cutting set;
S3, the mutual information and comentropy of obtained character string are counted according to step S1, and calculating obtains segmental arc least cost Participle scheme;
S4, carries out secondary calculating to the string assemble of the obtained participle schemes of step S3 according to confidence level method, sentences Whether the character string of breaking is true entry, obtains optimal participle scheme.
Preferably, the step S1 includes following sub-step:
Any character length is more than 1 frequency for being less than or equal to 8 word string in every address in S11, statistics address corpus In degree, deposit word frequency dictionary Word_dic;
S12, is counted in the mutual information between character string, deposit MI_map using formula (1);
Wherein p (xy) is the probability that character x and character y occur simultaneously in language material;P (x) be character x individually occur it is general Rate;P (y) is the probability that character y individually occurs;
S13, counts left entropy, the right entropy of character string, and be stored in LR_map, left entropy, the right side using formula (2) and formula (3) Entropy refers to the comentropy of character string left margin and right margin respectively;
Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the collection of the right adjacent word of word string Close, b represents right adjacent word, and aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.
Preferably, the step S2 is specially:The address character string of input is pre-processed using regular expression, Character string W after processing is carried out to be not inserted into separator in the middle of full cutting processing, continuous numeral, cutting set W={ w are obtainedi}, 1≤i≤2l-1, wherein l represents the length of character string.
Preferably, according to the mutual information in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string with And the comentropy of character string, utilize the cutting set W={ w obtained in formula (4) calculation procedure S2iIn each wiProbability, And result is preserved, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as;
M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, m, n table Show the word of the left character string rightmost side and the word of the right character string leftmost side.
Preferably, the step S4 is specially:Judged successively in segment_result using confidence level formula (5) The character string T cut out1,T2,...,TnWhether it is true entry, and true entry is put into result set last_ In result and export;
Wherein, fre (w1) and fre (w) represent character string w respectively1The number of times occurred with w in corpus, conf (w1| W) entry w is represented1Relative to entry w confidence level.
Specifically:Step S4 includes following sub-step:
S41, setting takes big threshold alpha and takes small threshold value beta;
S42, for character string T1=Q1,Q2,...,Qn, wherein Q1Single word is represented, n is character string T1Length, if n= 2, then by T1It is put into result set last_result, otherwise jumps to step S43;
S43, defines firstword=Q1Q2, secondword=Q1Q2Q3, confidence level is calculated using formula (5), if conf < α then retain secondword, otherwise jump to step S44;Judge whether secondword is equal to T simultaneously1If then following Ring terminates, and exports last_result, otherwise carries out epexegesis comparison, makes firstword=secondword, secondword =Q1Q2Q3Q4, circulation execution step S43;
S44, if conf > β, retain firstword, otherwise jumps to step S45;Judge that secondword is simultaneously It is no to be equal to T1If then firstword is put into result set last_result, and make T1After removal firstword Character string, otherwise carries out epexegesis comparison, keeps firstword constant, makes secondword=Q1Q2Q3Q4, and jump to step S43;
S45, if α < conf < β, compare the word frequency of character string, if fre (firstword) > fre (secondword), then firstword is put into result set last_result, and makes T1After removal firstword Character string, jumps to step S42;If whether fre (firstword) < fre (secondword), judge secondword Equal to T1If then secondword is put into result set last_result, circulation terminates, and export last_result, Otherwise epexegesis comparison is carried out, firstword=secondword, secondword=Q is made1Q2Q3Q4And jump to step S43.
The beneficial effects of the invention are as follows:
Present invention is mainly applied to the parsing of Chinese address in geographical location information service, this method can be realized to Chinese The participle of address, with stronger feasibility and validity.
Embodiment
With reference to embodiment, the invention will be further described.
The present invention provides a kind of based on the Chinese address segmenting method without dictionary, and Chinese address " Wuhan City's flood is chosen here Mountain area falls wild goose Lu Zhengjia gulfs 105 " specific implementation process of the invention is illustrated.
S1, data prepare:
(1) any character length is more than 1 frequency for being less than or equal to 8 word string in every address in statistics address corpus In degree, deposit word frequency dictionary Word_dic.
(2) in the mutual information between statistics character string, deposit MI_map.
Wherein p (xy) is the probability that character x and character y occur simultaneously in language material;P (x) be character x individually occur it is general Rate;P (y) is the probability that character y individually occurs;
(3) count in the left entropy of character string, right entropy, deposit LR_map, left entropy, right entropy refer to word string left margin and the right respectively The comentropy on boundary.
Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the collection of the right adjacent word of word string Close, b represents right adjacent word, and aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.
S2, the address character string to input are pre-processed using regular expression, and character string W after processing is cut entirely Office is managed, and is not inserted into separator in the middle of continuous numeral, is obtained cutting set W={ wi},1≤i≤2l-1, wherein l represents character The length of string.
S3, according to the mutual information and character in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string The comentropy of string, utilizes the cutting set W={ w obtained in formula (4) calculation procedure S2iIn each wiProbability, and preserve As a result, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as;
Pro (" Wuhan City Hongshan District | fall wild goose Lu Zhengjia gulfs | No. 105 "):=1.1029722727130447E5;
Then " Wuhan City Hongshan District | fall wild goose Lu Zhengjia gulfs | No. 105 " it is designated as segment_result.
M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, m, n table Show the word of the left character string rightmost side and the word of the right character string leftmost side.
S4, the character string T cut out judged successively using confidence level formula (5) in segment_result1, T2,...,TnWhether be true entry, numeral and number combination do not handle.
T1=" Wuhan City Hongshan District ", T2=" falling wild goose Lu Zhengjia gulfs ", T3=" No. 105 ".
Wherein, fre (w1) and fre (w) represent character string w respectively1The number of times occurred with w in corpus, conf (w1| W) entry w is represented1Relative to entry w confidence level.
Character string T1=" Wuhan City Hongshan District ", n=6.
A) firstword=" Wuhan ", secondword=" Wuhan City " is calculated, conf=using formula (5) 0.0019<0.3, then retain " Wuhan City ";Continue epexegesis and compare firstword=" Wuhan City ", secondword=" Wuhan City's flood ", firstword is calculated relative to secondword confidence levels, conf=0.818 using formula (5)>0.8, retain " military Chinese city ";Continue epexegesis and compare firstword=" Wuhan City ", secondword=" Wuhan City Hong Shan ", counted using formula (5) Calculate, conf=0.818>0.8, retain " Wuhan City ", continue epexegesis and compare firstword=" Wuhan City ", secondword =" Wuhan City Hongshan District ", is calculated, conf=0.818 using formula (5)>0.8, retain " Wuhan City ", insertion number of times be more than 3 and Secondword is equal to T1, " Wuhan City " is put into result set last-result.
b)T1=" Hongshan District ", n=3, firstword=" Hong Shan ", secondword=" Hongshan District ", utilizes formula (5) calculate, conf=0.028<0.3, then retain " Hongshan District ", secondword=T1, " Hongshan District " is put into result set last_result。
Character string T2=" falling wild goose Lu Zhengjia gulfs ", n=6.
A) firstword=" falling wild goose ", secondword=" Luo Yanlu " is calculated, conf=0.0 using formula (5)< 0.3, then retain " Luo Yanlu ";Continue epexegesis and compare firstword=" Luo Yanlu ", secondword=" Luo Yan roads Zheng ", profit Calculated with formula (5), conf=0.54,0.8 is less than more than 0.3, compares the word frequency of two words, fre (" Luo Yanlu ")=22> Fre (" Luo Yan roads Zheng ")=10, retains " Luo Yanlu " and " Luo Yanlu " is put into result set last_result.
b)T1=" Zheng Jia gulfs ", n=3, firstword=" Zheng Jia ", secondword=" Zheng Jia gulfs ", utilizes formula (5) calculate, conf=0.0<0.3, then retain " Zheng Jia gulfs ", secondword=T1, " Zheng Jia gulfs " is put into result set last_result。
Character string T3=" No. 105 ", representation for numeral and number combination, do not handle and be directly placed into result set last_ result。
S5, output last_result=" Wuhan City | Hongshan District | Luo Yanlu | Zheng Jia gulfs | No. 105 ".
The part not illustrated in specification is prior art or common knowledge.The present embodiment is merely to illustrate the invention, Rather than limitation the scope of the present invention, those skilled in the art change for equivalent replacement of the invention made etc. to be considered Fall into invention claims institute protection domain.

Claims (7)

1. it is a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:Comprise the following steps:
Random length is more than 1 word frequency, mutual information and the comentropy for being less than or equal to 8 character string in S1, statistics address corpus;
S2, to the address character string of input using just being pre-processed in expression formula, is carried out complete to the character string obtained after processing Cutting is handled, and obtains cutting set;
S3, the mutual information and comentropy of obtained character string are counted according to step S1, the participle for obtaining segmental arc least cost is calculated Scheme;
S4, carries out secondary calculating, judging should according to confidence level method to the string assemble of the obtained participle schemes of step S3 Whether character string is true entry, obtains optimal participle scheme.
2. it is according to claim 1 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step S1 includes following sub-step:
Any character length is more than 1 frequency for being less than or equal to 8 word string in every address in S11, statistics address corpus, deposits Enter in word frequency dictionary Word_dic;
S12, is counted in the mutual information between character string, deposit MI_map using formula (1);
<mrow> <mi>M</mi> <mi>I</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Wherein p (xy) is the probability that character x and character y occur simultaneously in language material;P (x) is the probability that character x individually occurs;p (y) it is probability that character y individually occurs;
S13, left entropy, the right entropy of character string are counted using formula (2) and formula (3), and are stored in LR_map, left entropy, right entropy point Do not refer to the comentropy of character string left margin and right margin;
<mrow> <msub> <mi>E</mi> <mi>L</mi> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>a</mi> <mo>&amp;Element;</mo> <mi>A</mi> </mrow> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>w</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <msub> <mi>log</mi> <mn>2</mn> </msub> <mi>P</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>w</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
<mrow> <msub> <mi>E</mi> <mi>R</mi> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>b</mi> <mo>&amp;Element;</mo> <mi>B</mi> </mrow> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mi>b</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <msub> <mi>log</mi> <mn>2</mn> </msub> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>|</mo> <mi>w</mi> <mi>b</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
Wherein w represents character string, and A represents the set of the left adjacent word of word string, and a represents left adjacent word, and B represents the set of the right adjacent word of word string, b Right adjacent word is represented, aw, wb represent that word string w combines the character string to be formed with left adjacent word a and right adjacent word b respectively.
3. it is according to claim 2 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step S2 is specially:The address character string of input is pre-processed using regular expression, full cutting is carried out to character string W after processing Separator is not inserted into the middle of processing, continuous numeral, cutting set W={ w are obtainedi},1≤i≤2l-1, wherein l represents character string Length.
4. it is according to claim 3 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step S3 is specially:According to the mutual information and character string in the word frequency dictionary obtained in step S1 between the word frequency of character string, character string Comentropy, utilize the cutting set W={ w obtained in formula (4) calculation procedure S2iIn each wiProbability, and preserve knot Really, the minimum participle scheme of result of calculation is chosen, segment_result is denoted as;
<mrow> <mi>c</mi> <mrow> <mo>(</mo> <mi>M</mi> <mo>,</mo> <mi>N</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>M</mi> <mi>I</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>E</mi> <mi>R</mi> </msub> <mrow> <mo>(</mo> <mi>M</mi> <mo>)</mo> </mrow> <msub> <mi>E</mi> <mi>L</mi> </msub> <mrow> <mo>(</mo> <mi>N</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
M represents that the left character string of address character string cut-off, N represent the right character string of address character string cut-off, and m, n represent left The word of the character string rightmost side and the word of the right character string leftmost side.
5. it is according to claim 4 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step S4 is specially:Judge the character string T cut out in segment_result successively using confidence level formula (5)1,T2,..., TnWhether it is true entry, and true entry is put into result set last_result and exported;
<mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>f</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>f</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>f</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>
Wherein, fre (w1) and fre (w) represent character string w respectively1The number of times occurred with w in corpus, conf (w1| w) represent Entry w1Relative to entry w confidence level.
6. it is according to claim 5 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:The step S4 includes following sub-step:
S41, setting takes big threshold alpha and takes small threshold value beta;
S42, for character string T1=Q1,Q2,...,Qn, wherein Q1Single word is represented, n is character string T1Length, if n=2, By T1It is put into result set last_result, otherwise jumps to step S43;
S43, defines firstword=Q1Q2, secondword=Q1Q2Q3, confidence level is calculated using formula (5), if conf < α Then retain secondword, otherwise jump to step S44;Judge whether secondword is equal to T simultaneously1If then circulation is tied Beam, and last_result is exported, epexegesis comparison is otherwise carried out, firstword=secondword, secondword=is made Q1Q2Q3Q4, circulation execution step S43;
S44, if conf > β, retain firstword, otherwise jumps to step S45;Simultaneously judge secondword whether etc. In T1If then firstword is put into result set last_result, and make T1Equal to the character removed after firstword String, otherwise carries out epexegesis comparison, keeps firstword constant, makes secondword=Q1Q2Q3Q4, and jump to step S43;
S45, if α < conf < β, compare the word frequency of character string, if fre (firstword) > fre (secondword), Firstword is put into result set last_result, and makes T1Equal to the character string removed after firstword, step is jumped to Rapid S42;If fre (firstword) < fre (secondword), judge whether secondword is equal to T1If then will Secondword is put into result set last_result, and circulation terminates, and exports last_result, otherwise carries out epexegesis ratio Compared with making firstword=secondword, secondword=Q1Q2Q3Q4And jump to step S43.
7. it is according to claim 6 a kind of based on the Chinese address segmenting method without dictionary, it is characterised in that:Step S43- In S45, if some word that epexegesis compares is compared 3 times, circulation is jumped out, and export last_result.
CN201710441735.1A 2017-06-13 2017-06-13 Chinese address word segmentation method based on no dictionary Expired - Fee Related CN107329950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710441735.1A CN107329950B (en) 2017-06-13 2017-06-13 Chinese address word segmentation method based on no dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710441735.1A CN107329950B (en) 2017-06-13 2017-06-13 Chinese address word segmentation method based on no dictionary

Publications (2)

Publication Number Publication Date
CN107329950A true CN107329950A (en) 2017-11-07
CN107329950B CN107329950B (en) 2021-01-05

Family

ID=60195558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710441735.1A Expired - Fee Related CN107329950B (en) 2017-06-13 2017-06-13 Chinese address word segmentation method based on no dictionary

Country Status (1)

Country Link
CN (1) CN107329950B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563631A (en) * 2018-03-23 2018-09-21 江苏速度信息科技股份有限公司 A kind of automatic identifying method of natural language address descriptor
CN108647263A (en) * 2018-04-28 2018-10-12 淮阴工学院 A kind of network address method for evaluating confidence crawled based on segmenting web page
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN109190997A (en) * 2018-09-18 2019-01-11 广东电网有限责任公司 Chinese address hierarchical analysis and standard processing method and system
CN109614396A (en) * 2018-12-17 2019-04-12 广东电网有限责任公司 A kind of method for cleaning of address data structure and standardization
CN109902290A (en) * 2019-01-23 2019-06-18 广州杰赛科技股份有限公司 A kind of term extraction method, system and equipment based on text information
CN110032730A (en) * 2019-02-18 2019-07-19 阿里巴巴集团控股有限公司 A kind of processing method of text data, device and equipment
CN113591004A (en) * 2021-08-04 2021-11-02 北京小米移动软件有限公司 Game tag generation method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007005884A2 (en) * 2005-07-01 2007-01-11 Microsoft Corporation Generating chinese language couplets
CN104572622A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Term filtering method
CN106528526A (en) * 2016-10-09 2017-03-22 武汉工程大学 A Chinese address semantic tagging method based on the Bayes word segmentation algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007005884A2 (en) * 2005-07-01 2007-01-11 Microsoft Corporation Generating chinese language couplets
CN104572622A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Term filtering method
CN106528526A (en) * 2016-10-09 2017-03-22 武汉工程大学 A Chinese address semantic tagging method based on the Bayes word segmentation algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI XIAOLIN ET AL.: "Interpretation of Chinese Address Information Based on Multi-factor Inference", 《2016 15TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING》 *
赵卫锋 等: "非结构化中文自然语言地址描述的自动识别", 《计算机工程与应用》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563631A (en) * 2018-03-23 2018-09-21 江苏速度信息科技股份有限公司 A kind of automatic identifying method of natural language address descriptor
CN108647263A (en) * 2018-04-28 2018-10-12 淮阴工学院 A kind of network address method for evaluating confidence crawled based on segmenting web page
CN108647263B (en) * 2018-04-28 2022-04-12 淮阴工学院 Network address confidence evaluation method based on webpage segmentation crawling
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN109190997A (en) * 2018-09-18 2019-01-11 广东电网有限责任公司 Chinese address hierarchical analysis and standard processing method and system
CN109190997B (en) * 2018-09-18 2021-03-12 广东电网有限责任公司 Chinese address hierarchical analysis and standard processing method and system
CN109614396A (en) * 2018-12-17 2019-04-12 广东电网有限责任公司 A kind of method for cleaning of address data structure and standardization
CN109902290A (en) * 2019-01-23 2019-06-18 广州杰赛科技股份有限公司 A kind of term extraction method, system and equipment based on text information
CN109902290B (en) * 2019-01-23 2023-06-30 广州杰赛科技股份有限公司 Text information-based term extraction method, system and equipment
CN110032730A (en) * 2019-02-18 2019-07-19 阿里巴巴集团控股有限公司 A kind of processing method of text data, device and equipment
CN110032730B (en) * 2019-02-18 2023-09-05 创新先进技术有限公司 Text data processing method, device and equipment
CN113591004A (en) * 2021-08-04 2021-11-02 北京小米移动软件有限公司 Game tag generation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN107329950B (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN107329950A (en) It is a kind of based on the Chinese address segmenting method without dictionary
CN107797991B (en) Dependency syntax tree-based knowledge graph expansion method and system
CN104572622B (en) A kind of screening technique of term
CN107885999A (en) A kind of leak detection method and system based on deep learning
US20150207704A1 (en) Public opinion information display system and method
CN107992481A (en) A kind of matching regular expressions method, apparatus and system based on multiway tree
US20120278339A1 (en) Query parsing for map search
CN105630941A (en) Statistics and webpage structure based Wen body text content extraction method
CN102289467A (en) Method and device for determining target site
CN105573979B (en) A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character
CN107046586A (en) A kind of algorithm generation domain name detection method based on natural language feature
CN107220300A (en) Information mining method, electronic installation and readable storage medium storing program for executing
CN106803035A (en) A kind of password conjecture set creation method and password cracking method based on username information
US20190147038A1 (en) Preserving and processing ambiguity in natural language
US11323403B2 (en) System and method for detecting geo-locations in social media
CN103955450A (en) Automatic extraction method of new words
CN104346382B (en) Use the text analysis system and method for language inquiry
CN107291730B (en) Method and device for providing correction suggestion for query word and probability dictionary construction method
CN111400998B (en) Text display method and device, electronic equipment and readable storage medium
CN106446139A (en) Webpage content extracting method and device
CN104516859B (en) A kind of word modification method and system
CN103455572B (en) Obtain the method and device of video display main body in webpage
CN112069305B (en) Data screening method and device and electronic equipment
CN106940711A (en) A kind of URL detection methods and detection means
CN104572787A (en) Method and device for recognizing pseudo original website

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210105