CN106610937A - Information theory-based Chinese automatic word segmentation method - Google Patents
Information theory-based Chinese automatic word segmentation method Download PDFInfo
- Publication number
- CN106610937A CN106610937A CN201610831711.2A CN201610831711A CN106610937A CN 106610937 A CN106610937 A CN 106610937A CN 201610831711 A CN201610831711 A CN 201610831711A CN 106610937 A CN106610937 A CN 106610937A
- Authority
- CN
- China
- Prior art keywords
- word
- word segmentation
- path
- dictionary
- paths
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
An information theory-based Chinese automatic word segmentation method comprises the steps of comparing and matching a sentence to be segmented with a word in a corpus which is already successfully initialized; splitting the sentence to be segmented to a mesh structure according to a probability statistical method; solving a weight value of each edge of the mesh structure by an information theory method, wherein a path with maximum weight value is a word segmentation result of the sentence to be segmented; and judging a word segmentation effect by precision rate and recall rate. The Chinese pre-processing speed is relatively faster than that of a method based on a word segmentation dictionary, the precision is relatively higher than that of the method based on the word segmentation dictionary, the method is higher in accuracy than a method based on a statistical method, is higher in practicability and more conforms to an experience value, and a great application value is provided for a subsequent natural language processing technology.
Description
Technical field
The present invention relates to Chinese semanteme networking technology area, and in particular to a kind of to be calculated based on information-theoretical Chinese Automatic Word Segmentation
Method.
Background technology
At this stage based on the Chinese Word Automatic Segmentation for understanding at present also in experimental stage, based on dictionary for word segmentation and based on probability
The method of statistics becomes the main flow of current Chinese automatic word segmentation technology.Method based on dictionary for word segmentation is transplanted simply, without the need for considering
The adaptivity problem transplanted between different field;But this kind of method to ambiguity analysis produced during automatic word segmentation and
The also relative shortcoming of the process of the problems such as name Entity recognition.Statistics-Based Method relies on powerful mathematical statistical model,
Participle aspect of performance is greatly improved, but bad in cross-cutting aspect effect, to the dependency of corpus than larger,
Need for different fields, prepare different corpus to train different fields to count participle model.So cause
After the conversion of field, it is necessary to provide the participle corpus in corresponding field for them.However, carrying out the mark required for participle training
The foundation and maintenance of language material needs substantial amounts of man power and material, and by contrast, the method based on dictionary for word segmentation is in domain-adaptive
Aspect has some superiority.When target participle field changes, only need to add the word in corresponding field based on the method for dictionary
Allusion quotation, the acquisition of domain lexicon compare it is also easily many for corpus, therefore by dictionary for word segmentation and probability statistics
Method is used in combination the main flow for becoming current participle.In order to realize Chinese Automatic Word Segmentation function and improve the accurate of word segmentation result
Degree, the present invention proposes a kind of based on information-theoretical Chinese Automatic Word Segmentation algorithm.
The content of the invention
To realize Chinese Automatic Word Segmentation function and the not high problem of accuracy for word segmentation result, the invention provides one
Plant and be based on information-theoretical Chinese Automatic Word Segmentation algorithm.
In order to solve the above problems, the present invention is achieved by the following technical solutions:
Step 1:Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both combinations
Model.
Step 2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary.
Step 3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined
Structure, every sequential node of this structure SM is defined as successively1M2M3M4M5E。
Step 4:Based on method of information theory, to above-mentioned network structure each edge certain weights are given.
Step 5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated.
Step 6:Verify the accuracy rate and recall rate of this word segmentation result.
Present invention has the advantages that:
1st, the speed of Chinese pretreatment is fast compared with the method based on dictionary for word segmentation.
2nd, the method has more preferable precision compared with the method based on dictionary for word segmentation.
3rd, the method has more preferable accuracy compared with based on statistical method.
4th, the method practicality is bigger, more meets empirical value.
5th, the method provides greatly using value for follow-up natural language processing technique.
Description of the drawings
Fig. 1 is a kind of to be based on information-theoretical Chinese Automatic Word Segmentation algorithm structure flow chart
Fig. 2 n-grams segmentation methods are illustrated
Specific embodiment
In order to improve the accuracy of Chinese Automatic Word Segmentation, the present invention is described in detail with reference to Fig. 1-Fig. 2, its is concrete
Implementation steps are as follows:
Step 1:Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both combinations
Model.
Step 2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, it is described in detail below:
The complete scanning of the Chinese character string of participle one time is treated, matching is made a look up in the dictionary of system, in running into dictionary
Some words are just identified;If there are no relevant matches in dictionary, individual character is just simply partitioned into as word;Until Chinese character string
For sky.
Step 3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined
Structure, every sequential node of this structure SM is defined as successively1M2M3M4M5E, its structure chart is as shown in Figure 2.
Step 4:Based on method of information theory, certain weights are given to above-mentioned network structure each edge, it was specifically calculated
Journey is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word
ni.That is the number collection of n paths word is combined into (n1, n2..., nn)。
Obtain min ()=min (n1, n2..., nn)
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.
In statistics corpus, the quantity of information X (C of each word are calculatedi), then the co-occurrence letter of the adjacent word of solution path
Breath amount X (Ci, Ci+1).Existing following formula:
X(Ci)=| x (Ci)1-x(Ci)2|
Above formula x (Ci)1For word C in text corpusiQuantity of information, x (Ci)2It is C containing wordiText message amount.
x(Ci)1=-p (Ci)1lnp(Ci)1
Above formula p (Ci)1For CiProbability in text corpus, n is C containing wordiText corpus number.
x(Ci)2=-p (Ci)2lnp(Ci)2
Above formula p (Ci)2It is C containing wordiTextual data probit, N is statistics corpus Chinese version sum.
X (C in the same manneri, Ci+1)=| x (Ci, Ci+1)1-x(Ci, Ci+1)2|
x(Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence information amount, x (Ci, Ci+1)2For adjacent word (Ci,
Ci+1) co-occurrence text message amount.
X (C in the same manneri, Ci+1)1=-p (Ci, Ci+1)1lnp(Ci, Ci+1)1
Above formula p (Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence probabilities, m is the word (C in text libraryi,
Ci+1) co-occurrence amount of text.
x(Ci, Ci+1)2=-p (Ci, Ci+1)2lnp(Ci, Ci+1)2
p(Ci, Ci+1)2For adjacent word (C in text libraryi, Ci+1) co-occurrence textual data probability.
The weights that every adjacent path can to sum up be obtained are
w(Ci, Ci+1)=X (Ci)+X(Ci+1)-2X(Ci, Ci+1)
Step 5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process
It is as follows:
There are n paths, it is different per paths length, it is assumed that path collection is combined into (L1, L2..., Ln)。
Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if
Its path collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 4,For S in remaining (n-m) pathjBar
The length in path.
One paths of maximum weight:
Step 6:Verify the accuracy rate and recall rate of this word segmentation result.
Accuracy rate:
Above formula nKnowFor《Dictionary for word segmentation》The number of dictionary word in participle sentence, n are treated in identificationzFor the correct participle word of the method
Number.
Recall rate:
Above formula nAlwaysThe total number of word in treat participle sentence.
Finally consider the two factors, judge the correctness of this system word segmentation result.
That is d=| zhaorate-rate |≤ε
ε is the threshold value of a very little, and this is given by expert.When d meets above-mentioned condition, then participle effect is more satisfactory.
Claims (3)
1. it is a kind of to be based on information-theoretical Chinese Automatic Word Segmentation algorithm, the present invention relates to Chinese semanteme networking technology area, specifically relates to
And one kind is based on information-theoretical Chinese Automatic Word Segmentation algorithm, it is characterized in that, comprises the steps:
Step 1:Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both binding models
Step 2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary
Step 3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined
Structure, is successively defined as every sequential node of this structure
Step 4:Based on method of information theory, to above-mentioned network structure each edge certain weights are given
Step 5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated
Step 6:Verify the accuracy rate and recall rate of this word segmentation result
Accuracy rate:
Above formulaFor《Dictionary for word segmentation》The number of dictionary word in participle sentence is treated in identification,For the number of the correct participle word of the method
Recall rate:
Above formulaThe total number of word in treat participle sentence
Finally consider the two factors, judge the correctness of this system word segmentation result
I.e.
For the threshold value of a very little, this is given by expert, and when d meets above-mentioned condition, then participle effect is more satisfactory.
2. information-theoretical Chinese Automatic Word Segmentation algorithm is based on according to the one kind described in claim 1, be it is characterized in that, the above
Concrete calculating process is as follows in step 4:
Step 4:Based on method of information theory, certain weights are given to above-mentioned network structure each edge, its concrete calculating process is such as
Under:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e.,
The number collection of n paths words is combined into
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved
In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula:
Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount
Above formulaForProbability in text corpus, n is containing wordText corpus number
Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum
In the same manner
It is the word in text corpusCo-occurrence information amount,For adjacent wordThe text message amount of co-occurrence
In the same manner
Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text libraryThe amount of text of co-occurrence
For adjacent word in text libraryThe textual data probability of co-occurrence
The weights that every adjacent path can to sum up be obtained are
。
3. information-theoretical Chinese Automatic Word Segmentation algorithm is based on according to the one kind described in claim 1, be it is characterized in that, the above
Concrete calculating process is as follows in step 5:
There are n paths, it is different per paths length, it is assumed that path collection is combined into
Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road
Electrical path length collection is combined into
It is per paths weight then:
Above formulaRespectively the 1st,
2 arriveThe weighted value on path side, can calculate one by one according to step 4,For in remaining (n-m) path
TheThe maximum paths of the Length Weight of paths:
。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610831711.2A CN106610937A (en) | 2016-09-19 | 2016-09-19 | Information theory-based Chinese automatic word segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610831711.2A CN106610937A (en) | 2016-09-19 | 2016-09-19 | Information theory-based Chinese automatic word segmentation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106610937A true CN106610937A (en) | 2017-05-03 |
Family
ID=58614954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610831711.2A Pending CN106610937A (en) | 2016-09-19 | 2016-09-19 | Information theory-based Chinese automatic word segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106610937A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291837A (en) * | 2017-05-31 | 2017-10-24 | 北京大学 | A kind of segmenting method of the network text based on field adaptability |
CN108874956A (en) * | 2018-06-05 | 2018-11-23 | 中国平安人寿保险股份有限公司 | Mass file search method, device, computer equipment and storage medium |
CN109033085A (en) * | 2018-08-02 | 2018-12-18 | 北京神州泰岳软件股份有限公司 | The segmenting method of Chinese automatic word-cut and Chinese text |
CN109190124A (en) * | 2018-09-14 | 2019-01-11 | 北京字节跳动网络技术有限公司 | Method and apparatus for participle |
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN110781204A (en) * | 2019-09-09 | 2020-02-11 | 腾讯大地通途(北京)科技有限公司 | Identification information determination method, device, equipment and storage medium of target object |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950284A (en) * | 2010-09-27 | 2011-01-19 | 北京新媒传信科技有限公司 | Chinese word segmentation method and system |
CN103970733A (en) * | 2014-04-10 | 2014-08-06 | 北京大学 | New Chinese word recognition method based on graph structure |
-
2016
- 2016-09-19 CN CN201610831711.2A patent/CN106610937A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950284A (en) * | 2010-09-27 | 2011-01-19 | 北京新媒传信科技有限公司 | Chinese word segmentation method and system |
CN103970733A (en) * | 2014-04-10 | 2014-08-06 | 北京大学 | New Chinese word recognition method based on graph structure |
Non-Patent Citations (2)
Title |
---|
BECK_ZHOU: "中文分词语言模型和动态规划", 《CSDN博客 HTTPS://BLOG.CSDN.NET/ZHOUBL668/ARTICLE/DETAILS/6896438》 * |
蒋建洪 等: "词典与统计方法结合的中文分词模型研究及应用", 《计算机工程与设计》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291837A (en) * | 2017-05-31 | 2017-10-24 | 北京大学 | A kind of segmenting method of the network text based on field adaptability |
CN107291837B (en) * | 2017-05-31 | 2020-04-03 | 北京大学 | Network text word segmentation method based on field adaptability |
CN108874956A (en) * | 2018-06-05 | 2018-11-23 | 中国平安人寿保险股份有限公司 | Mass file search method, device, computer equipment and storage medium |
CN109033085A (en) * | 2018-08-02 | 2018-12-18 | 北京神州泰岳软件股份有限公司 | The segmenting method of Chinese automatic word-cut and Chinese text |
CN109190124A (en) * | 2018-09-14 | 2019-01-11 | 北京字节跳动网络技术有限公司 | Method and apparatus for participle |
CN109190124B (en) * | 2018-09-14 | 2019-11-26 | 北京字节跳动网络技术有限公司 | Method and apparatus for participle |
WO2020052069A1 (en) * | 2018-09-14 | 2020-03-19 | 北京字节跳动网络技术有限公司 | Method and apparatus for word segmentation |
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN109858011B (en) * | 2018-11-30 | 2022-08-19 | 平安科技(深圳)有限公司 | Standard word bank word segmentation method, device, equipment and computer readable storage medium |
CN110781204A (en) * | 2019-09-09 | 2020-02-11 | 腾讯大地通途(北京)科技有限公司 | Identification information determination method, device, equipment and storage medium of target object |
CN110781204B (en) * | 2019-09-09 | 2024-02-20 | 腾讯大地通途(北京)科技有限公司 | Identification information determining method, device, equipment and storage medium of target object |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111046946B (en) | Burma language image text recognition method based on CRNN | |
CN110019839B (en) | Medical knowledge graph construction method and system based on neural network and remote supervision | |
CN106610937A (en) | Information theory-based Chinese automatic word segmentation method | |
CN106919673B (en) | Text mood analysis system based on deep learning | |
CN107608949B (en) | A kind of Text Information Extraction method and device based on semantic model | |
CN109359291A (en) | A kind of name entity recognition method | |
CN110287480B (en) | Named entity identification method, device, storage medium and terminal equipment | |
CN109543181B (en) | Named entity model and system based on combination of active learning and deep learning | |
CN110134946B (en) | Machine reading understanding method for complex data | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN107168957A (en) | A kind of Chinese word cutting method | |
CN113901797B (en) | Text error correction method, device, equipment and storage medium | |
CN109635288A (en) | A kind of resume abstracting method based on deep neural network | |
CN105373529A (en) | Intelligent word segmentation method based on hidden Markov model | |
CN105068997B (en) | The construction method and device of parallel corpora | |
CN109783809B (en) | Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus | |
CN110222328B (en) | Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium | |
CN106611041A (en) | New text similarity solution method | |
CN108664474A (en) | A kind of resume analytic method based on deep learning | |
CN104699797B (en) | A kind of web page data structured analysis method and device | |
CN103955450A (en) | Automatic extraction method of new words | |
CN108829823A (en) | A kind of file classification method | |
CN110134934A (en) | Text emotion analysis method and device | |
CN109255117A (en) | Chinese word cutting method and device | |
CN110222338A (en) | A kind of mechanism name entity recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170503 |