CN106610937A - Information theory-based Chinese automatic word segmentation method - Google Patents

Information theory-based Chinese automatic word segmentation method Download PDF

Info

Publication number
CN106610937A
CN106610937A CN201610831711.2A CN201610831711A CN106610937A CN 106610937 A CN106610937 A CN 106610937A CN 201610831711 A CN201610831711 A CN 201610831711A CN 106610937 A CN106610937 A CN 106610937A
Authority
CN
China
Prior art keywords
word
word segmentation
path
dictionary
paths
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610831711.2A
Other languages
Chinese (zh)
Inventor
金平艳
胡成华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yonglian Information Technology Co Ltd
Original Assignee
Sichuan Yonglian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yonglian Information Technology Co Ltd filed Critical Sichuan Yonglian Information Technology Co Ltd
Priority to CN201610831711.2A priority Critical patent/CN106610937A/en
Publication of CN106610937A publication Critical patent/CN106610937A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

An information theory-based Chinese automatic word segmentation method comprises the steps of comparing and matching a sentence to be segmented with a word in a corpus which is already successfully initialized; splitting the sentence to be segmented to a mesh structure according to a probability statistical method; solving a weight value of each edge of the mesh structure by an information theory method, wherein a path with maximum weight value is a word segmentation result of the sentence to be segmented; and judging a word segmentation effect by precision rate and recall rate. The Chinese pre-processing speed is relatively faster than that of a method based on a word segmentation dictionary, the precision is relatively higher than that of the method based on the word segmentation dictionary, the method is higher in accuracy than a method based on a statistical method, is higher in practicability and more conforms to an experience value, and a great application value is provided for a subsequent natural language processing technology.

Description

One kind is based on information-theoretical Chinese Automatic Word Segmentation algorithm
Technical field
The present invention relates to Chinese semanteme networking technology area, and in particular to a kind of to be calculated based on information-theoretical Chinese Automatic Word Segmentation Method.
Background technology
At this stage based on the Chinese Word Automatic Segmentation for understanding at present also in experimental stage, based on dictionary for word segmentation and based on probability The method of statistics becomes the main flow of current Chinese automatic word segmentation technology.Method based on dictionary for word segmentation is transplanted simply, without the need for considering The adaptivity problem transplanted between different field;But this kind of method to ambiguity analysis produced during automatic word segmentation and The also relative shortcoming of the process of the problems such as name Entity recognition.Statistics-Based Method relies on powerful mathematical statistical model, Participle aspect of performance is greatly improved, but bad in cross-cutting aspect effect, to the dependency of corpus than larger, Need for different fields, prepare different corpus to train different fields to count participle model.So cause After the conversion of field, it is necessary to provide the participle corpus in corresponding field for them.However, carrying out the mark required for participle training The foundation and maintenance of language material needs substantial amounts of man power and material, and by contrast, the method based on dictionary for word segmentation is in domain-adaptive Aspect has some superiority.When target participle field changes, only need to add the word in corresponding field based on the method for dictionary Allusion quotation, the acquisition of domain lexicon compare it is also easily many for corpus, therefore by dictionary for word segmentation and probability statistics Method is used in combination the main flow for becoming current participle.In order to realize Chinese Automatic Word Segmentation function and improve the accurate of word segmentation result Degree, the present invention proposes a kind of based on information-theoretical Chinese Automatic Word Segmentation algorithm.
The content of the invention
To realize Chinese Automatic Word Segmentation function and the not high problem of accuracy for word segmentation result, the invention provides one Plant and be based on information-theoretical Chinese Automatic Word Segmentation algorithm.
In order to solve the above problems, the present invention is achieved by the following technical solutions:
Step 1:Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both combinations Model.
Step 2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary.
Step 3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Structure, every sequential node of this structure SM is defined as successively1M2M3M4M5E。
Step 4:Based on method of information theory, to above-mentioned network structure each edge certain weights are given.
Step 5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated.
Step 6:Verify the accuracy rate and recall rate of this word segmentation result.
Present invention has the advantages that:
1st, the speed of Chinese pretreatment is fast compared with the method based on dictionary for word segmentation.
2nd, the method has more preferable precision compared with the method based on dictionary for word segmentation.
3rd, the method has more preferable accuracy compared with based on statistical method.
4th, the method practicality is bigger, more meets empirical value.
5th, the method provides greatly using value for follow-up natural language processing technique.
Description of the drawings
Fig. 1 is a kind of to be based on information-theoretical Chinese Automatic Word Segmentation algorithm structure flow chart
Fig. 2 n-grams segmentation methods are illustrated
Specific embodiment
In order to improve the accuracy of Chinese Automatic Word Segmentation, the present invention is described in detail with reference to Fig. 1-Fig. 2, its is concrete Implementation steps are as follows:
Step 1:Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both combinations Model.
Step 2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, it is described in detail below:
The complete scanning of the Chinese character string of participle one time is treated, matching is made a look up in the dictionary of system, in running into dictionary Some words are just identified;If there are no relevant matches in dictionary, individual character is just simply partitioned into as word;Until Chinese character string For sky.
Step 3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Structure, every sequential node of this structure SM is defined as successively1M2M3M4M5E, its structure chart is as shown in Figure 2.
Step 4:Based on method of information theory, certain weights are given to above-mentioned network structure each edge, it was specifically calculated Journey is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word ni.That is the number collection of n paths word is combined into (n1, n2..., nn)。
Obtain min ()=min (n1, n2..., nn)
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.
In statistics corpus, the quantity of information X (C of each word are calculatedi), then the co-occurrence letter of the adjacent word of solution path
Breath amount X (Ci, Ci+1).Existing following formula:
X(Ci)=| x (Ci)1-x(Ci)2|
Above formula x (Ci)1For word C in text corpusiQuantity of information, x (Ci)2It is C containing wordiText message amount.
x(Ci)1=-p (Ci)1lnp(Ci)1
Above formula p (Ci)1For CiProbability in text corpus, n is C containing wordiText corpus number.
x(Ci)2=-p (Ci)2lnp(Ci)2
Above formula p (Ci)2It is C containing wordiTextual data probit, N is statistics corpus Chinese version sum.
X (C in the same manneri, Ci+1)=| x (Ci, Ci+1)1-x(Ci, Ci+1)2|
x(Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence information amount, x (Ci, Ci+1)2For adjacent word (Ci, Ci+1) co-occurrence text message amount.
X (C in the same manneri, Ci+1)1=-p (Ci, Ci+1)1lnp(Ci, Ci+1)1
Above formula p (Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence probabilities, m is the word (C in text libraryi, Ci+1) co-occurrence amount of text.
x(Ci, Ci+1)2=-p (Ci, Ci+1)2lnp(Ci, Ci+1)2
p(Ci, Ci+1)2For adjacent word (C in text libraryi, Ci+1) co-occurrence textual data probability.
The weights that every adjacent path can to sum up be obtained are
w(Ci, Ci+1)=X (Ci)+X(Ci+1)-2X(Ci, Ci+1)
Step 5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process It is as follows:
There are n paths, it is different per paths length, it is assumed that path collection is combined into (L1, L2..., Ln)。
Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if Its path collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 4,For S in remaining (n-m) pathjBar The length in path.
One paths of maximum weight:
Step 6:Verify the accuracy rate and recall rate of this word segmentation result.
Accuracy rate:
Above formula nKnowFor《Dictionary for word segmentation》The number of dictionary word in participle sentence, n are treated in identificationzFor the correct participle word of the method Number.
Recall rate:
Above formula nAlwaysThe total number of word in treat participle sentence.
Finally consider the two factors, judge the correctness of this system word segmentation result.
That is d=| zhaorate-rate |≤ε
ε is the threshold value of a very little, and this is given by expert.When d meets above-mentioned condition, then participle effect is more satisfactory.

Claims (3)

1. it is a kind of to be based on information-theoretical Chinese Automatic Word Segmentation algorithm, the present invention relates to Chinese semanteme networking technology area, specifically relates to And one kind is based on information-theoretical Chinese Automatic Word Segmentation algorithm, it is characterized in that, comprises the steps:
Step 1:Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both binding models
Step 2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary
Step 3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined Structure, is successively defined as every sequential node of this structure
Step 4:Based on method of information theory, to above-mentioned network structure each edge certain weights are given
Step 5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated
Step 6:Verify the accuracy rate and recall rate of this word segmentation result
Accuracy rate:
Above formulaFor《Dictionary for word segmentation》The number of dictionary word in participle sentence is treated in identification,For the number of the correct participle word of the method
Recall rate:
Above formulaThe total number of word in treat participle sentence
Finally consider the two factors, judge the correctness of this system word segmentation result
I.e.
For the threshold value of a very little, this is given by expert, and when d meets above-mentioned condition, then participle effect is more satisfactory.
2. information-theoretical Chinese Automatic Word Segmentation algorithm is based on according to the one kind described in claim 1, be it is characterized in that, the above Concrete calculating process is as follows in step 4:
Step 4:Based on method of information theory, certain weights are given to above-mentioned network structure each edge, its concrete calculating process is such as Under:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e., The number collection of n paths words is combined into
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved
In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula:
Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount
Above formulaForProbability in text corpus, n is containing wordText corpus number
Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum
In the same manner
It is the word in text corpusCo-occurrence information amount,For adjacent wordThe text message amount of co-occurrence
In the same manner
Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text libraryThe amount of text of co-occurrence
For adjacent word in text libraryThe textual data probability of co-occurrence
The weights that every adjacent path can to sum up be obtained are
3. information-theoretical Chinese Automatic Word Segmentation algorithm is based on according to the one kind described in claim 1, be it is characterized in that, the above Concrete calculating process is as follows in step 5:
There are n paths, it is different per paths length, it is assumed that path collection is combined into
Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road Electrical path length collection is combined into
It is per paths weight then:
Above formulaRespectively the 1st, 2 arriveThe weighted value on path side, can calculate one by one according to step 4,For in remaining (n-m) path TheThe maximum paths of the Length Weight of paths:
CN201610831711.2A 2016-09-19 2016-09-19 Information theory-based Chinese automatic word segmentation method Pending CN106610937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610831711.2A CN106610937A (en) 2016-09-19 2016-09-19 Information theory-based Chinese automatic word segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610831711.2A CN106610937A (en) 2016-09-19 2016-09-19 Information theory-based Chinese automatic word segmentation method

Publications (1)

Publication Number Publication Date
CN106610937A true CN106610937A (en) 2017-05-03

Family

ID=58614954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610831711.2A Pending CN106610937A (en) 2016-09-19 2016-09-19 Information theory-based Chinese automatic word segmentation method

Country Status (1)

Country Link
CN (1) CN106610937A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291837A (en) * 2017-05-31 2017-10-24 北京大学 A kind of segmenting method of the network text based on field adaptability
CN108874956A (en) * 2018-06-05 2018-11-23 中国平安人寿保险股份有限公司 Mass file search method, device, computer equipment and storage medium
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text
CN109190124A (en) * 2018-09-14 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for participle
CN109858011A (en) * 2018-11-30 2019-06-07 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN110781204A (en) * 2019-09-09 2020-02-11 腾讯大地通途(北京)科技有限公司 Identification information determination method, device, equipment and storage medium of target object

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BECK_ZHOU: "中文分词语言模型和动态规划", 《CSDN博客 HTTPS://BLOG.CSDN.NET/ZHOUBL668/ARTICLE/DETAILS/6896438》 *
蒋建洪 等: "词典与统计方法结合的中文分词模型研究及应用", 《计算机工程与设计》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291837A (en) * 2017-05-31 2017-10-24 北京大学 A kind of segmenting method of the network text based on field adaptability
CN107291837B (en) * 2017-05-31 2020-04-03 北京大学 Network text word segmentation method based on field adaptability
CN108874956A (en) * 2018-06-05 2018-11-23 中国平安人寿保险股份有限公司 Mass file search method, device, computer equipment and storage medium
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text
CN109190124A (en) * 2018-09-14 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for participle
CN109190124B (en) * 2018-09-14 2019-11-26 北京字节跳动网络技术有限公司 Method and apparatus for participle
WO2020052069A1 (en) * 2018-09-14 2020-03-19 北京字节跳动网络技术有限公司 Method and apparatus for word segmentation
CN109858011A (en) * 2018-11-30 2019-06-07 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN109858011B (en) * 2018-11-30 2022-08-19 平安科技(深圳)有限公司 Standard word bank word segmentation method, device, equipment and computer readable storage medium
CN110781204A (en) * 2019-09-09 2020-02-11 腾讯大地通途(北京)科技有限公司 Identification information determination method, device, equipment and storage medium of target object
CN110781204B (en) * 2019-09-09 2024-02-20 腾讯大地通途(北京)科技有限公司 Identification information determining method, device, equipment and storage medium of target object

Similar Documents

Publication Publication Date Title
CN111046946B (en) Burma language image text recognition method based on CRNN
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN106610937A (en) Information theory-based Chinese automatic word segmentation method
CN106919673B (en) Text mood analysis system based on deep learning
CN107608949B (en) A kind of Text Information Extraction method and device based on semantic model
CN109359291A (en) A kind of name entity recognition method
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN110134946B (en) Machine reading understanding method for complex data
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN107168957A (en) A kind of Chinese word cutting method
CN113901797B (en) Text error correction method, device, equipment and storage medium
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN105373529A (en) Intelligent word segmentation method based on hidden Markov model
CN105068997B (en) The construction method and device of parallel corpora
CN109783809B (en) Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN106611041A (en) New text similarity solution method
CN108664474A (en) A kind of resume analytic method based on deep learning
CN104699797B (en) A kind of web page data structured analysis method and device
CN103955450A (en) Automatic extraction method of new words
CN108829823A (en) A kind of file classification method
CN110134934A (en) Text emotion analysis method and device
CN109255117A (en) Chinese word cutting method and device
CN110222338A (en) A kind of mechanism name entity recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170503