CN110414002A

CN110414002A - Intelligent Chinese-character segmenting method based on statistics and deep learning

Info

Publication number: CN110414002A
Application number: CN201910655795.2A
Authority: CN
Inventors: 徐建国; 刘梦凡; 刘泳慧
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2019-11-05
Anticipated expiration: 2039-07-19
Also published as: CN110414002B

Abstract

The invention discloses the intelligent Chinese-character segmenting methods based on statistics and deep learning, including data prediction；The building of field term collection；Segmenting method selection；Participle determines.It is applied widely the beneficial effects of the invention are as follows the participle model combined using the segmenting method based on statistics with depth learning technology, professional domain specialized word can accurately be segmented, it is fast that algorithm simply segments speed.

Description

Intelligent Chinese-character segmenting method based on statistics and deep learning

Technical field

The invention belongs to participle technique fields, are related to one kind for professional domain document, can be improved its technical term point The technology of word accuracy.

Background technique

Chinese word segmentation (Chinese Word Segmentation) is exactly to be cut into a chinese character sequence one by one individually Word process, it be carry out natural language processing basis.A branch of the Chinese information processing as natural language processing, It includes three levels: morphological analysis, syntactic analysis and semantic analysis, wherein Chinese word segmentation is the first step of morphological analysis.In Text participle application field is very extensive, small to name Entity recognition to POS part-of-speech tagging, NER, it is big to automatic classification, automatic Proofreading, Search engine, speech synthesis, machine translation etc..Chinese word cutting method based on statistics, participle accuracy is not high, is especially difficult to pair Professional domain specialized word is accurately segmented；Using only the segmenting method based on deep learning, algorithm complexity is high, participle speed Degree is slow.

Summary of the invention

The purpose of the present invention is to provide the intelligent Chinese-character segmenting methods based on statistics and deep learning, solve and are used only Two-way LSTM algorithm carries out the problems such as complexity when Chinese word segmentation is high, and participle speed is slow.The beneficial effects of the invention are as follows use It is applied widely based on the participle model that the segmenting method of statistics is combined with depth learning technology, it can be to professional domain profession Word is accurately segmented, and it is fast that algorithm simply segments speed.

The technical scheme adopted by the invention is that following the steps below:

Step1. data prediction；

Step2. field term collection constructs；

Step3. segmenting method selects；

Step4. participle determines.

Further, the text document that participle is treated in Step1. is pre-processed, by punctuation mark original in text, paragraph Separator etc. has the symbol of compartmentation by document cutting, to obtain shorter sentence or character string.

Further, each sub- subject in some field is numbered in Step2. from 1 to n, establishes terminology TS,Most common m technical term in each sub- subject is counted, these each most common technical terms of subject point Corresponding subject terminology TS is not constituted_i。

Further, sub- ambit belonging to this document is judged according to document topic to be segmented in Step3., extracts corresponding learn Section terminology TS_i, traverse terminology TS_i, count wait segment the ambit technical term for including in document and its quantity, document There is total degree and is in technical term in paragraphDefinition technical term amount threshold is Γ=ktotal_ The selection of num, segmenting method are as follows:

In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein num_jTable Show the number that j-th of technical term occurs in a document, for technical term amount threshold Γ=ktotal_num, wherein k Indicate proportionality coefficient, total_num indicates document total number of word, when total degree occurs in technical term in a certain paragraph of document to be segmented When greater than threshold value, illustrate that the paragraph has largely used the technical term of ambit, segments accuracy rate to improve, should use two-way LSTM algorithm is segmented；When there is total degree less than threshold value wait segment technical term in a certain paragraph of document, it is believed that The paragraph is general description, and technical term is taken based on the segmenting method i.e. Hidden Markov Model of statistics using less Complete the participle to paragraph.

Further, one is defined in Step4. into word information entropy Ψ

Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the general of Chinese character x and y appearance Rate, λ are proportionality coefficient, and ε is that the error term allowed need to lead to the participle completed by Hidden Markov Model segmenting method The tightness degree for being calculated as word information entropy Ψ to judge Chinese character x and y is crossed, so that it is determined that whether it can constitute a word, at word The value of comentropy is bigger, indicates that the combination degree of the two words one word of composition is higher；Conversely, indicating the combination of one word of composition Degree is lower, by the screening at word information entropy, further improves Hidden Markov Model participle accuracy.Two-way LSTM mind The result feature new as one of first time prediction can be subjected to following prediction through network, have very high accuracy and Stronger learning ability, therefore it is not required to the judgement that tries again to its word segmentation result.

Detailed description of the invention

Fig. 1 is the overall process flow chart that field document accurately segments；

Fig. 2 is data (text) preprocessing process flow chart；

Fig. 3 is the network structure of Bi-LSTM (two-way LSTM).

Specific embodiment

The present invention is described in detail With reference to embodiment.

The present invention is based on the intelligent Chinese-character segmenting method processes of statistics and deep learning as shown in Figure 1, steps are as follows:

Step1. data prediction.Fig. 2 is data (text) preprocessing process flow chart；Treat the text document of participle into Row pretreatment can have the symbol of compartmentation by document cutting by punctuation mark original in text, paragraph Separator etc., thus Obtain shorter sentence or character string.

In view of Chinese format write and feature, author is usually similar by content or logical communication link is closely interior is received within one In a paragragh.Therefore, the technical term in field generally can largely repeat in some or certain several paragraghs, for special The paragraph that industry term largely occurs, it should which selection participle accuracy is high, and the strong segmenting method of disambiguation ability is handled；And System can be taken based on by not concentrating the paragraph (such as background introduction, author's viewpoint, summing-up text) of appearance for technical term The Chinese word cutting method of meter can obtain higher accuracy, while improve participle speed, reduce algorithm complexity.

One professional domain document is accurately segmented, including Text Pretreatment, by subject terminology count profession Frequency of occurrence, participle model selection and completion segment several steps to term in a document.

Text Pretreatment is the premise and basis segmented.Minimal structure unit of the paragraph as performance article level, Same paragragh content deltas is small, can choose same process and carries out word segmentation processing.Therefore it should be incited somebody to action first by paragraph Separator Original text shelves are divided into multiple paragraphs, and the data processing unit that each paragraph is segmented as document takes same participle side Method；Cutting is continued to document segment by separators such as punctuation marks again, original separator is replaced with space.By pretreatment Text afterwards, what is obtained is the combination of shorter sentence or the character string as unit of paragraph.It is short to these during participle Sentence is handled one by one compared with short character strings, to reduce matching times, is improved participle efficiency, is reduced participle difficulty.

Step2. field term collection constructs.

The sub- subject that can usually determine the document fields roughly by Document Title, extracts the correlator built The subject terminology of subject traverses subject terminology, counts and commonly uses technical term frequency of occurrence in each paragraph, finally obtains this There is total degree in paragraph technical term.

Each sub- subject in some field is numbered from 1 to n, establishes terminology TS,Count each Most common m technical term (different subjects, m value may be different), these each most common professions of subject in sub- subject Term respectively constitutes corresponding subject terminology TS_i。

Step3. segmenting method selects.Sub- ambit belonging to this document is judged according to document topic to be segmented, and extracts phase Answer subject terminology TS_i.Traverse terminology TS_i, it counts wait segment the ambit technical term for including in document and its quantity, And use matrixIt indicates.There is total degree and is in technical term in document segmentDefine technical term Amount threshold is Γ=ktotal_num.The selection of segmenting method is as follows:

In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein num_jTable Show the number that j-th of technical term occurs in a document.For technical term amount threshold Γ=ktotal_num, wherein k Indicate that proportionality coefficient, total_num indicate document total number of word.That is, when technical term in a certain paragraph of document to be segmented When there is total degree greater than threshold value, illustrate that the paragraph has largely used the technical term of ambit, segment accuracy rate to improve, It should be segmented using two-way LSTM algorithm；When technical term total degree occurs less than threshold value in a certain paragraph of document to be segmented When, it is believed that the paragraph is general description, and technical term uses the i.e. hidden horse of segmenting method that is less, therefore being taken based on statistics Er Kefu model completes the participle to paragraph.Fig. 3 is two-way LSTM network structure.

Step4. participle determines.One is defined into word information entropy Ψ

Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the general of Chinese character x and y appearance Rate, λ are proportionality coefficient, and ε is the error term allowed.For the participle completed by Hidden Markov Model segmenting method, need to lead to The tightness degree for being calculated as word information entropy Ψ to judge Chinese character x and y is crossed, so that it is determined that whether it can constitute a word.At word The value of comentropy is bigger, indicates that the combination degree of the two words one word of composition is higher；Conversely, indicating the combination of one word of composition Degree is lower.By the screening at word information entropy, Hidden Markov Model participle accuracy is further improved.Two-way LSTM mind The result feature new as one of first time prediction can be subjected to following prediction through network, have very high accuracy and Stronger learning ability, therefore it is not required to the judgement that tries again to its word segmentation result.

The above is only not to make limit in any form to the present invention to better embodiment of the invention System, any simple modification that embodiment of above is made according to the technical essence of the invention, equivalent variations and modification, Belong in the range of technical solution of the present invention.

Claims

1. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that follow the steps below:

Step1. data prediction；

Step2. field term collection constructs；

Step3. segmenting method selects；

Step4. participle determines.

2. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: treat participle in the Step1. Text document is pre-processed, and has the symbol of compartmentation by document by punctuation mark original in text, paragraph Separator etc. Cutting, to obtain shorter sentence or character string.

3. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: to some field in the Step2. In each sub- subject be numbered from 1 to n, establish terminology TS,Count most common m in each sub- subject A technical term, these each most common technical terms of subject respectively constitute corresponding subject terminology TS_i。

4. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: according to wait segment in the Step3. Document topic judges sub- ambit belonging to this document, extracts corresponding subject terminology TS_i, traverse terminology TS_i, count wait divide The ambit technical term and its quantity for including in word document, there is total degree and is in technical term in document segmentDefinition technical term amount threshold is Γ=ktotal_num, and the selection of segmenting method is as follows:

In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein num_jIndicate jth The number that a technical term occurs in a document, for technical term amount threshold Γ=ktotal_num, wherein k is indicated Proportionality coefficient, total_num indicate document total number of word, when technical term total degree occurs and is greater than in a certain paragraph of document to be segmented When threshold value, illustrate that the paragraph has largely used the technical term of ambit, segments accuracy rate to improve, two-way LSTM should be used Algorithm is segmented；When there is total degree less than threshold value wait segment technical term in a certain paragraph of document, it is believed that the section It falls and is described for generality, technical term uses the i.e. Hidden Markov Model completion of segmenting method that is less, therefore being taken based on statistics To the participle of paragraph.

5. based on statistics and deep learning intelligent Chinese-character segmenting method, it is characterised in that: in the Step4. define one at Word information entropy Ψ

Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the probability of Chinese character x and y appearance, λ For proportionality coefficient, ε is that the error term allowed need to pass through meter for the participle completed by Hidden Markov Model segmenting method Word information entropy Ψ is counted as to judge the tightness degree of Chinese character x and y, so that it is determined that whether it can constitute a word, at word information The value of entropy is bigger, indicates that the combination degree of the two words one word of composition is higher；Conversely, indicating the combination degree of one word of composition It is lower, by the screening at word information entropy, further improve Hidden Markov Model participle accuracy, two-way LSTM nerve net The result that network can the predict first time feature new as one, carries out following prediction, with very high accuracy and relatively strong Learning ability, therefore be not required to the judgement that tries again to its word segmentation result.