CN110414002A - Intelligent Chinese-character segmenting method based on statistics and deep learning - Google Patents

Intelligent Chinese-character segmenting method based on statistics and deep learning Download PDF

Info

Publication number
CN110414002A
CN110414002A CN201910655795.2A CN201910655795A CN110414002A CN 110414002 A CN110414002 A CN 110414002A CN 201910655795 A CN201910655795 A CN 201910655795A CN 110414002 A CN110414002 A CN 110414002A
Authority
CN
China
Prior art keywords
document
technical term
word
segmenting method
statistics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910655795.2A
Other languages
Chinese (zh)
Other versions
CN110414002B (en
Inventor
徐建国
刘梦凡
刘泳慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN201910655795.2A priority Critical patent/CN110414002B/en
Publication of CN110414002A publication Critical patent/CN110414002A/en
Application granted granted Critical
Publication of CN110414002B publication Critical patent/CN110414002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses the intelligent Chinese-character segmenting methods based on statistics and deep learning, including data prediction;The building of field term collection;Segmenting method selection;Participle determines.It is applied widely the beneficial effects of the invention are as follows the participle model combined using the segmenting method based on statistics with depth learning technology, professional domain specialized word can accurately be segmented, it is fast that algorithm simply segments speed.

Description

Intelligent Chinese-character segmenting method based on statistics and deep learning
Technical field
The invention belongs to participle technique fields, are related to one kind for professional domain document, can be improved its technical term point The technology of word accuracy.
Background technique
Chinese word segmentation (Chinese Word Segmentation) is exactly to be cut into a chinese character sequence one by one individually Word process, it be carry out natural language processing basis.A branch of the Chinese information processing as natural language processing, It includes three levels: morphological analysis, syntactic analysis and semantic analysis, wherein Chinese word segmentation is the first step of morphological analysis.In Text participle application field is very extensive, small to name Entity recognition to POS part-of-speech tagging, NER, it is big to automatic classification, automatic Proofreading, Search engine, speech synthesis, machine translation etc..Chinese word cutting method based on statistics, participle accuracy is not high, is especially difficult to pair Professional domain specialized word is accurately segmented;Using only the segmenting method based on deep learning, algorithm complexity is high, participle speed Degree is slow.
Summary of the invention
The purpose of the present invention is to provide the intelligent Chinese-character segmenting methods based on statistics and deep learning, solve and are used only Two-way LSTM algorithm carries out the problems such as complexity when Chinese word segmentation is high, and participle speed is slow.The beneficial effects of the invention are as follows use It is applied widely based on the participle model that the segmenting method of statistics is combined with depth learning technology, it can be to professional domain profession Word is accurately segmented, and it is fast that algorithm simply segments speed.
The technical scheme adopted by the invention is that following the steps below:
Step1. data prediction;
Step2. field term collection constructs;
Step3. segmenting method selects;
Step4. participle determines.
Further, the text document that participle is treated in Step1. is pre-processed, by punctuation mark original in text, paragraph Separator etc. has the symbol of compartmentation by document cutting, to obtain shorter sentence or character string.
Further, each sub- subject in some field is numbered in Step2. from 1 to n, establishes terminology TS,Most common m technical term in each sub- subject is counted, these each most common technical terms of subject point Corresponding subject terminology TS is not constitutedi
Further, sub- ambit belonging to this document is judged according to document topic to be segmented in Step3., extracts corresponding learn Section terminology TSi, traverse terminology TSi, count wait segment the ambit technical term for including in document and its quantity, document There is total degree and is in technical term in paragraphDefinition technical term amount threshold is Γ=ktotal_ The selection of num, segmenting method are as follows:
In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein numjTable Show the number that j-th of technical term occurs in a document, for technical term amount threshold Γ=ktotal_num, wherein k Indicate proportionality coefficient, total_num indicates document total number of word, when total degree occurs in technical term in a certain paragraph of document to be segmented When greater than threshold value, illustrate that the paragraph has largely used the technical term of ambit, segments accuracy rate to improve, should use two-way LSTM algorithm is segmented;When there is total degree less than threshold value wait segment technical term in a certain paragraph of document, it is believed that The paragraph is general description, and technical term is taken based on the segmenting method i.e. Hidden Markov Model of statistics using less Complete the participle to paragraph.
Further, one is defined in Step4. into word information entropy Ψ
Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the general of Chinese character x and y appearance Rate, λ are proportionality coefficient, and ε is that the error term allowed need to lead to the participle completed by Hidden Markov Model segmenting method The tightness degree for being calculated as word information entropy Ψ to judge Chinese character x and y is crossed, so that it is determined that whether it can constitute a word, at word The value of comentropy is bigger, indicates that the combination degree of the two words one word of composition is higher;Conversely, indicating the combination of one word of composition Degree is lower, by the screening at word information entropy, further improves Hidden Markov Model participle accuracy.Two-way LSTM mind The result feature new as one of first time prediction can be subjected to following prediction through network, have very high accuracy and Stronger learning ability, therefore it is not required to the judgement that tries again to its word segmentation result.
Detailed description of the invention
Fig. 1 is the overall process flow chart that field document accurately segments;
Fig. 2 is data (text) preprocessing process flow chart;
Fig. 3 is the network structure of Bi-LSTM (two-way LSTM).
Specific embodiment
The present invention is described in detail With reference to embodiment.
The present invention is based on the intelligent Chinese-character segmenting method processes of statistics and deep learning as shown in Figure 1, steps are as follows:
Step1. data prediction.Fig. 2 is data (text) preprocessing process flow chart;Treat the text document of participle into Row pretreatment can have the symbol of compartmentation by document cutting by punctuation mark original in text, paragraph Separator etc., thus Obtain shorter sentence or character string.
In view of Chinese format write and feature, author is usually similar by content or logical communication link is closely interior is received within one In a paragragh.Therefore, the technical term in field generally can largely repeat in some or certain several paragraghs, for special The paragraph that industry term largely occurs, it should which selection participle accuracy is high, and the strong segmenting method of disambiguation ability is handled;And System can be taken based on by not concentrating the paragraph (such as background introduction, author's viewpoint, summing-up text) of appearance for technical term The Chinese word cutting method of meter can obtain higher accuracy, while improve participle speed, reduce algorithm complexity.
One professional domain document is accurately segmented, including Text Pretreatment, by subject terminology count profession Frequency of occurrence, participle model selection and completion segment several steps to term in a document.
Text Pretreatment is the premise and basis segmented.Minimal structure unit of the paragraph as performance article level, Same paragragh content deltas is small, can choose same process and carries out word segmentation processing.Therefore it should be incited somebody to action first by paragraph Separator Original text shelves are divided into multiple paragraphs, and the data processing unit that each paragraph is segmented as document takes same participle side Method;Cutting is continued to document segment by separators such as punctuation marks again, original separator is replaced with space.By pretreatment Text afterwards, what is obtained is the combination of shorter sentence or the character string as unit of paragraph.It is short to these during participle Sentence is handled one by one compared with short character strings, to reduce matching times, is improved participle efficiency, is reduced participle difficulty.
Step2. field term collection constructs.
The sub- subject that can usually determine the document fields roughly by Document Title, extracts the correlator built The subject terminology of subject traverses subject terminology, counts and commonly uses technical term frequency of occurrence in each paragraph, finally obtains this There is total degree in paragraph technical term.
Each sub- subject in some field is numbered from 1 to n, establishes terminology TS,Count each Most common m technical term (different subjects, m value may be different), these each most common professions of subject in sub- subject Term respectively constitutes corresponding subject terminology TSi
Step3. segmenting method selects.Sub- ambit belonging to this document is judged according to document topic to be segmented, and extracts phase Answer subject terminology TSi.Traverse terminology TSi, it counts wait segment the ambit technical term for including in document and its quantity, And use matrixIt indicates.There is total degree and is in technical term in document segmentDefine technical term Amount threshold is Γ=ktotal_num.The selection of segmenting method is as follows:
In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein numjTable Show the number that j-th of technical term occurs in a document.For technical term amount threshold Γ=ktotal_num, wherein k Indicate that proportionality coefficient, total_num indicate document total number of word.That is, when technical term in a certain paragraph of document to be segmented When there is total degree greater than threshold value, illustrate that the paragraph has largely used the technical term of ambit, segment accuracy rate to improve, It should be segmented using two-way LSTM algorithm;When technical term total degree occurs less than threshold value in a certain paragraph of document to be segmented When, it is believed that the paragraph is general description, and technical term uses the i.e. hidden horse of segmenting method that is less, therefore being taken based on statistics Er Kefu model completes the participle to paragraph.Fig. 3 is two-way LSTM network structure.
Step4. participle determines.One is defined into word information entropy Ψ
Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the general of Chinese character x and y appearance Rate, λ are proportionality coefficient, and ε is the error term allowed.For the participle completed by Hidden Markov Model segmenting method, need to lead to The tightness degree for being calculated as word information entropy Ψ to judge Chinese character x and y is crossed, so that it is determined that whether it can constitute a word.At word The value of comentropy is bigger, indicates that the combination degree of the two words one word of composition is higher;Conversely, indicating the combination of one word of composition Degree is lower.By the screening at word information entropy, Hidden Markov Model participle accuracy is further improved.Two-way LSTM mind The result feature new as one of first time prediction can be subjected to following prediction through network, have very high accuracy and Stronger learning ability, therefore it is not required to the judgement that tries again to its word segmentation result.
The above is only not to make limit in any form to the present invention to better embodiment of the invention System, any simple modification that embodiment of above is made according to the technical essence of the invention, equivalent variations and modification, Belong in the range of technical solution of the present invention.

Claims (5)

1. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that follow the steps below:
Step1. data prediction;
Step2. field term collection constructs;
Step3. segmenting method selects;
Step4. participle determines.
2. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: treat participle in the Step1. Text document is pre-processed, and has the symbol of compartmentation by document by punctuation mark original in text, paragraph Separator etc. Cutting, to obtain shorter sentence or character string.
3. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: to some field in the Step2. In each sub- subject be numbered from 1 to n, establish terminology TS,Count most common m in each sub- subject A technical term, these each most common technical terms of subject respectively constitute corresponding subject terminology TSi
4. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: according to wait segment in the Step3. Document topic judges sub- ambit belonging to this document, extracts corresponding subject terminology TSi, traverse terminology TSi, count wait divide The ambit technical term and its quantity for including in word document, there is total degree and is in technical term in document segmentDefinition technical term amount threshold is Γ=ktotal_num, and the selection of segmenting method is as follows:
In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein numjIndicate jth The number that a technical term occurs in a document, for technical term amount threshold Γ=ktotal_num, wherein k is indicated Proportionality coefficient, total_num indicate document total number of word, when technical term total degree occurs and is greater than in a certain paragraph of document to be segmented When threshold value, illustrate that the paragraph has largely used the technical term of ambit, segments accuracy rate to improve, two-way LSTM should be used Algorithm is segmented;When there is total degree less than threshold value wait segment technical term in a certain paragraph of document, it is believed that the section It falls and is described for generality, technical term uses the i.e. Hidden Markov Model completion of segmenting method that is less, therefore being taken based on statistics To the participle of paragraph.
5. based on statistics and deep learning intelligent Chinese-character segmenting method, it is characterised in that: in the Step4. define one at Word information entropy Ψ
Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the probability of Chinese character x and y appearance, λ For proportionality coefficient, ε is that the error term allowed need to pass through meter for the participle completed by Hidden Markov Model segmenting method Word information entropy Ψ is counted as to judge the tightness degree of Chinese character x and y, so that it is determined that whether it can constitute a word, at word information The value of entropy is bigger, indicates that the combination degree of the two words one word of composition is higher;Conversely, indicating the combination degree of one word of composition It is lower, by the screening at word information entropy, further improve Hidden Markov Model participle accuracy, two-way LSTM nerve net The result that network can the predict first time feature new as one, carries out following prediction, with very high accuracy and relatively strong Learning ability, therefore be not required to the judgement that tries again to its word segmentation result.
CN201910655795.2A 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning Active CN110414002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910655795.2A CN110414002B (en) 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910655795.2A CN110414002B (en) 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning

Publications (2)

Publication Number Publication Date
CN110414002A true CN110414002A (en) 2019-11-05
CN110414002B CN110414002B (en) 2023-06-09

Family

ID=68360365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910655795.2A Active CN110414002B (en) 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning

Country Status (1)

Country Link
CN (1) CN110414002B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014164739A (en) * 2013-02-28 2014-09-08 National Institute Of Information & Communication Technology Parallel translation dictionary generation device, method, and computer program for the same
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN108427670A (en) * 2018-04-08 2018-08-21 重庆邮电大学 A kind of sentiment analysis method based on context word vector sum deep learning
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014164739A (en) * 2013-02-28 2014-09-08 National Institute Of Information & Communication Technology Parallel translation dictionary generation device, method, and computer program for the same
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN108427670A (en) * 2018-04-08 2018-08-21 重庆邮电大学 A kind of sentiment analysis method based on context word vector sum deep learning
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宫法明,朱朋海: "基于自适应隐马尔可夫模型的石油领域文档分词", 《计算机科学》 *

Also Published As

Publication number Publication date
CN110414002B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN106919673B (en) Text mood analysis system based on deep learning
CN110298033B (en) Keyword corpus labeling training extraction system
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN101599071B (en) Automatic extraction method of conversation text topic
CN109408642A (en) A kind of domain entities relation on attributes abstracting method based on distance supervision
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
CN111178074A (en) Deep learning-based Chinese named entity recognition method
CN106294322A (en) A kind of Chinese based on LSTM zero reference resolution method
CN105138514B (en) It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN109299480A (en) Terminology Translation method and device based on context of co-text
CN109960799A (en) A kind of Optimum Classification method towards short text
CN108845982A (en) A kind of Chinese word cutting method of word-based linked character
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN102063424A (en) Method for Chinese word segmentation
CN109657058A (en) A kind of abstracting method of notice information
CN105975475A (en) Chinese phrase string-based fine-grained thematic information extraction method
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN110134934A (en) Text emotion analysis method and device
CN113032541A (en) Answer extraction method based on bert and fusion sentence cluster retrieval
CN109190099A (en) Sentence mould extracting method and device
CN110502759A (en) The Chinese for incorporating classified dictionary gets over the outer word treatment method of hybrid network nerve machine translation set
CN101520775B (en) Chinese syntax parsing method with merged semantic information
CN112632969B (en) Incremental industry dictionary updating method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant