CN110414002A - Intelligent Chinese-character segmenting method based on statistics and deep learning - Google Patents
Intelligent Chinese-character segmenting method based on statistics and deep learning Download PDFInfo
- Publication number
- CN110414002A CN110414002A CN201910655795.2A CN201910655795A CN110414002A CN 110414002 A CN110414002 A CN 110414002A CN 201910655795 A CN201910655795 A CN 201910655795A CN 110414002 A CN110414002 A CN 110414002A
- Authority
- CN
- China
- Prior art keywords
- document
- technical term
- word
- segmenting method
- statistics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses the intelligent Chinese-character segmenting methods based on statistics and deep learning, including data prediction;The building of field term collection;Segmenting method selection;Participle determines.It is applied widely the beneficial effects of the invention are as follows the participle model combined using the segmenting method based on statistics with depth learning technology, professional domain specialized word can accurately be segmented, it is fast that algorithm simply segments speed.
Description
Technical field
The invention belongs to participle technique fields, are related to one kind for professional domain document, can be improved its technical term point
The technology of word accuracy.
Background technique
Chinese word segmentation (Chinese Word Segmentation) is exactly to be cut into a chinese character sequence one by one individually
Word process, it be carry out natural language processing basis.A branch of the Chinese information processing as natural language processing,
It includes three levels: morphological analysis, syntactic analysis and semantic analysis, wherein Chinese word segmentation is the first step of morphological analysis.In
Text participle application field is very extensive, small to name Entity recognition to POS part-of-speech tagging, NER, it is big to automatic classification, automatic Proofreading,
Search engine, speech synthesis, machine translation etc..Chinese word cutting method based on statistics, participle accuracy is not high, is especially difficult to pair
Professional domain specialized word is accurately segmented;Using only the segmenting method based on deep learning, algorithm complexity is high, participle speed
Degree is slow.
Summary of the invention
The purpose of the present invention is to provide the intelligent Chinese-character segmenting methods based on statistics and deep learning, solve and are used only
Two-way LSTM algorithm carries out the problems such as complexity when Chinese word segmentation is high, and participle speed is slow.The beneficial effects of the invention are as follows use
It is applied widely based on the participle model that the segmenting method of statistics is combined with depth learning technology, it can be to professional domain profession
Word is accurately segmented, and it is fast that algorithm simply segments speed.
The technical scheme adopted by the invention is that following the steps below:
Step1. data prediction;
Step2. field term collection constructs;
Step3. segmenting method selects;
Step4. participle determines.
Further, the text document that participle is treated in Step1. is pre-processed, by punctuation mark original in text, paragraph
Separator etc. has the symbol of compartmentation by document cutting, to obtain shorter sentence or character string.
Further, each sub- subject in some field is numbered in Step2. from 1 to n, establishes terminology TS,Most common m technical term in each sub- subject is counted, these each most common technical terms of subject point
Corresponding subject terminology TS is not constitutedi。
Further, sub- ambit belonging to this document is judged according to document topic to be segmented in Step3., extracts corresponding learn
Section terminology TSi, traverse terminology TSi, count wait segment the ambit technical term for including in document and its quantity, document
There is total degree and is in technical term in paragraphDefinition technical term amount threshold is Γ=ktotal_
The selection of num, segmenting method are as follows:
In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein numjTable
Show the number that j-th of technical term occurs in a document, for technical term amount threshold Γ=ktotal_num, wherein k
Indicate proportionality coefficient, total_num indicates document total number of word, when total degree occurs in technical term in a certain paragraph of document to be segmented
When greater than threshold value, illustrate that the paragraph has largely used the technical term of ambit, segments accuracy rate to improve, should use two-way
LSTM algorithm is segmented;When there is total degree less than threshold value wait segment technical term in a certain paragraph of document, it is believed that
The paragraph is general description, and technical term is taken based on the segmenting method i.e. Hidden Markov Model of statistics using less
Complete the participle to paragraph.
Further, one is defined in Step4. into word information entropy Ψ
Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the general of Chinese character x and y appearance
Rate, λ are proportionality coefficient, and ε is that the error term allowed need to lead to the participle completed by Hidden Markov Model segmenting method
The tightness degree for being calculated as word information entropy Ψ to judge Chinese character x and y is crossed, so that it is determined that whether it can constitute a word, at word
The value of comentropy is bigger, indicates that the combination degree of the two words one word of composition is higher;Conversely, indicating the combination of one word of composition
Degree is lower, by the screening at word information entropy, further improves Hidden Markov Model participle accuracy.Two-way LSTM mind
The result feature new as one of first time prediction can be subjected to following prediction through network, have very high accuracy and
Stronger learning ability, therefore it is not required to the judgement that tries again to its word segmentation result.
Detailed description of the invention
Fig. 1 is the overall process flow chart that field document accurately segments;
Fig. 2 is data (text) preprocessing process flow chart;
Fig. 3 is the network structure of Bi-LSTM (two-way LSTM).
Specific embodiment
The present invention is described in detail With reference to embodiment.
The present invention is based on the intelligent Chinese-character segmenting method processes of statistics and deep learning as shown in Figure 1, steps are as follows:
Step1. data prediction.Fig. 2 is data (text) preprocessing process flow chart;Treat the text document of participle into
Row pretreatment can have the symbol of compartmentation by document cutting by punctuation mark original in text, paragraph Separator etc., thus
Obtain shorter sentence or character string.
In view of Chinese format write and feature, author is usually similar by content or logical communication link is closely interior is received within one
In a paragragh.Therefore, the technical term in field generally can largely repeat in some or certain several paragraghs, for special
The paragraph that industry term largely occurs, it should which selection participle accuracy is high, and the strong segmenting method of disambiguation ability is handled;And
System can be taken based on by not concentrating the paragraph (such as background introduction, author's viewpoint, summing-up text) of appearance for technical term
The Chinese word cutting method of meter can obtain higher accuracy, while improve participle speed, reduce algorithm complexity.
One professional domain document is accurately segmented, including Text Pretreatment, by subject terminology count profession
Frequency of occurrence, participle model selection and completion segment several steps to term in a document.
Text Pretreatment is the premise and basis segmented.Minimal structure unit of the paragraph as performance article level,
Same paragragh content deltas is small, can choose same process and carries out word segmentation processing.Therefore it should be incited somebody to action first by paragraph Separator
Original text shelves are divided into multiple paragraphs, and the data processing unit that each paragraph is segmented as document takes same participle side
Method;Cutting is continued to document segment by separators such as punctuation marks again, original separator is replaced with space.By pretreatment
Text afterwards, what is obtained is the combination of shorter sentence or the character string as unit of paragraph.It is short to these during participle
Sentence is handled one by one compared with short character strings, to reduce matching times, is improved participle efficiency, is reduced participle difficulty.
Step2. field term collection constructs.
The sub- subject that can usually determine the document fields roughly by Document Title, extracts the correlator built
The subject terminology of subject traverses subject terminology, counts and commonly uses technical term frequency of occurrence in each paragraph, finally obtains this
There is total degree in paragraph technical term.
Each sub- subject in some field is numbered from 1 to n, establishes terminology TS,Count each
Most common m technical term (different subjects, m value may be different), these each most common professions of subject in sub- subject
Term respectively constitutes corresponding subject terminology TSi。
Step3. segmenting method selects.Sub- ambit belonging to this document is judged according to document topic to be segmented, and extracts phase
Answer subject terminology TSi.Traverse terminology TSi, it counts wait segment the ambit technical term for including in document and its quantity,
And use matrixIt indicates.There is total degree and is in technical term in document segmentDefine technical term
Amount threshold is Γ=ktotal_num.The selection of segmenting method is as follows:
In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein numjTable
Show the number that j-th of technical term occurs in a document.For technical term amount threshold Γ=ktotal_num, wherein k
Indicate that proportionality coefficient, total_num indicate document total number of word.That is, when technical term in a certain paragraph of document to be segmented
When there is total degree greater than threshold value, illustrate that the paragraph has largely used the technical term of ambit, segment accuracy rate to improve,
It should be segmented using two-way LSTM algorithm;When technical term total degree occurs less than threshold value in a certain paragraph of document to be segmented
When, it is believed that the paragraph is general description, and technical term uses the i.e. hidden horse of segmenting method that is less, therefore being taken based on statistics
Er Kefu model completes the participle to paragraph.Fig. 3 is two-way LSTM network structure.
Step4. participle determines.One is defined into word information entropy Ψ
Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the general of Chinese character x and y appearance
Rate, λ are proportionality coefficient, and ε is the error term allowed.For the participle completed by Hidden Markov Model segmenting method, need to lead to
The tightness degree for being calculated as word information entropy Ψ to judge Chinese character x and y is crossed, so that it is determined that whether it can constitute a word.At word
The value of comentropy is bigger, indicates that the combination degree of the two words one word of composition is higher;Conversely, indicating the combination of one word of composition
Degree is lower.By the screening at word information entropy, Hidden Markov Model participle accuracy is further improved.Two-way LSTM mind
The result feature new as one of first time prediction can be subjected to following prediction through network, have very high accuracy and
Stronger learning ability, therefore it is not required to the judgement that tries again to its word segmentation result.
The above is only not to make limit in any form to the present invention to better embodiment of the invention
System, any simple modification that embodiment of above is made according to the technical essence of the invention, equivalent variations and modification,
Belong in the range of technical solution of the present invention.
Claims (5)
1. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that follow the steps below:
Step1. data prediction;
Step2. field term collection constructs;
Step3. segmenting method selects;
Step4. participle determines.
2. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: treat participle in the Step1.
Text document is pre-processed, and has the symbol of compartmentation by document by punctuation mark original in text, paragraph Separator etc.
Cutting, to obtain shorter sentence or character string.
3. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: to some field in the Step2.
In each sub- subject be numbered from 1 to n, establish terminology TS,Count most common m in each sub- subject
A technical term, these each most common technical terms of subject respectively constitute corresponding subject terminology TSi。
4. the intelligent Chinese-character segmenting method based on statistics and deep learning, it is characterised in that: according to wait segment in the Step3.
Document topic judges sub- ambit belonging to this document, extracts corresponding subject terminology TSi, traverse terminology TSi, count wait divide
The ambit technical term and its quantity for including in word document, there is total degree and is in technical term in document segmentDefinition technical term amount threshold is Γ=ktotal_num, and the selection of segmenting method is as follows:
In document technical term occur total degree it is cumulative equal to each technical term frequency of occurrence and, wherein numjIndicate jth
The number that a technical term occurs in a document, for technical term amount threshold Γ=ktotal_num, wherein k is indicated
Proportionality coefficient, total_num indicate document total number of word, when technical term total degree occurs and is greater than in a certain paragraph of document to be segmented
When threshold value, illustrate that the paragraph has largely used the technical term of ambit, segments accuracy rate to improve, two-way LSTM should be used
Algorithm is segmented;When there is total degree less than threshold value wait segment technical term in a certain paragraph of document, it is believed that the section
It falls and is described for generality, technical term uses the i.e. Hidden Markov Model completion of segmenting method that is less, therefore being taken based on statistics
To the participle of paragraph.
5. based on statistics and deep learning intelligent Chinese-character segmenting method, it is characterised in that: in the Step4. define one at
Word information entropy Ψ
Wherein, p (x, y) is the probability of Chinese character x and Chinese character y co-occurrence, and p (x), p (y) respectively indicate the probability of Chinese character x and y appearance, λ
For proportionality coefficient, ε is that the error term allowed need to pass through meter for the participle completed by Hidden Markov Model segmenting method
Word information entropy Ψ is counted as to judge the tightness degree of Chinese character x and y, so that it is determined that whether it can constitute a word, at word information
The value of entropy is bigger, indicates that the combination degree of the two words one word of composition is higher;Conversely, indicating the combination degree of one word of composition
It is lower, by the screening at word information entropy, further improve Hidden Markov Model participle accuracy, two-way LSTM nerve net
The result that network can the predict first time feature new as one, carries out following prediction, with very high accuracy and relatively strong
Learning ability, therefore be not required to the judgement that tries again to its word segmentation result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910655795.2A CN110414002B (en) | 2019-07-19 | 2019-07-19 | Intelligent Chinese word segmentation method based on statistics and deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910655795.2A CN110414002B (en) | 2019-07-19 | 2019-07-19 | Intelligent Chinese word segmentation method based on statistics and deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110414002A true CN110414002A (en) | 2019-11-05 |
CN110414002B CN110414002B (en) | 2023-06-09 |
Family
ID=68360365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910655795.2A Active CN110414002B (en) | 2019-07-19 | 2019-07-19 | Intelligent Chinese word segmentation method based on statistics and deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110414002B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014164739A (en) * | 2013-02-28 | 2014-09-08 | National Institute Of Information & Communication Technology | Parallel translation dictionary generation device, method, and computer program for the same |
CN107818130A (en) * | 2017-09-15 | 2018-03-20 | 深圳市电陶思创科技有限公司 | The method for building up and system of a kind of search engine |
CN108427670A (en) * | 2018-04-08 | 2018-08-21 | 重庆邮电大学 | A kind of sentiment analysis method based on context word vector sum deep learning |
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
-
2019
- 2019-07-19 CN CN201910655795.2A patent/CN110414002B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014164739A (en) * | 2013-02-28 | 2014-09-08 | National Institute Of Information & Communication Technology | Parallel translation dictionary generation device, method, and computer program for the same |
CN107818130A (en) * | 2017-09-15 | 2018-03-20 | 深圳市电陶思创科技有限公司 | The method for building up and system of a kind of search engine |
CN108427670A (en) * | 2018-04-08 | 2018-08-21 | 重庆邮电大学 | A kind of sentiment analysis method based on context word vector sum deep learning |
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
Non-Patent Citations (1)
Title |
---|
宫法明,朱朋海: "基于自适应隐马尔可夫模型的石油领域文档分词", 《计算机科学》 * |
Also Published As
Publication number | Publication date |
---|---|
CN110414002B (en) | 2023-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106919673B (en) | Text mood analysis system based on deep learning | |
CN110298033B (en) | Keyword corpus labeling training extraction system | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN101599071B (en) | Automatic extraction method of conversation text topic | |
CN109408642A (en) | A kind of domain entities relation on attributes abstracting method based on distance supervision | |
CN109635297B (en) | Entity disambiguation method and device, computer device and computer storage medium | |
CN111178074A (en) | Deep learning-based Chinese named entity recognition method | |
CN106294322A (en) | A kind of Chinese based on LSTM zero reference resolution method | |
CN105138514B (en) | It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method | |
CN109299480A (en) | Terminology Translation method and device based on context of co-text | |
CN109960799A (en) | A kind of Optimum Classification method towards short text | |
CN108845982A (en) | A kind of Chinese word cutting method of word-based linked character | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN102063424A (en) | Method for Chinese word segmentation | |
CN109657058A (en) | A kind of abstracting method of notice information | |
CN105975475A (en) | Chinese phrase string-based fine-grained thematic information extraction method | |
CN112101027A (en) | Chinese named entity recognition method based on reading understanding | |
CN103324626A (en) | Method for setting multi-granularity dictionary and segmenting words and device thereof | |
CN110134934A (en) | Text emotion analysis method and device | |
CN113032541A (en) | Answer extraction method based on bert and fusion sentence cluster retrieval | |
CN109190099A (en) | Sentence mould extracting method and device | |
CN110502759A (en) | The Chinese for incorporating classified dictionary gets over the outer word treatment method of hybrid network nerve machine translation set | |
CN101520775B (en) | Chinese syntax parsing method with merged semantic information | |
CN112632969B (en) | Incremental industry dictionary updating method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |