CN110414002B - Intelligent Chinese word segmentation method based on statistics and deep learning - Google Patents
Intelligent Chinese word segmentation method based on statistics and deep learning Download PDFInfo
- Publication number
- CN110414002B CN110414002B CN201910655795.2A CN201910655795A CN110414002B CN 110414002 B CN110414002 B CN 110414002B CN 201910655795 A CN201910655795 A CN 201910655795A CN 110414002 B CN110414002 B CN 110414002B
- Authority
- CN
- China
- Prior art keywords
- word segmentation
- document
- word
- term
- paragraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000013135 deep learning Methods 0.000 title claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 230000015572 biosynthetic process Effects 0.000 claims description 8
- 230000002457 bidirectional effect Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses an intelligent Chinese word segmentation method based on statistics and deep learning, which comprises data preprocessing; constructing a domain term set; selecting a word segmentation method; and (5) word segmentation judgment. The invention has the advantages that the word segmentation model based on the combination of the word segmentation method based on statistics and the deep learning technology is adopted, the application range is wide, the accurate word segmentation can be carried out on the professional words in the professional field, and the algorithm is simple and the word segmentation speed is high.
Description
Technical Field
The invention belongs to the technical field of word segmentation, and relates to a technology for improving the accuracy of word segmentation of a professional term aiming at a document in the professional field.
Background
Chinese segmentation (Chinese Word Segmentation) is a process of segmenting a sequence of chinese characters into individual words, which is the basis for performing natural language processing. Chinese information processing is a branch of natural language processing, and comprises three layers: lexical analysis, syntactic analysis, and semantic analysis, wherein chinese segmentation is the first step of lexical analysis. The Chinese word segmentation application field is very wide, as small as POS part of speech tagging and NER named entity recognition, and as large as automatic classification, automatic proofreading, search engines, speech synthesis, machine translation and the like. The Chinese word segmentation method based on statistics is low in word segmentation accuracy, and especially accurate word segmentation of professional words in the professional field is difficult; only a word segmentation method based on deep learning is used, so that algorithm complexity is high, and word segmentation speed is low.
Disclosure of Invention
The invention aims to provide an intelligent Chinese word segmentation method based on statistics and deep learning, which solves the problems of high complexity, low word segmentation speed and the like when only using a bidirectional LSTM algorithm to perform Chinese word segmentation. The invention has the advantages that the word segmentation model based on the combination of the word segmentation method based on statistics and the deep learning technology is adopted, the application range is wide, the accurate word segmentation can be carried out on the professional words in the professional field, and the algorithm is simple and the word segmentation speed is high.
The technical scheme adopted by the invention is carried out according to the following steps:
step1, preprocessing data;
step2, constructing a domain term set;
step3, selecting a word segmentation method;
step4, word segmentation judgment.
Further, preprocessing the text document to be segmented in Step1, and segmenting the document by means of the symbols with separation function such as original punctuation marks, paragraph separators and the like, thereby obtaining shorter sentences or character strings.
Further, step2, numbering each sub-discipline from 1 to n in a domain, creating a term set TS,counting m most commonly used terms in each sub-discipline, wherein the most commonly used terms of each discipline respectively form a corresponding discipline term set TS i 。
Further, step3, judging the sub-discipline domain of the document according to the topic of the document to be segmented, and extracting the corresponding discipline term set TS i Traversing the term set TS i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, wherein the total occurrence times of the technical terms in the document paragraph are as followsDefining the term quantity threshold as Γ=k·total_num, the choice of word segmentation method is as follows: />
The total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num j Representing the number of times of occurrence of the jth special term in the document, and for a special term quantity threshold value Γ=k·total_num, wherein k represents a proportionality coefficient, total_num represents the total number of words of the document, when the total number of times of occurrence of the special term in a certain paragraph of the document to be segmented is larger than the threshold value, the paragraph is described to use a large number of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total number of occurrences of a term in a paragraph of a document to be segmented is less than a threshold value, the paragraph can be considered as a general oneDescriptive description, the term of art is less, so the word segmentation of the paragraph is completed by adopting a statistical-based word segmentation method, namely a hidden Markov model.
Further, a word information entropy ψ is defined in Step4
Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, ε is an allowable error term, for word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of Chinese character x and y needs to be judged by calculating word formation information entropy ψ, thus determining whether it can form a word, the larger the value of the word formation information entropy is, the higher the combination degree of the two words forming a word is represented; on the contrary, the lower the combination degree of the words is, the word segmentation accuracy of the hidden Markov model is further improved through the screening of word formation information entropy. The bi-directional LSTM neural network takes the result of the first prediction as a new characteristic to predict the following, has high accuracy and strong learning ability, and does not need to judge the word segmentation result again.
Drawings
FIG. 1 is a general process flow diagram of an accurate segmentation of a domain document;
FIG. 2 is a flow chart of a data (text) preprocessing process;
fig. 3 is a network structure diagram of Bi-LSTM (Bi-directional LSTM).
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
The flow of the intelligent Chinese word segmentation method based on statistics and deep learning is shown in figure 1, and the steps are as follows:
step1. data pretreatment. FIG. 2 is a flow chart of a data (text) preprocessing process; the text document to be segmented is preprocessed, and the document can be segmented by means of symbols with separation functions such as original punctuation marks, paragraph separators and the like, so that shorter sentences or character strings are obtained.
In view of chinese writing formats and characteristics, authors often place content that is similar or logically closely related in a natural segment. Therefore, the technical terms in the field generally repeatedly appear in a large number in one or a plurality of natural sections, and for the paragraphs in which the technical terms appear in a large number, a word segmentation method with high word segmentation accuracy and strong disambiguation capability should be selected for processing; and for paragraphs (such as background introduction, author views, summarization characters and the like) with non-concentrated professional terms, a Chinese word segmentation method based on statistics can be adopted, so that higher accuracy can be obtained, the word segmentation speed is improved, and the algorithm complexity is reduced.
The method comprises the steps of accurately word segmentation of a professional field document, including text preprocessing, statistics of the occurrence times of the professional term in the document by means of a subject term set, word segmentation model selection and word segmentation completion.
Text preprocessing is the premise and basis for word segmentation. The paragraphs are used as the minimum structural units for representing the article hierarchy, the content difference of the same natural segment is small, and the same method can be selected for word segmentation. Therefore, the original document is firstly divided into a plurality of paragraphs by means of paragraph separators, each paragraph is used as a data processing unit for document word segmentation, and the same word segmentation method is adopted; and then the document paragraph is continuously segmented by means of separators such as punctuations and the like, and the original separators are replaced by spaces. The text after preprocessing results in a combination of shorter sentences or strings in paragraphs. In the word segmentation process, the short sentences or shorter character strings are processed one by one, so that the matching times are reduced, the word segmentation efficiency is improved, and the word segmentation difficulty is reduced.
Step2. Domain term set construction.
The sub-disciplines in the field of the document can be roughly judged through the document title, a discipline term set of the constructed related sub-disciplines is extracted, the discipline term set is traversed, the occurrence frequency of the commonly used technical terms in each paragraph is counted, and finally the total occurrence frequency of the technical terms of the paragraph is obtained.
Each sub-discipline in a certain domain is numbered from 1 to n,a term set TS is established and,counting m most commonly used technical terms in each sub-discipline (m may be different from discipline to discipline), wherein each of the most commonly used technical terms respectively forms a corresponding discipline term set TS i 。
Step3, selecting a word segmentation method. Judging the sub-discipline field to which the document belongs according to the topic of the document to be segmented, and extracting a corresponding discipline term set TS i . Traversing a term set TS i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, and using a matrixAnd (3) representing. The total number of occurrences of technical term in the document paragraph is +.>The term quantity threshold is defined as Γ=k·total_num. The word segmentation method is selected as follows:
the total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num j Representing the number of times the jth term appears in the document. For the term quantity threshold Γ=k·total_num, where k represents a scaling factor and total_num represents the total number of words of the document. That is, when the total occurrence times of the special terms in a certain paragraph of the document to be segmented are larger than a threshold value, the paragraph is described as using a large amount of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total occurrence frequency of the special terms in a certain paragraph of the document to be segmented is smaller than a threshold value, the paragraph can be considered to be general description, and the special terms are used less, so that a statistical word segmentation method, namely a hidden Markov model, is adopted to complete word segmentation of the paragraph. Fig. 3 is a diagram of a bidirectional LSTM network configuration.
Step4, word segmentation judgment. Defining a word information entropy psi
Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, and ε is an allowable error term. For word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of the Chinese characters x and y is judged by calculating word formation information entropy psi so as to determine whether the Chinese characters x and y can form a word or not. The larger the value of the word formation information entropy is, the higher the combination degree of the two words forming one word is; conversely, the lower the degree of combination constituting a word. And through screening of word information entropy, the word segmentation accuracy of the hidden Markov model is further improved. The bi-directional LSTM neural network takes the result of the first prediction as a new characteristic to predict the following, has high accuracy and strong learning ability, and does not need to judge the word segmentation result again.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and modification made to the above embodiments according to the technical substance of the present invention falls within the scope of the technical solution of the present invention.
Claims (2)
1. The intelligent Chinese word segmentation method based on statistics and deep learning is characterized by comprising the following steps of:
step1, preprocessing data;
step2, constructing a domain term set;
step3, selecting a word segmentation method;
step4, word segmentation judgment;
preprocessing a text document to be segmented in the step1, and segmenting the document by means of original symbols with separation function, so as to obtain shorter sentences or character strings;
the Step is2. Numbering each sub-discipline from 1 to n in a certain domain, creating a term set TS,counting m most commonly used terms in each sub-discipline, wherein the most commonly used terms of each discipline respectively form a corresponding discipline term set TS i ;
The step3 is used for judging the sub-disciplinary field of the document according to the topic of the document to be segmented, and extracting a corresponding disciplinary term set TS i Traversing the term set TS i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, wherein the total occurrence times of the technical terms in the document paragraph are as followsDefining the term quantity threshold as Γ=k·total_num, the choice of word segmentation method is as follows:
the total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num j Representing the number of times of occurrence of the jth special term in the document, and for a special term quantity threshold value Γ=k·total_num, wherein k represents a proportionality coefficient, total_num represents the total number of words of the document, when the total number of times of occurrence of the special term in a certain paragraph of the document to be segmented is larger than the threshold value, the paragraph is described to use a large number of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total occurrence frequency of the special terms in a certain paragraph of the document to be segmented is smaller than a threshold value, the paragraph can be considered as a general description, and the special terms are used less, so that the segmentation of the paragraph is completed by adopting a statistical-based segmentation method, namely a hidden Markov model.
2. The intelligent chinese word segmentation method based on statistics and deep learning as recited in claim 1, wherein: the step4 defines a word information entropy ψ
Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, ε is an allowable error term, for word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of Chinese character x and y needs to be judged by calculating word formation information entropy ψ, thus determining whether it can form a word, the larger the value of the word formation information entropy is, the higher the combination degree of the two words forming a word is represented; on the contrary, the lower the combination degree of the words is, the word information entropy is filtered, the word segmentation accuracy of the hidden Markov model is further improved, the two-way LSTM neural network takes the result of the first prediction as a new characteristic to conduct the following prediction, and the method has high accuracy and strong learning ability, so that the word segmentation result does not need to be judged again.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910655795.2A CN110414002B (en) | 2019-07-19 | 2019-07-19 | Intelligent Chinese word segmentation method based on statistics and deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910655795.2A CN110414002B (en) | 2019-07-19 | 2019-07-19 | Intelligent Chinese word segmentation method based on statistics and deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110414002A CN110414002A (en) | 2019-11-05 |
CN110414002B true CN110414002B (en) | 2023-06-09 |
Family
ID=68360365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910655795.2A Active CN110414002B (en) | 2019-07-19 | 2019-07-19 | Intelligent Chinese word segmentation method based on statistics and deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110414002B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014164739A (en) * | 2013-02-28 | 2014-09-08 | National Institute Of Information & Communication Technology | Parallel translation dictionary generation device, method, and computer program for the same |
CN107818130A (en) * | 2017-09-15 | 2018-03-20 | 深圳市电陶思创科技有限公司 | The method for building up and system of a kind of search engine |
CN108427670A (en) * | 2018-04-08 | 2018-08-21 | 重庆邮电大学 | A kind of sentiment analysis method based on context word vector sum deep learning |
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
-
2019
- 2019-07-19 CN CN201910655795.2A patent/CN110414002B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014164739A (en) * | 2013-02-28 | 2014-09-08 | National Institute Of Information & Communication Technology | Parallel translation dictionary generation device, method, and computer program for the same |
CN107818130A (en) * | 2017-09-15 | 2018-03-20 | 深圳市电陶思创科技有限公司 | The method for building up and system of a kind of search engine |
CN108427670A (en) * | 2018-04-08 | 2018-08-21 | 重庆邮电大学 | A kind of sentiment analysis method based on context word vector sum deep learning |
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
Non-Patent Citations (1)
Title |
---|
基于自适应隐马尔可夫模型的石油领域文档分词;宫法明,朱朋海;《计算机科学》;20180630;第3章 * |
Also Published As
Publication number | Publication date |
---|---|
CN110414002A (en) | 2019-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021114745A1 (en) | Named entity recognition method employing affix perception for use in social media | |
CN111310471B (en) | Travel named entity identification method based on BBLC model | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
WO2022141878A1 (en) | End-to-end language model pretraining method and system, and device and storage medium | |
CN112101028B (en) | Multi-feature bidirectional gating field expert entity extraction method and system | |
CN106294320B (en) | A kind of terminology extraction method and system towards academic paper | |
CN116187163B (en) | Construction method and system of pre-training model for patent document processing | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
CN110502744B (en) | Text emotion recognition method and device for historical park evaluation | |
CN111324742A (en) | Construction method of digital human knowledge map | |
CN111897917B (en) | Rail transit industry term extraction method based on multi-modal natural language features | |
CN113033183B (en) | Network new word discovery method and system based on statistics and similarity | |
CN111061882A (en) | Knowledge graph construction method | |
CN113312922B (en) | Improved chapter-level triple information extraction method | |
CN110457690A (en) | A kind of judgment method of patent creativeness | |
CN107895000A (en) | A kind of cross-cutting semantic information retrieval method based on convolutional neural networks | |
CN112632969B (en) | Incremental industry dictionary updating method and system | |
CN112926345A (en) | Multi-feature fusion neural machine translation error detection method based on data enhancement training | |
CN111008530A (en) | Complex semantic recognition method based on document word segmentation | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN114997288A (en) | Design resource association method | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
CN112257442A (en) | Policy document information extraction method based on corpus expansion neural network | |
CN110502759B (en) | Method for processing Chinese-Yue hybrid network neural machine translation out-of-set words fused into classification dictionary | |
CN114564912B (en) | Intelligent document format checking and correcting method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |