CN110414002B - Intelligent Chinese word segmentation method based on statistics and deep learning - Google Patents

Intelligent Chinese word segmentation method based on statistics and deep learning Download PDF

Info

Publication number
CN110414002B
CN110414002B CN201910655795.2A CN201910655795A CN110414002B CN 110414002 B CN110414002 B CN 110414002B CN 201910655795 A CN201910655795 A CN 201910655795A CN 110414002 B CN110414002 B CN 110414002B
Authority
CN
China
Prior art keywords
word segmentation
document
word
term
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910655795.2A
Other languages
Chinese (zh)
Other versions
CN110414002A (en
Inventor
徐建国
刘梦凡
刘泳慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN201910655795.2A priority Critical patent/CN110414002B/en
Publication of CN110414002A publication Critical patent/CN110414002A/en
Application granted granted Critical
Publication of CN110414002B publication Critical patent/CN110414002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an intelligent Chinese word segmentation method based on statistics and deep learning, which comprises data preprocessing; constructing a domain term set; selecting a word segmentation method; and (5) word segmentation judgment. The invention has the advantages that the word segmentation model based on the combination of the word segmentation method based on statistics and the deep learning technology is adopted, the application range is wide, the accurate word segmentation can be carried out on the professional words in the professional field, and the algorithm is simple and the word segmentation speed is high.

Description

Intelligent Chinese word segmentation method based on statistics and deep learning
Technical Field
The invention belongs to the technical field of word segmentation, and relates to a technology for improving the accuracy of word segmentation of a professional term aiming at a document in the professional field.
Background
Chinese segmentation (Chinese Word Segmentation) is a process of segmenting a sequence of chinese characters into individual words, which is the basis for performing natural language processing. Chinese information processing is a branch of natural language processing, and comprises three layers: lexical analysis, syntactic analysis, and semantic analysis, wherein chinese segmentation is the first step of lexical analysis. The Chinese word segmentation application field is very wide, as small as POS part of speech tagging and NER named entity recognition, and as large as automatic classification, automatic proofreading, search engines, speech synthesis, machine translation and the like. The Chinese word segmentation method based on statistics is low in word segmentation accuracy, and especially accurate word segmentation of professional words in the professional field is difficult; only a word segmentation method based on deep learning is used, so that algorithm complexity is high, and word segmentation speed is low.
Disclosure of Invention
The invention aims to provide an intelligent Chinese word segmentation method based on statistics and deep learning, which solves the problems of high complexity, low word segmentation speed and the like when only using a bidirectional LSTM algorithm to perform Chinese word segmentation. The invention has the advantages that the word segmentation model based on the combination of the word segmentation method based on statistics and the deep learning technology is adopted, the application range is wide, the accurate word segmentation can be carried out on the professional words in the professional field, and the algorithm is simple and the word segmentation speed is high.
The technical scheme adopted by the invention is carried out according to the following steps:
step1, preprocessing data;
step2, constructing a domain term set;
step3, selecting a word segmentation method;
step4, word segmentation judgment.
Further, preprocessing the text document to be segmented in Step1, and segmenting the document by means of the symbols with separation function such as original punctuation marks, paragraph separators and the like, thereby obtaining shorter sentences or character strings.
Further, step2, numbering each sub-discipline from 1 to n in a domain, creating a term set TS,
Figure BDA0002136833420000011
counting m most commonly used terms in each sub-discipline, wherein the most commonly used terms of each discipline respectively form a corresponding discipline term set TS i
Further, step3, judging the sub-discipline domain of the document according to the topic of the document to be segmented, and extracting the corresponding discipline term set TS i Traversing the term set TS i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, wherein the total occurrence times of the technical terms in the document paragraph are as follows
Figure BDA0002136833420000021
Defining the term quantity threshold as Γ=k·total_num, the choice of word segmentation method is as follows: />
Figure BDA0002136833420000022
The total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num j Representing the number of times of occurrence of the jth special term in the document, and for a special term quantity threshold value Γ=k·total_num, wherein k represents a proportionality coefficient, total_num represents the total number of words of the document, when the total number of times of occurrence of the special term in a certain paragraph of the document to be segmented is larger than the threshold value, the paragraph is described to use a large number of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total number of occurrences of a term in a paragraph of a document to be segmented is less than a threshold value, the paragraph can be considered as a general oneDescriptive description, the term of art is less, so the word segmentation of the paragraph is completed by adopting a statistical-based word segmentation method, namely a hidden Markov model.
Further, a word information entropy ψ is defined in Step4
Figure BDA0002136833420000023
Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, ε is an allowable error term, for word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of Chinese character x and y needs to be judged by calculating word formation information entropy ψ, thus determining whether it can form a word, the larger the value of the word formation information entropy is, the higher the combination degree of the two words forming a word is represented; on the contrary, the lower the combination degree of the words is, the word segmentation accuracy of the hidden Markov model is further improved through the screening of word formation information entropy. The bi-directional LSTM neural network takes the result of the first prediction as a new characteristic to predict the following, has high accuracy and strong learning ability, and does not need to judge the word segmentation result again.
Drawings
FIG. 1 is a general process flow diagram of an accurate segmentation of a domain document;
FIG. 2 is a flow chart of a data (text) preprocessing process;
fig. 3 is a network structure diagram of Bi-LSTM (Bi-directional LSTM).
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
The flow of the intelligent Chinese word segmentation method based on statistics and deep learning is shown in figure 1, and the steps are as follows:
step1. data pretreatment. FIG. 2 is a flow chart of a data (text) preprocessing process; the text document to be segmented is preprocessed, and the document can be segmented by means of symbols with separation functions such as original punctuation marks, paragraph separators and the like, so that shorter sentences or character strings are obtained.
In view of chinese writing formats and characteristics, authors often place content that is similar or logically closely related in a natural segment. Therefore, the technical terms in the field generally repeatedly appear in a large number in one or a plurality of natural sections, and for the paragraphs in which the technical terms appear in a large number, a word segmentation method with high word segmentation accuracy and strong disambiguation capability should be selected for processing; and for paragraphs (such as background introduction, author views, summarization characters and the like) with non-concentrated professional terms, a Chinese word segmentation method based on statistics can be adopted, so that higher accuracy can be obtained, the word segmentation speed is improved, and the algorithm complexity is reduced.
The method comprises the steps of accurately word segmentation of a professional field document, including text preprocessing, statistics of the occurrence times of the professional term in the document by means of a subject term set, word segmentation model selection and word segmentation completion.
Text preprocessing is the premise and basis for word segmentation. The paragraphs are used as the minimum structural units for representing the article hierarchy, the content difference of the same natural segment is small, and the same method can be selected for word segmentation. Therefore, the original document is firstly divided into a plurality of paragraphs by means of paragraph separators, each paragraph is used as a data processing unit for document word segmentation, and the same word segmentation method is adopted; and then the document paragraph is continuously segmented by means of separators such as punctuations and the like, and the original separators are replaced by spaces. The text after preprocessing results in a combination of shorter sentences or strings in paragraphs. In the word segmentation process, the short sentences or shorter character strings are processed one by one, so that the matching times are reduced, the word segmentation efficiency is improved, and the word segmentation difficulty is reduced.
Step2. Domain term set construction.
The sub-disciplines in the field of the document can be roughly judged through the document title, a discipline term set of the constructed related sub-disciplines is extracted, the discipline term set is traversed, the occurrence frequency of the commonly used technical terms in each paragraph is counted, and finally the total occurrence frequency of the technical terms of the paragraph is obtained.
Each sub-discipline in a certain domain is numbered from 1 to n,a term set TS is established and,
Figure BDA0002136833420000031
counting m most commonly used technical terms in each sub-discipline (m may be different from discipline to discipline), wherein each of the most commonly used technical terms respectively forms a corresponding discipline term set TS i
Step3, selecting a word segmentation method. Judging the sub-discipline field to which the document belongs according to the topic of the document to be segmented, and extracting a corresponding discipline term set TS i . Traversing a term set TS i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, and using a matrix
Figure BDA0002136833420000041
And (3) representing. The total number of occurrences of technical term in the document paragraph is +.>
Figure BDA0002136833420000042
The term quantity threshold is defined as Γ=k·total_num. The word segmentation method is selected as follows:
Figure BDA0002136833420000043
the total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num j Representing the number of times the jth term appears in the document. For the term quantity threshold Γ=k·total_num, where k represents a scaling factor and total_num represents the total number of words of the document. That is, when the total occurrence times of the special terms in a certain paragraph of the document to be segmented are larger than a threshold value, the paragraph is described as using a large amount of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total occurrence frequency of the special terms in a certain paragraph of the document to be segmented is smaller than a threshold value, the paragraph can be considered to be general description, and the special terms are used less, so that a statistical word segmentation method, namely a hidden Markov model, is adopted to complete word segmentation of the paragraph. Fig. 3 is a diagram of a bidirectional LSTM network configuration.
Step4, word segmentation judgment. Defining a word information entropy psi
Figure BDA0002136833420000044
Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, and ε is an allowable error term. For word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of the Chinese characters x and y is judged by calculating word formation information entropy psi so as to determine whether the Chinese characters x and y can form a word or not. The larger the value of the word formation information entropy is, the higher the combination degree of the two words forming one word is; conversely, the lower the degree of combination constituting a word. And through screening of word information entropy, the word segmentation accuracy of the hidden Markov model is further improved. The bi-directional LSTM neural network takes the result of the first prediction as a new characteristic to predict the following, has high accuracy and strong learning ability, and does not need to judge the word segmentation result again.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and modification made to the above embodiments according to the technical substance of the present invention falls within the scope of the technical solution of the present invention.

Claims (2)

1. The intelligent Chinese word segmentation method based on statistics and deep learning is characterized by comprising the following steps of:
step1, preprocessing data;
step2, constructing a domain term set;
step3, selecting a word segmentation method;
step4, word segmentation judgment;
preprocessing a text document to be segmented in the step1, and segmenting the document by means of original symbols with separation function, so as to obtain shorter sentences or character strings;
the Step is2. Numbering each sub-discipline from 1 to n in a certain domain, creating a term set TS,
Figure FDA0004197970280000011
counting m most commonly used terms in each sub-discipline, wherein the most commonly used terms of each discipline respectively form a corresponding discipline term set TS i
The step3 is used for judging the sub-disciplinary field of the document according to the topic of the document to be segmented, and extracting a corresponding disciplinary term set TS i Traversing the term set TS i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, wherein the total occurrence times of the technical terms in the document paragraph are as follows
Figure FDA0004197970280000012
Defining the term quantity threshold as Γ=k·total_num, the choice of word segmentation method is as follows:
Figure FDA0004197970280000013
the total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num j Representing the number of times of occurrence of the jth special term in the document, and for a special term quantity threshold value Γ=k·total_num, wherein k represents a proportionality coefficient, total_num represents the total number of words of the document, when the total number of times of occurrence of the special term in a certain paragraph of the document to be segmented is larger than the threshold value, the paragraph is described to use a large number of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total occurrence frequency of the special terms in a certain paragraph of the document to be segmented is smaller than a threshold value, the paragraph can be considered as a general description, and the special terms are used less, so that the segmentation of the paragraph is completed by adopting a statistical-based segmentation method, namely a hidden Markov model.
2. The intelligent chinese word segmentation method based on statistics and deep learning as recited in claim 1, wherein: the step4 defines a word information entropy ψ
Figure FDA0004197970280000021
Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, ε is an allowable error term, for word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of Chinese character x and y needs to be judged by calculating word formation information entropy ψ, thus determining whether it can form a word, the larger the value of the word formation information entropy is, the higher the combination degree of the two words forming a word is represented; on the contrary, the lower the combination degree of the words is, the word information entropy is filtered, the word segmentation accuracy of the hidden Markov model is further improved, the two-way LSTM neural network takes the result of the first prediction as a new characteristic to conduct the following prediction, and the method has high accuracy and strong learning ability, so that the word segmentation result does not need to be judged again.
CN201910655795.2A 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning Active CN110414002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910655795.2A CN110414002B (en) 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910655795.2A CN110414002B (en) 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning

Publications (2)

Publication Number Publication Date
CN110414002A CN110414002A (en) 2019-11-05
CN110414002B true CN110414002B (en) 2023-06-09

Family

ID=68360365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910655795.2A Active CN110414002B (en) 2019-07-19 2019-07-19 Intelligent Chinese word segmentation method based on statistics and deep learning

Country Status (1)

Country Link
CN (1) CN110414002B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014164739A (en) * 2013-02-28 2014-09-08 National Institute Of Information & Communication Technology Parallel translation dictionary generation device, method, and computer program for the same
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN108427670A (en) * 2018-04-08 2018-08-21 重庆邮电大学 A kind of sentiment analysis method based on context word vector sum deep learning
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014164739A (en) * 2013-02-28 2014-09-08 National Institute Of Information & Communication Technology Parallel translation dictionary generation device, method, and computer program for the same
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN108427670A (en) * 2018-04-08 2018-08-21 重庆邮电大学 A kind of sentiment analysis method based on context word vector sum deep learning
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自适应隐马尔可夫模型的石油领域文档分词;宫法明,朱朋海;《计算机科学》;20180630;第3章 *

Also Published As

Publication number Publication date
CN110414002A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
CN111310471B (en) Travel named entity identification method based on BBLC model
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
WO2022141878A1 (en) End-to-end language model pretraining method and system, and device and storage medium
CN112101028B (en) Multi-feature bidirectional gating field expert entity extraction method and system
CN106294320B (en) A kind of terminology extraction method and system towards academic paper
CN116187163B (en) Construction method and system of pre-training model for patent document processing
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN110502744B (en) Text emotion recognition method and device for historical park evaluation
CN111324742A (en) Construction method of digital human knowledge map
CN111897917B (en) Rail transit industry term extraction method based on multi-modal natural language features
CN113033183B (en) Network new word discovery method and system based on statistics and similarity
CN111061882A (en) Knowledge graph construction method
CN113312922B (en) Improved chapter-level triple information extraction method
CN110457690A (en) A kind of judgment method of patent creativeness
CN107895000A (en) A kind of cross-cutting semantic information retrieval method based on convolutional neural networks
CN112632969B (en) Incremental industry dictionary updating method and system
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN111008530A (en) Complex semantic recognition method based on document word segmentation
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114997288A (en) Design resource association method
CN114265936A (en) Method for realizing text mining of science and technology project
CN112257442A (en) Policy document information extraction method based on corpus expansion neural network
CN110502759B (en) Method for processing Chinese-Yue hybrid network neural machine translation out-of-set words fused into classification dictionary
CN114564912B (en) Intelligent document format checking and correcting method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant