CN110414002B

CN110414002B - Intelligent Chinese word segmentation method based on statistics and deep learning

Info

Publication number: CN110414002B
Application number: CN201910655795.2A
Authority: CN
Inventors: 徐建国; 刘梦凡; 刘泳慧
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2023-06-09
Anticipated expiration: 2039-07-19
Also published as: CN110414002A

Abstract

The invention discloses an intelligent Chinese word segmentation method based on statistics and deep learning, which comprises data preprocessing; constructing a domain term set; selecting a word segmentation method; and (5) word segmentation judgment. The invention has the advantages that the word segmentation model based on the combination of the word segmentation method based on statistics and the deep learning technology is adopted, the application range is wide, the accurate word segmentation can be carried out on the professional words in the professional field, and the algorithm is simple and the word segmentation speed is high.

Description

Intelligent Chinese word segmentation method based on statistics and deep learning

Technical Field

The invention belongs to the technical field of word segmentation, and relates to a technology for improving the accuracy of word segmentation of a professional term aiming at a document in the professional field.

Background

Chinese segmentation (Chinese Word Segmentation) is a process of segmenting a sequence of chinese characters into individual words, which is the basis for performing natural language processing. Chinese information processing is a branch of natural language processing, and comprises three layers: lexical analysis, syntactic analysis, and semantic analysis, wherein chinese segmentation is the first step of lexical analysis. The Chinese word segmentation application field is very wide, as small as POS part of speech tagging and NER named entity recognition, and as large as automatic classification, automatic proofreading, search engines, speech synthesis, machine translation and the like. The Chinese word segmentation method based on statistics is low in word segmentation accuracy, and especially accurate word segmentation of professional words in the professional field is difficult; only a word segmentation method based on deep learning is used, so that algorithm complexity is high, and word segmentation speed is low.

Disclosure of Invention

The invention aims to provide an intelligent Chinese word segmentation method based on statistics and deep learning, which solves the problems of high complexity, low word segmentation speed and the like when only using a bidirectional LSTM algorithm to perform Chinese word segmentation. The invention has the advantages that the word segmentation model based on the combination of the word segmentation method based on statistics and the deep learning technology is adopted, the application range is wide, the accurate word segmentation can be carried out on the professional words in the professional field, and the algorithm is simple and the word segmentation speed is high.

The technical scheme adopted by the invention is carried out according to the following steps:

step1, preprocessing data;

step2, constructing a domain term set;

step3, selecting a word segmentation method;

step4, word segmentation judgment.

Further, preprocessing the text document to be segmented in Step1, and segmenting the document by means of the symbols with separation function such as original punctuation marks, paragraph separators and the like, thereby obtaining shorter sentences or character strings.

Further, step2, numbering each sub-discipline from 1 to n in a domain, creating a term set TS,

counting m most commonly used terms in each sub-discipline, wherein the most commonly used terms of each discipline respectively form a corresponding discipline term set TS _i 。

Further, step3, judging the sub-discipline domain of the document according to the topic of the document to be segmented, and extracting the corresponding discipline term set TS _i Traversing the term set TS _i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, wherein the total occurrence times of the technical terms in the document paragraph are as follows

Defining the term quantity threshold as Γ=k·total_num, the choice of word segmentation method is as follows: />

The total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num _j Representing the number of times of occurrence of the jth special term in the document, and for a special term quantity threshold value Γ=k·total_num, wherein k represents a proportionality coefficient, total_num represents the total number of words of the document, when the total number of times of occurrence of the special term in a certain paragraph of the document to be segmented is larger than the threshold value, the paragraph is described to use a large number of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total number of occurrences of a term in a paragraph of a document to be segmented is less than a threshold value, the paragraph can be considered as a general oneDescriptive description, the term of art is less, so the word segmentation of the paragraph is completed by adopting a statistical-based word segmentation method, namely a hidden Markov model.

Further, a word information entropy ψ is defined in Step4

Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, ε is an allowable error term, for word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of Chinese character x and y needs to be judged by calculating word formation information entropy ψ, thus determining whether it can form a word, the larger the value of the word formation information entropy is, the higher the combination degree of the two words forming a word is represented; on the contrary, the lower the combination degree of the words is, the word segmentation accuracy of the hidden Markov model is further improved through the screening of word formation information entropy. The bi-directional LSTM neural network takes the result of the first prediction as a new characteristic to predict the following, has high accuracy and strong learning ability, and does not need to judge the word segmentation result again.

Drawings

FIG. 1 is a general process flow diagram of an accurate segmentation of a domain document;

FIG. 2 is a flow chart of a data (text) preprocessing process;

fig. 3 is a network structure diagram of Bi-LSTM (Bi-directional LSTM).

Detailed Description

The present invention will be described in detail with reference to the following embodiments.

The flow of the intelligent Chinese word segmentation method based on statistics and deep learning is shown in figure 1, and the steps are as follows:

step1. data pretreatment. FIG. 2 is a flow chart of a data (text) preprocessing process; the text document to be segmented is preprocessed, and the document can be segmented by means of symbols with separation functions such as original punctuation marks, paragraph separators and the like, so that shorter sentences or character strings are obtained.

In view of chinese writing formats and characteristics, authors often place content that is similar or logically closely related in a natural segment. Therefore, the technical terms in the field generally repeatedly appear in a large number in one or a plurality of natural sections, and for the paragraphs in which the technical terms appear in a large number, a word segmentation method with high word segmentation accuracy and strong disambiguation capability should be selected for processing; and for paragraphs (such as background introduction, author views, summarization characters and the like) with non-concentrated professional terms, a Chinese word segmentation method based on statistics can be adopted, so that higher accuracy can be obtained, the word segmentation speed is improved, and the algorithm complexity is reduced.

The method comprises the steps of accurately word segmentation of a professional field document, including text preprocessing, statistics of the occurrence times of the professional term in the document by means of a subject term set, word segmentation model selection and word segmentation completion.

Text preprocessing is the premise and basis for word segmentation. The paragraphs are used as the minimum structural units for representing the article hierarchy, the content difference of the same natural segment is small, and the same method can be selected for word segmentation. Therefore, the original document is firstly divided into a plurality of paragraphs by means of paragraph separators, each paragraph is used as a data processing unit for document word segmentation, and the same word segmentation method is adopted; and then the document paragraph is continuously segmented by means of separators such as punctuations and the like, and the original separators are replaced by spaces. The text after preprocessing results in a combination of shorter sentences or strings in paragraphs. In the word segmentation process, the short sentences or shorter character strings are processed one by one, so that the matching times are reduced, the word segmentation efficiency is improved, and the word segmentation difficulty is reduced.

Step2. Domain term set construction.

The sub-disciplines in the field of the document can be roughly judged through the document title, a discipline term set of the constructed related sub-disciplines is extracted, the discipline term set is traversed, the occurrence frequency of the commonly used technical terms in each paragraph is counted, and finally the total occurrence frequency of the technical terms of the paragraph is obtained.

Each sub-discipline in a certain domain is numbered from 1 to n,a term set TS is established and,

counting m most commonly used technical terms in each sub-discipline (m may be different from discipline to discipline), wherein each of the most commonly used technical terms respectively forms a corresponding discipline term set TS _i 。

Step3, selecting a word segmentation method. Judging the sub-discipline field to which the document belongs according to the topic of the document to be segmented, and extracting a corresponding discipline term set TS _i . Traversing a term set TS _i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, and using a matrix

And (3) representing. The total number of occurrences of technical term in the document paragraph is +.>

The term quantity threshold is defined as Γ=k·total_num. The word segmentation method is selected as follows:

the total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num _j Representing the number of times the jth term appears in the document. For the term quantity threshold Γ=k·total_num, where k represents a scaling factor and total_num represents the total number of words of the document. That is, when the total occurrence times of the special terms in a certain paragraph of the document to be segmented are larger than a threshold value, the paragraph is described as using a large amount of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total occurrence frequency of the special terms in a certain paragraph of the document to be segmented is smaller than a threshold value, the paragraph can be considered to be general description, and the special terms are used less, so that a statistical word segmentation method, namely a hidden Markov model, is adopted to complete word segmentation of the paragraph. Fig. 3 is a diagram of a bidirectional LSTM network configuration.

Step4, word segmentation judgment. Defining a word information entropy psi

Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, and ε is an allowable error term. For word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of the Chinese characters x and y is judged by calculating word formation information entropy psi so as to determine whether the Chinese characters x and y can form a word or not. The larger the value of the word formation information entropy is, the higher the combination degree of the two words forming one word is; conversely, the lower the degree of combination constituting a word. And through screening of word information entropy, the word segmentation accuracy of the hidden Markov model is further improved. The bi-directional LSTM neural network takes the result of the first prediction as a new characteristic to predict the following, has high accuracy and strong learning ability, and does not need to judge the word segmentation result again.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and modification made to the above embodiments according to the technical substance of the present invention falls within the scope of the technical solution of the present invention.

Claims

1. The intelligent Chinese word segmentation method based on statistics and deep learning is characterized by comprising the following steps of:

step1, preprocessing data;

step2, constructing a domain term set;

step3, selecting a word segmentation method;

step4, word segmentation judgment;

preprocessing a text document to be segmented in the step1, and segmenting the document by means of original symbols with separation function, so as to obtain shorter sentences or character strings;

the Step is2. Numbering each sub-discipline from 1 to n in a certain domain, creating a term set TS,

counting m most commonly used terms in each sub-discipline, wherein the most commonly used terms of each discipline respectively form a corresponding discipline term set TS _i ；

The step3 is used for judging the sub-disciplinary field of the document according to the topic of the document to be segmented, and extracting a corresponding disciplinary term set TS _i Traversing the term set TS _i Counting the technical terms and the number of the technical terms in the subject field contained in the document to be segmented, wherein the total occurrence times of the technical terms in the document paragraph are as follows

Defining the term quantity threshold as Γ=k·total_num, the choice of word segmentation method is as follows:

the total number of occurrences of a term in a document is equal to the cumulative sum of the number of occurrences of each term, where num _j Representing the number of times of occurrence of the jth special term in the document, and for a special term quantity threshold value Γ=k·total_num, wherein k represents a proportionality coefficient, total_num represents the total number of words of the document, when the total number of times of occurrence of the special term in a certain paragraph of the document to be segmented is larger than the threshold value, the paragraph is described to use a large number of special terms in the subject field, and in order to improve the segmentation accuracy, a bidirectional LSTM algorithm is adopted for segmentation; when the total occurrence frequency of the special terms in a certain paragraph of the document to be segmented is smaller than a threshold value, the paragraph can be considered as a general description, and the special terms are used less, so that the segmentation of the paragraph is completed by adopting a statistical-based segmentation method, namely a hidden Markov model.

2. The intelligent chinese word segmentation method based on statistics and deep learning as recited in claim 1, wherein: the step4 defines a word information entropy ψ

Wherein, p (x, y) is the probability of co-occurrence of Chinese character x and Chinese character y, p (x), p (y) respectively represent the probability of occurrence of Chinese character x and y, λ is a proportionality coefficient, ε is an allowable error term, for word segmentation completed by the hidden Markov model word segmentation method, the tightness degree of Chinese character x and y needs to be judged by calculating word formation information entropy ψ, thus determining whether it can form a word, the larger the value of the word formation information entropy is, the higher the combination degree of the two words forming a word is represented; on the contrary, the lower the combination degree of the words is, the word information entropy is filtered, the word segmentation accuracy of the hidden Markov model is further improved, the two-way LSTM neural network takes the result of the first prediction as a new characteristic to conduct the following prediction, and the method has high accuracy and strong learning ability, so that the word segmentation result does not need to be judged again.