CN109388806B

CN109388806B - Chinese word segmentation method based on deep learning and forgetting algorithm

Info

Publication number: CN109388806B
Application number: CN201811258651.5A
Authority: CN
Inventors: 卢学裕; 王安; 杨大海; 杨利军
Original assignee: Beijing Botbrain Intelligent Technology Co ltd
Current assignee: Beijing Botbrain Intelligent Technology Co ltd
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2023-06-27
Anticipated expiration: 2038-10-26
Also published as: CN109388806A

Abstract

The invention discloses a Chinese word segmentation method based on deep learning and forgetting algorithm, which comprises the following steps: and (3) a step of: the method comprises the steps of scanning sentences word by word to obtain natural language, dividing the scanned natural language into word sequences by adopting a deep learning word segmentation method, and collecting the word sequences into a first word stock; and II: the method comprises the steps of scanning sentences word by word to obtain natural language, dividing the obtained natural language into candidate words by adopting a forgetting algorithm word segmentation method, and collecting the candidate words into a second word bank, and thirdly: fusing the word sequence in the first word bank with the candidate words in the second word bank to obtain a final word segmentation result, wherein the fusion method comprises the following steps: the continuous single words in the second word stock are combined into words if the words in the corresponding deep learning; and if the single word in the second word stock is the word in the corresponding deep learning, merging the single word forwards or backwards into the word. According to the word segmentation method, the deep learning word segmentation method and the forgetting algorithm word segmentation method are fused, so that the knowledge of the field can be automatically detected, a new word discovery function in the unsupervised field is completed, and the word segmentation effect is improved.

Description

Chinese word segmentation method based on deep learning and forgetting algorithm

Technical Field

The invention relates to the technical field of word segmentation, in particular to a Chinese word segmentation method based on deep learning and forgetting algorithm.

Background

Chinese segmentation (Chinese Word Segmentation) refers to the segmentation of a sequence of Chinese characters into individual words. Word segmentation is the process of recombining a continuous word sequence into a word sequence according to a certain specification.

1. Word segmentation method based on character string matching

The method is also called a mechanical word segmentation method, and is characterized in that a Chinese character string to be analyzed is matched with entries in a 'fully large' machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the situation of the preferential matching of different lengths, the matching can be divided into maximum (longest) matching and minimum (shortest) matching; according to the combination of the part-of-speech labeling process or not, the method can be divided into a simple word segmentation method and an integrated method combining word segmentation and labeling. Several mechanical word segmentation methods are commonly used as follows:

1) Forward maximum matching (left to right direction);

2) Reverse maximum matching (right-to-left direction);

3) Minimum segmentation (minimizing the number of words cut in each sentence).

The above-described various methods may also be combined with each other, and for example, a forward maximum matching method and a reverse maximum matching method may be combined to constitute a bidirectional matching method. Due to the word forming characteristics of Chinese characters, the forward minimum matching and the reverse minimum matching are rarely used. In general, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less. The statistics show that the error rate of the pure forward maximum matching is 1/169, and the error rate of the pure reverse maximum matching is 1/245. But this accuracy is far from meeting practical requirements. The word segmentation system actually used takes mechanical word segmentation as a primary segmentation means, and further improves the segmentation accuracy by utilizing various other language information.

One method is to improve the scanning mode, called feature scanning or sign segmentation, to identify and segment some words with obvious features in the character string to be analyzed, and to use these words as break points to divide the original character string into smaller strings and then to enter mechanical word segmentation, so as to reduce the error rate of matching. The other method combines word segmentation and word class labeling, provides help for word segmentation decision by using rich word class information, and also carries out inspection and adjustment on word segmentation results in the labeling process, thereby greatly improving the segmentation accuracy.

For the mechanical word segmentation method, a general model can be built, and a professional academic paper is provided in this respect, which is not discussed in detail here.

2. Word segmentation method based on understanding

The word segmentation method achieves the effect of word recognition by enabling a computer to simulate the understanding of people to sentences. The basic idea is that the syntactic and semantic analysis is performed while the words are segmented, and the syntactic information and the semantic information are utilized to process the ambiguity. It generally consists of three parts: the system comprises a word segmentation subsystem, a syntactic semantic subsystem and a general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely, the word segmentation subsystem simulates the understanding process of people to sentences. This word segmentation method requires the use of a large amount of language knowledge and information. Because of the general and complex nature of Chinese language knowledge, it is difficult to organize various language information into machine-readable forms, and word segmentation systems based on understanding are still in the experimental stage at present.

3. Word segmentation method based on statistics

Formally, words are stable combinations of words, and therefore in this context, the more times adjacent words appear simultaneously, the more likely a word is composed. Therefore, the frequency or probability of co-occurrence of the characters adjacent to the characters can better reflect the credibility of the formed words. The frequency of the combination of each word of adjacent co-occurrence in the corpus can be counted, and the co-occurrence information of each word can be calculated. And defining mutual occurrence information of the two words, and calculating adjacent co-occurrence probabilities of the two Chinese characters X, Y. The mutual information shows the tightness of the combination relation between Chinese characters. When the degree of compactness is above a certain threshold, it is considered that the word may constitute a word. The method only needs to count the word group frequency in the corpus, and does not need to split a dictionary, so the method is also called a dictionary-free word segmentation method or a statistical word extraction method. However, this method has a certain limitation that common word groups, such as "this", "one", "some", "my", "many", etc., which are frequently found but are not words, are frequently extracted, and recognition accuracy of the common words is poor and space-time overhead is high. The statistical word segmentation system in practical application uses a part of basic word segmentation dictionary (commonly used word dictionary) to carry out string matching word segmentation, and simultaneously uses a statistical method to recognize some new words, namely, the string frequency statistics and the string matching are combined, so that the characteristics of high word segmentation speed and high efficiency of matching word segmentation are brought into play, and the advantages of word segmentation without dictionary are utilized to combine with context recognition word generation and automatic disambiguation.

1. Ambiguity identification

Ambiguity refers to the same sentence, and there may be two or more segmentation methods. For example: the phrase "surface" is intended to be divided into "surface" and "surface" because both are words. This is known as cross ambiguity. Such cross ambiguity is quite common, and the previous examples of "and" take "are in fact errors due to the cross ambiguity. "make-up and apparel" can be classified as "make-up and apparel" or "make-up and apparel". Since no one has knowledge to understand, it is difficult for a computer to know exactly which scenario is correct.

The cross ambiguity is relatively easy to handle compared with the combination ambiguity, which must be judged according to the whole sentence. For example, in the sentence "this door handle is bad," the "handle" is a word, but in the sentence "please take the handle off," the "handle" is not a word; in the sentence "will be in the order of one, the" will be "is a word, but in the sentence" will increase twice in three years of output ", the" will be "is no longer a word. How do these words computers go to identify again?

There is also a problem in ambiguity that is true if both the cross ambiguity and the combined ambiguity computers can resolve. True ambiguity means that a sentence is given, and the person does not know which should be a word and which should not be a word. For example: the "auction for table tennis is completed" may be divided into "auction for table tennis is completed" and "auction for table tennis is completed" or "auction for table tennis is completed" and if there is no context, the person who is afraid of not knowing "auction" does not calculate a word here.

2. New word recognition

New words, terms of art are called unregistered words. I.e. those words which are not already included in the dictionary but which can indeed be referred to as words. Most typically, a person can easily understand that the sentence "Wang Junhu" in "out of Guangzhou," Wang Junhu "is a word, because it is a person's name, but it is difficult for a computer to recognize. If Wang Junhu is recorded as a word in a dictionary, there are so many names worldwide, and there are newly added names at every moment, and recording these names is a huge project. Even if this can be done, problems remain, such as: in the sentence "Wang Junhu tiger brain," Wang Junhu "can also not calculate the word?

In addition to the name of a person, the new words include organization names, place names, product names, trademark names, abbreviations, ellipses and the like, which are difficult to process, and are just words which are frequently used by people, so that the new word recognition in a word segmentation system is very important for a search engine. At present, the recognition accuracy of new words is one of important marks for evaluating the quality of a word segmentation system. The existing word segmentation algorithm is based on a word stock, and words which do not appear in the word stock cannot be segmented.

Disclosure of Invention

Aiming at the technical problems, the invention provides a Chinese word segmentation method based on a deep learning and forgetting algorithm, which can automatically detect domain knowledge by fusing the deep learning word segmentation method and the forgetting algorithm word segmentation method, complete a new word discovery function in an unsupervised domain and improve word segmentation effect.

In order to solve the technical problems, the invention adopts the following technical scheme: a Chinese word segmentation method based on deep learning and forgetting algorithm comprises the following steps:

step one: the method comprises the steps of scanning sentences word by word to obtain natural language, dividing the scanned natural language into word sequences by adopting a deep learning word segmentation method, and collecting the word sequences into a first word stock;

step two: the sentence is scanned word by word to obtain natural language, the word segmentation method of forgetting algorithm is adopted to divide the obtained natural language into candidate words and receive the candidate words into a second word stock,

step three: fusing the word sequence in the first word bank with the candidate words in the second word bank to obtain a final word segmentation result, wherein the fusion method comprises the following steps:

the first word stock and the second word stock are words, and then are combined into words; the first word stock and the second word stock are single words, and are combined into words; the continuous single words in the second word stock are combined into words if the words in the corresponding deep learning; and if the single word in the second word stock is the word in the corresponding deep learning, merging the single word forwards or backwards into the word.

Wherein, the step one deep learning word segmentation method adopts an RNN method.

The deep learning word segmentation method adopts an LSTM model in the RNN method.

The word segmentation method of the forgetting algorithm adopts a judgment formula as follows:

P(W _n W _n+1 )＜P(W _n )*P(W _n+1 )

wherein Wn is the nth word in the scanned sentence

The forgetting curve adopted by the forgetting algorithm in the second step is a Newton cooling curve.

The beneficial effects of the invention are as follows:

the word segmentation method of the invention has the following advantages:

(1) Unsupervised learning, which can use a large amount of corpus for training;

(2) The O (N) level time complexity can be achieved in a relatively short time for large-scale word segmentation;

(3) The word library is self-maintained, and the program can automatically find and add new words, adjust word frequency, clean wrong words and remove uncommon words under the condition that manual participation is not needed, so that the size of the dictionary is kept to be proper;

(4) Domain adaptation: when the field changes, the vocabulary entry and the vocabulary frequency are adaptively adjusted along with the field change;

(5) The method can support word segmentation of exclusive word banks such as names of remote artists, program names and the like.

Drawings

FIG. 1 is a forgetting graph used by forgetting coefficients in a Chinese word segmentation method based on a deep learning and forgetting algorithm;

fig. 2 is a logic diagram of LSTM model in a chinese word segmentation method based on deep learning and forgetting algorithm according to the present invention.

Detailed Description

In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings which form a part hereof. The accompanying drawings illustrate, by way of example, specific embodiments in which the invention may be practiced. The illustrated embodiments are not intended to be exhaustive of all embodiments according to the invention. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

A Chinese word segmentation method based on deep learning and forgetting algorithm comprises the following steps:

The invention adopts the combination of deep learning and forgetting algorithm, can automatically detect domain knowledge, completes the new word discovery function in the unsupervised domain, and improves the word segmentation effect.

The main steps of the forgetting algorithm are as follows:

the word segmentation can be accomplished in O (N) level time, a single pass, using the following steps:

scanning sentences word by word, finding out all words ending with the word in a limited word length from a word stock, respectively calculating probability products of the words and words before the word, taking the word with the largest result value, and respectively caching the maximum probability product of the position of the current word and the corresponding word segmentation result. Repeating the above steps until the sentence is scanned, and obtaining the final word position as the whole sentence word segmentation result.

If two adjacent words are irrelevant, the two words can be disconnected in the middle. The sentence is scanned word by word, if two adjacent words meet the following formula, the two words are disconnected, so that the sentence can be cut into a plurality of substrings, and a candidate word set is obtained, and the judging formula is shown in the following diagram:

P(W _n W _n+1 )＜P(W _n )*P(W _n+1 )

wherein Wn is the nth word in the scanned sentence

The parameters required in the formula can be obtained statistically: the corpus is traversed once, so that the frequency number of the single word, the frequency number of the co-occurrence of two adjacent words and the frequency sum of all the single words required in the formula can be obtained.

The forgetting curve used by the forgetting coefficient is shown in fig. 1:

the deep learning method adopts an RNN method, and specifically adopts an LSTM model.

The Chinese word segmentation is to divide a natural language text into word sequences, and preferably sequence labeling is performed by labeling each word in a sentence with four tags, namely BMES (B is a word head, M is a word, E is a word tail, and S is a single word).

For { Beijing east search and big data platform data mining algorithm part })

Labeled { BE BE S BME BE BMME BME }

Training is carried out on the original input sequence and the original output sequence for marking corpus, and finally word segmentation sequences are generated, wherein a logical diagram of the LSTM model is shown in fig. 2. In fig. 2, X is an input sequence, H is an output sequence, and the basic idea of word segmentation uses a sequence labeling problem to label each word in a sentence as four labels of BEMS. The input of the whole model is a character sequence and the output is a label sequence, so this is a standard sequenceto sequence problem.

The combined word segmentation method improves the word segmentation effect by fusing the results of the two methods, and takes a forgetting algorithm as a theme because

By way of example, the name of the artist plays an important role in recommendation

Forgetting algorithm as unsupervised learning, low cost of corpus acquisition path

Deep learning algorithm training corpus is scarce and training time is long

Merging scheme

Successive single words in the result of the forgetting algorithm are merged into words if the word in the corresponding deep learning

Single word in forgetting algorithm result, if word in corresponding deep learning, combining forward or backward into word

Merging with reference parts of speech

Example 1:

natural language is obtained through scanning sentences, then word segmentation and deep learning word segmentation are carried out through a forgetting algorithm, and word segmentation results after fusion are obtained:

the following are the results of the two algorithms separately word-splitting:

the forgetting algorithm improves word segmentation results;

< actual shooting > < man > < subway > < leowed > < woman > < passenger > < quilt > < heat > < passenger > < torsion > < acquisition >

< monster of pocket > < network > < version > < registration > < download > < teaching > < video >

< jol > < some > < snowy > < real > < loving > < romantic > < surface > < girl > < feeling > < crying > <161105> < very > < perfect >

< laugh very popular > < Zheng Mou > < Yang Mou > < kissing > < play > < game > < talk > < love >

< ginger something > < Tight > < mah-jong > < reaction > < olympic > < laugh > < can > < combination with > < gymnastics >

The sound of the ball is the sound of the ball, the sound of the horn is the ultra-large sound of the ball, the sound of the ball is the sound of the ball, and the sound of the ball is the sound of the ball.

Word segmentation result of deep learning algorithm:

< practice > < clap > < man > < subway > < leowed woman > < passenger > < quilt > < heat-center > < passenger > < twist >

< pocket > < monster > < web version > < registration > < download > < teaching > < video >

< something joss > < snow > < night foraging > < genuine love > < romantic > < appearance > < girl > < feeling pain > < cry > <161105> < very > < perfect >

< pico > < laugh > < very > < pouring > < Zheng Mou > < Yang Mou > < kiss > < play > < game > < love >

< ginger > < certain > < Tibet > < mah-jong > < reaction > < forward Olympic > < laugh > < can > < combination with > < gymnastics > < combination >

< week sometime > < stock > < station > < left > < outer > < field > < horn > < loud sound > < self-contained > < won > < fight > < enemy >;

the results after combining by the above scheme:

< actual shooting > < man > < subway > < lewy < passenger > < quilt > < heat center > < passenger > < twist >

< pocket monster > < web version > < registration > < download > < teaching > < video >

< something joker > < snowy > < foraging > < genuine > above > < romantic > < appearance > < girl > < feeling > < cry of pain > <161105> < very > < perfect >

< laugh very popular > < Zheng Mou > < Yang Mou > < kissing > < play > < game > < love >

< week sometime > < stock > < station > < left > < outer > < field > < horn > < loud sound > < self-contained > < won > < protection > < enemy >.

Claims

1. A Chinese word segmentation method based on deep learning and forgetting algorithm is characterized by comprising the following steps:

the first word stock and the second word stock are words, and then are combined into words; the first word stock and the second word stock are single words, and are combined into words; the continuous single words in the second word stock are combined into words if the words in the corresponding deep learning; if the single word in the second word stock is the word in the corresponding deep learning, merging the single word forwards or backwards into the word;

P(W _n W _n+1 )＜P(W _n )*P(W _n+1 )

wherein Wn is the nth word in the scanned sentence

P(W _n )：

P(W _n Wn ₊₁ )：

And in the second step, the forgetting curve adopted by the forgetting algorithm is a Newton cooling curve.

2. The method for Chinese word segmentation based on deep learning and forgetting algorithm as set forth in claim 1, wherein the step one deep learning word segmentation method adopts RNN method.

3. The method for Chinese word segmentation based on deep learning and forgetting algorithm according to claim 1 or 2, wherein the step one deep learning word segmentation method adopts an LSTM model in the RNN method.