CN110750986B

CN110750986B - Neural network word segmentation system and training method based on minimum information entropy

Info

Publication number: CN110750986B
Application number: CN201810724646.2A
Authority: CN
Inventors: 张鹏
Original assignee: Potevio Information Technology Co Ltd
Current assignee: Potevio Information Technology Co Ltd
Priority date: 2018-07-04
Filing date: 2018-07-04
Publication date: 2023-10-10
Anticipated expiration: 2038-07-04
Also published as: CN110750986A

Abstract

The embodiment of the invention provides a neural network word segmentation system and a training method based on minimum information entropy. The system comprises: the system comprises a convolutional neural network, a two-way long-short-term memory neural network, a first word stock prediction layer and a minimum information entropy word stock prediction layer, wherein: the convolutional neural network is used for extracting the characteristic vector of the input text and outputting the characteristic vector to the two-way long-short-term memory neural network; the two-way long-short-term memory neural network is used for reading the context information of the feature vector and outputting the feature vector to the first word stock prediction layer and the minimum information entropy word stock prediction layer; the first word stock prediction layer is used for calculating and outputting labels of each word according to the first word stock; the minimum information entropy word stock prediction layer is used for calculating and outputting labels of each word according to the minimum information entropy word stock. According to the embodiment of the invention, the minimum information entropy word stock prediction layer is added in the neural network word segmentation system, so that the word segmentation system improves the recognition capability of the unregistered words, and further improves the word segmentation accuracy.

Description

Neural network word segmentation system and training method based on minimum information entropy

Technical Field

The embodiment of the invention relates to the technical field of natural language processing, in particular to a neural network word segmentation system and a training method based on minimum information entropy.

Background

With the advent of deep learning, neural network word segmentation systems became a research hotspot. The current word segmentation method based on statistics has been adopted by various large companies, and a general neural network word segmentation system framework is shown in fig. 1, and referring to fig. 1, the word segmentation system comprises a CNN (Convolutional Neural Networks, convolutional neural network), a BLSTM (Bidirectional Long Short-term Memory, two-way long-short-term Memory neural network) and a prediction layer, wherein the BLSTM comprises a forward LSTM and a backward LSTM, and the prediction layer is a CRF (conditional random field ).

The word segmentation system based on the framework is used for inputting Chinese characters one by one, each word is converted into a multidimensional vector through a word2vector (word to vector) tool, and the word vector is input into a convolutional neural network in sentence units for feature extraction. The typical convolutional neural network outputs a feature map to a full-connection layer, which combines multiple groups of features after multi-layer convolutional pooling operation sequentially into a set of vectors, which are input into the BLSTM for learning. The BLSTM is stacked by two LSTM networks, one reads text from the forward direction and the other acquires text from the reverse direction, so that the context information of the text can be acquired simultaneously, the BLSTM outputs a result to a CRF (conditional random field layer), and the BLSTM layer calculates the output vector of the BLSTM layer through the conditional random field theory to obtain the label of each word, such as a single word S, and the beginning of one word is B. Thus, after labeling each word, we obtain word segmentation results.

However, the word segmentation method based on statistics is severely dependent on the corpus, and the segmentation of the words which do not exist in the corpus is almost not completed, so that the word segmentation result has low accuracy.

Disclosure of Invention

The embodiment of the invention provides a neural network word segmentation system and a training method based on minimum information entropy, which are used for solving the problems of low word segmentation result accuracy and the like in the prior art.

In one aspect, an embodiment of the present invention provides a neural network word segmentation system based on minimum information entropy, where the system includes: the system comprises a convolutional neural network, a two-way long-short-term memory neural network, a first word stock prediction layer and a minimum information entropy word stock prediction layer, wherein:

the convolutional neural network is used for extracting the characteristic vector of the input text and outputting the characteristic vector to the two-way long-short-term memory neural network;

the two-way long-short-term memory neural network is used for receiving the feature vector, reading the context information, removing redundancy, and outputting the context information to the first word stock prediction layer and the minimum information entropy word stock prediction layer;

the first word stock prediction layer is used for receiving the feature vector output by the two-way long-short term memory neural network and calculating and outputting the label of each word of the input text according to the first word stock;

the minimum information entropy word stock prediction layer is used for receiving the feature vector output by the two-way long-short term memory neural network and calculating and outputting the label of each word of the input text according to the minimum information entropy word stock.

On the other hand, the embodiment of the invention provides a training method of a neural network word segmentation system based on minimum information entropy, which comprises the following steps:

calculating a loss function L of the neural network word segmentation system based on the minimum information entropy _total The calculation formula is as follows:

L _total ＝L _C +L _C1 +L _Fab

wherein L is _C For the loss function corresponding to the first word stock, L _C1 A loss function corresponding to the minimum information entropy word stock, L _Fab The opposite number of the point information of the word segmentation result output through the first word stock;

minimizing a loss function L of the neural network word segmentation system based on minimum information entropy _total And obtaining a converged neural network word segmentation system based on the minimum information entropy.

In another aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the steps of the training method of the neural network word segmentation system based on minimum information entropy as described above when the processor executes the program.

In another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, where the program when executed by a processor implements the steps of the training method of the neural network word segmentation system based on minimum information entropy as described above.

According to the embodiment of the invention, the minimum information entropy word stock prediction layer is added in the neural network word segmentation system, so that the word segmentation system has the minimum information entropy characteristic, the word segmentation with the minimum information entropy rule can be carried out on the input text, the recognition capability of the unregistered word is improved, and the word segmentation accuracy is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a conventional neural network word segmentation system;

fig. 2 is a schematic structural diagram of a neural network word segmentation system based on minimum information entropy according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a neural network word segmentation system based on minimum information entropy according to another embodiment of the present invention;

fig. 4 is a flowchart of a training method of a neural network word segmentation system based on minimum information entropy according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for constructing a minimum information entropy word stock according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 2 shows a schematic structural diagram of a neural network word segmentation system based on minimum information entropy according to an embodiment of the present invention.

As shown in fig. 2, the neural network word segmentation system based on minimum information entropy provided by the embodiment of the invention includes:

the method comprises a convolutional neural network 11, a two-way long-short term memory neural network 12, a first word stock prediction layer 13 and a minimum information entropy word stock prediction layer 14, wherein:

the convolutional neural network 11 is configured to extract a feature vector of an input text, and output the feature vector to the bidirectional long-short-term memory neural network;

the two-way long-short term memory neural network 12 is configured to receive the feature vector, read the context information and output the context information to the first word stock prediction layer and the minimum information entropy word stock prediction layer after redundancy removal;

the first word stock prediction layer 13 is configured to receive the feature vector output by the two-way long-short term memory neural network, and calculate and output a label of each word of the input text according to a first word stock;

the minimum information entropy lexicon prediction layer 14 is configured to receive the feature vector output by the two-way long-short term memory neural network, and calculate and output a label of each word of the input text according to the minimum information entropy lexicon.

The first word stock is an existing corpus, the first word stock prediction layer performs word segmentation according to the existing corpus, the minimum information entropy word stock is a corpus containing minimum information entropy features, and the minimum information entropy word stock prediction layer has the capability of predicting words which are not logged in the existing word stock according to the minimum information entropy word stock, for example, "I love Beijing" can segment labels of each word through the existing word stock, and "Beijing roast duck" can not accurately identify the labels of each word through the existing word stock, and can accurately identify the labels of each word through the minimum information entropy word stock.

Specifically, the minimum information entropy prediction layer comprises a full connection layer and a conditional random field layer, wherein the full connection layer is used for receiving the characteristic vector output by the two-way long-short-term memory neural network, classifying the characteristic vector and outputting the characteristic vector to the conditional random field layer; the conditional random field layer is used for calculating the feature vector output by the conditional random field according to the minimum information entropy word stock and outputting the label of each word of the input text.

The minimum information entropy prediction layer provided by the embodiment of the invention consists of a full connection layer and a conditional random field layer, and the network structure is consistent with the original word stock prediction layer.

Specifically, the label of the minimum information entropy word stock prediction layer for outputting a single word is S, the label of the first word of the multi-word is B, the label of the middle word of the multi-word is I, and the label of the last word of the multi-word is E.

Specifically, for example, the word "I am added as" S "tag, and the word" beijing roast duck "is labeled as" beijing I roast I duck E "by the minimum information entropy prediction layer.

Fig. 3 shows a schematic structural diagram of a neural network word segmentation system based on minimum information entropy according to an embodiment of the present invention.

Referring to fig. 3, a neural network word segmentation system with minimum information entropy provided by an embodiment of the present invention includes: CNN, forward LSTM, backward LSTM, original thesaurus prediction layer and minimum entropy thesaurus prediction layer, taking input text as "I love Beijing" as an example, the input text is marked as "SSBE" through the original thesaurus prediction layer and the minimum entropy thesaurus prediction layer.

The embodiment of the invention also provides a training method of the neural network word segmentation system based on the minimum information entropy.

Fig. 4 shows a flowchart of a training method of a neural network word segmentation system based on minimum information entropy according to an embodiment of the present invention.

As shown in fig. 4, the training method of the neural network word segmentation system based on the minimum information entropy provided by the embodiment of the invention specifically includes the following steps:

s21, calculating a loss function L of the neural network word segmentation system based on the minimum information entropy _total The calculation formula is as follows:

L _total ＝L _C +L _C1 +L _Fab

the embodiment of the invention modifies the loss function of the original neural network word segmentation system, wherein the modified loss function comprises the loss function corresponding to the original word stock prediction layer and the loss function corresponding to the added minimum information entropy prediction layer, and the added point information (cross entropy) is commonly used as the loss function of the neural network system.

S22, minimizing a loss function L of the neural network word segmentation system based on the minimum information entropy _total And obtaining a converged neural network word segmentation system based on the minimum information entropy.

Specifically, according to the embodiment of the invention, the word segmentation system with the minimum loss function is obtained as an optimized system by adjusting each parameter in the word segmentation network.

According to the training method of the neural network word segmentation system based on the minimum information entropy, which is provided by the embodiment of the invention, the loss function of the neural network word segmentation system based on the minimum information entropy is calculated, and the optimized neural network word segmentation system is obtained by minimizing the loss function, so that the word segmentation system has the word segmentation capability through the minimum information entropy word stock.

Specifically, the loss function lc= - Σcorresponding to the first word stock _i,x∈C logp(y|x；W,b)；

The loss function Lc corresponding to the minimum information entropy word stock ₁ ＝-∑ _i,x∈C1 logp(y|z；W,b)；

The inverse number of the point information

Wherein p (y|x; W, b) represents the probability of outputting the tag y given the input word vector x, the weight W and the bias b;

p (y|z; W, b) represents the probability of outputting the tag y given the input word vector z, the weight W and the bias b;

F _ab the larger the value of (a) is, the smaller the entropy of information after word a and word b are combined together is, p _ab Representing the probability of occurrence of the word a and the word b after combination.

Specifically, the network structures of the minimum information entropy prediction layer and the primitive library prediction layer are identical, but since the prediction targets are different, their internal weights are different. Thus, by improving the output layer, the network has the function of predicting the minimum information entropy word stock, and the word stock is obtained based on the minimum entropy principle, so that the vectors and weights obtained for the common part, namely the CNN part and the BLSTM part, also have the capability.

The aim of minimizing the loss function is to obtain a system with available convergence, and specifically, the embodiment of the invention adopts an Adam algorithm to timely adjust parameters such as weight W, bias b and the like, and continuously optimizes and trains a neural network word segmentation system to obtain a final word segmentation system with the capability of predicting the minimum information entropy word stock.

The embodiment of the invention also provides a method for constructing the minimum information entropy word stock.

Fig. 5 shows a flowchart of a method for constructing a minimum information entropy word stock according to an embodiment of the present invention.

As shown in fig. 5, the method for constructing the minimum information entropy word stock provided by the embodiment of the invention specifically includes the following steps:

s31, counting the probability p of the occurrence of k words after combination _a Wherein k is greater than or equal to 2, and k is an integer;

s32, word segmentation is carried out in the corpus according to the probability of occurrence of the k words after combination, and a word segmentation result with the minimum entropy after combination of the k words is obtained;

and S33, marking the word segmentation result to obtain a minimum information entropy word stock.

The embodiment of the invention counts the occurrence probability of each word after combination, and performs word segmentation in a corpus according to the counted probability, for example, a plurality of words with the combination occurrence frequency smaller than a certain threshold value are separated. And (3) carrying out scanning statistics on the text to obtain a word segmentation result with minimum information entropy, marking the word segmentation result with minimum information entropy, for example, marking Beijing roast duck as Beijing I roast I duck E, adding the Beijing I roast I duck E into a minimum information entropy word stock, and continuously adding new tagged word segmentation to obtain the minimum information entropy word stock.

According to the method for constructing the word stock with the minimum information entropy, provided by the embodiment of the invention, the word segmentation result with the minimum information entropy is calculated, and the word segmentation result with the minimum entropy is marked, so that the word stock with the minimum information entropy can be obtained, the capability of identifying the unregistered words is realized, and the word segmentation accuracy is improved.

On the basis of the above embodiment, S32 specifically includes:

calculation ofF is made to _a The maximum word segmentation result is the word segmentation result with the minimum entropy after combining k characters;

wherein p is _a For the probability of occurrence of k words after combining, p _i Is the probability of occurrence of a single word of the k words.

Specifically F _a The larger the value of (2) indicates that the entropy of the information after combination is reduced, and thus, the embodiment of the invention is realized by enabling F to be _a And (5) obtaining the word segmentation result with the minimum entropy by the maximum value of the (4).

On the basis of the above embodiment, S33 specifically includes:

the single word of the segmented word is marked as a label S, the first word of the segmented multi-word is marked as a label B, the middle word of the segmented multi-word is marked as a label I, and the last word of the segmented multi-word is marked as a label E.

The embodiment of the invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method as shown in fig. 4 when executing the program.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

As shown in fig. 6, the electronic device provided by the embodiment of the present invention includes a memory 41, a processor 42, a bus 43, and a computer program stored on the memory 41 and executable on the processor 42. Wherein the memory 41 and the processor 42 communicate with each other via the bus 43.

The processor 42 is configured to invoke program instructions in the memory 41 to implement the method of fig. 4 when executing the program.

For example, the processor, when executing the program, implements the following method:

calculating a loss function of the neural network word segmentation system based on the minimum information entropy;

and minimizing the loss function of the neural network word segmentation system based on the minimum information entropy to obtain a converged neural network word segmentation system based on the minimum information entropy.

According to the electronic equipment provided by the embodiment of the invention, the optimized neural network word segmentation system is obtained by minimizing the loss function, so that the word segmentation system has the word segmentation capability through the minimum information entropy word stock.

Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of fig. 4.

The non-transitory computer readable storage medium provided by the embodiment of the invention obtains the optimized neural network word segmentation system by minimizing the loss function, so that the word segmentation system has the word segmentation capability through the minimum information entropy word stock.

An embodiment of the present invention discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising:

Those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A neural network word segmentation system based on minimum information entropy, the system comprising:

the system comprises a convolutional neural network, a two-way long-short-term memory neural network, a first word stock prediction layer and a minimum information entropy word stock prediction layer, wherein:

the two-way long-short-term memory neural network is used for receiving the feature vector, reading the context information, removing redundancy and outputting the context information to the first word stock prediction layer and the minimum information entropy word stock prediction layer;

the minimum information entropy word stock prediction layer is used for receiving the feature vector output by the two-way long-short-term memory neural network, calculating and outputting the label of each word of the input text according to the minimum information entropy word stock;

the method for constructing the minimum information entropy word stock comprises the following steps of:

statistics of probability p of occurrence of k words after combination _a Wherein k is greater than or equal to 2, and k is an integer;

word segmentation is carried out in the corpus according to the probability of occurrence of the k words after combination, and a word segmentation result with minimum entropy after combination of the k words is obtained;

marking the word segmentation result to obtain a minimum information entropy word stock;

the word segmentation result with the minimum entropy after combining the k words comprises the following steps:

2. The system of claim 1, wherein the minimum information entropy prediction layer comprises a fully connected layer and a conditional random field layer, the fully connected layer is used for receiving the feature vector output by the two-way long-short-term memory neural network, and outputting the feature vector to the conditional random field layer after classification; the conditional random field layer is used for calculating the feature vector output by the conditional random field according to the minimum information entropy word stock and outputting the label of each word of the input text.

3. The system of claim 1, wherein the minimum information entropy lexicon prediction layer is configured to output a single word with a label of S, a first word of the multiple word with a label of B, a middle word of the multiple word with a label of I, and a last word of the multiple word with a label of E.

4. A training method for a neural network word segmentation system based on minimum information entropy according to any one of claims 1 to 3, the method comprising:

L _total ＝L _C +L _C1 +L _Fab

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the loss function L corresponding to the first word stock _C ＝-∑ _i,x∈C logp(y|x；W,b)；

The loss function L corresponding to the minimum information entropy word stock _C1 ＝-∑ _i,x∈C1 logp(y|z；W,b)；

The inverse number of the point information

6. The method of claim 4, wherein the marking the word segmentation result to obtain a minimum information entropy word stock comprises:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the training method of the neural network word segmentation system based on minimum information entropy as claimed in any one of claims 4 to 6 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the training method of the neural network word segmentation system based on minimum information entropy as claimed in any one of claims 4 to 6.