CN111090981A

CN111090981A - Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Info

Publication number: CN111090981A
Application number: CN201911241042.3A
Authority: CN
Inventors: 屈丹; 杨绪魁; 张文林; 司念文; 陈琦; 牛铜; 闫红刚; 张连海; 李�真
Original assignee: Information Engineering University of PLA Strategic Support Force; Zhengzhou Xinda Institute of Advanced Technology
Current assignee: Information Engineering University of PLA Strategic Support Force; Zhengzhou Xinda Institute of Advanced Technology
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-05-01
Anticipated expiration: 2039-12-06
Also published as: CN111090981B

Abstract

The invention belongs to the technical field of natural language processing, and discloses a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network, wherein the method comprises the following steps: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character; a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model; improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model; the system comprises: the system comprises a corpus processing module, a network structure selection module and a model construction and optimization module. The invention solves the problems that sentences can not be automatically broken and punctuation marks are lost in the voice transcription text.

Description

Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network.

Background

The existing text automatic sentence-breaking and punctuation method mainly comprises two aspects: on the one hand, the method focuses on studying sentence break and punctuation of english text, while chinese text (such as ancient chinese text) is partially studied, but most of the methods adopted are traditional statistical machine learning models (such as conditional random fields), such methods require artificial feature design, have low accuracy, and only implement functions related to automatic sentence break functions, and little or no functions related to automatic punctuation addition (chenxiao, kowning, slow wave. On the other hand, research is focused on the field of post-processing of voice transcription texts, for example, in the patent of the invention with publication number CN 102231278A, punctuation type added at the current position needs to be determined by combining pause position duration (setting threshold value) between sentences and by adding classification function of a classifier, so that the function delay of punctuation and punctuation is long, the real-time performance is not high, and a model for adding punctuation is complex.

Disclosure of Invention

The invention provides a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generating model based on a bidirectional long-time memory network, aiming at the problems that automatic sentence-breaking and punctuation symbols are absent in a voice transcribed text.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network comprises the following steps:

step 1: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;

step 2: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;

and step 3: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.

Further, the step 1 comprises:

step 1.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;

step 1.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;

step 1.3: reserving four operators and Greek letters in the corpus;

step 1.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.

Further, the step 3 comprises:

a log-likelihood loss function is used, the loss function being:

wherein x⁽ⁱ⁾Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)⁽ⁱ⁾|x⁽ⁱ⁾(ii) a Theta) represents x⁽ⁱ⁾Corresponding tag sequence y⁽ⁱ⁾θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;

in sentence x⁽ⁱ⁾In (1), calculating output label y⁽ⁱ⁾The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:

where n represents the number of tags, j represents the tag number,

representing the number of jth labels of the ith sentence;

adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:

and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.

A Chinese text automatic sentence-breaking and punctuation generation model construction system based on a bidirectional long-time memory network comprises:

the corpus processing module is used for processing Chinese text corpora, removing useless symbols and adding a designed label to each character;

the network structure selection module is used for utilizing a bidirectional long-time memory network as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;

and the model construction and optimization module is used for improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of the Chinese text automatic sentence-breaking and punctuation generation model.

Further, the corpus processing module is specifically configured to:

reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;

replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;

reserving four operators and Greek letters in the corpus;

labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.

Further, the model construction and optimization module is specifically configured to:

a log-likelihood loss function is used, the loss function being:

where n represents the number of tags, j represents the tag number,

representing the number of jth labels of the ith sentence;

Compared with the prior art, the invention has the following beneficial effects:

the invention can solve the problems that sentences can not be automatically broken and punctuation marks are lost in the voice transcription text. Through the technical scheme and the implementation method provided by the invention, the voice recognition text can be post-processed, sentences can be automatically broken, 4 common punctuations (commas, periods, question marks and exclamation marks) can be added, and the reading experience of a user can be obviously improved.

The invention regards the automatic punctuation as a standard natural language sequence marking task, adopts the two-way LSTM network to model for the time sequence text sequence, marks each input character, designs five kinds of labels in total, respectively represents the form that the character is followed by the next character: { non-punctuation marks, commas, periods, problems, exclamation marks }, preprocessing an original text in the standard format, making training corpora, inputting the text in multiple fields of time, law, famous novels and the like, about 300M in training, inputting the text into a 2-layer bidirectional LSTM for learning, outputting a label corresponding to each character after repeated iterative optimization, and then restoring punctuation marks to obtain the punctuated text.

Drawings

FIG. 1 is a basic flowchart of a method for constructing a Chinese text automatic sentence break and punctuation generation model based on a bidirectional long-and-short term memory network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of an automatic sentence-breaking and punctuation generation model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a system for constructing a model of automatically breaking sentences and punctuation in a chinese text based on a bidirectional long-and-short term memory network according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

example 1

As shown in fig. 1, a method for constructing a model for automatically breaking sentences and punctuation in a chinese text based on a bidirectional long-and-short term memory network includes:

step S101: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;

step S102: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;

step S103: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.

Specifically, the step S101 includes:

step S101.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;

step S101.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks; the replacement strategy is specifically shown in table 1:

TABLE 1 corpus punctuation processing strategy

Step S101.3: reserving four operators and Greek letters in the corpus;

step S101.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation, comma, period, question mark, exclamation mark:

specifically, five kinds of tags are designed:

label 1: n, corresponding to NONE;

and 2, labeling: c, corresponding to COMMA;

and (3) labeling: p, corresponding to PERIOD;

and (4) labeling: q, corresponding to QUESTION;

and (5) labeling: e, corresponding to EXCLAMATION.

In the above labels, the form of the character followed by the next character is indicated:

non-punctuation marks, commas, periods, question marks, exclamation marks.

The original Chinese text is preprocessed in the standard format, a training corpus is manufactured, the training corpus is input into a bidirectional long-time memory network for learning, a label corresponding to each character is output, and then punctuation recovery is carried out.

Examples are as follows:

original: he indicates that this time … will be

Inputting: he points out this time …

Labeling: n nc N ….

Specifically, the step S102 includes:

in a natural language sequence labeling task, a Recurrent Neural Network (RNN) model is often used, wherein a long short-term memory (LSTM) is used as a special type of RNN, and a memory cell and a gate mechanism are introduced into each hidden layer cell to control input and output of information streams, thereby effectively solving the problem of gradient disappearance existing in a common RNN. In contrast, LSTM is more adept at processing serialized data, such as natural language text, and can model a larger range of context information in the sequence.

In the technical scheme adopted by the invention, a bidirectional LSTM (binary LSTM, BLSTM) network is used for modeling the natural language text from the positive direction and the negative direction, so that the automatic sentence-breaking and punctuation functions are realized, and the specific model structure is shown in figure 2.

In the above network structure, the roles of the layers are as follows:

a) input layer and Embedding layer: the input layer adopts the training corpus without punctuation as input, and realizes the conversion from characters to character indexes by establishing two mappings of word2id and id2 word. The index sequence is ordered according to the dictionary obtained by initialization in the character vector table, and the conversion from the index to the character vector can be realized according to the sequence. The function of the Embedding layer is to index character vectors, convert input characters into character vectors with uniform dimensions, and the character vectors contain rich semantic information. In this model, the dimension of the character vector used is 300 dimensions, and the dictionary size is 14157.

b) Forward and reverse LSTM layers: forward LSTM hidden states and backward LSTM hidden states are computed separately and then projected to a common output layer. The unidirectional LSTM only comprises a hidden layer in one direction and inputs a vector x according to the current time_tAnd the hidden state vector h of the previous moment_t-1Calculating the hidden state h of the current time_t. The bidirectional LSTM comprises a forward layer and a reverse layer, and the hidden state vectors of the forward layer at the current moment need to be calculated respectively

And reverse hidden state vector

Where m is the hidden unit dimension, the LSTM () function represents the nonlinear transformation of the LSTM network, whose main function is to encode the input character vector into the corresponding hidden state vector.

c) An output layer: adopting a weighted summation mode to carry out forward implicit state vector

And reverse hidden state vector

Linear combination is carried out to obtain the hidden vector h of the BLSTM_t∈R^m×1：

Wherein, W₁∈R^m×mAnd V₁∈R^m×mAs a weight matrix, b₁∈R^m×1Are the corresponding offset terms. The hidden layer simultaneously aggregates the sequence information of the current element in the input sequence in the forward direction and the backward direction, and can provide richer context characteristics for the annotation.

Specifically, the step S103 includes:

given training set

Wherein the ith sentence

The corresponding tag sequence is y⁽ⁱ⁾＝[y₁ ⁽ⁱ⁾,y₂ ⁽ⁱ⁾,...,y_n ⁽ⁱ⁾]. In the model training, a log-likelihood loss function is adopted, and an L2 regularization term is added, wherein the loss function is as follows:

in order to improve sentence-breaking quality and reading experience and promote finer sentence-breaking, the shorter the sentence length in the result is, the better the sentence length is, the loss function is improved, and a long sentence punishment factor is added:

in sentence x⁽ⁱ⁾In (1), calculating output label y⁽ⁱ⁾The number of labels corresponding to medium comma, period, question mark and exclamation mark, i.e. in sentence x⁽ⁱ⁾In (1), calculating its output label y⁽ⁱ⁾The number of non-NONE tags in the list, i.e. COMMA, PERIOD, QUESTION, EXCLAMATION:

where n represents the number of tags, j represents the tag number,

representing the number of jth labels of the ith sentence;

adding the formula into a loss function, adding a long sentence penalty factor β, improving the loss function, and calculating the average sentence length loss together with the batch, wherein the improved loss function is as follows:

and constructing a Chinese text automatic sentence-breaking and punctuation generating model by taking the minimized and improved log-likelihood loss function as a target.

In the training process, a mini-batch gradient descent method is adopted, and k is the size of each batch. And (3) applying a Dropout strategy to randomly remove part of BLSTM hidden layer units and weights thereof with a certain probability so as to prevent the training data from being over-fitted.

To verify the effect of the present invention, the following experiment was performed:

(1) the method comprises the steps of obtaining an original Chinese text corpus (with mark symbols), and training the Chinese text in the fields of time administration, law, famous works and the like by about 300M.

(2) The punctuation marks in the original text are normalized, only comma, period, question mark and exclamation mark are reserved, and other punctuation marks are automatically classified into one of the four categories or directly removed. Each word in the normalized text is labeled as one of { N, C, P, Q, E } (representing { NONE, COMMA, PERIOD, QUESTION, EXCLAMATION }), and the set of labeling rules is sent to the BLSTM neural network for training.

(3) The results were tested in another part of the randomly drawn article.

And training the punctuation generation model by adopting an LSTM network in a tensoflow library, writing the model into an x.pb binary file after the training is finished, freezing the weight data and the calculation map by adopting a freeze _ graph.

By adopting the automatic sentence-breaking and punctuation generation model, the partial experimental results on the public corpus are as follows:

table 2 discloses examples of partial experimental results on corpora

In conclusion, the method and the device can solve the problems that sentences cannot be automatically broken and punctuation marks are lost in the voice transcription text and the like. Through the technical scheme and the implementation method provided by the invention, the voice recognition text can be post-processed, sentences can be automatically broken, 4 common punctuations (commas, periods, question marks and exclamation marks) can be added, and the reading experience of a user can be obviously improved.

Example 2

As shown in fig. 3, a system for constructing a model for automatically breaking sentences and punctuation in chinese text based on a bidirectional long-and-short term memory network includes:

the corpus processing module 201 is configured to process a chinese text corpus, remove useless symbols, and add a designed tag to each character;

the model initialization module 202 is configured to use a bidirectional long-and-short-term memory network as a reference network structure of a Chinese text automatic sentence break and punctuation generation model;

the model construction and optimization module 203 is configured to adopt a log-likelihood loss function, improve the log-likelihood loss function by adding a long-sentence penalty factor, train the tagged chinese text corpus from the positive and negative directions with the minimized improved log-likelihood loss function as a target, and complete the construction of the automatic Chinese text sentence-breaking and punctuation generation model.

Specifically, the corpus processing module 201 is specifically configured to:

reserving four operators and Greek letters in the corpus;

Specifically, the model construction and optimization module 203 is specifically configured to:

a log-likelihood loss function is used, the loss function being:

where n represents the number of tags, j represents the tag number,

representing the number of jth labels of the ith sentence;

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A method for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network is characterized by comprising the following steps:

2. The method for constructing the model for generating the automatic Chinese text punctuation and punctuation based on the bidirectional long-and-short term memory network as claimed in claim 1, wherein said step 1 comprises:

step 1.3: reserving four operators and Greek letters in the corpus;

3. The method for constructing the model for generating the automatic Chinese text punctuation and punctuation based on the bidirectional long-and-short term memory network as claimed in claim 2, wherein said step 3 comprises:

a log-likelihood loss function is used, the loss function being:

where n represents the number of tags, j represents the tag number,

representing the number of jth labels of the ith sentence;

4. A Chinese text automatic sentence-breaking and punctuation generation model construction system based on a bidirectional long-time memory network is characterized by comprising the following steps:

5. The system for constructing a model of automatic Chinese text sentence break and punctuation based on a bidirectional long-and-short term memory network as claimed in claim 4, wherein said corpus processing module is specifically configured to:

reserving four operators and Greek letters in the corpus;

6. The system for constructing a model for generating automatic Chinese text punctuation and punctuation based on a bidirectional long-and-short term memory network as claimed in claim 4, wherein said model construction and optimization module is specifically configured to:

a log-likelihood loss function is used, the loss function being:

where n represents the number of tags, j represents the tag number,

representing the number of jth labels of the ith sentence;