CN111090981A - Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network - Google Patents

Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network Download PDF

Info

Publication number
CN111090981A
CN111090981A CN201911241042.3A CN201911241042A CN111090981A CN 111090981 A CN111090981 A CN 111090981A CN 201911241042 A CN201911241042 A CN 201911241042A CN 111090981 A CN111090981 A CN 111090981A
Authority
CN
China
Prior art keywords
sentence
punctuation
chinese text
marks
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911241042.3A
Other languages
Chinese (zh)
Other versions
CN111090981B (en
Inventor
屈丹
杨绪魁
张文林
司念文
陈琦
牛铜
闫红刚
张连海
李�真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Zhengzhou Xinda Institute of Advanced Technology
Original Assignee
Information Engineering University of PLA Strategic Support Force
Zhengzhou Xinda Institute of Advanced Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force , Zhengzhou Xinda Institute of Advanced Technology filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN201911241042.3A priority Critical patent/CN111090981B/en
Publication of CN111090981A publication Critical patent/CN111090981A/en
Application granted granted Critical
Publication of CN111090981B publication Critical patent/CN111090981B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention belongs to the technical field of natural language processing, and discloses a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network, wherein the method comprises the following steps: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character; a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model; improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model; the system comprises: the system comprises a corpus processing module, a network structure selection module and a model construction and optimization module. The invention solves the problems that sentences can not be automatically broken and punctuation marks are lost in the voice transcription text.

Description

Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network.
Background
The existing text automatic sentence-breaking and punctuation method mainly comprises two aspects: on the one hand, the method focuses on studying sentence break and punctuation of english text, while chinese text (such as ancient chinese text) is partially studied, but most of the methods adopted are traditional statistical machine learning models (such as conditional random fields), such methods require artificial feature design, have low accuracy, and only implement functions related to automatic sentence break functions, and little or no functions related to automatic punctuation addition (chenxiao, kowning, slow wave. On the other hand, research is focused on the field of post-processing of voice transcription texts, for example, in the patent of the invention with publication number CN 102231278A, punctuation type added at the current position needs to be determined by combining pause position duration (setting threshold value) between sentences and by adding classification function of a classifier, so that the function delay of punctuation and punctuation is long, the real-time performance is not high, and a model for adding punctuation is complex.
Disclosure of Invention
The invention provides a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generating model based on a bidirectional long-time memory network, aiming at the problems that automatic sentence-breaking and punctuation symbols are absent in a voice transcribed text.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network comprises the following steps:
step 1: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;
step 2: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and step 3: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.
Further, the step 1 comprises:
step 1.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
step 1.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
step 1.3: reserving four operators and Greek letters in the corpus;
step 1.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
Further, the step 3 comprises:
a log-likelihood loss function is used, the loss function being:
Figure BDA0002306235820000021
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
Figure BDA0002306235820000022
where n represents the number of tags, j represents the tag number,
Figure BDA0002306235820000023
representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
Figure BDA0002306235820000024
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
A Chinese text automatic sentence-breaking and punctuation generation model construction system based on a bidirectional long-time memory network comprises:
the corpus processing module is used for processing Chinese text corpora, removing useless symbols and adding a designed label to each character;
the network structure selection module is used for utilizing a bidirectional long-time memory network as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and the model construction and optimization module is used for improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of the Chinese text automatic sentence-breaking and punctuation generation model.
Further, the corpus processing module is specifically configured to:
reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
reserving four operators and Greek letters in the corpus;
labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
Further, the model construction and optimization module is specifically configured to:
a log-likelihood loss function is used, the loss function being:
Figure BDA0002306235820000031
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
Figure BDA0002306235820000041
where n represents the number of tags, j represents the tag number,
Figure BDA0002306235820000042
representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
Figure BDA0002306235820000043
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
Compared with the prior art, the invention has the following beneficial effects:
the invention can solve the problems that sentences can not be automatically broken and punctuation marks are lost in the voice transcription text. Through the technical scheme and the implementation method provided by the invention, the voice recognition text can be post-processed, sentences can be automatically broken, 4 common punctuations (commas, periods, question marks and exclamation marks) can be added, and the reading experience of a user can be obviously improved.
The invention regards the automatic punctuation as a standard natural language sequence marking task, adopts the two-way LSTM network to model for the time sequence text sequence, marks each input character, designs five kinds of labels in total, respectively represents the form that the character is followed by the next character: { non-punctuation marks, commas, periods, problems, exclamation marks }, preprocessing an original text in the standard format, making training corpora, inputting the text in multiple fields of time, law, famous novels and the like, about 300M in training, inputting the text into a 2-layer bidirectional LSTM for learning, outputting a label corresponding to each character after repeated iterative optimization, and then restoring punctuation marks to obtain the punctuated text.
Drawings
FIG. 1 is a basic flowchart of a method for constructing a Chinese text automatic sentence break and punctuation generation model based on a bidirectional long-and-short term memory network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a network structure of an automatic sentence-breaking and punctuation generation model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a system for constructing a model of automatically breaking sentences and punctuation in a chinese text based on a bidirectional long-and-short term memory network according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
example 1
As shown in fig. 1, a method for constructing a model for automatically breaking sentences and punctuation in a chinese text based on a bidirectional long-and-short term memory network includes:
step S101: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;
step S102: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
step S103: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.
Specifically, the step S101 includes:
step S101.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
step S101.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks; the replacement strategy is specifically shown in table 1:
TABLE 1 corpus punctuation processing strategy
Figure 1
Step S101.3: reserving four operators and Greek letters in the corpus;
step S101.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation, comma, period, question mark, exclamation mark:
specifically, five kinds of tags are designed:
label 1: n, corresponding to NONE;
and 2, labeling: c, corresponding to COMMA;
and (3) labeling: p, corresponding to PERIOD;
and (4) labeling: q, corresponding to QUESTION;
and (5) labeling: e, corresponding to EXCLAMATION.
In the above labels, the form of the character followed by the next character is indicated:
non-punctuation marks, commas, periods, question marks, exclamation marks.
The original Chinese text is preprocessed in the standard format, a training corpus is manufactured, the training corpus is input into a bidirectional long-time memory network for learning, a label corresponding to each character is output, and then punctuation recovery is carried out.
Examples are as follows:
original: he indicates that this time … will be
Inputting: he points out this time …
Labeling: n nc N ….
Specifically, the step S102 includes:
in a natural language sequence labeling task, a Recurrent Neural Network (RNN) model is often used, wherein a long short-term memory (LSTM) is used as a special type of RNN, and a memory cell and a gate mechanism are introduced into each hidden layer cell to control input and output of information streams, thereby effectively solving the problem of gradient disappearance existing in a common RNN. In contrast, LSTM is more adept at processing serialized data, such as natural language text, and can model a larger range of context information in the sequence.
In the technical scheme adopted by the invention, a bidirectional LSTM (binary LSTM, BLSTM) network is used for modeling the natural language text from the positive direction and the negative direction, so that the automatic sentence-breaking and punctuation functions are realized, and the specific model structure is shown in figure 2.
In the above network structure, the roles of the layers are as follows:
a) input layer and Embedding layer: the input layer adopts the training corpus without punctuation as input, and realizes the conversion from characters to character indexes by establishing two mappings of word2id and id2 word. The index sequence is ordered according to the dictionary obtained by initialization in the character vector table, and the conversion from the index to the character vector can be realized according to the sequence. The function of the Embedding layer is to index character vectors, convert input characters into character vectors with uniform dimensions, and the character vectors contain rich semantic information. In this model, the dimension of the character vector used is 300 dimensions, and the dictionary size is 14157.
b) Forward and reverse LSTM layers: forward LSTM hidden states and backward LSTM hidden states are computed separately and then projected to a common output layer. The unidirectional LSTM only comprises a hidden layer in one direction and inputs a vector x according to the current timetAnd the hidden state vector h of the previous momentt-1Calculating the hidden state h of the current timet. The bidirectional LSTM comprises a forward layer and a reverse layer, and the hidden state vectors of the forward layer at the current moment need to be calculated respectively
Figure BDA0002306235820000071
And reverse hidden state vector
Figure BDA0002306235820000072
Figure BDA0002306235820000073
Figure BDA0002306235820000074
Where m is the hidden unit dimension, the LSTM () function represents the nonlinear transformation of the LSTM network, whose main function is to encode the input character vector into the corresponding hidden state vector.
c) An output layer: adopting a weighted summation mode to carry out forward implicit state vector
Figure BDA0002306235820000075
And reverse hidden state vector
Figure BDA0002306235820000076
Linear combination is carried out to obtain the hidden vector h of the BLSTMt∈Rm×1
Figure BDA0002306235820000077
Wherein, W1∈Rm×mAnd V1∈Rm×mAs a weight matrix, b1∈Rm×1Are the corresponding offset terms. The hidden layer simultaneously aggregates the sequence information of the current element in the input sequence in the forward direction and the backward direction, and can provide richer context characteristics for the annotation.
Specifically, the step S103 includes:
given training set
Figure BDA0002306235820000078
Wherein the ith sentence
Figure BDA0002306235820000079
The corresponding tag sequence is y(i)=[y1 (i),y2 (i),...,yn (i)]. In the model training, a log-likelihood loss function is adopted, and an L2 regularization term is added, wherein the loss function is as follows:
Figure BDA00023062358200000710
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in order to improve sentence-breaking quality and reading experience and promote finer sentence-breaking, the shorter the sentence length in the result is, the better the sentence length is, the loss function is improved, and a long sentence punishment factor is added:
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark, i.e. in sentence x(i)In (1), calculating its output label y(i)The number of non-NONE tags in the list, i.e. COMMA, PERIOD, QUESTION, EXCLAMATION:
Figure BDA0002306235820000081
where n represents the number of tags, j represents the tag number,
Figure BDA0002306235820000082
representing the number of jth labels of the ith sentence;
adding the formula into a loss function, adding a long sentence penalty factor β, improving the loss function, and calculating the average sentence length loss together with the batch, wherein the improved loss function is as follows:
Figure BDA0002306235820000083
and constructing a Chinese text automatic sentence-breaking and punctuation generating model by taking the minimized and improved log-likelihood loss function as a target.
In the training process, a mini-batch gradient descent method is adopted, and k is the size of each batch. And (3) applying a Dropout strategy to randomly remove part of BLSTM hidden layer units and weights thereof with a certain probability so as to prevent the training data from being over-fitted.
To verify the effect of the present invention, the following experiment was performed:
(1) the method comprises the steps of obtaining an original Chinese text corpus (with mark symbols), and training the Chinese text in the fields of time administration, law, famous works and the like by about 300M.
(2) The punctuation marks in the original text are normalized, only comma, period, question mark and exclamation mark are reserved, and other punctuation marks are automatically classified into one of the four categories or directly removed. Each word in the normalized text is labeled as one of { N, C, P, Q, E } (representing { NONE, COMMA, PERIOD, QUESTION, EXCLAMATION }), and the set of labeling rules is sent to the BLSTM neural network for training.
(3) The results were tested in another part of the randomly drawn article.
And training the punctuation generation model by adopting an LSTM network in a tensoflow library, writing the model into an x.pb binary file after the training is finished, freezing the weight data and the calculation map by adopting a freeze _ graph.
By adopting the automatic sentence-breaking and punctuation generation model, the partial experimental results on the public corpus are as follows:
table 2 discloses examples of partial experimental results on corpora
Figure BDA0002306235820000091
In conclusion, the method and the device can solve the problems that sentences cannot be automatically broken and punctuation marks are lost in the voice transcription text and the like. Through the technical scheme and the implementation method provided by the invention, the voice recognition text can be post-processed, sentences can be automatically broken, 4 common punctuations (commas, periods, question marks and exclamation marks) can be added, and the reading experience of a user can be obviously improved.
The invention regards the automatic punctuation as a standard natural language sequence marking task, adopts the two-way LSTM network to model for the time sequence text sequence, marks each input character, designs five kinds of labels in total, respectively represents the form that the character is followed by the next character: { non-punctuation marks, commas, periods, problems, exclamation marks }, preprocessing an original text in the standard format, making training corpora, inputting the text in multiple fields of time, law, famous novels and the like, about 300M in training, inputting the text into a 2-layer bidirectional LSTM for learning, outputting a label corresponding to each character after repeated iterative optimization, and then restoring punctuation marks to obtain the punctuated text.
Example 2
As shown in fig. 3, a system for constructing a model for automatically breaking sentences and punctuation in chinese text based on a bidirectional long-and-short term memory network includes:
the corpus processing module 201 is configured to process a chinese text corpus, remove useless symbols, and add a designed tag to each character;
the model initialization module 202 is configured to use a bidirectional long-and-short-term memory network as a reference network structure of a Chinese text automatic sentence break and punctuation generation model;
the model construction and optimization module 203 is configured to adopt a log-likelihood loss function, improve the log-likelihood loss function by adding a long-sentence penalty factor, train the tagged chinese text corpus from the positive and negative directions with the minimized improved log-likelihood loss function as a target, and complete the construction of the automatic Chinese text sentence-breaking and punctuation generation model.
Specifically, the corpus processing module 201 is specifically configured to:
reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
reserving four operators and Greek letters in the corpus;
labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
Specifically, the model construction and optimization module 203 is specifically configured to:
a log-likelihood loss function is used, the loss function being:
Figure BDA0002306235820000111
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
Figure BDA0002306235820000112
where n represents the number of tags, j represents the tag number,
Figure BDA0002306235820000113
representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
Figure BDA0002306235820000114
and constructing a Chinese text automatic sentence-breaking and punctuation generating model by taking the minimized and improved log-likelihood loss function as a target.
The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims (6)

1. A method for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network is characterized by comprising the following steps:
step 1: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;
step 2: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and step 3: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.
2. The method for constructing the model for generating the automatic Chinese text punctuation and punctuation based on the bidirectional long-and-short term memory network as claimed in claim 1, wherein said step 1 comprises:
step 1.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
step 1.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
step 1.3: reserving four operators and Greek letters in the corpus;
step 1.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
3. The method for constructing the model for generating the automatic Chinese text punctuation and punctuation based on the bidirectional long-and-short term memory network as claimed in claim 2, wherein said step 3 comprises:
a log-likelihood loss function is used, the loss function being:
Figure FDA0002306235810000011
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
Figure FDA0002306235810000021
where n represents the number of tags, j represents the tag number,
Figure FDA0002306235810000022
representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
Figure FDA0002306235810000023
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
4. A Chinese text automatic sentence-breaking and punctuation generation model construction system based on a bidirectional long-time memory network is characterized by comprising the following steps:
the corpus processing module is used for processing Chinese text corpora, removing useless symbols and adding a designed label to each character;
the network structure selection module is used for utilizing a bidirectional long-time memory network as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and the model construction and optimization module is used for improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of the Chinese text automatic sentence-breaking and punctuation generation model.
5. The system for constructing a model of automatic Chinese text sentence break and punctuation based on a bidirectional long-and-short term memory network as claimed in claim 4, wherein said corpus processing module is specifically configured to:
reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
reserving four operators and Greek letters in the corpus;
labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
6. The system for constructing a model for generating automatic Chinese text punctuation and punctuation based on a bidirectional long-and-short term memory network as claimed in claim 4, wherein said model construction and optimization module is specifically configured to:
a log-likelihood loss function is used, the loss function being:
Figure FDA0002306235810000031
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
Figure FDA0002306235810000032
where n represents the number of tags, j represents the tag number,
Figure FDA0002306235810000033
representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
Figure FDA0002306235810000034
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
CN201911241042.3A 2019-12-06 2019-12-06 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network Active CN111090981B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911241042.3A CN111090981B (en) 2019-12-06 2019-12-06 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911241042.3A CN111090981B (en) 2019-12-06 2019-12-06 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Publications (2)

Publication Number Publication Date
CN111090981A true CN111090981A (en) 2020-05-01
CN111090981B CN111090981B (en) 2022-04-15

Family

ID=70394814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911241042.3A Active CN111090981B (en) 2019-12-06 2019-12-06 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Country Status (1)

Country Link
CN (1) CN111090981B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723584A (en) * 2020-06-24 2020-09-29 天津大学 Punctuation prediction method based on consideration of domain information
CN111951792A (en) * 2020-07-30 2020-11-17 北京先声智能科技有限公司 Punctuation marking model based on grouping convolution neural network
CN112001167A (en) * 2020-08-26 2020-11-27 四川云从天府人工智能科技有限公司 Punctuation mark adding method, system, equipment and medium
CN112101003A (en) * 2020-09-14 2020-12-18 深圳前海微众银行股份有限公司 Sentence text segmentation method, device and equipment and computer readable storage medium
CN112906366A (en) * 2021-01-29 2021-06-04 深圳力维智联技术有限公司 ALBERT-based model construction method, device, system and medium
CN113542661A (en) * 2021-09-09 2021-10-22 北京鼎天宏盛科技有限公司 Video conference voice recognition method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067514A (en) * 1998-06-23 2000-05-23 International Business Machines Corporation Method for automatically punctuating a speech utterance in a continuous speech recognition system
US20130325442A1 (en) * 2010-09-24 2013-12-05 National University Of Singapore Methods and Systems for Automated Text Correction
US20190043486A1 (en) * 2017-08-04 2019-02-07 EMR.AI Inc. Method to aid transcribing a dictated to written structured report
CN109887499A (en) * 2019-04-11 2019-06-14 中国石油大学(华东) A kind of voice based on Recognition with Recurrent Neural Network is made pauses in reading unpunctuated ancient writings algorithm automatically
US20190188463A1 (en) * 2017-12-15 2019-06-20 Adobe Inc. Using deep learning techniques to determine the contextual reading order in a form document
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based
CN110110335A (en) * 2019-05-09 2019-08-09 南京大学 A kind of name entity recognition method based on Overlay model
CN110245332A (en) * 2019-04-22 2019-09-17 平安科技(深圳)有限公司 Chinese character code method and apparatus based on two-way length memory network model in short-term
US10431210B1 (en) * 2018-04-16 2019-10-01 International Business Machines Corporation Implementing a whole sentence recurrent neural network language model for natural language processing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067514A (en) * 1998-06-23 2000-05-23 International Business Machines Corporation Method for automatically punctuating a speech utterance in a continuous speech recognition system
US20130325442A1 (en) * 2010-09-24 2013-12-05 National University Of Singapore Methods and Systems for Automated Text Correction
US20190043486A1 (en) * 2017-08-04 2019-02-07 EMR.AI Inc. Method to aid transcribing a dictated to written structured report
US20190188463A1 (en) * 2017-12-15 2019-06-20 Adobe Inc. Using deep learning techniques to determine the contextual reading order in a form document
US10431210B1 (en) * 2018-04-16 2019-10-01 International Business Machines Corporation Implementing a whole sentence recurrent neural network language model for natural language processing
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based
CN109887499A (en) * 2019-04-11 2019-06-14 中国石油大学(华东) A kind of voice based on Recognition with Recurrent Neural Network is made pauses in reading unpunctuated ancient writings algorithm automatically
CN110245332A (en) * 2019-04-22 2019-09-17 平安科技(深圳)有限公司 Chinese character code method and apparatus based on two-way length memory network model in short-term
CN110110335A (en) * 2019-05-09 2019-08-09 南京大学 A kind of name entity recognition method based on Overlay model

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
JIANFENG GAO等: "Chinese Word Segmentation and Named Entity Recognition:A Pragmatic Approach", 《COMPUTATIONAL LINGUISTICS》 *
司念文: "面向军事领域的句子级文本处理技术研究", 《中国优秀博硕士学位论文全文数据库(硕士)工程科技Ⅱ辑》 *
应文豪等: "一种话题敏感的抽取式多文档摘要方法", 《中文信息学报》 *
张合等: "一种基于层叠CRF的古文断句与句读标记方法", 《计算机应用研究》 *
张慧等: "CRF模型的自动标点预测方法研究", 《网络新媒体技术》 *
王博立等: "一种基于循环神经网络的古文断句方法", 《北京大学学报(自然科学版)》 *
胡婕等: "双向循环网络中文分词模型", 《小型微型计算机系统》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723584A (en) * 2020-06-24 2020-09-29 天津大学 Punctuation prediction method based on consideration of domain information
CN111723584B (en) * 2020-06-24 2024-05-07 天津大学 Punctuation prediction method based on consideration field information
CN111951792A (en) * 2020-07-30 2020-11-17 北京先声智能科技有限公司 Punctuation marking model based on grouping convolution neural network
CN111951792B (en) * 2020-07-30 2022-12-16 北京先声智能科技有限公司 Punctuation marking model based on grouping convolution neural network
CN112001167A (en) * 2020-08-26 2020-11-27 四川云从天府人工智能科技有限公司 Punctuation mark adding method, system, equipment and medium
CN112101003A (en) * 2020-09-14 2020-12-18 深圳前海微众银行股份有限公司 Sentence text segmentation method, device and equipment and computer readable storage medium
CN112101003B (en) * 2020-09-14 2023-03-14 深圳前海微众银行股份有限公司 Sentence text segmentation method, device and equipment and computer readable storage medium
CN112906366A (en) * 2021-01-29 2021-06-04 深圳力维智联技术有限公司 ALBERT-based model construction method, device, system and medium
CN112906366B (en) * 2021-01-29 2023-07-07 深圳力维智联技术有限公司 ALBERT-based model construction method, device, system and medium
CN113542661A (en) * 2021-09-09 2021-10-22 北京鼎天宏盛科技有限公司 Video conference voice recognition method and system

Also Published As

Publication number Publication date
CN111090981B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN111090981B (en) Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network
Du et al. Explicit interaction model towards text classification
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
CN107291795B (en) Text classification method combining dynamic word embedding and part-of-speech tagging
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN107358948B (en) Language input relevance detection method based on attention model
CN109858041B (en) Named entity recognition method combining semi-supervised learning with user-defined dictionary
CN112733541A (en) Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN110347836B (en) Method for classifying sentiments of Chinese-Yue-bilingual news by blending into viewpoint sentence characteristics
CN110851599B (en) Automatic scoring method for Chinese composition and teaching assistance system
CN109086269B (en) Semantic bilingual recognition method based on semantic resource word representation and collocation relationship
CN108829823A (en) A kind of file classification method
CN110909736A (en) Image description method based on long-short term memory model and target detection algorithm
Yang et al. Rits: Real-time interactive text steganography based on automatic dialogue model
CN109919175B (en) Entity multi-classification method combined with attribute information
Xing et al. A convolutional neural network for aspect-level sentiment classification
CN111581970B (en) Text recognition method, device and storage medium for network context
CN111709242A (en) Chinese punctuation mark adding method based on named entity recognition
CN113723105A (en) Training method, device and equipment of semantic feature extraction model and storage medium
CN112287106A (en) Online comment emotion classification method based on dual-channel hybrid neural network
CN110222338A (en) A kind of mechanism name entity recognition method
CN112131367A (en) Self-auditing man-machine conversation method, system and readable storage medium
CN114756681A (en) Evaluation text fine-grained suggestion mining method based on multi-attention fusion
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant