CN111090981A - Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network - Google Patents
Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network Download PDFInfo
- Publication number
- CN111090981A CN111090981A CN201911241042.3A CN201911241042A CN111090981A CN 111090981 A CN111090981 A CN 111090981A CN 201911241042 A CN201911241042 A CN 201911241042A CN 111090981 A CN111090981 A CN 111090981A
- Authority
- CN
- China
- Prior art keywords
- sentence
- punctuation
- chinese text
- marks
- loss function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Abstract
The invention belongs to the technical field of natural language processing, and discloses a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network, wherein the method comprises the following steps: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character; a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model; improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model; the system comprises: the system comprises a corpus processing module, a network structure selection module and a model construction and optimization module. The invention solves the problems that sentences can not be automatically broken and punctuation marks are lost in the voice transcription text.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network.
Background
The existing text automatic sentence-breaking and punctuation method mainly comprises two aspects: on the one hand, the method focuses on studying sentence break and punctuation of english text, while chinese text (such as ancient chinese text) is partially studied, but most of the methods adopted are traditional statistical machine learning models (such as conditional random fields), such methods require artificial feature design, have low accuracy, and only implement functions related to automatic sentence break functions, and little or no functions related to automatic punctuation addition (chenxiao, kowning, slow wave. On the other hand, research is focused on the field of post-processing of voice transcription texts, for example, in the patent of the invention with publication number CN 102231278A, punctuation type added at the current position needs to be determined by combining pause position duration (setting threshold value) between sentences and by adding classification function of a classifier, so that the function delay of punctuation and punctuation is long, the real-time performance is not high, and a model for adding punctuation is complex.
Disclosure of Invention
The invention provides a method and a system for constructing a Chinese text automatic sentence-breaking and punctuation generating model based on a bidirectional long-time memory network, aiming at the problems that automatic sentence-breaking and punctuation symbols are absent in a voice transcribed text.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network comprises the following steps:
step 1: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;
step 2: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and step 3: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.
Further, the step 1 comprises:
step 1.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
step 1.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
step 1.3: reserving four operators and Greek letters in the corpus;
step 1.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
Further, the step 3 comprises:
a log-likelihood loss function is used, the loss function being:
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
where n represents the number of tags, j represents the tag number,representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
A Chinese text automatic sentence-breaking and punctuation generation model construction system based on a bidirectional long-time memory network comprises:
the corpus processing module is used for processing Chinese text corpora, removing useless symbols and adding a designed label to each character;
the network structure selection module is used for utilizing a bidirectional long-time memory network as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and the model construction and optimization module is used for improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of the Chinese text automatic sentence-breaking and punctuation generation model.
Further, the corpus processing module is specifically configured to:
reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
reserving four operators and Greek letters in the corpus;
labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
Further, the model construction and optimization module is specifically configured to:
a log-likelihood loss function is used, the loss function being:
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
where n represents the number of tags, j represents the tag number,representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
Compared with the prior art, the invention has the following beneficial effects:
the invention can solve the problems that sentences can not be automatically broken and punctuation marks are lost in the voice transcription text. Through the technical scheme and the implementation method provided by the invention, the voice recognition text can be post-processed, sentences can be automatically broken, 4 common punctuations (commas, periods, question marks and exclamation marks) can be added, and the reading experience of a user can be obviously improved.
The invention regards the automatic punctuation as a standard natural language sequence marking task, adopts the two-way LSTM network to model for the time sequence text sequence, marks each input character, designs five kinds of labels in total, respectively represents the form that the character is followed by the next character: { non-punctuation marks, commas, periods, problems, exclamation marks }, preprocessing an original text in the standard format, making training corpora, inputting the text in multiple fields of time, law, famous novels and the like, about 300M in training, inputting the text into a 2-layer bidirectional LSTM for learning, outputting a label corresponding to each character after repeated iterative optimization, and then restoring punctuation marks to obtain the punctuated text.
Drawings
FIG. 1 is a basic flowchart of a method for constructing a Chinese text automatic sentence break and punctuation generation model based on a bidirectional long-and-short term memory network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a network structure of an automatic sentence-breaking and punctuation generation model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a system for constructing a model of automatically breaking sentences and punctuation in a chinese text based on a bidirectional long-and-short term memory network according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
example 1
As shown in fig. 1, a method for constructing a model for automatically breaking sentences and punctuation in a chinese text based on a bidirectional long-and-short term memory network includes:
step S101: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;
step S102: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
step S103: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.
Specifically, the step S101 includes:
step S101.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
step S101.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks; the replacement strategy is specifically shown in table 1:
TABLE 1 corpus punctuation processing strategy
Step S101.3: reserving four operators and Greek letters in the corpus;
step S101.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation, comma, period, question mark, exclamation mark:
specifically, five kinds of tags are designed:
label 1: n, corresponding to NONE;
and 2, labeling: c, corresponding to COMMA;
and (3) labeling: p, corresponding to PERIOD;
and (4) labeling: q, corresponding to QUESTION;
and (5) labeling: e, corresponding to EXCLAMATION.
In the above labels, the form of the character followed by the next character is indicated:
non-punctuation marks, commas, periods, question marks, exclamation marks.
The original Chinese text is preprocessed in the standard format, a training corpus is manufactured, the training corpus is input into a bidirectional long-time memory network for learning, a label corresponding to each character is output, and then punctuation recovery is carried out.
Examples are as follows:
original: he indicates that this time … will be
Inputting: he points out this time …
Labeling: n nc N ….
Specifically, the step S102 includes:
in a natural language sequence labeling task, a Recurrent Neural Network (RNN) model is often used, wherein a long short-term memory (LSTM) is used as a special type of RNN, and a memory cell and a gate mechanism are introduced into each hidden layer cell to control input and output of information streams, thereby effectively solving the problem of gradient disappearance existing in a common RNN. In contrast, LSTM is more adept at processing serialized data, such as natural language text, and can model a larger range of context information in the sequence.
In the technical scheme adopted by the invention, a bidirectional LSTM (binary LSTM, BLSTM) network is used for modeling the natural language text from the positive direction and the negative direction, so that the automatic sentence-breaking and punctuation functions are realized, and the specific model structure is shown in figure 2.
In the above network structure, the roles of the layers are as follows:
a) input layer and Embedding layer: the input layer adopts the training corpus without punctuation as input, and realizes the conversion from characters to character indexes by establishing two mappings of word2id and id2 word. The index sequence is ordered according to the dictionary obtained by initialization in the character vector table, and the conversion from the index to the character vector can be realized according to the sequence. The function of the Embedding layer is to index character vectors, convert input characters into character vectors with uniform dimensions, and the character vectors contain rich semantic information. In this model, the dimension of the character vector used is 300 dimensions, and the dictionary size is 14157.
b) Forward and reverse LSTM layers: forward LSTM hidden states and backward LSTM hidden states are computed separately and then projected to a common output layer. The unidirectional LSTM only comprises a hidden layer in one direction and inputs a vector x according to the current timetAnd the hidden state vector h of the previous momentt-1Calculating the hidden state h of the current timet. The bidirectional LSTM comprises a forward layer and a reverse layer, and the hidden state vectors of the forward layer at the current moment need to be calculated respectivelyAnd reverse hidden state vector
Where m is the hidden unit dimension, the LSTM () function represents the nonlinear transformation of the LSTM network, whose main function is to encode the input character vector into the corresponding hidden state vector.
c) An output layer: adopting a weighted summation mode to carry out forward implicit state vectorAnd reverse hidden state vectorLinear combination is carried out to obtain the hidden vector h of the BLSTMt∈Rm×1:
Wherein, W1∈Rm×mAnd V1∈Rm×mAs a weight matrix, b1∈Rm×1Are the corresponding offset terms. The hidden layer simultaneously aggregates the sequence information of the current element in the input sequence in the forward direction and the backward direction, and can provide richer context characteristics for the annotation.
Specifically, the step S103 includes:
given training setWherein the ith sentenceThe corresponding tag sequence is y(i)=[y1 (i),y2 (i),...,yn (i)]. In the model training, a log-likelihood loss function is adopted, and an L2 regularization term is added, wherein the loss function is as follows:
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in order to improve sentence-breaking quality and reading experience and promote finer sentence-breaking, the shorter the sentence length in the result is, the better the sentence length is, the loss function is improved, and a long sentence punishment factor is added:
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark, i.e. in sentence x(i)In (1), calculating its output label y(i)The number of non-NONE tags in the list, i.e. COMMA, PERIOD, QUESTION, EXCLAMATION:
where n represents the number of tags, j represents the tag number,representing the number of jth labels of the ith sentence;
adding the formula into a loss function, adding a long sentence penalty factor β, improving the loss function, and calculating the average sentence length loss together with the batch, wherein the improved loss function is as follows:
and constructing a Chinese text automatic sentence-breaking and punctuation generating model by taking the minimized and improved log-likelihood loss function as a target.
In the training process, a mini-batch gradient descent method is adopted, and k is the size of each batch. And (3) applying a Dropout strategy to randomly remove part of BLSTM hidden layer units and weights thereof with a certain probability so as to prevent the training data from being over-fitted.
To verify the effect of the present invention, the following experiment was performed:
(1) the method comprises the steps of obtaining an original Chinese text corpus (with mark symbols), and training the Chinese text in the fields of time administration, law, famous works and the like by about 300M.
(2) The punctuation marks in the original text are normalized, only comma, period, question mark and exclamation mark are reserved, and other punctuation marks are automatically classified into one of the four categories or directly removed. Each word in the normalized text is labeled as one of { N, C, P, Q, E } (representing { NONE, COMMA, PERIOD, QUESTION, EXCLAMATION }), and the set of labeling rules is sent to the BLSTM neural network for training.
(3) The results were tested in another part of the randomly drawn article.
And training the punctuation generation model by adopting an LSTM network in a tensoflow library, writing the model into an x.pb binary file after the training is finished, freezing the weight data and the calculation map by adopting a freeze _ graph.
By adopting the automatic sentence-breaking and punctuation generation model, the partial experimental results on the public corpus are as follows:
table 2 discloses examples of partial experimental results on corpora
In conclusion, the method and the device can solve the problems that sentences cannot be automatically broken and punctuation marks are lost in the voice transcription text and the like. Through the technical scheme and the implementation method provided by the invention, the voice recognition text can be post-processed, sentences can be automatically broken, 4 common punctuations (commas, periods, question marks and exclamation marks) can be added, and the reading experience of a user can be obviously improved.
The invention regards the automatic punctuation as a standard natural language sequence marking task, adopts the two-way LSTM network to model for the time sequence text sequence, marks each input character, designs five kinds of labels in total, respectively represents the form that the character is followed by the next character: { non-punctuation marks, commas, periods, problems, exclamation marks }, preprocessing an original text in the standard format, making training corpora, inputting the text in multiple fields of time, law, famous novels and the like, about 300M in training, inputting the text into a 2-layer bidirectional LSTM for learning, outputting a label corresponding to each character after repeated iterative optimization, and then restoring punctuation marks to obtain the punctuated text.
Example 2
As shown in fig. 3, a system for constructing a model for automatically breaking sentences and punctuation in chinese text based on a bidirectional long-and-short term memory network includes:
the corpus processing module 201 is configured to process a chinese text corpus, remove useless symbols, and add a designed tag to each character;
the model initialization module 202 is configured to use a bidirectional long-and-short-term memory network as a reference network structure of a Chinese text automatic sentence break and punctuation generation model;
the model construction and optimization module 203 is configured to adopt a log-likelihood loss function, improve the log-likelihood loss function by adding a long-sentence penalty factor, train the tagged chinese text corpus from the positive and negative directions with the minimized improved log-likelihood loss function as a target, and complete the construction of the automatic Chinese text sentence-breaking and punctuation generation model.
Specifically, the corpus processing module 201 is specifically configured to:
reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
reserving four operators and Greek letters in the corpus;
labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
Specifically, the model construction and optimization module 203 is specifically configured to:
a log-likelihood loss function is used, the loss function being:
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
where n represents the number of tags, j represents the tag number,representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
and constructing a Chinese text automatic sentence-breaking and punctuation generating model by taking the minimized and improved log-likelihood loss function as a target.
The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.
Claims (6)
1. A method for constructing a Chinese text automatic sentence-breaking and punctuation generation model based on a bidirectional long-time and short-time memory network is characterized by comprising the following steps:
step 1: processing the Chinese text corpus, removing useless symbols, and adding a designed label to each character;
step 2: a bidirectional long-time and short-time memory network is used as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and step 3: and (3) improving the log-likelihood loss function by adding a long-sentence penalty factor by adopting the log-likelihood loss function, training the Chinese text corpus to which the labels are added from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of an automatic Chinese text sentence-breaking and punctuation generation model.
2. The method for constructing the model for generating the automatic Chinese text punctuation and punctuation based on the bidirectional long-and-short term memory network as claimed in claim 1, wherein said step 1 comprises:
step 1.1: reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
step 1.2: replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
step 1.3: reserving four operators and Greek letters in the corpus;
step 1.4: labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
3. The method for constructing the model for generating the automatic Chinese text punctuation and punctuation based on the bidirectional long-and-short term memory network as claimed in claim 2, wherein said step 3 comprises:
a log-likelihood loss function is used, the loss function being:
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
where n represents the number of tags, j represents the tag number,representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
4. A Chinese text automatic sentence-breaking and punctuation generation model construction system based on a bidirectional long-time memory network is characterized by comprising the following steps:
the corpus processing module is used for processing Chinese text corpora, removing useless symbols and adding a designed label to each character;
the network structure selection module is used for utilizing a bidirectional long-time memory network as a reference network structure of a Chinese text automatic sentence-breaking and punctuation generation model;
and the model construction and optimization module is used for improving the log-likelihood loss function by adding a long-sentence punishment factor by adopting the log-likelihood loss function, training the Chinese text corpus added with the labels from the positive direction and the negative direction by taking the minimized improved log-likelihood loss function as a target, and completing the construction of the Chinese text automatic sentence-breaking and punctuation generation model.
5. The system for constructing a model of automatic Chinese text sentence break and punctuation based on a bidirectional long-and-short term memory network as claimed in claim 4, wherein said corpus processing module is specifically configured to:
reserving commas, periods, question marks and exclamation marks in the corpus, wherein all the commas, periods, question marks and exclamation marks are full-angle symbols;
replacing pause signs, colon signs, dash signs and connection signs in the corpus with commas, replacing semicolons and ellipses with periods, and directly removing quotation marks, brackets, book title marks and interval marks;
reserving four operators and Greek letters in the corpus;
labeling each character entered with a label indicating the form of the character immediately followed by the next character: non-punctuation marks, commas, periods, question marks, exclamation marks.
6. The system for constructing a model for generating automatic Chinese text punctuation and punctuation based on a bidirectional long-and-short term memory network as claimed in claim 4, wherein said model construction and optimization module is specifically configured to:
a log-likelihood loss function is used, the loss function being:
wherein x(i)Representing the ith sentence, 1 ≦ i ≦ N, N representing the total number of sentences in the corpus, k representing the number of sentences in the batch, 1 ≦ k ≦ N, P (y)(i)|x(i)(ii) a Theta) represents x(i)Corresponding tag sequence y(i)θ represents a hyper-parametric set of models, λ represents an L2 regularization parameter;
in sentence x(i)In (1), calculating output label y(i)The number of labels corresponding to medium comma, period, question mark and exclamation mark is as follows:
where n represents the number of tags, j represents the tag number,representing the number of jth labels of the ith sentence;
adding a long sentence penalty factor β to improve the loss function, wherein the improved loss function is as follows:
and training the Chinese text corpus after the labels are added from the positive direction and the negative direction by taking the minimized and improved log-likelihood loss function as a target to complete the construction of an automatic sentence-breaking and punctuation generation model of the Chinese text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911241042.3A CN111090981B (en) | 2019-12-06 | 2019-12-06 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911241042.3A CN111090981B (en) | 2019-12-06 | 2019-12-06 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111090981A true CN111090981A (en) | 2020-05-01 |
CN111090981B CN111090981B (en) | 2022-04-15 |
Family
ID=70394814
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911241042.3A Active CN111090981B (en) | 2019-12-06 | 2019-12-06 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111090981B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723584A (en) * | 2020-06-24 | 2020-09-29 | 天津大学 | Punctuation prediction method based on consideration of domain information |
CN111951792A (en) * | 2020-07-30 | 2020-11-17 | 北京先声智能科技有限公司 | Punctuation marking model based on grouping convolution neural network |
CN112001167A (en) * | 2020-08-26 | 2020-11-27 | 四川云从天府人工智能科技有限公司 | Punctuation mark adding method, system, equipment and medium |
CN112101003A (en) * | 2020-09-14 | 2020-12-18 | 深圳前海微众银行股份有限公司 | Sentence text segmentation method, device and equipment and computer readable storage medium |
CN112906366A (en) * | 2021-01-29 | 2021-06-04 | 深圳力维智联技术有限公司 | ALBERT-based model construction method, device, system and medium |
CN113542661A (en) * | 2021-09-09 | 2021-10-22 | 北京鼎天宏盛科技有限公司 | Video conference voice recognition method and system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6067514A (en) * | 1998-06-23 | 2000-05-23 | International Business Machines Corporation | Method for automatically punctuating a speech utterance in a continuous speech recognition system |
US20130325442A1 (en) * | 2010-09-24 | 2013-12-05 | National University Of Singapore | Methods and Systems for Automated Text Correction |
US20190043486A1 (en) * | 2017-08-04 | 2019-02-07 | EMR.AI Inc. | Method to aid transcribing a dictated to written structured report |
CN109887499A (en) * | 2019-04-11 | 2019-06-14 | 中国石油大学(华东) | A kind of voice based on Recognition with Recurrent Neural Network is made pauses in reading unpunctuated ancient writings algorithm automatically |
US20190188463A1 (en) * | 2017-12-15 | 2019-06-20 | Adobe Inc. | Using deep learning techniques to determine the contextual reading order in a form document |
CN109918666A (en) * | 2019-03-06 | 2019-06-21 | 北京工商大学 | A kind of Chinese punctuation mark adding method neural network based |
CN110110335A (en) * | 2019-05-09 | 2019-08-09 | 南京大学 | A kind of name entity recognition method based on Overlay model |
CN110245332A (en) * | 2019-04-22 | 2019-09-17 | 平安科技(深圳)有限公司 | Chinese character code method and apparatus based on two-way length memory network model in short-term |
US10431210B1 (en) * | 2018-04-16 | 2019-10-01 | International Business Machines Corporation | Implementing a whole sentence recurrent neural network language model for natural language processing |
-
2019
- 2019-12-06 CN CN201911241042.3A patent/CN111090981B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6067514A (en) * | 1998-06-23 | 2000-05-23 | International Business Machines Corporation | Method for automatically punctuating a speech utterance in a continuous speech recognition system |
US20130325442A1 (en) * | 2010-09-24 | 2013-12-05 | National University Of Singapore | Methods and Systems for Automated Text Correction |
US20190043486A1 (en) * | 2017-08-04 | 2019-02-07 | EMR.AI Inc. | Method to aid transcribing a dictated to written structured report |
US20190188463A1 (en) * | 2017-12-15 | 2019-06-20 | Adobe Inc. | Using deep learning techniques to determine the contextual reading order in a form document |
US10431210B1 (en) * | 2018-04-16 | 2019-10-01 | International Business Machines Corporation | Implementing a whole sentence recurrent neural network language model for natural language processing |
CN109918666A (en) * | 2019-03-06 | 2019-06-21 | 北京工商大学 | A kind of Chinese punctuation mark adding method neural network based |
CN109887499A (en) * | 2019-04-11 | 2019-06-14 | 中国石油大学(华东) | A kind of voice based on Recognition with Recurrent Neural Network is made pauses in reading unpunctuated ancient writings algorithm automatically |
CN110245332A (en) * | 2019-04-22 | 2019-09-17 | 平安科技(深圳)有限公司 | Chinese character code method and apparatus based on two-way length memory network model in short-term |
CN110110335A (en) * | 2019-05-09 | 2019-08-09 | 南京大学 | A kind of name entity recognition method based on Overlay model |
Non-Patent Citations (7)
Title |
---|
JIANFENG GAO等: "Chinese Word Segmentation and Named Entity Recognition:A Pragmatic Approach", 《COMPUTATIONAL LINGUISTICS》 * |
司念文: "面向军事领域的句子级文本处理技术研究", 《中国优秀博硕士学位论文全文数据库(硕士)工程科技Ⅱ辑》 * |
应文豪等: "一种话题敏感的抽取式多文档摘要方法", 《中文信息学报》 * |
张合等: "一种基于层叠CRF的古文断句与句读标记方法", 《计算机应用研究》 * |
张慧等: "CRF模型的自动标点预测方法研究", 《网络新媒体技术》 * |
王博立等: "一种基于循环神经网络的古文断句方法", 《北京大学学报(自然科学版)》 * |
胡婕等: "双向循环网络中文分词模型", 《小型微型计算机系统》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723584A (en) * | 2020-06-24 | 2020-09-29 | 天津大学 | Punctuation prediction method based on consideration of domain information |
CN111723584B (en) * | 2020-06-24 | 2024-05-07 | 天津大学 | Punctuation prediction method based on consideration field information |
CN111951792A (en) * | 2020-07-30 | 2020-11-17 | 北京先声智能科技有限公司 | Punctuation marking model based on grouping convolution neural network |
CN111951792B (en) * | 2020-07-30 | 2022-12-16 | 北京先声智能科技有限公司 | Punctuation marking model based on grouping convolution neural network |
CN112001167A (en) * | 2020-08-26 | 2020-11-27 | 四川云从天府人工智能科技有限公司 | Punctuation mark adding method, system, equipment and medium |
CN112101003A (en) * | 2020-09-14 | 2020-12-18 | 深圳前海微众银行股份有限公司 | Sentence text segmentation method, device and equipment and computer readable storage medium |
CN112101003B (en) * | 2020-09-14 | 2023-03-14 | 深圳前海微众银行股份有限公司 | Sentence text segmentation method, device and equipment and computer readable storage medium |
CN112906366A (en) * | 2021-01-29 | 2021-06-04 | 深圳力维智联技术有限公司 | ALBERT-based model construction method, device, system and medium |
CN112906366B (en) * | 2021-01-29 | 2023-07-07 | 深圳力维智联技术有限公司 | ALBERT-based model construction method, device, system and medium |
CN113542661A (en) * | 2021-09-09 | 2021-10-22 | 北京鼎天宏盛科技有限公司 | Video conference voice recognition method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111090981B (en) | 2022-04-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111090981B (en) | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network | |
Du et al. | Explicit interaction model towards text classification | |
CN108460013B (en) | Sequence labeling model and method based on fine-grained word representation model | |
CN107291795B (en) | Text classification method combining dynamic word embedding and part-of-speech tagging | |
CN107729309B (en) | Deep learning-based Chinese semantic analysis method and device | |
CN107358948B (en) | Language input relevance detection method based on attention model | |
CN109858041B (en) | Named entity recognition method combining semi-supervised learning with user-defined dictionary | |
CN112733541A (en) | Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism | |
CN111985239B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
CN110347836B (en) | Method for classifying sentiments of Chinese-Yue-bilingual news by blending into viewpoint sentence characteristics | |
CN110851599B (en) | Automatic scoring method for Chinese composition and teaching assistance system | |
CN109086269B (en) | Semantic bilingual recognition method based on semantic resource word representation and collocation relationship | |
CN108829823A (en) | A kind of file classification method | |
CN110909736A (en) | Image description method based on long-short term memory model and target detection algorithm | |
Yang et al. | Rits: Real-time interactive text steganography based on automatic dialogue model | |
CN109919175B (en) | Entity multi-classification method combined with attribute information | |
Xing et al. | A convolutional neural network for aspect-level sentiment classification | |
CN111581970B (en) | Text recognition method, device and storage medium for network context | |
CN111709242A (en) | Chinese punctuation mark adding method based on named entity recognition | |
CN113723105A (en) | Training method, device and equipment of semantic feature extraction model and storage medium | |
CN112287106A (en) | Online comment emotion classification method based on dual-channel hybrid neural network | |
CN110222338A (en) | A kind of mechanism name entity recognition method | |
CN112131367A (en) | Self-auditing man-machine conversation method, system and readable storage medium | |
CN114756681A (en) | Evaluation text fine-grained suggestion mining method based on multi-attention fusion | |
CN112905736A (en) | Unsupervised text emotion analysis method based on quantum theory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |