CN111027291B - Method and device for adding mark symbols in text and method and device for training model, and electronic equipment - Google Patents
Method and device for adding mark symbols in text and method and device for training model, and electronic equipment Download PDFInfo
- Publication number
- CN111027291B CN111027291B CN201911182421.XA CN201911182421A CN111027291B CN 111027291 B CN111027291 B CN 111027291B CN 201911182421 A CN201911182421 A CN 201911182421A CN 111027291 B CN111027291 B CN 111027291B
- Authority
- CN
- China
- Prior art keywords
- text
- word
- training
- punctuation
- added
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 112
- 238000000034 method Methods 0.000 title claims abstract description 63
- 239000013598 vector Substances 0.000 claims abstract description 143
- 238000012545 processing Methods 0.000 claims abstract description 56
- 230000011218 segmentation Effects 0.000 claims abstract description 53
- 238000010606 normalization Methods 0.000 claims abstract description 30
- 238000001914 filtration Methods 0.000 claims abstract description 26
- 230000015654 memory Effects 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 230000009467 reduction Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 7
- 230000002457 bidirectional effect Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000009795 derivation Methods 0.000 description 4
- 238000003780 insertion Methods 0.000 description 4
- 230000037431 insertion Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The embodiment of the invention provides a method and a device for adding punctuation marks in a text and training a model and electronic equipment, wherein the method comprises the following steps: word segmentation and part-of-speech recognition are carried out on the text to be added, normalization processing is carried out, and characters/word vectors are determined; splicing part-of-speech information, word segmentation boundary information and word/word vectors to obtain feature vectors; inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set; filtering the text sequences which do not meet the conditions in the candidate text sequence set; in the rest text sequences of the candidate text sequence set, the text sequence which has the highest joint probability and accords with the punctuation mark specification is output, and the output text sequence is subjected to normalized reduction operation, so that the problem that a plurality of punctuation marks are added behind characters can be well solved, and the accuracy of adding the punctuation marks is improved.
Description
Technical Field
The embodiment of the invention relates to a language processing technology, in particular to a method and a device for adding a punctuation mark in a text and training a model and electronic equipment.
Background
With the rapid development of society and high-tech technology, natural language processing such as intelligent home control, automatic question answering, voice assistant and the like is getting more attention. However, since spoken dialog does not have punctuation, sentence boundaries and canonical language structures cannot be distinguished, punctuation prediction is an extremely important natural language processing task. In a smart phone customer service scene, for speaking of a user, an original text without punctuation and sentence breaking is obtained through voice recognition, and the original text is directly used without a method, so that punctuation mark prediction is needed, and the purpose of adding punctuation marks is achieved.
In the related art, the automatic addition of punctuation marks to this scene has developed different technical solutions, and currently two main categories are: firstly, judging the position of the punctuation mark by judging the voice pause time based on voice information, and secondly, judging the insertion position of the punctuation mark by judging text information based on text sequence information. The former is limited in that, since it can be judged only by the pause time, the situation of processing long text with a fast speech speed or having a speech pause in the middle cannot be well handled. The latter method based on text sequence information can judge the adding position of punctuation marks according to the context characteristics. In some scenarios, however, multiple punctuations need to be added after a character, but current methods do not address the need to insert multiple punctuations after a character.
Disclosure of Invention
The embodiment of the invention provides a method and a device for adding punctuation marks in a text, a model training method and a device and electronic equipment, which can well solve the problem of adding a plurality of punctuation marks behind a character and improve the accuracy of adding the punctuation marks.
In a first aspect, an embodiment of the present invention provides a method for adding punctuation marks in text, including:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
And outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
In a second aspect, the embodiment of the present invention further provides a training method for adding punctuation marks to a text, including:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
In a third aspect, an embodiment of the present invention further provides a punctuation mark adding device in a text, including:
the processing module is used for acquiring a text to be added without punctuation marks, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
The word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector;
the candidate text sequence set obtaining module is used for inputting the feature vector into the trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
the filtering module is used for filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which are not consistent with punctuation standards in the candidate text sequence set;
and the output/restoration module is used for outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalization restoration operation on the output text sequence.
In a fourth aspect, an embodiment of the present invention provides a training device for adding punctuation marks to a text, including:
The training text obtaining module is used for removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
the processing module is used for carrying out word segmentation processing and part-of-speech recognition on the training text and carrying out normalization processing on the processed set words of the training text;
the word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector;
the training module is used for inputting the feature vector into the seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
In a fifth aspect, an embodiment of the present invention further provides an electronic device, including:
one or more processors;
storage means for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement a method for adding punctuation marks in text provided by the embodiment of the present invention, or a method for training a model for adding punctuation marks in text provided by the embodiment of the present invention.
In a sixth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the program when executed by a processor implements a method for adding punctuation marks in text provided by the embodiment of the present invention, or a method for training a model for adding punctuation marks in text provided by the embodiment of the present invention.
According to the technical scheme provided by the embodiment of the invention, word segmentation processing, part-of-speech recognition and normalization processing are carried out on the text to be added without punctuation marks, a word/word vector corresponding to each character in the text to be added is obtained, the word/word vector, part-of-speech information and word segmentation boundary information are spliced to obtain a feature vector, the feature vector is input into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation marks, a candidate text sequence set is formed, candidate texts which do not meet the conditions are filtered, the joint probability is highest, the text sequences which meet the punctuation mark specification are output, and the output text sequences are subjected to normalization reduction operation, so that the problem of adding a plurality of punctuation marks behind characters can be well solved, and the accuracy of punctuation mark addition is improved.
Drawings
FIG. 1a is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 1b is a schematic diagram of the structure of the seq2seq model provided by an embodiment of the present invention;
FIG. 1c is a flow chart of a method for training a seq2seq model provided by an embodiment of the present invention;
FIG. 2a is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 2b is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 3 is a block diagram of a device for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 4 is a block diagram of a training device for adding punctuation marks in text according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Fig. 1a is a flowchart of a method for adding punctuation marks in a text, where the method may be performed by a device for adding punctuation marks in a text, where the device may be implemented by software and/or hardware, where the device may be configured in an electronic device such as a terminal or a server, and where the method may be applied to a scene where punctuation marks are added to a text without punctuation marks, and optionally, may be applied to a scene where a plurality of punctuation marks are added after words in a text.
As shown in fig. 1a, the technical solution provided by the embodiment of the present invention includes:
s110: obtaining a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added.
In the embodiment of the invention, no punctuation mark is in the text to be added, wherein the text to be added can be the text converted by voice, for example, the text converted by customer service dialogue, the text converted by voice assistant, and the text to be added can also be other text without the punctuation mark.
In the embodiment of the invention, the set words can be words such as numbers, and normalization processing can be performed on the words such as numbers. The part-of-speech recognition specifically analyzes the attribute of each word/word in the text to be added, and can mark the part-of-speech information obtained after the part-of-speech recognition.
S120: and acquiring the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model.
In the embodiment of the invention, the normalized text to be added can be input into a pre-trained character/word vector model to obtain the character/word vector corresponding to each character in the text to be added. Wherein the pre-trained word/word vector model may refer to the model in the related art.
S130: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector.
In the embodiment of the invention, the part-of-speech information, the word segmentation boundary information and the splicing sequence of the word/word vectors are not limited. The part-of-speech information is name, verb, other part-of-speech information, etc. corresponding to each character. The word segmentation boundary information refers to information such as information of the first character and the last character of the word.
S140: and inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set.
In an embodiment of the present invention, feature vectors are input into a trained Sequence-to-Sequence, seq2seq model.
The seq2seq model can be a seq2seq model based on a two-way long-short-term memory model and an attention mechanism, and the semantic information can be extracted by adopting the seq2seq model, so that the accuracy of adding punctuation marks is improved. The seq2seq model may include an embedded layer, an encoder, an attention layer, and a decoder. Reference may be made to fig. 1b for a specific structure of the model. Wherein the encoder comprises a bidirectional long short-term memory unit (bidirectional LSTM). Because punctuation can be used to segment semantic groups, the insertion position of punctuation needs to be comprehensively judged by utilizing the semantic information of the above and the below, the encoder adopts a bidirectional LSTM, presents 2 independent hidden states of an input vector according to a forward sequence and a reverse sequence so as to capture past and future information respectively, and then combines the 2 hidden states as final output.
Wherein, the attention layer adopts a standard attention implementation method. The coding layer obtains hidden states of the positive layer and the negative layer, the context vector c output by the attention layer can be obtained based on the attention weight and 2 hidden states output by the coding layer, and the weight can reflect the relation between the current hidden state and the context.
The decoder comprises a two-way long-short-term memory unit, is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the vector output by the embedding layer, selects the characters at all moments based on the size of the probability, and forms at least one text sequence added with punctuation marks. Wherein the output of the attention layer may be a context vector. Specifically, N characters are selected based on the probability size at each moment through the decoding layer; at least one punctuation-added text sequence is constructed based on the characters selected at each instant. Where N is a positive integer. At each moment, at least one predicted character exists, each character has a certain probability, a character with the probability larger than the set probability can be selected as the predicted character, and at least one text sequence added with punctuation can be combined according to the at least one character selected at each moment, so that at least one candidate text sequence is obtained, and a candidate text sequence set is formed.
S150: and filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which do not accord with punctuation standards in the candidate text sequence set.
In the embodiment of the invention, the text sequences in the candidate text sequence set may be inconsistent with the characters and/or words in the text to be added. For example, one text sequence of the candidate text set is: ABC, XX; but the text to be added is ABBXX, and if 'ABC, XX' in the candidate text set is inconsistent with the word/word in the text to be added, filtering 'ABC, XX' in the candidate text set.
In the embodiment of the invention, the text sequences in the candidate text sequence set may have the condition that punctuation marks do not accord with punctuation mark specifications, for example, comma continuous use condition, left and right bracket mismatch condition, left and right signature mismatch condition, ellipsis misuse condition and the like, and the text sequences of which the punctuation marks do not accord with the punctuation mark specifications are filtered.
S160: and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
In the embodiment of the invention, in the rest text sequences of the candidate text sequence set, each character of the text sequence can obtain the predicted probability through the seq2seq model, and the text sequence with the highest joint probability is selected for output. The joint probability refers to that the text sequence is determined based on the probability of each character, for example, the probability of each character in the text sequence can be the sum, or the average value of the probability of each character, or the sum of the products of the probability of each character and the corresponding weight.
In the embodiment of the invention, the restoring operation of normalizing the output text sequence can specifically restore words such as numbers in the output text. By filtering the text sequence output by the seq2seq model, unreasonable conditions and conditions which violate punctuation mark specifications can be filtered, the text sequence can be accurately predicted, and the punctuation adding accuracy is improved.
The method for adding punctuation marks in a specific text may refer to fig. 1c, and it should be noted that the method provided in the embodiment of the present invention may be an online flow.
In the related art, the method of determining the insertion position of the punctuation mark based on the text sequence information by determining the text information and the method of determining the addition position of the punctuation mark based on the text sequence information by the context feature may be performed by a sequence labeling method, but in some scenes, a plurality of punctuation marks need to be added behind the character. For example, "teacher schedules students to finish silently writing" out of table ". The last "table" word in this text is followed by a number of punctuations. The method provided by the embodiment of the invention can well solve the problem of adding a plurality of punctuation marks behind the characters by processing the text to be added without the punctuation marks to obtain the feature vector, inputting the feature vector into a trained seq2seq model, obtaining a candidate text sequence added with punctuation marks, filtering to obtain a final candidate text sequence, inputting a sequence into the seq2seq model, outputting the sequence, and filtering and screening, so that the problem of adding a plurality of punctuation marks behind the characters can be well solved, and the accuracy of adding the punctuation marks is improved.
According to the technical scheme provided by the embodiment of the invention, word segmentation processing, part-of-speech recognition and normalization processing are carried out on the text to be added without punctuation marks, a word/word vector corresponding to each character in the text to be added is obtained, the word/word vector, part-of-speech information and word segmentation boundary information are spliced to obtain a feature vector, the feature vector is input into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation marks, a candidate text sequence set is formed, candidate texts which do not meet the conditions are filtered, the joint probability is highest, the text sequences which meet the punctuation mark specification are output, and the output text sequences are subjected to normalization reduction operation, so that the problem of adding a plurality of punctuation marks behind characters can be well solved, and the accuracy of punctuation mark addition is improved.
Fig. 2a is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention, where a process of training a seq2seq model is added according to the embodiment of the present invention, and the training process of the seq2seq model may be an offline process.
As shown in fig. 2a, the technical solution provided by the embodiment of the present invention includes:
S210: removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts; and performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text.
The normalization processing of the training text may be performed by normalizing numbers in the training text, for example, the numbers may be replaced with set identifiers, so as to increase the processing speed of the model.
The word segmentation process and the part-of-speech recognition in the training text are the same as the specific processes in the embodiment.
S220: and acquiring the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model.
S230: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector.
In the embodiment of the invention, the semantic information of the current input state can be obtained through the word vector and/or the word vector and the part of speech feature, the characteristics of word segmentation, part of speech and the like are introduced, the boundary information of the current input character is also obtained, the knowledge of word boundaries can be learned in training, and the situation that punctuation marks divide word segmentation contents in a prediction stage is avoided.
S240: inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
The seq2seq model can be a seq2seq model based on a two-way long-short-term memory model and an attention mechanism, and the semantic information can be extracted by adopting the seq2seq model, so that the accuracy of adding punctuation marks is improved. The seq2seq model may include an embedded layer, an encoder, an attention layer, and a decoder. Reference may be made to fig. 1b for a specific structure of the model. Wherein the encoder comprises a bidirectional long short-term memory unit (bidirectional LSTM). Because punctuation can be used to segment semantic groups, the insertion position of punctuation needs to be comprehensively judged by utilizing the semantic information of the above and the below, the encoder adopts a bidirectional LSTM, presents 2 independent hidden states of an input vector according to a forward sequence and a reverse sequence so as to capture past and future information respectively, and then combines the 2 hidden states as final output.
Wherein, the attention layer adopts a standard attention implementation method. The coding layer obtains hidden states of the positive layer and the negative layer, the context vector c output by the attention layer can be obtained based on the attention weight and 2 hidden states output by the coding layer, and the weight can reflect the relation between the current hidden state and the context.
The decoder comprises a two-way long-short-term memory unit, is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the vector output by the embedding layer, selects the characters at all moments based on the size of the probability, and forms at least one text sequence added with punctuation marks. Wherein the output of the attention layer may be a context vector. Specifically, N characters are selected based on the probability size at each moment through the decoding layer; at least one punctuation-added text sequence is constructed based on the characters selected at each instant. Where N is a positive integer. At each moment, at least one predicted character exists, each character has a certain probability, the characters with the probability larger than the set probability can be selected as the predicted characters, and at least one text sequence added with punctuation can be combined according to the at least one character selected at each moment. Matching the text sequence output by the decoding layer with the original text, adjusting the seq2seq model, and completing training of the seq2seq model to obtain a trained seq2seq model. The process of training the seq2seq model in particular can be referred to in fig. 2b. The seq2seq model trained by the method fully introduces the word vector and the word vector of the context to judge, fully considers the state transition relation between punctuation marks, and can obtain better prediction effect.
S250: obtaining a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added.
S260: and acquiring the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model.
S270: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector.
S280: and inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set.
S290: and filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which do not accord with punctuation standards in the candidate text sequence set.
S291: and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
Fig. 3 is a flowchart of a method for training a punctuation adding model in a text, where the method may be performed by a punctuation adding model training device in the text, and the device may be implemented by software and/or hardware.
As shown in fig. 3, the technical solution provided by the embodiment of the present invention includes:
s310: and removing punctuation marks from the plurality of original texts with the punctuation marks to obtain training texts.
S320: and performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text.
S330: and acquiring the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model.
S340: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector.
S350: inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
In this embodiment, the descriptions of S310-S350 may be referred to the descriptions of S210-S250 in the above embodiments.
According to the embodiment of the invention, the semantic information of the current input state can be obtained through the character vector and/or the character vector and the part of speech feature, characteristics of word segmentation, part of speech and the like are introduced, boundary information of the current input character is also obtained, knowledge of word boundaries can be learned in training, and the situation that punctuation marks divide word segmentation contents in a prediction stage is avoided.
Fig. 3 is a block diagram of a punctuation mark adding device in text according to an embodiment of the present invention, where, as shown in fig. 3, the device includes: a processing module 310, a word/word vector derivation module 320, a concatenation module 330, a candidate text sequence set derivation module 340, a filtering module 350, and an output/restoration module 360.
The processing module 310 is configured to obtain a text to be added without punctuation marks, perform word segmentation processing and part-of-speech recognition on the text to be added, and perform normalization processing on a set word in the processed text to be added;
a word/word vector obtaining module 320, configured to obtain, through a pre-trained word/word vector model, a word/word vector corresponding to each character in the processed text to be added;
a splicing module 330, configured to splice part-of-speech information, word segmentation boundary information, and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
a candidate text sequence set obtaining module 340, configured to input the feature vector into a trained seq2seq model, obtain a plurality of candidate text sequences added with punctuation, and form a candidate text sequence set;
a filtering module 350, configured to filter a text sequence in the candidate text sequence set, where the text sequence is inconsistent with a word/word in a text to be added, and filter a text sequence in the candidate text sequence set, where the text sequence does not meet punctuation standards;
The output/restore module 360 is configured to output, from the remaining text sequences in the candidate text sequence set, a text sequence that has the highest joint probability and meets the punctuation specification, and perform a normalization restore operation on the output text sequence.
Optionally, the seq2seq model is a seq2seq model based on a two-way long-short term memory model and an attention mechanism.
Optionally, the seq2seq model includes an embedded layer, an encoder, an attention layer, and a decoder;
the encoder comprises a two-way long-short-term memory unit; the attention layer comprises a long-period and short-period memory unit;
the decoder comprises a two-way long-short-term memory unit, and is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the output result of the embedded layer, selecting the characters at all moments based on the size of the probability, and forming at least one text sequence added with punctuation marks.
Optionally, the device further comprises a training module for:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts; performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
Acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
Optionally, the setting word includes a number.
The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method.
Fig. 4 is a structural block diagram of a training device for adding punctuation marks to a text according to an embodiment of the present invention, as shown in fig. 4, where the device includes: training text derivation module 410, processing module 420, word/word vector derivation module 430, concatenation module 440, and training module 450.
The training text obtaining module 410 is configured to remove punctuation marks from a plurality of original texts with punctuation marks, so as to obtain a training text;
the processing module 420 is configured to perform word segmentation processing and part-of-speech recognition on the training text, and perform normalization processing on the processed set words of the training text;
A word/word vector obtaining module 430, configured to obtain, through a pre-trained word/word vector model, a word/word vector corresponding to each character in the processed training text;
the splicing module 440 is configured to splice part-of-speech information, word segmentation boundary information, and word/word vectors corresponding to each character in the training text to obtain feature vectors;
the training module 450 is configured to input the feature vector into a seq2seq model, and train the seq2seq model to obtain a trained seq2seq model.
The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 5, where the device includes:
one or more processors 510, one processor 510 being illustrated in fig. 5;
a memory 520;
the apparatus may further include: an input device 530 and an output device 540.
The processor 510, memory 520, input means 530 and output means 540 in the apparatus may be connected by a bus or otherwise, in fig. 5 by way of example.
The memory 520 is a non-transitory computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to a method for adding punctuation marks in text in an embodiment of the present invention (e.g., the processing module 310, the word/word vector obtaining module 320, the stitching module 330, the candidate text sequence set obtaining module 340, the filtering module 350, and the output/restoration module 360 shown in fig. 3). Or program instructions/modules corresponding to a training method for adding a punctuation mark to a text according to an embodiment of the present invention (e.g., training text obtaining module 410, processing module 420, word/word vector obtaining module 430, splicing module 440, and training module 450 shown in fig. 4)
The processor 510 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 520, i.e. implements a method of punctuation addition in text of the above-described method embodiments, i.e.:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
Or the embodiment of the invention provides a training method for adding a punctuation mark into a text, which comprises the following steps:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
Memory 520 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 530 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the computer device. The output 540 may include a display device such as a display screen.
The embodiment of the invention provides a computer readable storage medium, on which a computer program is stored, the program when executed by a processor implementing a method for adding a punctuation mark in a text, as follows:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
And outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
Or the embodiment of the invention provides a training method for adding a punctuation mark into a text, which comprises the following steps:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.
Claims (10)
1. A method for adding punctuation marks in a text, comprising:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
Inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
2. The method of claim 1, wherein the seq2seq model is a seq2seq model based on a two-way long-short term memory model and an attention mechanism.
3. The method according to claim 1 or 2, wherein the seq2seq model comprises an embedded layer, an encoder, an attention layer and a decoder;
the encoder comprises a two-way long-short-term memory unit; the attention layer comprises a long-period and short-period memory unit;
the decoder comprises a two-way long-short-term memory unit, and is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the output result of the embedded layer, selecting the characters at all moments based on the size of the probability, and forming at least one text sequence added with punctuation marks.
4. The method as recited in claim 1, further comprising:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts; performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
5. The method of claim 1, wherein the setting word comprises a number.
6. The method for training the punctuation mark addition model in the text is characterized by comprising the following steps of:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
Acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
7. A punctuation mark adding apparatus in a text, comprising:
the processing module is used for acquiring a text to be added without punctuation marks, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
the word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector;
the candidate text sequence set obtaining module is used for inputting the feature vector into the trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
The filtering module is used for filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which are not consistent with punctuation standards in the candidate text sequence set;
and the output/restoration module is used for outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalization restoration operation on the output text sequence.
8. A punctuation mark addition model training device in a text, comprising:
the training text obtaining module is used for removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
the processing module is used for carrying out word segmentation processing and part-of-speech recognition on the training text and carrying out normalization processing on the processed set words of the training text;
the word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector;
The training module is used for inputting the feature vector into the seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement a punctuation in text method as claimed in any one of claims 1-5, or a punctuation in text model training method as claimed in claim 6.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a punctuation in text addition method according to any one of claims 1-5 or a punctuation in text addition model training method according to claim 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911182421.XA CN111027291B (en) | 2019-11-27 | 2019-11-27 | Method and device for adding mark symbols in text and method and device for training model, and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911182421.XA CN111027291B (en) | 2019-11-27 | 2019-11-27 | Method and device for adding mark symbols in text and method and device for training model, and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111027291A CN111027291A (en) | 2020-04-17 |
CN111027291B true CN111027291B (en) | 2024-03-26 |
Family
ID=70207202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911182421.XA Active CN111027291B (en) | 2019-11-27 | 2019-11-27 | Method and device for adding mark symbols in text and method and device for training model, and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111027291B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950256A (en) * | 2020-06-23 | 2020-11-17 | 北京百度网讯科技有限公司 | Sentence break processing method and device, electronic equipment and computer storage medium |
CN111753524A (en) * | 2020-07-01 | 2020-10-09 | 携程计算机技术(上海)有限公司 | Text sentence break position identification method and system, electronic device and storage medium |
CN112001167B (en) * | 2020-08-26 | 2021-04-23 | 四川云从天府人工智能科技有限公司 | Punctuation mark adding method, system, equipment and medium |
CN112464642A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Method, device, medium and electronic equipment for adding punctuation to text |
CN113609850A (en) * | 2021-07-02 | 2021-11-05 | 北京达佳互联信息技术有限公司 | Word segmentation processing method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107767870A (en) * | 2017-09-29 | 2018-03-06 | 百度在线网络技术(北京)有限公司 | Adding method, device and the computer equipment of punctuation mark |
CN108932226A (en) * | 2018-05-29 | 2018-12-04 | 华东师范大学 | A kind of pair of method without punctuate text addition punctuation mark |
CN109918666A (en) * | 2019-03-06 | 2019-06-21 | 北京工商大学 | A kind of Chinese punctuation mark adding method neural network based |
WO2019174422A1 (en) * | 2018-03-16 | 2019-09-19 | 北京国双科技有限公司 | Method for analyzing entity association relationship, and related apparatus |
-
2019
- 2019-11-27 CN CN201911182421.XA patent/CN111027291B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107767870A (en) * | 2017-09-29 | 2018-03-06 | 百度在线网络技术(北京)有限公司 | Adding method, device and the computer equipment of punctuation mark |
WO2019174422A1 (en) * | 2018-03-16 | 2019-09-19 | 北京国双科技有限公司 | Method for analyzing entity association relationship, and related apparatus |
CN108932226A (en) * | 2018-05-29 | 2018-12-04 | 华东师范大学 | A kind of pair of method without punctuate text addition punctuation mark |
CN109918666A (en) * | 2019-03-06 | 2019-06-21 | 北京工商大学 | A kind of Chinese punctuation mark adding method neural network based |
Non-Patent Citations (1)
Title |
---|
任智慧 ; 徐浩煜 ; 封松林 ; 周晗 ; 施俊 ; .基于LSTM网络的序列标注中文分词法.计算机应用研究.2017,第34卷(第5期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111027291A (en) | 2020-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111027291B (en) | Method and device for adding mark symbols in text and method and device for training model, and electronic equipment | |
CN108922564B (en) | Emotion recognition method and device, computer equipment and storage medium | |
CN111402861B (en) | Voice recognition method, device, equipment and storage medium | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN111312231A (en) | Audio detection method and device, electronic equipment and readable storage medium | |
CN112509562B (en) | Method, apparatus, electronic device and medium for text post-processing | |
CN110929094A (en) | Video title processing method and device | |
CN112507706A (en) | Training method and device of knowledge pre-training model and electronic equipment | |
CN114154518A (en) | Data enhancement model training method and device, electronic equipment and storage medium | |
CN115587598A (en) | Multi-turn dialogue rewriting method, equipment and medium | |
CN114822498A (en) | Training method of voice translation model, voice translation method, device and equipment | |
CN112860871B (en) | Natural language understanding model training method, natural language understanding method and device | |
CN113160820B (en) | Speech recognition method, training method, device and equipment of speech recognition model | |
CN116909435A (en) | Data processing method and device, electronic equipment and storage medium | |
CN113345409B (en) | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium | |
CN112836476B (en) | Summary generation method, device, equipment and medium | |
CN112002325B (en) | Multi-language voice interaction method and device | |
CN115565186A (en) | Method and device for training character recognition model, electronic equipment and storage medium | |
CN115527520A (en) | Anomaly detection method, device, electronic equipment and computer readable storage medium | |
CN115346520A (en) | Method, apparatus, electronic device and medium for speech recognition | |
CN110728137B (en) | Method and device for word segmentation | |
CN113761943A (en) | Method for generating judicial dialogues, method and device for training models, and storage medium | |
CN116913278B (en) | Voice processing method, device, equipment and storage medium | |
CN114896986B (en) | Method and device for enhancing training data of semantic recognition model | |
JP7426919B2 (en) | Program, device and method for estimating causal terms from images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Country or region after: China Address after: Room 501, 502, 503, No. 66 Boxia Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, March 2012 Applicant after: Daguan Data Co.,Ltd. Address before: Room 301, 303 and 304, block B, 112 liangxiu Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203 Applicant before: DATAGRAND INFORMATION TECHNOLOGY (SHANGHAI) Co.,Ltd. Country or region before: China |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |