CN111027291A - Method and device for adding punctuation marks in text and training model and electronic equipment - Google Patents

Method and device for adding punctuation marks in text and training model and electronic equipment Download PDF

Info

Publication number
CN111027291A
CN111027291A CN201911182421.XA CN201911182421A CN111027291A CN 111027291 A CN111027291 A CN 111027291A CN 201911182421 A CN201911182421 A CN 201911182421A CN 111027291 A CN111027291 A CN 111027291A
Authority
CN
China
Prior art keywords
text
word
training
character
added
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911182421.XA
Other languages
Chinese (zh)
Other versions
CN111027291B (en
Inventor
张健
陈运文
房悦竹
赵朗琦
刘书龙
纪达麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datagrand Tech Inc
Original Assignee
Datagrand Tech Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datagrand Tech Inc filed Critical Datagrand Tech Inc
Priority to CN201911182421.XA priority Critical patent/CN111027291B/en
Publication of CN111027291A publication Critical patent/CN111027291A/en
Application granted granted Critical
Publication of CN111027291B publication Critical patent/CN111027291B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method, a device and electronic equipment for adding a mark symbol in a text and training a model, wherein the method comprises the following steps: performing word segmentation processing and part-of-speech recognition on the text to be added, performing normalization processing, and determining a word/word vector; splicing the part of speech information, the word segmentation boundary information and the word/word vectors to obtain a characteristic vector; inputting the feature vectors into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuations and form a candidate text sequence set; filtering text sequences which do not meet the conditions in the candidate text sequence set; in the remaining text sequences of the candidate text sequence set, the text sequence which has the highest joint probability and meets the punctuation mark specification is output, and the output text sequence is subjected to normalized reduction operation, so that the problem that a plurality of punctuation marks are added behind characters can be well solved, and the accuracy of adding the punctuation marks is improved.

Description

Method and device for adding punctuation marks in text and training model and electronic equipment
Technical Field
The embodiment of the invention relates to a language processing technology, in particular to a method, a device and electronic equipment for adding punctuation marks in texts and training models.
Background
With the rapid development of society and high-tech technologies, natural language processing such as smart home control, automatic question answering, voice assistance and the like is receiving more and more attention. However, punctuation prediction is an extremely important natural language processing task because spoken dialog has no punctuation and cannot distinguish between sentence boundaries and canonical language structures. In the service scene of the intelligent telephone, for the speech of the user, the original text without punctuation and punctuation is obtained through speech recognition, and the original text without punctuation and punctuation is not directly used, so that the punctuation prediction is carried out on the original text, and the aim of adding punctuation is fulfilled.
In the related art, different technical solutions have been developed for the scene of automatically adding punctuation marks, and at present, the method mainly includes two categories: the method comprises the steps of judging the positions of punctuation marks by judging the voice pause duration based on voice information, and judging the insertion positions of the punctuation marks by judging text information based on text sequence information. The former has a limitation that the processing of a long text with a fast speech speed or the processing of a speech pause in the middle cannot be well processed because the judgment can be made only by the pause duration. The latter method based on text sequence information can judge the adding position of punctuation mark according to the context characteristic. However, in some scenarios, multiple punctuation marks need to be added behind the character, but the current methods do not yet address the need to insert multiple punctuation marks behind the character.
Disclosure of Invention
The embodiment of the invention provides a method and a device for adding punctuation marks in a text and training a model, and electronic equipment, which can well solve the problem of adding a plurality of punctuation marks behind characters and improve the accuracy of adding the punctuation marks.
In a first aspect, an embodiment of the present invention provides a method for adding a punctuation mark in a text, including:
acquiring a text to be added without punctuations, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the file to be added to obtain a characteristic vector;
inputting the characteristic vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuations and form a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which are not in accordance with punctuation mark specifications in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the remaining text sequences of the candidate text sequence set, and performing normalized reduction operation on the output text sequence.
In a second aspect, an embodiment of the present invention further provides a training method for adding a landmark symbol in a text, including:
removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
In a third aspect, an embodiment of the present invention further provides a device for adding a punctuation mark in a text, including:
the processing module is used for acquiring a text to be added without punctuations, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on set words in the processed text to be added;
the character/word vector obtaining module is used for obtaining a character/word vector corresponding to each character in the processed text to be added through a pre-trained character/word vector model;
the splicing module is used for splicing the part-of-speech information, the participle boundary information and the character/word vector corresponding to each character in the file to be added to obtain a characteristic vector;
a candidate text sequence set obtaining module, configured to input the feature vector into a trained seq2seq model, obtain a plurality of candidate text sequences to which punctuations are added, and form a candidate text sequence set;
the filtering module is used for filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set and filtering text sequences which are inconsistent with punctuation mark specifications in the candidate text sequence set;
and the output/reduction module is used for outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the residual text sequences of the candidate text sequence set, and carrying out normalized reduction operation on the output text sequence.
In a fourth aspect, an embodiment of the present invention provides a training apparatus for a punctuation mark adding model in a text, including:
the training text obtaining module is used for removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
the processing module is used for performing word segmentation processing and part-of-speech recognition on the training text and performing normalization processing on the set words of the training text after processing;
the character/word vector obtaining module is used for obtaining a character/word vector corresponding to each character in the processed training text through a pre-trained character/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and the training module is used for inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
In a fifth aspect, an embodiment of the present invention further provides an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement a method for adding a bid-marking symbol in a text provided by the embodiment of the present invention, or a method for training a model for adding a bid-marking symbol in a text provided by the embodiment of the present invention.
In a sixth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for adding a bid-marking symbol in a text, or a method for training a model for adding a bid-marking symbol in a text, according to an embodiment of the present invention.
According to the technical scheme provided by the embodiment of the invention, word segmentation processing, part of speech recognition and normalization processing are carried out on the text to be added without punctuation marks, a character/word vector corresponding to each character in the file to be added is obtained, the character/word vector, part of speech information and word segmentation boundary information are spliced to obtain a characteristic vector, the characteristic vector is input into a trained seq2seq model to obtain a plurality of candidate text sequences with punctuation marks added, a candidate text sequence set is formed, the candidate texts which do not meet the conditions are filtered, the text sequences which have the highest joint probability and meet the punctuation mark specifications are output, and the output text sequences are subjected to normalization reduction operation, so that the problem of adding a plurality of punctuation marks behind the characters can be well solved, and the accuracy of punctuation mark addition is improved.
Drawings
Fig. 1a is a flowchart of a method for adding a punctuation mark in a text according to an embodiment of the present invention;
FIG. 1b is a schematic structural diagram of a seq2seq model provided by an embodiment of the present invention;
FIG. 1c is a flowchart of a method for seq2seq model training according to an embodiment of the present invention;
fig. 2a is a flowchart of a method for adding a punctuation mark in a text according to an embodiment of the present invention;
fig. 2b is a flowchart of a method for adding a punctuation mark in a text according to an embodiment of the present invention;
fig. 3 is a block diagram of a device for adding a punctuation mark in a text according to an embodiment of the present invention;
FIG. 4 is a block diagram of a training apparatus for adding a landmark symbol in a text according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Fig. 1a is a flowchart of a method for adding punctuation marks in a text according to an embodiment of the present invention, where the method may be executed by a device for adding punctuation marks in a text, where the device may be implemented by software and/or hardware, the device may be configured in an electronic device such as a terminal or a server, and the method may be applied in a scenario where punctuation marks are added to a text without punctuation marks, and optionally, in a scenario where multiple punctuation marks are added after words in a text.
As shown in fig. 1a, the technical solution provided by the embodiment of the present invention includes:
s110: the method comprises the steps of obtaining a text to be added without punctuations, carrying out word segmentation processing and part-of-speech recognition on the text to be added, and carrying out normalization processing on set words in the processed text to be added.
In the embodiment of the present invention, the text to be added has no punctuation mark, where the text to be added may be a text converted by speech, for example, a text converted by customer service dialog, a text converted by a speech assistant, or another text without punctuation mark.
In the embodiment of the present invention, the setting word may be a word such as a number, and the word such as a number may be normalized. The part-of-speech recognition specifically analyzes the attribute of each character/word in the text to be added, and can label part-of-speech information obtained after the part-of-speech recognition.
S120: and acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model.
In the embodiment of the invention, the normalized text to be added can be input into the pre-trained character/word vector model to obtain the character/word vector corresponding to each character in the text to be added. The pre-trained word/word vector model may refer to a model in the related art.
S130: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the file to be added to obtain the characteristic vectors.
In the embodiment of the present invention, the sequence of the part of speech information, the word segmentation boundary information, and the word/word vector concatenation is not limited. The part-of-speech information is that each character corresponds to a name, a verb, other part-of-speech information, and the like. The segmentation boundary information refers to information such as information of the first character and the last character of a word.
S140: and inputting the characteristic vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuations and form a candidate text sequence set.
In an embodiment of the present invention, the feature vectors are input into a trained Sequence-to-Sequence (seq 2seq) model.
The seq2seq model can be based on a bidirectional long-short term memory model and an attention mechanism, and the model can be used for exploring the semantic information of the above and improving the accuracy of adding punctuations. The seq2seq model may include an embedding layer, an encoder, an attention layer, and a decoder. Reference may be made to fig. 1b with respect to the specific structure of the model. Wherein the encoder includes a bidirectional long short term memory unit (bidirectional LSTM). Since punctuation marks can be used for segmenting semantic groups, the insertion positions of the punctuation marks need to be comprehensively judged by using semantic information of an upper layer and a lower layer at the same time, an encoder adopts bidirectional LSTM to present 2 independent hidden states to an input vector according to a forward sequence and a reverse sequence so as to capture past and future information respectively, and then combines the 2 hidden states as final output.
Wherein, the attention layer adopts a standard attention implementation method. The coding layer obtains hidden states of a positive layer and a negative layer, and the context vector c output by the attention layer can be obtained based on attention weight weighting and 2 hidden states output by the coding layer, and the weight can embody the relation between the current hidden state and the context.
The decoder comprises a bidirectional long-short term memory unit, a bidirectional long-short term memory unit and a bidirectional long-short term memory unit, wherein the bidirectional long-short term memory unit is used for generating probability distribution of characters at each moment through a softmax function based on output results of the attention layer and vectors output by the embedding layer, selecting the characters at each moment based on the probability size, and forming at least one text sequence added with punctuation marks. Wherein the output result of the attention layer may be a context vector. Specifically, N characters are selected through a decoding layer at each moment based on the probability; at least one punctuation-added text sequence is formed based on the characters selected at the respective time instants. Wherein N is a positive integer. At each moment, at least one predicted character is provided, each character has a certain probability, the character with the probability greater than the set probability can be selected as the predicted character, and at least one text sequence added with punctuations can be combined according to the at least one character selected at each moment, so that at least one candidate text sequence is obtained, and a candidate text sequence set is formed.
S150: and filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which are not in accordance with the punctuation mark specification in the candidate text sequence set.
In the embodiment of the present invention, the text sequences in the candidate text sequence set may be inconsistent with the words and/or phrases in the text to be added. For example, one text sequence of the candidate text set is: ABC, XX; but the text to be added is ABBXX, if the 'ABC, XX' in the candidate text set is inconsistent with the characters/words in the text to be added, then the 'ABC, XX' in the candidate text set is filtered.
In the embodiment of the present invention, the text sequences in the candidate text sequence set may have a situation that the punctuation mark does not meet the punctuation mark specification, for example, a comma connection situation, a left-right parentheses mismatch situation, a left-right title mismatch situation, an ellipsis misuse situation, and the like, and the text sequences whose punctuation marks do not meet the punctuation mark specification are filtered.
S160: and outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the remaining text sequences of the candidate text sequence set, and performing normalized reduction operation on the output text sequence.
In the embodiment of the invention, in the residual text sequences of the candidate text sequence set, each character of the text sequence can obtain the predicted probability through a seq2seq model, and the text sequence with the highest joint probability is selected for output. The joint probability refers to that the text sequence is determined based on the probability of each character, and may be, for example, the sum of the probabilities of each character in the text sequence, or the average of the probabilities of each character, or the sum of the product of the probability of each character and the corresponding weight.
In the embodiment of the present invention, the reduction operation of normalizing the output text sequence may specifically be to reduce words such as numbers in the output text. By filtering the text sequence output by the seq2seq model, unreasonable situations and situations of violating punctuation mark specifications can be filtered, the text sequence can be accurately predicted, and the accuracy of punctuation addition is improved.
Fig. 1c may be referred to in a specific text as a method for adding a punctuation mark, and it should be noted that the method provided in the embodiment of the present invention may be an online process.
In the related art, the insertion position of a punctuation mark is judged by judging text information based on the text sequence information, and the adding position of the punctuation mark is judged by context characteristics based on the text sequence information, which can be performed in a sequence labeling manner, but in some scenes, a plurality of punctuation marks need to be added behind characters. For example, "teacher arranges students to write" issue List "by default. ", multiple punctuation marks need to be inserted after the last" table "word in the text. The method provided by the embodiment of the invention can obtain a candidate text sequence added with punctuations by processing the text to be added without punctuations to obtain a characteristic vector, inputting the characteristic vector into a trained seq2seq model, and filtering to obtain a final candidate text sequence, wherein the seq2seq model inputs a sequence and outputs a sequence which is also filtered and screened, so that the problem of adding a plurality of punctuations behind characters can be well solved, and the accuracy of adding punctuations is improved.
According to the technical scheme provided by the embodiment of the invention, word segmentation processing, part of speech recognition and normalization processing are carried out on the text to be added without punctuation marks, a character/word vector corresponding to each character in the file to be added is obtained, the character/word vector, part of speech information and word segmentation boundary information are spliced to obtain a characteristic vector, the characteristic vector is input into a trained seq2seq model to obtain a plurality of candidate text sequences with punctuation marks added, a candidate text sequence set is formed, the candidate texts which do not meet the conditions are filtered, the text sequences which have the highest joint probability and meet the punctuation mark specifications are output, and the output text sequences are subjected to normalization reduction operation, so that the problem of adding a plurality of punctuation marks behind the characters can be well solved, and the accuracy of punctuation mark addition is improved.
Fig. 2a is a flowchart of a method for adding a landmark symbol in a text according to an embodiment of the present invention, where on the basis of the above embodiment, a process of training a seq2seq model is added in the embodiment of the present invention, and the training process of the seq2seq model may be an offline process.
As shown in fig. 2a, the technical solution provided by the embodiment of the present invention includes:
s210: removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts; performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing.
The normalization processing on the training text may be normalization processing on numbers and the like in the training text, for example, the numbers and the like may be replaced by setting marks, so as to increase the processing speed of the model.
The word segmentation processing and the part-of-speech recognition in the training text are the same as those in the above embodiment.
S220: and acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model.
S230: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain the feature vectors.
In the embodiment of the invention, semantic information of the current input state can be obtained through the character vector and/or the word vector and the part-of-speech characteristics, the characteristics of word segmentation, part-of-speech and the like are introduced, the boundary information of the current input character is also obtained, the knowledge of word boundaries can be learned in training, and the situation that punctuation marks cut the word segmentation content in the prediction stage is avoided.
S240: and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
The seq2seq model can be based on a bidirectional long-short term memory model and an attention mechanism, and the model can be used for exploring the semantic information of the above and improving the accuracy of adding punctuations. The seq2seq model may include an embedding layer, an encoder, an attention layer, and a decoder. Reference may be made to fig. 1b with respect to the specific structure of the model. Wherein the encoder includes a bidirectional long short term memory unit (bidirectional LSTM). Since punctuation marks can be used for segmenting semantic groups, the insertion positions of the punctuation marks need to be comprehensively judged by using semantic information of an upper layer and a lower layer at the same time, an encoder adopts bidirectional LSTM to present 2 independent hidden states to an input vector according to a forward sequence and a reverse sequence so as to capture past and future information respectively, and then combines the 2 hidden states as final output.
Wherein, the attention layer adopts a standard attention implementation method. The coding layer obtains hidden states of a positive layer and a negative layer, and the context vector c output by the attention layer can be obtained based on attention weight weighting and 2 hidden states output by the coding layer, and the weight can embody the relation between the current hidden state and the context.
The decoder comprises a bidirectional long-short term memory unit, a bidirectional long-short term memory unit and a bidirectional long-short term memory unit, wherein the bidirectional long-short term memory unit is used for generating probability distribution of characters at each moment through a softmax function based on output results of the attention layer and vectors output by the embedding layer, selecting the characters at each moment based on the probability size, and forming at least one text sequence added with punctuation marks. Wherein the output result of the attention layer may be a context vector. Specifically, N characters are selected through a decoding layer at each moment based on the probability; at least one punctuation-added text sequence is formed based on the characters selected at the respective time instants. Wherein N is a positive integer. At each moment, at least one predicted character is provided, each character has a certain probability, characters with the probability higher than the set probability can be selected as predicted characters, and at least one punctuation added text sequence can be combined according to at least one character selected at each moment. And matching the text sequence output by the decoding layer with the original text, adjusting the seq2seq model, finishing the training of the seq2seq model and obtaining the trained seq2seq model. Wherein, the process of training the seq2seq model specifically can refer to fig. 2 b. The seq2seq model trained by the method fully introduces word vectors and word vectors of context for judgment, fully considers state conversion relation among punctuations, and can obtain better prediction effect.
S250: the method comprises the steps of obtaining a text to be added without punctuations, carrying out word segmentation processing and part-of-speech recognition on the text to be added, and carrying out normalization processing on set words in the processed text to be added.
S260: and acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model.
S270: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the file to be added to obtain the characteristic vectors.
S280: and inputting the characteristic vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuations and form a candidate text sequence set.
S290: and filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which are not in accordance with the punctuation mark specification in the candidate text sequence set.
S291: and outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the remaining text sequences of the candidate text sequence set, and performing normalized reduction operation on the output text sequence.
Fig. 3 is a flowchart of a method for training a text punctuation mark adding model, which may be performed by a device for training a text punctuation mark adding model, where the device may be implemented by software and/or hardware.
As shown in fig. 3, the technical solution provided by the embodiment of the present invention includes:
s310: and removing punctuation marks from the original texts with the punctuation marks to obtain the training texts.
S320: performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing.
S330: and acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model.
S340: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain the feature vectors.
S350: and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
In this embodiment, reference may be made to the descriptions of S210 to S250 in the above embodiments for the descriptions of S310 to S350.
The embodiment of the invention can acquire the semantic information of the current input state through the character vector and/or the word vector and the part-of-speech characteristics, introduces the characteristics of word segmentation, part-of-speech and the like, also acquires the boundary information of the current input character, can learn the knowledge of word boundaries in training, and avoids the situation that punctuation marks are generated in the prediction stage to separate word segmentation contents.
Fig. 3 is a block diagram of a structure of a device for adding a punctuation mark in a text according to an embodiment of the present invention, as shown in fig. 3, the device includes: a processing module 310, a word/word vector obtaining module 320, a concatenation module 330, a candidate text sequence set obtaining module 340, a filtering module 350, and an output/restoration module 360.
The processing module 310 is configured to obtain a to-be-added text without punctuation marks, perform word segmentation processing and part-of-speech recognition on the to-be-added text, and perform normalization processing on set words in the processed to-be-added text;
a word/word vector obtaining module 320, configured to obtain, through a pre-trained word/word vector model, a word/word vector corresponding to each character in the processed text to be added;
the splicing module 330 is configured to splice part-of-speech information, word segmentation boundary information, and word/word vectors corresponding to each character in the file to be added to obtain a feature vector;
a candidate text sequence set obtaining module 340, configured to input the feature vector into the trained seq2seq model, obtain a plurality of candidate text sequences to which punctuations are added, and form a candidate text sequence set;
a filtering module 350, configured to filter text sequences in the candidate text sequence set that are inconsistent with the characters/words in the text to be added, and filter text sequences in the candidate text sequence set that are inconsistent with the punctuation mark specification;
and the output/reduction module 360 is configured to output, from the remaining text sequences in the candidate text sequence set, a text sequence that has the highest joint probability and meets the punctuation specification, and perform a reduction operation of normalization on the output text sequence.
Optionally, the seq2seq model is a seq2seq model based on a bidirectional long-short term memory model and an attention mechanism.
Optionally, the seq2seq model comprises an embedded layer, an encoder, an attention layer and a decoder;
the encoder comprises a bidirectional long-short term memory unit; the attention layer includes long and short term memory cells;
the decoder comprises a bidirectional long-short term memory unit, and the bidirectional long-short term memory unit is used for generating probability distribution of characters at each moment through a softmax function based on the output result of the attention layer and the output result of the embedding layer, selecting the characters at each moment based on the probability size, and forming at least one text sequence added with punctuation marks.
Optionally, the apparatus further comprises a training module, configured to:
removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts; performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
Optionally, the setting words comprise numbers.
The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Fig. 4 is a block diagram of a structure of a training apparatus for adding a landmark symbol in a text according to an embodiment of the present invention, and as shown in fig. 4, the apparatus includes: a training text obtaining module 410, a processing module 420, a word/word vector obtaining module 430, a concatenation module 440, and a training module 450.
The training text obtaining module 410 is configured to remove punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
the processing module 420 is configured to perform word segmentation processing and part-of-speech recognition on the training text, and perform normalization processing on the set words of the training text after processing;
a word/word vector obtaining module 430, configured to obtain, through a pre-trained word/word vector model, a word/word vector corresponding to each character in the processed training text;
a concatenation module 440, configured to concatenate the part-of-speech information, the segmentation boundary information, and the word/word vectors corresponding to each character in the training text to obtain a feature vector;
the training module 450 is configured to input the feature vector into a seq2seq model, and train the seq2seq model to obtain a trained seq2seq model.
The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes:
one or more processors 510, one processor 510 being illustrated in FIG. 5;
a memory 520;
the apparatus may further include: an input device 530 and an output device 540.
The processor 510, the memory 520, the input device 530 and the output device 540 of the apparatus may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The memory 520, which is a non-transitory computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a method for adding punctuation marks in text in an embodiment of the present invention (for example, the processing module 310, the word/word vector obtaining module 320, the concatenation module 330, the candidate text sequence set obtaining module 340, the filtering module 350, and the output/restoration module 360 shown in fig. 3). Or a program instruction/module corresponding to the training method for adding a model to a punctuation mark in a text in the embodiment of the present invention (for example, the training text obtaining module 410, the processing module 420, the word/word vector obtaining module 430, the concatenation module 440, and the training module 450 shown in fig. 4)
The processor 510 executes various functional applications and data processing of the computer device by executing the software programs, instructions and modules stored in the memory 520, namely, implementing a text punctuation mark adding method of the above method embodiment, namely:
acquiring a text to be added without punctuations, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the file to be added to obtain a characteristic vector;
inputting the characteristic vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuations and form a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which are not in accordance with punctuation mark specifications in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the remaining text sequences of the candidate text sequence set, and performing normalized reduction operation on the output text sequence.
Or, implementing the method for training the punctuation mark adding model in the text provided by the embodiment of the invention, namely:
removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to a terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 540 may include a display device such as a display screen.
An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for adding a landmark symbol in a text, such as:
acquiring a text to be added without punctuations, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the file to be added to obtain a characteristic vector;
inputting the characteristic vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuations and form a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which are not in accordance with punctuation mark specifications in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the remaining text sequences of the candidate text sequence set, and performing normalized reduction operation on the output text sequence.
Or, implementing the method for training the punctuation mark adding model in the text provided by the embodiment of the invention, namely:
removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method for adding a punctuation mark in a text is characterized by comprising the following steps:
acquiring a text to be added without punctuations, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the file to be added to obtain a characteristic vector;
inputting the characteristic vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuations and form a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which are not in accordance with punctuation mark specifications in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the remaining text sequences of the candidate text sequence set, and performing normalized reduction operation on the output text sequence.
2. The method of claim 1, wherein the seq2seq model is a seq2seq model based on a two-way long-short term memory model and an attention mechanism.
3. The method of claim 1 or 2, wherein the seq2seq model comprises an embedded layer, an encoder, an attention layer and a decoder;
the encoder comprises a bidirectional long-short term memory unit; the attention layer includes long and short term memory cells;
the decoder comprises a bidirectional long-short term memory unit, and the bidirectional long-short term memory unit is used for generating probability distribution of characters at each moment through a softmax function based on the output result of the attention layer and the output result of the embedding layer, selecting the characters at each moment based on the probability size, and forming at least one text sequence added with punctuation marks.
4. The method of claim 1, further comprising:
removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts; performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
5. The method of claim 1, wherein the set word comprises a number.
6. A training method for adding a model to a punctuation mark in a text is characterized by comprising the following steps:
removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the set words of the training text after processing;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
7. A device for adding a punctuation mark in a text, comprising:
the processing module is used for acquiring a text to be added without punctuations, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on set words in the processed text to be added;
the character/word vector obtaining module is used for obtaining a character/word vector corresponding to each character in the processed text to be added through a pre-trained character/word vector model;
the splicing module is used for splicing the part-of-speech information, the participle boundary information and the character/word vector corresponding to each character in the file to be added to obtain a characteristic vector;
a candidate text sequence set obtaining module, configured to input the feature vector into a trained seq2seq model, obtain a plurality of candidate text sequences to which punctuations are added, and form a candidate text sequence set;
the filtering module is used for filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set and filtering text sequences which are inconsistent with punctuation mark specifications in the candidate text sequence set;
and the output/reduction module is used for outputting the text sequence which has the highest joint probability and meets the punctuation mark specification in the residual text sequences of the candidate text sequence set, and carrying out normalized reduction operation on the output text sequence.
8. A device for training a punctuation mark adding model in a text is characterized by comprising:
the training text obtaining module is used for removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
the processing module is used for performing word segmentation processing and part-of-speech recognition on the training text and performing normalization processing on the set words of the training text after processing;
the character/word vector obtaining module is used for obtaining a character/word vector corresponding to each character in the processed training text through a pre-trained character/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vectors corresponding to each character in the training text to obtain a characteristic vector;
and the training module is used for inputting the characteristic vector into a seq2seq model, and training the seq2seq model to obtain the trained seq2seq model.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a method of textual bid symbol addition according to any of claims 1-5 or a method of textual bid symbol addition model training according to claim 6.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for adding a bid in text as claimed in any one of claims 1 to 5, or a method for training a model for adding a bid in text as claimed in claim 6.
CN201911182421.XA 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment Active CN111027291B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911182421.XA CN111027291B (en) 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911182421.XA CN111027291B (en) 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment

Publications (2)

Publication Number Publication Date
CN111027291A true CN111027291A (en) 2020-04-17
CN111027291B CN111027291B (en) 2024-03-26

Family

ID=70207202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911182421.XA Active CN111027291B (en) 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment

Country Status (1)

Country Link
CN (1) CN111027291B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN111950256A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Sentence break processing method and device, electronic equipment and computer storage medium
CN112001167A (en) * 2020-08-26 2020-11-27 四川云从天府人工智能科技有限公司 Punctuation mark adding method, system, equipment and medium
CN112199927A (en) * 2020-10-19 2021-01-08 古联(北京)数字传媒科技有限公司 Ancient book mark point filling method and device
WO2021213155A1 (en) * 2020-11-25 2021-10-28 平安科技(深圳)有限公司 Method, apparatus, medium, and electronic device for adding punctuation to text
CN113609850A (en) * 2021-07-02 2021-11-05 北京达佳互联信息技术有限公司 Word segmentation processing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767870A (en) * 2017-09-29 2018-03-06 百度在线网络技术(北京)有限公司 Adding method, device and the computer equipment of punctuation mark
CN108932226A (en) * 2018-05-29 2018-12-04 华东师范大学 A kind of pair of method without punctuate text addition punctuation mark
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based
WO2019174422A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Method for analyzing entity association relationship, and related apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767870A (en) * 2017-09-29 2018-03-06 百度在线网络技术(北京)有限公司 Adding method, device and the computer equipment of punctuation mark
WO2019174422A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Method for analyzing entity association relationship, and related apparatus
CN108932226A (en) * 2018-05-29 2018-12-04 华东师范大学 A kind of pair of method without punctuate text addition punctuation mark
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任智慧;徐浩煜;封松林;周晗;施俊;: "基于LSTM网络的序列标注中文分词法" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950256A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Sentence break processing method and device, electronic equipment and computer storage medium
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN112001167A (en) * 2020-08-26 2020-11-27 四川云从天府人工智能科技有限公司 Punctuation mark adding method, system, equipment and medium
CN112199927A (en) * 2020-10-19 2021-01-08 古联(北京)数字传媒科技有限公司 Ancient book mark point filling method and device
WO2021213155A1 (en) * 2020-11-25 2021-10-28 平安科技(深圳)有限公司 Method, apparatus, medium, and electronic device for adding punctuation to text
CN113609850A (en) * 2021-07-02 2021-11-05 北京达佳互联信息技术有限公司 Word segmentation processing method and device, electronic equipment and storage medium
CN113609850B (en) * 2021-07-02 2024-05-17 北京达佳互联信息技术有限公司 Word segmentation processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111027291B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
CN111027291B (en) Method and device for adding mark symbols in text and method and device for training model, and electronic equipment
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
CN108922564B (en) Emotion recognition method and device, computer equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113469298B (en) Model training method and resource recommendation method
CN116884391B (en) Multimode fusion audio generation method and device based on diffusion model
CN112883967B (en) Image character recognition method, device, medium and electronic equipment
CN115587598A (en) Multi-turn dialogue rewriting method, equipment and medium
CN116434752A (en) Speech recognition error correction method and device
CN113160820B (en) Speech recognition method, training method, device and equipment of speech recognition model
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN112036122B (en) Text recognition method, electronic device and computer readable medium
CN113032538A (en) Topic transfer method based on knowledge graph, controller and storage medium
CN117093864A (en) Text generation model training method and device
CN112836476B (en) Summary generation method, device, equipment and medium
CN115527520A (en) Anomaly detection method, device, electronic equipment and computer readable storage medium
CN115496734A (en) Quality evaluation method of video content, network training method and device
CN112002325B (en) Multi-language voice interaction method and device
CN115346520A (en) Method, apparatus, electronic device and medium for speech recognition
CN114297409A (en) Model training method, information extraction method and device, electronic device and medium
CN114358019A (en) Method and system for training intention prediction model
CN114239601A (en) Statement processing method and device and electronic equipment
US20240127812A1 (en) Method and system for auto-correction of an ongoing speech command
CN113392645B (en) Prosodic phrase boundary prediction method and device, electronic equipment and storage medium
CN115563962A (en) Method, device and related equipment for detecting wrongly written text characters in parallel voice data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: Room 501, 502, 503, No. 66 Boxia Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, March 2012

Applicant after: Daguan Data Co.,Ltd.

Address before: Room 301, 303 and 304, block B, 112 liangxiu Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Applicant before: DATAGRAND INFORMATION TECHNOLOGY (SHANGHAI) Co.,Ltd.

Country or region before: China

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant