CN111027291B - Method and device for adding mark symbols in text and method and device for training model, and electronic equipment - Google Patents

Method and device for adding mark symbols in text and method and device for training model, and electronic equipment Download PDF

Info

Publication number
CN111027291B
CN111027291B CN201911182421.XA CN201911182421A CN111027291B CN 111027291 B CN111027291 B CN 111027291B CN 201911182421 A CN201911182421 A CN 201911182421A CN 111027291 B CN111027291 B CN 111027291B
Authority
CN
China
Prior art keywords
text
word
training
punctuation
added
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911182421.XA
Other languages
Chinese (zh)
Other versions
CN111027291A (en
Inventor
张健
陈运文
房悦竹
赵朗琦
刘书龙
纪达麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Co ltd
Original Assignee
Daguan Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Co ltd filed Critical Daguan Data Co ltd
Priority to CN201911182421.XA priority Critical patent/CN111027291B/en
Publication of CN111027291A publication Critical patent/CN111027291A/en
Application granted granted Critical
Publication of CN111027291B publication Critical patent/CN111027291B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The embodiment of the invention provides a method and a device for adding punctuation marks in a text and training a model and electronic equipment, wherein the method comprises the following steps: word segmentation and part-of-speech recognition are carried out on the text to be added, normalization processing is carried out, and characters/word vectors are determined; splicing part-of-speech information, word segmentation boundary information and word/word vectors to obtain feature vectors; inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set; filtering the text sequences which do not meet the conditions in the candidate text sequence set; in the rest text sequences of the candidate text sequence set, the text sequence which has the highest joint probability and accords with the punctuation mark specification is output, and the output text sequence is subjected to normalized reduction operation, so that the problem that a plurality of punctuation marks are added behind characters can be well solved, and the accuracy of adding the punctuation marks is improved.

Description

Method and device for adding mark symbols in text and method and device for training model, and electronic equipment
Technical Field
The embodiment of the invention relates to a language processing technology, in particular to a method and a device for adding a punctuation mark in a text and training a model and electronic equipment.
Background
With the rapid development of society and high-tech technology, natural language processing such as intelligent home control, automatic question answering, voice assistant and the like is getting more attention. However, since spoken dialog does not have punctuation, sentence boundaries and canonical language structures cannot be distinguished, punctuation prediction is an extremely important natural language processing task. In a smart phone customer service scene, for speaking of a user, an original text without punctuation and sentence breaking is obtained through voice recognition, and the original text is directly used without a method, so that punctuation mark prediction is needed, and the purpose of adding punctuation marks is achieved.
In the related art, the automatic addition of punctuation marks to this scene has developed different technical solutions, and currently two main categories are: firstly, judging the position of the punctuation mark by judging the voice pause time based on voice information, and secondly, judging the insertion position of the punctuation mark by judging text information based on text sequence information. The former is limited in that, since it can be judged only by the pause time, the situation of processing long text with a fast speech speed or having a speech pause in the middle cannot be well handled. The latter method based on text sequence information can judge the adding position of punctuation marks according to the context characteristics. In some scenarios, however, multiple punctuations need to be added after a character, but current methods do not address the need to insert multiple punctuations after a character.
Disclosure of Invention
The embodiment of the invention provides a method and a device for adding punctuation marks in a text, a model training method and a device and electronic equipment, which can well solve the problem of adding a plurality of punctuation marks behind a character and improve the accuracy of adding the punctuation marks.
In a first aspect, an embodiment of the present invention provides a method for adding punctuation marks in text, including:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
And outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
In a second aspect, the embodiment of the present invention further provides a training method for adding punctuation marks to a text, including:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
In a third aspect, an embodiment of the present invention further provides a punctuation mark adding device in a text, including:
the processing module is used for acquiring a text to be added without punctuation marks, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
The word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector;
the candidate text sequence set obtaining module is used for inputting the feature vector into the trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
the filtering module is used for filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which are not consistent with punctuation standards in the candidate text sequence set;
and the output/restoration module is used for outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalization restoration operation on the output text sequence.
In a fourth aspect, an embodiment of the present invention provides a training device for adding punctuation marks to a text, including:
The training text obtaining module is used for removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
the processing module is used for carrying out word segmentation processing and part-of-speech recognition on the training text and carrying out normalization processing on the processed set words of the training text;
the word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector;
the training module is used for inputting the feature vector into the seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
In a fifth aspect, an embodiment of the present invention further provides an electronic device, including:
one or more processors;
storage means for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement a method for adding punctuation marks in text provided by the embodiment of the present invention, or a method for training a model for adding punctuation marks in text provided by the embodiment of the present invention.
In a sixth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the program when executed by a processor implements a method for adding punctuation marks in text provided by the embodiment of the present invention, or a method for training a model for adding punctuation marks in text provided by the embodiment of the present invention.
According to the technical scheme provided by the embodiment of the invention, word segmentation processing, part-of-speech recognition and normalization processing are carried out on the text to be added without punctuation marks, a word/word vector corresponding to each character in the text to be added is obtained, the word/word vector, part-of-speech information and word segmentation boundary information are spliced to obtain a feature vector, the feature vector is input into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation marks, a candidate text sequence set is formed, candidate texts which do not meet the conditions are filtered, the joint probability is highest, the text sequences which meet the punctuation mark specification are output, and the output text sequences are subjected to normalization reduction operation, so that the problem of adding a plurality of punctuation marks behind characters can be well solved, and the accuracy of punctuation mark addition is improved.
Drawings
FIG. 1a is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 1b is a schematic diagram of the structure of the seq2seq model provided by an embodiment of the present invention;
FIG. 1c is a flow chart of a method for training a seq2seq model provided by an embodiment of the present invention;
FIG. 2a is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 2b is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 3 is a block diagram of a device for adding punctuation marks in text according to an embodiment of the present invention;
FIG. 4 is a block diagram of a training device for adding punctuation marks in text according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Fig. 1a is a flowchart of a method for adding punctuation marks in a text, where the method may be performed by a device for adding punctuation marks in a text, where the device may be implemented by software and/or hardware, where the device may be configured in an electronic device such as a terminal or a server, and where the method may be applied to a scene where punctuation marks are added to a text without punctuation marks, and optionally, may be applied to a scene where a plurality of punctuation marks are added after words in a text.
As shown in fig. 1a, the technical solution provided by the embodiment of the present invention includes:
s110: obtaining a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added.
In the embodiment of the invention, no punctuation mark is in the text to be added, wherein the text to be added can be the text converted by voice, for example, the text converted by customer service dialogue, the text converted by voice assistant, and the text to be added can also be other text without the punctuation mark.
In the embodiment of the invention, the set words can be words such as numbers, and normalization processing can be performed on the words such as numbers. The part-of-speech recognition specifically analyzes the attribute of each word/word in the text to be added, and can mark the part-of-speech information obtained after the part-of-speech recognition.
S120: and acquiring the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model.
In the embodiment of the invention, the normalized text to be added can be input into a pre-trained character/word vector model to obtain the character/word vector corresponding to each character in the text to be added. Wherein the pre-trained word/word vector model may refer to the model in the related art.
S130: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector.
In the embodiment of the invention, the part-of-speech information, the word segmentation boundary information and the splicing sequence of the word/word vectors are not limited. The part-of-speech information is name, verb, other part-of-speech information, etc. corresponding to each character. The word segmentation boundary information refers to information such as information of the first character and the last character of the word.
S140: and inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set.
In an embodiment of the present invention, feature vectors are input into a trained Sequence-to-Sequence, seq2seq model.
The seq2seq model can be a seq2seq model based on a two-way long-short-term memory model and an attention mechanism, and the semantic information can be extracted by adopting the seq2seq model, so that the accuracy of adding punctuation marks is improved. The seq2seq model may include an embedded layer, an encoder, an attention layer, and a decoder. Reference may be made to fig. 1b for a specific structure of the model. Wherein the encoder comprises a bidirectional long short-term memory unit (bidirectional LSTM). Because punctuation can be used to segment semantic groups, the insertion position of punctuation needs to be comprehensively judged by utilizing the semantic information of the above and the below, the encoder adopts a bidirectional LSTM, presents 2 independent hidden states of an input vector according to a forward sequence and a reverse sequence so as to capture past and future information respectively, and then combines the 2 hidden states as final output.
Wherein, the attention layer adopts a standard attention implementation method. The coding layer obtains hidden states of the positive layer and the negative layer, the context vector c output by the attention layer can be obtained based on the attention weight and 2 hidden states output by the coding layer, and the weight can reflect the relation between the current hidden state and the context.
The decoder comprises a two-way long-short-term memory unit, is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the vector output by the embedding layer, selects the characters at all moments based on the size of the probability, and forms at least one text sequence added with punctuation marks. Wherein the output of the attention layer may be a context vector. Specifically, N characters are selected based on the probability size at each moment through the decoding layer; at least one punctuation-added text sequence is constructed based on the characters selected at each instant. Where N is a positive integer. At each moment, at least one predicted character exists, each character has a certain probability, a character with the probability larger than the set probability can be selected as the predicted character, and at least one text sequence added with punctuation can be combined according to the at least one character selected at each moment, so that at least one candidate text sequence is obtained, and a candidate text sequence set is formed.
S150: and filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which do not accord with punctuation standards in the candidate text sequence set.
In the embodiment of the invention, the text sequences in the candidate text sequence set may be inconsistent with the characters and/or words in the text to be added. For example, one text sequence of the candidate text set is: ABC, XX; but the text to be added is ABBXX, and if 'ABC, XX' in the candidate text set is inconsistent with the word/word in the text to be added, filtering 'ABC, XX' in the candidate text set.
In the embodiment of the invention, the text sequences in the candidate text sequence set may have the condition that punctuation marks do not accord with punctuation mark specifications, for example, comma continuous use condition, left and right bracket mismatch condition, left and right signature mismatch condition, ellipsis misuse condition and the like, and the text sequences of which the punctuation marks do not accord with the punctuation mark specifications are filtered.
S160: and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
In the embodiment of the invention, in the rest text sequences of the candidate text sequence set, each character of the text sequence can obtain the predicted probability through the seq2seq model, and the text sequence with the highest joint probability is selected for output. The joint probability refers to that the text sequence is determined based on the probability of each character, for example, the probability of each character in the text sequence can be the sum, or the average value of the probability of each character, or the sum of the products of the probability of each character and the corresponding weight.
In the embodiment of the invention, the restoring operation of normalizing the output text sequence can specifically restore words such as numbers in the output text. By filtering the text sequence output by the seq2seq model, unreasonable conditions and conditions which violate punctuation mark specifications can be filtered, the text sequence can be accurately predicted, and the punctuation adding accuracy is improved.
The method for adding punctuation marks in a specific text may refer to fig. 1c, and it should be noted that the method provided in the embodiment of the present invention may be an online flow.
In the related art, the method of determining the insertion position of the punctuation mark based on the text sequence information by determining the text information and the method of determining the addition position of the punctuation mark based on the text sequence information by the context feature may be performed by a sequence labeling method, but in some scenes, a plurality of punctuation marks need to be added behind the character. For example, "teacher schedules students to finish silently writing" out of table ". The last "table" word in this text is followed by a number of punctuations. The method provided by the embodiment of the invention can well solve the problem of adding a plurality of punctuation marks behind the characters by processing the text to be added without the punctuation marks to obtain the feature vector, inputting the feature vector into a trained seq2seq model, obtaining a candidate text sequence added with punctuation marks, filtering to obtain a final candidate text sequence, inputting a sequence into the seq2seq model, outputting the sequence, and filtering and screening, so that the problem of adding a plurality of punctuation marks behind the characters can be well solved, and the accuracy of adding the punctuation marks is improved.
According to the technical scheme provided by the embodiment of the invention, word segmentation processing, part-of-speech recognition and normalization processing are carried out on the text to be added without punctuation marks, a word/word vector corresponding to each character in the text to be added is obtained, the word/word vector, part-of-speech information and word segmentation boundary information are spliced to obtain a feature vector, the feature vector is input into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation marks, a candidate text sequence set is formed, candidate texts which do not meet the conditions are filtered, the joint probability is highest, the text sequences which meet the punctuation mark specification are output, and the output text sequences are subjected to normalization reduction operation, so that the problem of adding a plurality of punctuation marks behind characters can be well solved, and the accuracy of punctuation mark addition is improved.
Fig. 2a is a flowchart of a method for adding punctuation marks in text according to an embodiment of the present invention, where a process of training a seq2seq model is added according to the embodiment of the present invention, and the training process of the seq2seq model may be an offline process.
As shown in fig. 2a, the technical solution provided by the embodiment of the present invention includes:
S210: removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts; and performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text.
The normalization processing of the training text may be performed by normalizing numbers in the training text, for example, the numbers may be replaced with set identifiers, so as to increase the processing speed of the model.
The word segmentation process and the part-of-speech recognition in the training text are the same as the specific processes in the embodiment.
S220: and acquiring the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model.
S230: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector.
In the embodiment of the invention, the semantic information of the current input state can be obtained through the word vector and/or the word vector and the part of speech feature, the characteristics of word segmentation, part of speech and the like are introduced, the boundary information of the current input character is also obtained, the knowledge of word boundaries can be learned in training, and the situation that punctuation marks divide word segmentation contents in a prediction stage is avoided.
S240: inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
The seq2seq model can be a seq2seq model based on a two-way long-short-term memory model and an attention mechanism, and the semantic information can be extracted by adopting the seq2seq model, so that the accuracy of adding punctuation marks is improved. The seq2seq model may include an embedded layer, an encoder, an attention layer, and a decoder. Reference may be made to fig. 1b for a specific structure of the model. Wherein the encoder comprises a bidirectional long short-term memory unit (bidirectional LSTM). Because punctuation can be used to segment semantic groups, the insertion position of punctuation needs to be comprehensively judged by utilizing the semantic information of the above and the below, the encoder adopts a bidirectional LSTM, presents 2 independent hidden states of an input vector according to a forward sequence and a reverse sequence so as to capture past and future information respectively, and then combines the 2 hidden states as final output.
Wherein, the attention layer adopts a standard attention implementation method. The coding layer obtains hidden states of the positive layer and the negative layer, the context vector c output by the attention layer can be obtained based on the attention weight and 2 hidden states output by the coding layer, and the weight can reflect the relation between the current hidden state and the context.
The decoder comprises a two-way long-short-term memory unit, is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the vector output by the embedding layer, selects the characters at all moments based on the size of the probability, and forms at least one text sequence added with punctuation marks. Wherein the output of the attention layer may be a context vector. Specifically, N characters are selected based on the probability size at each moment through the decoding layer; at least one punctuation-added text sequence is constructed based on the characters selected at each instant. Where N is a positive integer. At each moment, at least one predicted character exists, each character has a certain probability, the characters with the probability larger than the set probability can be selected as the predicted characters, and at least one text sequence added with punctuation can be combined according to the at least one character selected at each moment. Matching the text sequence output by the decoding layer with the original text, adjusting the seq2seq model, and completing training of the seq2seq model to obtain a trained seq2seq model. The process of training the seq2seq model in particular can be referred to in fig. 2b. The seq2seq model trained by the method fully introduces the word vector and the word vector of the context to judge, fully considers the state transition relation between punctuation marks, and can obtain better prediction effect.
S250: obtaining a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added.
S260: and acquiring the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model.
S270: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector.
S280: and inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set.
S290: and filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which do not accord with punctuation standards in the candidate text sequence set.
S291: and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
Fig. 3 is a flowchart of a method for training a punctuation adding model in a text, where the method may be performed by a punctuation adding model training device in the text, and the device may be implemented by software and/or hardware.
As shown in fig. 3, the technical solution provided by the embodiment of the present invention includes:
s310: and removing punctuation marks from the plurality of original texts with the punctuation marks to obtain training texts.
S320: and performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text.
S330: and acquiring the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model.
S340: and splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector.
S350: inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
In this embodiment, the descriptions of S310-S350 may be referred to the descriptions of S210-S250 in the above embodiments.
According to the embodiment of the invention, the semantic information of the current input state can be obtained through the character vector and/or the character vector and the part of speech feature, characteristics of word segmentation, part of speech and the like are introduced, boundary information of the current input character is also obtained, knowledge of word boundaries can be learned in training, and the situation that punctuation marks divide word segmentation contents in a prediction stage is avoided.
Fig. 3 is a block diagram of a punctuation mark adding device in text according to an embodiment of the present invention, where, as shown in fig. 3, the device includes: a processing module 310, a word/word vector derivation module 320, a concatenation module 330, a candidate text sequence set derivation module 340, a filtering module 350, and an output/restoration module 360.
The processing module 310 is configured to obtain a text to be added without punctuation marks, perform word segmentation processing and part-of-speech recognition on the text to be added, and perform normalization processing on a set word in the processed text to be added;
a word/word vector obtaining module 320, configured to obtain, through a pre-trained word/word vector model, a word/word vector corresponding to each character in the processed text to be added;
a splicing module 330, configured to splice part-of-speech information, word segmentation boundary information, and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
a candidate text sequence set obtaining module 340, configured to input the feature vector into a trained seq2seq model, obtain a plurality of candidate text sequences added with punctuation, and form a candidate text sequence set;
a filtering module 350, configured to filter a text sequence in the candidate text sequence set, where the text sequence is inconsistent with a word/word in a text to be added, and filter a text sequence in the candidate text sequence set, where the text sequence does not meet punctuation standards;
The output/restore module 360 is configured to output, from the remaining text sequences in the candidate text sequence set, a text sequence that has the highest joint probability and meets the punctuation specification, and perform a normalization restore operation on the output text sequence.
Optionally, the seq2seq model is a seq2seq model based on a two-way long-short term memory model and an attention mechanism.
Optionally, the seq2seq model includes an embedded layer, an encoder, an attention layer, and a decoder;
the encoder comprises a two-way long-short-term memory unit; the attention layer comprises a long-period and short-period memory unit;
the decoder comprises a two-way long-short-term memory unit, and is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the output result of the embedded layer, selecting the characters at all moments based on the size of the probability, and forming at least one text sequence added with punctuation marks.
Optionally, the device further comprises a training module for:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts; performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
Acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
Optionally, the setting word includes a number.
The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method.
Fig. 4 is a structural block diagram of a training device for adding punctuation marks to a text according to an embodiment of the present invention, as shown in fig. 4, where the device includes: training text derivation module 410, processing module 420, word/word vector derivation module 430, concatenation module 440, and training module 450.
The training text obtaining module 410 is configured to remove punctuation marks from a plurality of original texts with punctuation marks, so as to obtain a training text;
the processing module 420 is configured to perform word segmentation processing and part-of-speech recognition on the training text, and perform normalization processing on the processed set words of the training text;
A word/word vector obtaining module 430, configured to obtain, through a pre-trained word/word vector model, a word/word vector corresponding to each character in the processed training text;
the splicing module 440 is configured to splice part-of-speech information, word segmentation boundary information, and word/word vectors corresponding to each character in the training text to obtain feature vectors;
the training module 450 is configured to input the feature vector into a seq2seq model, and train the seq2seq model to obtain a trained seq2seq model.
The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 5, where the device includes:
one or more processors 510, one processor 510 being illustrated in fig. 5;
a memory 520;
the apparatus may further include: an input device 530 and an output device 540.
The processor 510, memory 520, input means 530 and output means 540 in the apparatus may be connected by a bus or otherwise, in fig. 5 by way of example.
The memory 520 is a non-transitory computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to a method for adding punctuation marks in text in an embodiment of the present invention (e.g., the processing module 310, the word/word vector obtaining module 320, the stitching module 330, the candidate text sequence set obtaining module 340, the filtering module 350, and the output/restoration module 360 shown in fig. 3). Or program instructions/modules corresponding to a training method for adding a punctuation mark to a text according to an embodiment of the present invention (e.g., training text obtaining module 410, processing module 420, word/word vector obtaining module 430, splicing module 440, and training module 450 shown in fig. 4)
The processor 510 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 520, i.e. implements a method of punctuation addition in text of the above-described method embodiments, i.e.:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
Or the embodiment of the invention provides a training method for adding a punctuation mark into a text, which comprises the following steps:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
Memory 520 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 530 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the computer device. The output 540 may include a display device such as a display screen.
The embodiment of the invention provides a computer readable storage medium, on which a computer program is stored, the program when executed by a processor implementing a method for adding a punctuation mark in a text, as follows:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
And outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
Or the embodiment of the invention provides a training method for adding a punctuation mark into a text, which comprises the following steps:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. A method for adding punctuation marks in a text, comprising:
acquiring a text to be added without punctuation marks, performing word segmentation and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
acquiring a word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the text to be added to obtain feature vectors;
Inputting the feature vector into a trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
filtering text sequences which are inconsistent with characters/words in the text to be added in the candidate text sequence set, and filtering text sequences which do not accord with punctuation standards in the candidate text sequence set;
and outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalized restoration operation on the output text sequence.
2. The method of claim 1, wherein the seq2seq model is a seq2seq model based on a two-way long-short term memory model and an attention mechanism.
3. The method according to claim 1 or 2, wherein the seq2seq model comprises an embedded layer, an encoder, an attention layer and a decoder;
the encoder comprises a two-way long-short-term memory unit; the attention layer comprises a long-period and short-period memory unit;
the decoder comprises a two-way long-short-term memory unit, and is used for generating probability distribution of characters at all moments through a softmax function based on the output result of the attention layer and the output result of the embedded layer, selecting the characters at all moments based on the size of the probability, and forming at least one text sequence added with punctuation marks.
4. The method as recited in claim 1, further comprising:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts; performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
5. The method of claim 1, wherein the setting word comprises a number.
6. The method for training the punctuation mark addition model in the text is characterized by comprising the following steps of:
removing punctuation marks from a plurality of original texts with the punctuation marks to obtain training texts;
performing word segmentation processing and part-of-speech recognition on the training text, and performing normalization processing on the processed set words of the training text;
Acquiring a word/word vector corresponding to each character in the processed training text through a pre-training word/word vector model;
splicing part-of-speech information, word segmentation boundary information and word/word vectors corresponding to each character in the training text to obtain feature vectors;
inputting the feature vector into a seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
7. A punctuation mark adding apparatus in a text, comprising:
the processing module is used for acquiring a text to be added without punctuation marks, performing word segmentation processing and part-of-speech recognition on the text to be added, and performing normalization processing on the set words in the processed text to be added;
the word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed text to be added through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the text to be added to obtain a feature vector;
the candidate text sequence set obtaining module is used for inputting the feature vector into the trained seq2seq model to obtain a plurality of candidate text sequences added with punctuation, and forming a candidate text sequence set;
The filtering module is used for filtering the text sequences which are inconsistent with the characters/words in the text to be added in the candidate text sequence set, and filtering the text sequences which are not consistent with punctuation standards in the candidate text sequence set;
and the output/restoration module is used for outputting the text sequence which has the highest joint probability and accords with the punctuation mark specification from the rest text sequences in the candidate text sequence set, and carrying out normalization restoration operation on the output text sequence.
8. A punctuation mark addition model training device in a text, comprising:
the training text obtaining module is used for removing punctuation marks from a plurality of original texts with punctuation marks to obtain training texts;
the processing module is used for carrying out word segmentation processing and part-of-speech recognition on the training text and carrying out normalization processing on the processed set words of the training text;
the word/word vector obtaining module is used for obtaining the word/word vector corresponding to each character in the processed training text through a pre-trained word/word vector model;
the splicing module is used for splicing the part-of-speech information, the word segmentation boundary information and the word/word vector corresponding to each character in the training text to obtain a feature vector;
The training module is used for inputting the feature vector into the seq2seq model, and training the seq2seq model to obtain a trained seq2seq model.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement a punctuation in text method as claimed in any one of claims 1-5, or a punctuation in text model training method as claimed in claim 6.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a punctuation in text addition method according to any one of claims 1-5 or a punctuation in text addition model training method according to claim 6.
CN201911182421.XA 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment Active CN111027291B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911182421.XA CN111027291B (en) 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911182421.XA CN111027291B (en) 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment

Publications (2)

Publication Number Publication Date
CN111027291A CN111027291A (en) 2020-04-17
CN111027291B true CN111027291B (en) 2024-03-26

Family

ID=70207202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911182421.XA Active CN111027291B (en) 2019-11-27 2019-11-27 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment

Country Status (1)

Country Link
CN (1) CN111027291B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950256A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Sentence break processing method and device, electronic equipment and computer storage medium
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN112001167B (en) * 2020-08-26 2021-04-23 四川云从天府人工智能科技有限公司 Punctuation mark adding method, system, equipment and medium
CN112464642A (en) * 2020-11-25 2021-03-09 平安科技(深圳)有限公司 Method, device, medium and electronic equipment for adding punctuation to text
CN113609850A (en) * 2021-07-02 2021-11-05 北京达佳互联信息技术有限公司 Word segmentation processing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767870A (en) * 2017-09-29 2018-03-06 百度在线网络技术(北京)有限公司 Adding method, device and the computer equipment of punctuation mark
CN108932226A (en) * 2018-05-29 2018-12-04 华东师范大学 A kind of pair of method without punctuate text addition punctuation mark
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based
WO2019174422A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Method for analyzing entity association relationship, and related apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767870A (en) * 2017-09-29 2018-03-06 百度在线网络技术(北京)有限公司 Adding method, device and the computer equipment of punctuation mark
WO2019174422A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Method for analyzing entity association relationship, and related apparatus
CN108932226A (en) * 2018-05-29 2018-12-04 华东师范大学 A kind of pair of method without punctuate text addition punctuation mark
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任智慧 ; 徐浩煜 ; 封松林 ; 周晗 ; 施俊 ; .基于LSTM网络的序列标注中文分词法.计算机应用研究.2017,第34卷(第5期),全文. *

Also Published As

Publication number Publication date
CN111027291A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN111027291B (en) Method and device for adding mark symbols in text and method and device for training model, and electronic equipment
CN108922564B (en) Emotion recognition method and device, computer equipment and storage medium
CN111402861B (en) Voice recognition method, device, equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111312231A (en) Audio detection method and device, electronic equipment and readable storage medium
CN112509562B (en) Method, apparatus, electronic device and medium for text post-processing
CN110929094A (en) Video title processing method and device
CN112507706A (en) Training method and device of knowledge pre-training model and electronic equipment
CN114154518A (en) Data enhancement model training method and device, electronic equipment and storage medium
CN115587598A (en) Multi-turn dialogue rewriting method, equipment and medium
CN114822498A (en) Training method of voice translation model, voice translation method, device and equipment
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
CN113160820B (en) Speech recognition method, training method, device and equipment of speech recognition model
CN116909435A (en) Data processing method and device, electronic equipment and storage medium
CN113345409B (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
CN112836476B (en) Summary generation method, device, equipment and medium
CN112002325B (en) Multi-language voice interaction method and device
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN115527520A (en) Anomaly detection method, device, electronic equipment and computer readable storage medium
CN115346520A (en) Method, apparatus, electronic device and medium for speech recognition
CN110728137B (en) Method and device for word segmentation
CN113761943A (en) Method for generating judicial dialogues, method and device for training models, and storage medium
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN114896986B (en) Method and device for enhancing training data of semantic recognition model
JP7426919B2 (en) Program, device and method for estimating causal terms from images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: Room 501, 502, 503, No. 66 Boxia Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, March 2012

Applicant after: Daguan Data Co.,Ltd.

Address before: Room 301, 303 and 304, block B, 112 liangxiu Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Applicant before: DATAGRAND INFORMATION TECHNOLOGY (SHANGHAI) Co.,Ltd.

Country or region before: China

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant