EP3852013A1 - Method, apparatus, and storage medium for predicting punctuation in text - Google Patents
Method, apparatus, and storage medium for predicting punctuation in text Download PDFInfo
- Publication number
- EP3852013A1 EP3852013A1 EP20215758.2A EP20215758A EP3852013A1 EP 3852013 A1 EP3852013 A1 EP 3852013A1 EP 20215758 A EP20215758 A EP 20215758A EP 3852013 A1 EP3852013 A1 EP 3852013A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- character
- text
- punctuation
- prediction result
- predicted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012545 processing Methods 0.000 claims abstract description 22
- 230000004044 response Effects 0.000 claims description 56
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000015654 memory Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000012549 training Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 206010010071 Coma Diseases 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001151 other effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present disclosure relates to a field of data processing technologies, specifically, to a field of text tagging technologies, and more particularly, to a method, an apparatus, and a storage medium for predicting a punctuation in a text.
- the current method for predicting a punctuation of a text uses a sequence tagging model or a language model to determine whether the punctuation follows each character in the text and a type of the punctuation.
- the sequence tagging model has poor adaptability to new words and hot words; and the language model has poor generalization ability, such that a slight difference in the text may result in a different result. Consequently, the efficiency of predicting the punctuation is low.
- the present disclosure provides a method for predicting a punctuation in a text, including: obtaining a text to be predicted; inputting the text to be predicted into a preset sequence tagging model to obtain at least one prediction result and a first score corresponding to each of the at least one prediction result of each character in the text to be predicted, in which each of the at least one prediction result represents whether a punctuation follows the corresponding character and a type of the punctuation; generating a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; inputting the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result; determining a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result; and performing punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated
- inputting the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result and the first score corresponding to each of the at least one prediction result of each character in the text to be predicted includes: inputting the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and performing reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- generating the text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result includes: for each character in the text to be predicted, determining whether the corresponding character is a first character in the text to be predicted; in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generating a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; and in response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generating a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result.
- the content In response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; and in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted, in which in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- determining the punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result includes: for each of the at least one prediction result of the character, obtaining the first score and the second score corresponding to the corresponding prediction result; performing a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; and determining the punctuation existence situation of the character based on a prediction result with a smallest total score.
- performing the punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain the punctuated text corresponding to the text to be predicted includes: for each character in the text to be predicted, determining whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, adding the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtaining the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- the present disclosure provides an apparatus for predicting a punctuation, including: an obtaining module, configured to obtain a text to be predicted; an input module, configured to input the text to be predicted into a preset sequence tagging model to obtain at least one prediction result and a first score corresponding to each of the at least one prediction result of each character in the text to be predicted, each of the at least one prediction result representing whether a punctuation follows the corresponding character and a type of the punctuation; a first determination module, configured to, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result, and input the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result; a second determination module, configured to determine a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result; and a punctuation processing module, configured to perform punctuation processing on the
- the input module is configured to: input the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and perform reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- the first determination module is configured to: determine whether the corresponding character is a first character in the text to be predicted; in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; and in response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generate a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by a punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result.
- the content In response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; and in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by a punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted, in which in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- the second determination module is configured to: for each of the at least one prediction result of the character, obtain the first score and the second score corresponding to the corresponding prediction result; perform a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; and determine the punctuation existence situation of the character based on a prediction result with a smallest total score.
- the punctuation processing module is configured to: for each character in the text to be predicted, determine whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, add the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtain the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- the present disclosure provides a computer-readable storage medium having a computer instruction stored thereon.
- the computer instruction is configured to make a computer implement the above method.
- FIG. 1 is a schematic diagram according to Embodiment 1 of the present disclosure. It should be noted that an executive subject of the method for predicting the punctuation according to embodiments of the present disclosure is the apparatus for predicting the punctuation.
- the apparatus may be implemented by software and/or hardware, and configured in a terminal device or a server, which is not limited in embodiments of the present disclosure.
- the method for predicting the punctuation may include the following.
- a text to be predicted is obtained.
- the text to be predicted may be an unpunctuated text.
- the unpunctuated text for example, may be a text obtained after a speech recognition system recognizes speech, or an unpunctuated text obtained in the process of speech transcription.
- the text to be predicted is inputted into a preset sequence tagging model to obtain a first punctuation prediction result of each character in the text to be predicted.
- the first punctuation prediction result includes at least one prediction result and a first score corresponding to each of the at least one prediction result.
- Each of the at least one prediction result represents whether the punctuation follows the corresponding character and a type of the punctuation.
- the process of performing step 102 by the apparatus for predicting the punctuation may include inputting the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and for each of the at least one prediction result of each character, performing reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- the sequence tagging model may be, for example, a recurrent neural network (RNN) model, a bidirectional-long short-term memory (Bi-LSTM) model, etc.
- RNN recurrent neural network
- Bi-LSTM bidirectional-long short-term memory
- the model includes four parts: an input window, a word vector, a BLSTM layer and a softmax inference layer.
- an output of the softmax inference layer may be at least one prediction result corresponding to each character in the text and a prediction probability of each of the at least one prediction result.
- the reciprocal operation and the logarithmic operation are performed on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- the reciprocal operation and the logarithmic operation are performed on the prediction probability of the corresponding prediction result, such that the higher the prediction probability, the lower the score, that is, the lower the score, the higher the prediction probability.
- the preset sequence tagging model is obtained after training an initial sequence tagging model using training data.
- the training data includes samples in an amount greater than a preset amount, and each sample includes an unpunctuated text and a corresponding sequence of punctuations.
- the preset amount may be, for example, 5GB.
- a text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result.
- the text to be inputted is inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result.
- an input of the language model is a piece of text
- an output of the language model is a perplexity of the piece of text.
- the perplexity is determined as the score.
- the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result.
- the text to be inputted is "I feel good today", and the corresponding character is "good”. If one prediction result is that no punctuation follows the character "good”, the text to be inputted may be "I feel good”; if one prediction result is that a coma follows the character "good”, the text to be inputted may be "I feel good,”; and if one prediction result is that a period follows the character "good”, the text to be inputted may be "I feel good.”.
- the preset language model is obtained after training an initial language model using training data.
- the training data includes samples in an amount greater than a preset amount, and each sample includes an unpunctuated text and a corresponding punctuated text.
- the preset amount may be, for example, 1TB.
- a punctuation existence situation of the corresponding character is determined based on the first score and the second score corresponding to each of the at least one prediction result.
- the process of performing step 104 by the apparatus for predicting the punctuation may include, for each of the at least one prediction result of the character, obtaining the first score and the second score corresponding to the corresponding prediction result, performing a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result, and determining the punctuation existence situation of the character based on a prediction result with a smallest total score.
- a merging algorithm for merging scores of the sequence tagging model and the language model may be the beam-search algorithm.
- the algorithm is able to merge a score of the sequence tagging model and a score of the language model for each of the at least one prediction result of each character in the text to be predicted, and to make a selection from the at least one prediction result, thereby reducing the amount of subsequent calculations.
- punctuation processing is performed on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted.
- the process of performing step 105 by the apparatus for predicting the punctuation may include, for each character in the text to be predicted, determining whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, adding the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtaining the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- the schematic diagram of the method for predicting the punctuation may be as illustrated in FIG. 2 .
- the text to be predicted is "Hi I am in Baidu Technology Park”
- the punctuated text corresponding to the text to be predicted is "Hi, I am in Baidu Technology Park.”
- both the sequence tagging model and the language model are adopted, so that advantages of the sequence tagging model and the language model may be combined, and the efficiency of predicting the punctuation may be improved.
- the text to be predicted is obtained.
- the text to be predicted is inputted into the preset sequence tagging model to obtain the first punctuation prediction result of each character in the text to be predicted.
- the first punctuation prediction result includes at least one prediction result and the first score corresponding to each of the at least one prediction result.
- Each of the at least one prediction result represents whether the punctuation follows the corresponding character and the type of the punctuation.
- the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result.
- the text to be inputted is inputted into the preset language model to obtain the second score corresponding to each of the at least one prediction result.
- the punctuation existence situation of the corresponding character is determined based on the first score and the second score corresponding to each of the at least one prediction result.
- the punctuation processing is performed on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain the punctuated text corresponding to the text to be predicted. Consequently, whether the punctuation follows each character in the text and the type of the punctuation are determined based on the sequence tagging model and the language model. In this manner, advantages of the sequence tagging model and the language model are combined, such that the efficiency of predicting the punctuation is improved.
- FIG. 3 is a schematic diagram according to Embodiment 2 of the present disclosure. As illustrated in FIG. 3 , based on the embodiment illustrated in FIG. 1 , step 103 may include the following steps.
- the first character refers to a character that the text to be predicted starts with.
- the first character refers to the first one that appears in the beginning of the text to be predicted.
- a text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result.
- the text to be inputted corresponding to each of the at least one prediction result of the corresponding first character may include the first character and the content represented by the corresponding prediction result.
- the score of each of the at least one prediction result of the corresponding character may be determined based on the one or more characters following the corresponding character, and then, the text to be inputted corresponding to each of the at least one prediction result may be generated based on the one or more characters following the corresponding character.
- the text to be inputted corresponding to each of the at least one prediction result of the corresponding first character may include the first character, the content represented by the corresponding prediction result, and one or more characters of a preset number following the corresponding first character in the text to be predicted.
- a text to be inputted corresponding to each of the at least one prediction result is generated based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character in the text to be predicted, the non-first character, and the content represented by the corresponding prediction result.
- the corresponding character is the non-first character in the text to be predicted
- the punctuation existence situation of some character before the non-first character is that a punctuation follows some character
- the text to be inputted is generated based on the punctuation existence situation of each character previous to the non-first character, such that the accuracy of the text to be inputted and the efficiency of predicting the punctuation are improved.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result.
- the content In response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation.
- the content represented by the punctuation existence situation of the character before the non-first character is empty, and thus no punctuation is inserted after the character before the non-first character.
- the punctuation existence situation of the character before the non-first character represents that a punctuation follows the character before the non-first character
- the content represented by the punctuation existence situation of the character before the non-first character is a type of the punctuation, and then the punctuation is inserted after the character before the non-first character.
- the score of each of the at least one prediction result of the corresponding character may be determined based on the one or more characters following the corresponding character, and then, the text to be inputted corresponding to each of the at least one prediction result may be generated based on the one or more characters following the corresponding character.
- the text to be inputted corresponding each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted.
- FIG. 4 is a schematic diagram of determining a punctuation existence situation of a character.
- the punctuation existence situation of the character “addition” is that a comma follows the character “addition”.
- scores of prediction results of the character “I” are determined based on the character “addition” and the comma following the character “addition", and then a choice is made accordingly.
- the text to be predicted is obtained.
- the text to be predicted is inputted into the preset sequence tagging model to obtain the first punctuation prediction result of each character in the text to be predicted.
- the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result.
- the text to be inputted corresponding to each of the at least one prediction result is generated based on the punctuation existence situation of each of the at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result. Consequently, whether the punctuation follows each character in the text and the type of the punctuation are determined based on the sequence tagging model and the language model. In this manner, advantages of the sequence tagging model and the language model are combined, such that the efficiency of predicting the punctuation is improved.
- embodiments of the present disclosure also provide an apparatus for predicting a punctuation.
- FIG. 5 is a schematic diagram according to Embodiment 3 of the present disclosure.
- an apparatus for predicting a punctuation 100 includes an obtaining module 110, an input module 120, a first determination module 130, a second determination module 140, and a punctuation processing module 150.
- the obtaining module 110 is configured to obtain a text to be predicted.
- the input module 120 is configured to input the text to be predicted into a preset sequence tagging model to obtain a first punctuation prediction result of each character in the text to be predicted.
- the first punctuation prediction result includes at least one prediction result and a first score corresponding to each of the at least one prediction result.
- Each of the at least one prediction result represents whether the punctuation follows the corresponding character and a type of the punctuation.
- the first determination module 130 is configured to, for each character in the text to be predicted and each of the at least one prediction result of the corresponding character, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result, and input the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result.
- the second determination module 140 is configured to determine a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result.
- the punctuation processing module 150 is configured to perform punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted.
- the input module 120 is configured to: input the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and for each of the at least one prediction result of each character, perform reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- the first determination module 130 is configured to: for each character in the text to be predicted, determine whether the corresponding character is a first character in the text to be predicted; in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; and in response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generate a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result.
- the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation.
- the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted.
- the second determination module 140 is configured to: for each of the at least one prediction result of the character, obtain the first score and the second score corresponding to the corresponding prediction result; perform a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; and determine the punctuation existence situation of the character based on a prediction result with a smallest total score.
- the punctuation processing module 150 is configured to: for each character in the text to be predicted, determine whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, add the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtain the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- the text to be predicted is obtained.
- the text to be predicted is inputted into the preset sequence tagging model to obtain the first punctuation prediction result of each character in the text to be predicted.
- the first punctuation prediction result includes at least one prediction result and the first score corresponding to each of the at least one prediction result.
- Each of the at least one prediction result represents whether the punctuation follows the corresponding character and the type of the punctuation.
- the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result.
- the text to be inputted is inputted into the preset language model to obtain the second score corresponding to each of the at least one prediction result.
- the punctuation existence situation of the corresponding character is determined based on the first score and the second score corresponding to each of the at least one prediction result.
- the punctuation processing is performed on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain the punctuated text corresponding to the text to be predicted. Consequently, whether the punctuation follows each character in the text and the type of the punctuation are determined based on the sequence tagging model and the language model. In this manner, advantages of the sequence tagging model and the language model are combined, such that the efficiency of predicting the punctuation is improved.
- an electronic device and a readable storage medium are provided.
- FIG. 6 is a block diagram of an electronic device for implementing a method for predicting a punctuation according to embodiments of the present disclosure.
- the electronic device is intended to represent various forms of digital computers, such as a laptop computer, a desktop computer, a workbench, a personal digital assistant, a server, a blade server, a mainframe computer and other suitable computers.
- the electronic device may also represent various forms of mobile devices, such as a personal digital processor, a cellular phone, a smart phone, a wearable device and other similar computing devices.
- Components shown herein, their connections and relationships as well as their functions are merely examples, and are not intended to limit the implementation of the present disclosure described and/or required herein.
- the electronic device includes: one or more processors 301, a memory 302, and interfaces for connecting various components, including a high-speed interface and a low-speed interface.
- the components are interconnected by different buses and may be mounted on a common motherboard or otherwise installed as required.
- the processor may process instructions executed within the electronic device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface).
- an external input/output device such as a display device coupled to the interface.
- multiple processors and/or multiple buses may be used with multiple memories.
- multiple electronic devices may be connected, each providing some of the necessary operations (for example, as a server array, a group of blade servers, or a multiprocessor system).
- One processor 301 is taken as an example in FIG. 6 .
- the memory 302 is a non-transitory computer-readable storage medium according to the embodiments of the present disclosure.
- the memory stores instructions executable by at least one processor, so that the at least one processor executes the method for predicting the punctuation provided by the present disclosure.
- the non-transitory computer-readable storage medium according to the present disclosure stores computer instructions, which are configured to make the computer execute the method for predicting the punctuation provided by the present disclosure.
- the memory 302 may be configured to store non-transitory software programs, non-transitory computer executable programs and modules, such as program instructions/modules (for example, the obtaining module 110, the input module 120, the first determination module 130, the second determination module 140, and the punctuation processing module 150 illustrated in FIG. 5 ) corresponding to the method for predicting the punctuation according to embodiments of the present disclosure.
- the processor 301 executes various functional applications and performs data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 302, that is, the method for predicting the punctuation according to the foregoing method embodiments is implemented.
- the memory 302 may include a storage program area and a storage data area, where the storage program area may store an operating system and applications required for at least one function; and the storage data area may store data created according to the use of the electronic device, and the like.
- the memory 302 may include a high-speed random-access memory, and may further include a non-transitory memory, such as at least one magnetic disk memory, a flash memory device, or other non-transitory solid-state memories.
- the memory 302 may optionally include memories remotely disposed with respect to the processor 301, and these remote memories may be connected to the electronic device, which is configured to implement the method for predicting the punctuation, through a network. Examples of the network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
- the electronic device configured to implement the method for predicting the punctuation may further include an input device 303 and an output device 304.
- the processor 301, the memory 302, the input device 303 and the output device 304 may be connected through a bus or in other manners.
- FIG. 6 is illustrated by establishing the connection through a bus.
- the input device 303 may receive input numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device configured to implement the method for predicting the punctuation, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointing stick, one or more mouse buttons, trackballs, joysticks and other input devices.
- the output device 304 may include a display device, an auxiliary lighting device (for example, an LED), a haptic feedback device (for example, a vibration motor), and so on.
- the display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display and a plasma display. In some embodiments, the display device may be a touch screen.
- Various implementations of systems and technologies described herein may be implemented in digital electronic circuit systems, integrated circuit systems, application-specific ASICs (application-specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include: being implemented in one or more computer programs that are executable and/or interpreted on a programmable system including at least one programmable processor.
- the programmable processor may be a dedicated or general-purpose programmable processor that may receive data and instructions from a storage system, at least one input device and at least one output device, and transmit the data and instructions to the storage system, the at least one input device and the at least one output device.
- the systems and technologies described herein may be implemented on a computer having: a display device (for example, a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor) for displaying information to the user; and a keyboard and a pointing device (such as a mouse or trackball) through which the user may provide input to the computer.
- a display device for example, a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor
- a keyboard and a pointing device such as a mouse or trackball
- Other kinds of devices may also be used to provide interactions with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback or haptic feedback); and input from the user may be received in any form (including acoustic input, voice input or tactile input).
- the systems and technologies described herein may be implemented in a computing system that includes back-end components (for example, as a data server), a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, a user computer with a graphical user interface or a web browser, through which the user may interact with the implementation of the systems and technologies described herein), or a computing system including any combination of the back-end components, the middleware components or the front-end components.
- the components of the system may be interconnected by digital data communication (e.g., a communication network) in any form or medium. Examples of the communication network include: a local area network (LAN), a wide area network (WAN), and the Internet.
- Computer systems may include a client and a server.
- the client and server are generally remote from each other and typically interact through the communication network.
- a client-server relationship is generated by computer programs running on respective computers and having a client-server relationship with each other.
Abstract
Description
- The present disclosure relates to a field of data processing technologies, specifically, to a field of text tagging technologies, and more particularly, to a method, an apparatus, and a storage medium for predicting a punctuation in a text.
- The current method for predicting a punctuation of a text uses a sequence tagging model or a language model to determine whether the punctuation follows each character in the text and a type of the punctuation. However, the sequence tagging model has poor adaptability to new words and hot words; and the language model has poor generalization ability, such that a slight difference in the text may result in a different result. Consequently, the efficiency of predicting the punctuation is low.
- The present disclosure provides a method for predicting a punctuation in a text, including: obtaining a text to be predicted; inputting the text to be predicted into a preset sequence tagging model to obtain at least one prediction result and a first score corresponding to each of the at least one prediction result of each character in the text to be predicted, in which each of the at least one prediction result represents whether a punctuation follows the corresponding character and a type of the punctuation; generating a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; inputting the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result; determining a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result; and performing punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted.
- Alternatively, inputting the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result and the first score corresponding to each of the at least one prediction result of each character in the text to be predicted includes: inputting the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and performing reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- Alternatively, generating the text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result includes: for each character in the text to be predicted, determining whether the corresponding character is a first character in the text to be predicted; in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generating a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; and in response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generating a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- Alternatively, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result. In response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; and in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- Alternatively, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted, in which in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- Alternatively, determining the punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result includes: for each of the at least one prediction result of the character, obtaining the first score and the second score corresponding to the corresponding prediction result; performing a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; and determining the punctuation existence situation of the character based on a prediction result with a smallest total score.
- Alternatively, performing the punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain the punctuated text corresponding to the text to be predicted includes: for each character in the text to be predicted, determining whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, adding the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtaining the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- The present disclosure provides an apparatus for predicting a punctuation, including: an obtaining module, configured to obtain a text to be predicted; an input module, configured to input the text to be predicted into a preset sequence tagging model to obtain at least one prediction result and a first score corresponding to each of the at least one prediction result of each character in the text to be predicted, each of the at least one prediction result representing whether a punctuation follows the corresponding character and a type of the punctuation; a first determination module, configured to, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result, and input the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result; a second determination module, configured to determine a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result; and a punctuation processing module, configured to perform punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted.
- Alternatively, the input module is configured to: input the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and perform reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- Alternatively, the first determination module is configured to: determine whether the corresponding character is a first character in the text to be predicted; in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; and in response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generate a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- Alternatively, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by a punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result. In response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; and in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- Alternatively, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes: the at least one character before the non-first character, content represented by a punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted, in which in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation; in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation.
- Alternatively, the second determination module is configured to: for each of the at least one prediction result of the character, obtain the first score and the second score corresponding to the corresponding prediction result; perform a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; and determine the punctuation existence situation of the character based on a prediction result with a smallest total score.
- Alternatively, the punctuation processing module is configured to: for each character in the text to be predicted, determine whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, add the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtain the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- The present disclosure provides a computer-readable storage medium having a computer instruction stored thereon. The computer instruction is configured to make a computer implement the above method.
- Other effects of the above-mentioned implementations will be described below in combination with specific embodiments.
- The accompanying drawings are used for a better understanding of the solution, and do not constitute a limitation to the present disclosure.
-
FIG. 1 is a schematic diagram according to Embodiment 1 of the present disclosure. -
FIG. 2 is a schematic diagram of a method for predicting a punctuation. -
FIG. 3 is a schematic diagram according to Embodiment 2 of the present disclosure. -
FIG. 4 is a schematic diagram of determining a punctuation existence situation of a character. -
FIG. 5 is a schematic diagram according toEmbodiment 3 of the present disclosure. -
FIG. 6 is a block diagram of an electronic device for implementing a method for predicting a punctuation according to embodiments of the present disclosure. - Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present disclosure. Also, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.
- A method and apparatus for predicting a punctuation according to embodiments of the present disclosure will be described below in combination with the drawings.
-
FIG. 1 is a schematic diagram according to Embodiment 1 of the present disclosure. It should be noted that an executive subject of the method for predicting the punctuation according to embodiments of the present disclosure is the apparatus for predicting the punctuation. The apparatus may be implemented by software and/or hardware, and configured in a terminal device or a server, which is not limited in embodiments of the present disclosure. - As illustrated in
FIG. 1 , the method for predicting the punctuation may include the following. - At
block 101, a text to be predicted is obtained. - In the present disclosure, the text to be predicted may be an unpunctuated text. The unpunctuated text, for example, may be a text obtained after a speech recognition system recognizes speech, or an unpunctuated text obtained in the process of speech transcription.
- At
block 102, the text to be predicted is inputted into a preset sequence tagging model to obtain a first punctuation prediction result of each character in the text to be predicted. The first punctuation prediction result includes at least one prediction result and a first score corresponding to each of the at least one prediction result. Each of the at least one prediction result represents whether the punctuation follows the corresponding character and a type of the punctuation. - In the present disclosure, the process of performing
step 102 by the apparatus for predicting the punctuation may include inputting the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and for each of the at least one prediction result of each character, performing reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result. - In the present disclosure, the sequence tagging model may be, for example, a recurrent neural network (RNN) model, a bidirectional-long short-term memory (Bi-LSTM) model, etc. Take the Bi-LSTM model as an example. The model includes four parts: an input window, a word vector, a BLSTM layer and a softmax inference layer. When an input is a text, an output of the softmax inference layer may be at least one prediction result corresponding to each character in the text and a prediction probability of each of the at least one prediction result. Take a character "good" in a text "I feel good today" as an example. There may be, for example, three prediction results corresponding to the character "good". One is that no punctuation follows the character "good", one is that a comma follows the character "good", and one is that a period follows the character "good".
- In the present disclosure, for each of the at least one prediction result of each character, the reciprocal operation and the logarithmic operation are performed on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result. The reciprocal operation and the logarithmic operation are performed on the prediction probability of the corresponding prediction result, such that the higher the prediction probability, the lower the score, that is, the lower the score, the higher the prediction probability.
- In the present disclosure, the preset sequence tagging model is obtained after training an initial sequence tagging model using training data. The training data includes samples in an amount greater than a preset amount, and each sample includes an unpunctuated text and a corresponding sequence of punctuations. The preset amount may be, for example, 5GB.
- At
block 103, for each character in the text to be predicted and each of the at least one prediction result of the corresponding character, a text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result. The text to be inputted is inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result. - In the present disclosure, an input of the language model is a piece of text, and an output of the language model is a perplexity of the piece of text. The lower the perplexity is, the higher the probability that the piece of text may appear. Therefore, a lower perplexity is better. In the present disclosure, the perplexity is determined as the score.
- In the present disclosure, since the input of the language model is a piece of text, for each of the at least one prediction result of the corresponding character, the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result. For example, the text to be inputted is "I feel good today", and the corresponding character is "good". If one prediction result is that no punctuation follows the character "good", the text to be inputted may be "I feel good"; if one prediction result is that a coma follows the character "good", the text to be inputted may be "I feel good,"; and if one prediction result is that a period follows the character "good", the text to be inputted may be "I feel good.".
- In the present disclosure, the preset language model is obtained after training an initial language model using training data. The training data includes samples in an amount greater than a preset amount, and each sample includes an unpunctuated text and a corresponding punctuated text. The preset amount may be, for example, 1TB.
- At
block 104, a punctuation existence situation of the corresponding character is determined based on the first score and the second score corresponding to each of the at least one prediction result. - In the present disclosure, the process of performing
step 104 by the apparatus for predicting the punctuation may include, for each of the at least one prediction result of the character, obtaining the first score and the second score corresponding to the corresponding prediction result, performing a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result, and determining the punctuation existence situation of the character based on a prediction result with a smallest total score. -
- In the present disclosure, a merging algorithm for merging scores of the sequence tagging model and the language model may be the beam-search algorithm. The algorithm is able to merge a score of the sequence tagging model and a score of the language model for each of the at least one prediction result of each character in the text to be predicted, and to make a selection from the at least one prediction result, thereby reducing the amount of subsequent calculations.
- At
block 105, punctuation processing is performed on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted. - In the present disclosure, the process of performing
step 105 by the apparatus for predicting the punctuation may include, for each character in the text to be predicted, determining whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, adding the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtaining the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed. - In the present disclosure, the schematic diagram of the method for predicting the punctuation may be as illustrated in
FIG. 2 . InFIG. 2 , the text to be predicted is "Hi I am in Baidu Technology Park", and the punctuated text corresponding to the text to be predicted is "Hi, I am in Baidu Technology Park.". In the process of predicting punctuations of the text to be predicted, both the sequence tagging model and the language model are adopted, so that advantages of the sequence tagging model and the language model may be combined, and the efficiency of predicting the punctuation may be improved. - With the method for predicting the punctuation, the text to be predicted is obtained. The text to be predicted is inputted into the preset sequence tagging model to obtain the first punctuation prediction result of each character in the text to be predicted. The first punctuation prediction result includes at least one prediction result and the first score corresponding to each of the at least one prediction result. Each of the at least one prediction result represents whether the punctuation follows the corresponding character and the type of the punctuation. For each character in the text to be predicted and each of the at least one prediction result of the corresponding character, the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result. The text to be inputted is inputted into the preset language model to obtain the second score corresponding to each of the at least one prediction result. The punctuation existence situation of the corresponding character is determined based on the first score and the second score corresponding to each of the at least one prediction result. The punctuation processing is performed on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain the punctuated text corresponding to the text to be predicted. Consequently, whether the punctuation follows each character in the text and the type of the punctuation are determined based on the sequence tagging model and the language model. In this manner, advantages of the sequence tagging model and the language model are combined, such that the efficiency of predicting the punctuation is improved.
-
FIG. 3 is a schematic diagram according to Embodiment 2 of the present disclosure. As illustrated inFIG. 3 , based on the embodiment illustrated inFIG. 1 , step 103 may include the following steps. - At
block 1031, for each character in the text to be predicted, it is determined whether the corresponding character is a first character in the text to be predicted. - In the present disclosure, the first character refers to a character that the text to be predicted starts with. In other words, the first character refers to the first one that appears in the beginning of the text to be predicted.
- At
block 1032, in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, a text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result. - In the present disclosure, when the corresponding character is the first character in the text to be predicted, the text to be inputted corresponding to each of the at least one prediction result of the corresponding first character may include the first character and the content represented by the corresponding prediction result.
- Furthermore, since one or more characters following the corresponding character have a relatively important influence on a score of each of the at least one prediction result of the corresponding character, the score of each of the at least one prediction result of the corresponding character may be determined based on the one or more characters following the corresponding character, and then, the text to be inputted corresponding to each of the at least one prediction result may be generated based on the one or more characters following the corresponding character. Therefore, in the present disclosure, when the corresponding character is the first character in the text to be predicted, the text to be inputted corresponding to each of the at least one prediction result of the corresponding first character may include the first character, the content represented by the corresponding prediction result, and one or more characters of a preset number following the corresponding first character in the text to be predicted.
- At
block 1033, in response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, a text to be inputted corresponding to each of the at least one prediction result is generated based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result. - In the present disclosure, when the corresponding character is a non-first character in the text to be predicted, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character in the text to be predicted, the non-first character, and the content represented by the corresponding prediction result.
- Further, when the corresponding character is the non-first character in the text to be predicted, if the punctuation existence situation of some character before the non-first character is that a punctuation follows some character, the text to be inputted is generated based on the punctuation existence situation of each character previous to the non-first character, such that the accuracy of the text to be inputted and the efficiency of predicting the punctuation are improved. Therefore, when the corresponding character is the non-first character in the text to be predicted, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result.
- In response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation.
- When a punctuation existence situation of a character before the non-first character represents that no punctuation follows the character before the non-first character, the content represented by the punctuation existence situation of the character before the non-first character is empty, and thus no punctuation is inserted after the character before the non-first character. When the punctuation existence situation of the character before the non-first character represents that a punctuation follows the character before the non-first character, the content represented by the punctuation existence situation of the character before the non-first character is a type of the punctuation, and then the punctuation is inserted after the character before the non-first character.
- Furthermore, since one or more characters following the corresponding character have a relatively important influence on the score of each of the at least one prediction result of the corresponding character, the score of each of the at least one prediction result of the corresponding character may be determined based on the one or more characters following the corresponding character, and then, the text to be inputted corresponding to each of the at least one prediction result may be generated based on the one or more characters following the corresponding character. Therefore, in the present disclosure, when the corresponding character is the non-first character in the text to be predicted, the text to be inputted corresponding each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted.
- In the present disclosure,
FIG. 4 is a schematic diagram of determining a punctuation existence situation of a character. InFIG. 4 , for a character "addition", it is determined that the punctuation existence situation of the character "addition" is that a comma follows the character "addition". In the subsequent processing of the character "I", scores of prediction results of the character "I" are determined based on the character "addition" and the comma following the character "addition", and then a choice is made accordingly. - With the method for predicting the punctuation, the text to be predicted is obtained. The text to be predicted is inputted into the preset sequence tagging model to obtain the first punctuation prediction result of each character in the text to be predicted. Regarding each character in the text to be predicted, in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result. In response to the corresponding character being the non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, the text to be inputted corresponding to each of the at least one prediction result is generated based on the punctuation existence situation of each of the at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result. Consequently, whether the punctuation follows each character in the text and the type of the punctuation are determined based on the sequence tagging model and the language model. In this manner, advantages of the sequence tagging model and the language model are combined, such that the efficiency of predicting the punctuation is improved.
- To implement the above embodiments, embodiments of the present disclosure also provide an apparatus for predicting a punctuation.
-
FIG. 5 is a schematic diagram according toEmbodiment 3 of the present disclosure. As illustrated inFIG. 3 , an apparatus for predicting apunctuation 100 includes an obtainingmodule 110, aninput module 120, afirst determination module 130, asecond determination module 140, and apunctuation processing module 150. - The obtaining
module 110 is configured to obtain a text to be predicted. - The
input module 120 is configured to input the text to be predicted into a preset sequence tagging model to obtain a first punctuation prediction result of each character in the text to be predicted. The first punctuation prediction result includes at least one prediction result and a first score corresponding to each of the at least one prediction result. Each of the at least one prediction result represents whether the punctuation follows the corresponding character and a type of the punctuation. - The
first determination module 130 is configured to, for each character in the text to be predicted and each of the at least one prediction result of the corresponding character, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result, and input the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result. - The
second determination module 140 is configured to determine a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result. - The
punctuation processing module 150 is configured to perform punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted. - In some embodiments of the present disclosure, the
input module 120 is configured to: input the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; and for each of the at least one prediction result of each character, perform reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result. - In some embodiments of the present disclosure, the
first determination module 130 is configured to: for each character in the text to be predicted, determine whether the corresponding character is a first character in the text to be predicted; in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; and in response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generate a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result. - In some embodiments of the present disclosure, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result. In response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation.
- In some embodiments of the present disclosure, the text to be inputted corresponding to each of the at least one prediction result of the non-first character includes the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted.
- In some embodiments of the present disclosure, the
second determination module 140 is configured to: for each of the at least one prediction result of the character, obtain the first score and the second score corresponding to the corresponding prediction result; perform a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; and determine the punctuation existence situation of the character based on a prediction result with a smallest total score. - In some embodiments of the present disclosure, the
punctuation processing module 150 is configured to: for each character in the text to be predicted, determine whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character; in response to that the punctuation follows the corresponding character, add the punctuation to follow the character in the text to be predicted based on a type of the punctuation; and obtain the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed. - It should be noted that the foregoing explanation of the method for predicting the punctuation is also applicable to the apparatus for predicting the punctuation according to embodiments of the present disclosure, and will not be repeated here.
- With the method for predicting the punctuation, the text to be predicted is obtained. The text to be predicted is inputted into the preset sequence tagging model to obtain the first punctuation prediction result of each character in the text to be predicted. The first punctuation prediction result includes at least one prediction result and the first score corresponding to each of the at least one prediction result. Each of the at least one prediction result represents whether the punctuation follows the corresponding character and the type of the punctuation. For each character in the text to be predicted and each of the at least one prediction result of the corresponding character, the text to be inputted corresponding to each of the at least one prediction result is generated based on the text to be predicted and the corresponding prediction result. The text to be inputted is inputted into the preset language model to obtain the second score corresponding to each of the at least one prediction result. The punctuation existence situation of the corresponding character is determined based on the first score and the second score corresponding to each of the at least one prediction result. The punctuation processing is performed on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain the punctuated text corresponding to the text to be predicted. Consequently, whether the punctuation follows each character in the text and the type of the punctuation are determined based on the sequence tagging model and the language model. In this manner, advantages of the sequence tagging model and the language model are combined, such that the efficiency of predicting the punctuation is improved.
- According to embodiments of the present disclosure, an electronic device and a readable storage medium are provided.
-
FIG. 6 is a block diagram of an electronic device for implementing a method for predicting a punctuation according to embodiments of the present disclosure. The electronic device is intended to represent various forms of digital computers, such as a laptop computer, a desktop computer, a workbench, a personal digital assistant, a server, a blade server, a mainframe computer and other suitable computers. The electronic device may also represent various forms of mobile devices, such as a personal digital processor, a cellular phone, a smart phone, a wearable device and other similar computing devices. Components shown herein, their connections and relationships as well as their functions are merely examples, and are not intended to limit the implementation of the present disclosure described and/or required herein. - As shown in
FIG. 6 , the electronic device includes: one ormore processors 301, amemory 302, and interfaces for connecting various components, including a high-speed interface and a low-speed interface. The components are interconnected by different buses and may be mounted on a common motherboard or otherwise installed as required. The processor may process instructions executed within the electronic device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, when necessary, multiple processors and/or multiple buses may be used with multiple memories. Similarly, multiple electronic devices may be connected, each providing some of the necessary operations (for example, as a server array, a group of blade servers, or a multiprocessor system). Oneprocessor 301 is taken as an example inFIG. 6 . - The
memory 302 is a non-transitory computer-readable storage medium according to the embodiments of the present disclosure. The memory stores instructions executable by at least one processor, so that the at least one processor executes the method for predicting the punctuation provided by the present disclosure. The non-transitory computer-readable storage medium according to the present disclosure stores computer instructions, which are configured to make the computer execute the method for predicting the punctuation provided by the present disclosure. - As a non-transitory computer-readable storage medium, the
memory 302 may be configured to store non-transitory software programs, non-transitory computer executable programs and modules, such as program instructions/modules (for example, the obtainingmodule 110, theinput module 120, thefirst determination module 130, thesecond determination module 140, and thepunctuation processing module 150 illustrated inFIG. 5 ) corresponding to the method for predicting the punctuation according to embodiments of the present disclosure. Theprocessor 301 executes various functional applications and performs data processing of the server by running non-transitory software programs, instructions and modules stored in thememory 302, that is, the method for predicting the punctuation according to the foregoing method embodiments is implemented. - The
memory 302 may include a storage program area and a storage data area, where the storage program area may store an operating system and applications required for at least one function; and the storage data area may store data created according to the use of the electronic device, and the like. In addition, thememory 302 may include a high-speed random-access memory, and may further include a non-transitory memory, such as at least one magnetic disk memory, a flash memory device, or other non-transitory solid-state memories. In some embodiments, thememory 302 may optionally include memories remotely disposed with respect to theprocessor 301, and these remote memories may be connected to the electronic device, which is configured to implement the method for predicting the punctuation, through a network. Examples of the network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof. - The electronic device configured to implement the method for predicting the punctuation may further include an
input device 303 and anoutput device 304. Theprocessor 301, thememory 302, theinput device 303 and theoutput device 304 may be connected through a bus or in other manners.FIG. 6 is illustrated by establishing the connection through a bus. - The
input device 303 may receive input numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device configured to implement the method for predicting the punctuation, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointing stick, one or more mouse buttons, trackballs, joysticks and other input devices. Theoutput device 304 may include a display device, an auxiliary lighting device (for example, an LED), a haptic feedback device (for example, a vibration motor), and so on. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display and a plasma display. In some embodiments, the display device may be a touch screen. - Various implementations of systems and technologies described herein may be implemented in digital electronic circuit systems, integrated circuit systems, application-specific ASICs (application-specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include: being implemented in one or more computer programs that are executable and/or interpreted on a programmable system including at least one programmable processor. The programmable processor may be a dedicated or general-purpose programmable processor that may receive data and instructions from a storage system, at least one input device and at least one output device, and transmit the data and instructions to the storage system, the at least one input device and the at least one output device.
- These computing programs (also known as programs, software, software applications, or codes) include machine instructions of a programmable processor, and may implement these calculation procedures by utilizing high-level procedures and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device and/or apparatus configured to provide machine instructions and/or data to a programmable processor (for example, a magnetic disk, an optical disk, a memory and a programmable logic device (PLD)), and includes machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signals" refers to any signal used to provide machine instructions and/or data to a programmable processor.
- In order to provide interactions with the user, the systems and technologies described herein may be implemented on a computer having: a display device (for example, a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor) for displaying information to the user; and a keyboard and a pointing device (such as a mouse or trackball) through which the user may provide input to the computer. Other kinds of devices may also be used to provide interactions with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback or haptic feedback); and input from the user may be received in any form (including acoustic input, voice input or tactile input).
- The systems and technologies described herein may be implemented in a computing system that includes back-end components (for example, as a data server), a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, a user computer with a graphical user interface or a web browser, through which the user may interact with the implementation of the systems and technologies described herein), or a computing system including any combination of the back-end components, the middleware components or the front-end components. The components of the system may be interconnected by digital data communication (e.g., a communication network) in any form or medium. Examples of the communication network include: a local area network (LAN), a wide area network (WAN), and the Internet.
- Computer systems may include a client and a server. The client and server are generally remote from each other and typically interact through the communication network. A client-server relationship is generated by computer programs running on respective computers and having a client-server relationship with each other.
- It should be understood that various forms of processes shown above may be reordered, added or deleted. For example, the blocks described in the present disclosure may be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solution disclosed in the present disclosure may be achieved, there is no limitation herein.
- The foregoing specific implementations do not constitute a limit on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made according to design requirements and other factors. Any modification, equivalent replacement and improvement made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.
Claims (15)
- A method for predicting a punctuation in a text, comprising:obtaining a text to be predicted (101);inputting (102) the text to be predicted into a preset sequence tagging model to obtain at least one prediction result and a first score corresponding to each of the at least one prediction result of each character in the text to be predicted, each of the at least one prediction result representing whether a punctuation follows the corresponding character and a type of the punctuation;generating (103) a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result;inputting (103) the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result;determining (104) a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result; andperforming (105) punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted.
- The method of claim 1, wherein inputting (102) the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result and the first score corresponding to each of the at least one prediction result of each character in the text to be predicted, comprises:inputting the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; andperforming reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- The method of claim 1 or 2, wherein generating (103) the text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result comprises:for each character in the text to be predicted, determining (1031) whether the corresponding character is a first character in the text to be predicted;in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generating (1032) a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; andin response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generating (1033) a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- The method of claim 3, wherein the text to be inputted corresponding to each of the at least one prediction result of the non-first character comprises: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result;
wherein in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation;
wherein in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation. - The method of claim 3, wherein the text to be inputted corresponding to each of the at least one prediction result of the non-first character comprises: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted;
wherein in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation;
wherein in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation. - The method of any one of claims 1 to 5, wherein determining (104) the punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result comprises:for each of the at least one prediction result of the character, obtaining the first score and the second score corresponding to the corresponding prediction result;performing a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; anddetermining the punctuation existence situation of the character based on a prediction result with a smallest total score.
- The method of any one of claims 1 to 6, wherein performing (105) the punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain the punctuated text corresponding to the text to be predicted comprises:for each character in the text to be predicted, determining whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character;in response to that the punctuation follows the corresponding character, adding the punctuation to follow the character in the text to be predicted based on a type of the punctuation; andobtaining the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- An apparatus for predicting a punctuation in a text, comprising:an obtaining module (110), configured to obtain a text to be predicted;an input module (120), configured to input the text to be predicted into a preset sequence tagging model to obtain at least one prediction result and a first score corresponding to each of the at least one prediction result of each character in the text to be predicted, each of the at least one prediction result representing whether a punctuation follows the corresponding character and a type of the punctuation;a first determination module (130), configured to, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result, and input the text to be inputted into a preset language model to obtain a second score corresponding to each of the at least one prediction result;a second determination module (140), configured to determine a punctuation existence situation of the corresponding character based on the first score and the second score corresponding to each of the at least one prediction result; anda punctuation processing module (150), configured to perform punctuation processing on the text to be predicted based on the punctuation existence situation of each character in the text to be predicted to obtain a punctuated text corresponding to the text to be predicted.
- The apparatus of claim 8, wherein the input module (120) is configured to:input the text to be predicted into the preset sequence tagging model to obtain the at least one prediction result corresponding to each character in the text to be predicted and a prediction probability of each of the at least one prediction result; andperform reciprocal operation and logarithmic operation on the prediction probability of the corresponding prediction result to obtain the first score corresponding to the corresponding prediction result.
- The apparatus of claim 8 or 9, wherein the first determination module (130) is configured to:determine whether the corresponding character is a first character in the text to be predicted;in response to the corresponding character being the first character in the text to be predicted, for each of the at least one prediction result of the corresponding first character, generate a text to be inputted corresponding to each of the at least one prediction result based on the text to be predicted and the corresponding prediction result; andin response to the corresponding character being a non-first character in the text to be predicted, for each of the at least one prediction result of the corresponding non-first character, generate a text to be inputted corresponding to each of the at least one prediction result based on a punctuation existence situation of each of the at least one character before the non-first character in the text to be predicted, the text to be predicted and the corresponding prediction result.
- The apparatus of claim 10, wherein the text to be inputted corresponding to each of the at least one prediction result of the non-first character comprises: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, and content represented by the corresponding prediction result;
wherein in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation;
wherein in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation. - The apparatus of claim 10, wherein the text to be inputted corresponding to each of the at least one prediction result of the non-first character comprises: the at least one character before the non-first character, content represented by the punctuation existence situation of each of the at least one character before the non-first character, the non-first character, content represented by the corresponding prediction result, and a preset number of characters after the non-first character in the text to be predicted;
wherein in response to that the corresponding prediction result represents that no punctuation follows the character, the content is empty; and in response to that the corresponding prediction result represents that a punctuation follows the character, the content is a type of the punctuation;
wherein in response to that the punctuation existence situation represents that no punctuation follows the character, the content is empty; and in response to that the punctuation existence situation represents that a punctuation follows the character, the content is a type of the punctuation. - The apparatus of any one of claims 8 to 12, wherein the second determination module (140) is configured to:for each of the at least one prediction result of the character, obtain the first score and the second score corresponding to the corresponding prediction result;perform a weighted sum calculation on the first score and the second score to obtain a total score corresponding to the corresponding prediction result; anddetermine the punctuation existence situation of the character based on a prediction result with a smallest total score.
- The apparatus of any one of claims 8 to 13, wherein the punctuation processing module (150) is configured to:for each character in the text to be predicted, determine whether the punctuation follows the corresponding character based on the punctuation existence situation of the corresponding character;in response to that the punctuation follows the corresponding character, add the punctuation to follow the character in the text to be predicted based on a type of the punctuation; andobtain the punctuated text corresponding to the text to be predicted after all characters in the text to be predicted are processed.
- A computer-readable storage medium having a computer instruction stored thereon, wherein the computer instruction is configured to make a computer implement the method of any one of claims 1 to 7.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010046714.1A CN111241810B (en) | 2020-01-16 | 2020-01-16 | Punctuation prediction method and punctuation prediction device |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3852013A1 true EP3852013A1 (en) | 2021-07-21 |
Family
ID=70866149
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20215758.2A Withdrawn EP3852013A1 (en) | 2020-01-16 | 2020-12-18 | Method, apparatus, and storage medium for predicting punctuation in text |
Country Status (5)
Country | Link |
---|---|
US (1) | US11216615B2 (en) |
EP (1) | EP3852013A1 (en) |
JP (1) | JP7133002B2 (en) |
KR (1) | KR102630243B1 (en) |
CN (1) | CN111241810B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111414731B (en) * | 2020-02-28 | 2023-08-11 | 北京小米松果电子有限公司 | Text labeling method and device |
CN112685996B (en) * | 2020-12-23 | 2024-03-22 | 北京有竹居网络技术有限公司 | Text punctuation prediction method and device, readable medium and electronic equipment |
CN113378541B (en) * | 2021-05-21 | 2023-07-07 | 标贝(北京)科技有限公司 | Text punctuation prediction method, device, system and storage medium |
CN114528850B (en) * | 2022-02-16 | 2023-08-04 | 马上消费金融股份有限公司 | Punctuation prediction model training method, punctuation adding method and punctuation adding device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150095025A1 (en) * | 2008-09-25 | 2015-04-02 | Multimodal Technologies, Llc | Decoding-Time Prediction of Non-Verbalized Tokens |
CN108845682A (en) * | 2018-06-28 | 2018-11-20 | 北京金山安全软件有限公司 | Input prediction method and device |
US20190103091A1 (en) * | 2017-09-29 | 2019-04-04 | Baidu Online Network Technology (Beijing) Co., Ltd . | Method and apparatus for training text normalization model, method and apparatus for text normalization |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3232289B2 (en) * | 1999-08-30 | 2001-11-26 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Symbol insertion device and method |
CN104484322A (en) * | 2010-09-24 | 2015-04-01 | 新加坡国立大学 | Methods and systems for automated text correction |
JP5611270B2 (en) * | 2012-05-08 | 2014-10-22 | ヤフー株式会社 | Word dividing device and word dividing method |
CN104143331B (en) * | 2013-05-24 | 2015-12-09 | 腾讯科技(深圳)有限公司 | A kind of method and system adding punctuate |
CN106803422B (en) * | 2015-11-26 | 2020-05-12 | 中国科学院声学研究所 | Language model reestimation method based on long-time and short-time memory network |
EP4312147A3 (en) * | 2016-06-08 | 2024-03-27 | Google LLC | Scalable dynamic class language modeling |
CN108628813B (en) * | 2017-03-17 | 2022-09-23 | 北京搜狗科技发展有限公司 | Processing method and device for processing |
US10867595B2 (en) * | 2017-05-19 | 2020-12-15 | Baidu Usa Llc | Cold fusing sequence-to-sequence models with language models |
CN107767870B (en) * | 2017-09-29 | 2021-03-23 | 百度在线网络技术(北京)有限公司 | Punctuation mark adding method and device and computer equipment |
CN109255115B (en) * | 2018-10-19 | 2023-04-07 | 科大讯飞股份有限公司 | Text punctuation adjustment method and device |
CN109558576B (en) * | 2018-11-05 | 2023-05-23 | 中山大学 | Punctuation mark prediction method based on self-attention mechanism |
CN109858038B (en) * | 2019-03-01 | 2023-04-18 | 科大讯飞股份有限公司 | Text punctuation determination method and device |
CN110413987B (en) * | 2019-06-14 | 2023-05-30 | 平安科技(深圳)有限公司 | Punctuation mark prediction method based on multiple prediction models and related equipment |
CN110298042A (en) * | 2019-06-26 | 2019-10-01 | 四川长虹电器股份有限公司 | Based on Bilstm-crf and knowledge mapping video display entity recognition method |
CN110516253B (en) * | 2019-08-30 | 2023-08-25 | 思必驰科技股份有限公司 | Chinese spoken language semantic understanding method and system |
CN110688822A (en) * | 2019-09-27 | 2020-01-14 | 上海智臻智能网络科技股份有限公司 | Punctuation mark adding method, punctuation mark adding device and punctuation mark adding medium |
-
2020
- 2020-01-16 CN CN202010046714.1A patent/CN111241810B/en active Active
- 2020-09-29 US US17/036,561 patent/US11216615B2/en active Active
- 2020-12-18 EP EP20215758.2A patent/EP3852013A1/en not_active Withdrawn
- 2020-12-24 JP JP2020215550A patent/JP7133002B2/en active Active
-
2021
- 2021-01-14 KR KR1020210005164A patent/KR102630243B1/en active IP Right Grant
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150095025A1 (en) * | 2008-09-25 | 2015-04-02 | Multimodal Technologies, Llc | Decoding-Time Prediction of Non-Verbalized Tokens |
US20190103091A1 (en) * | 2017-09-29 | 2019-04-04 | Baidu Online Network Technology (Beijing) Co., Ltd . | Method and apparatus for training text normalization model, method and apparatus for text normalization |
CN108845682A (en) * | 2018-06-28 | 2018-11-20 | 北京金山安全软件有限公司 | Input prediction method and device |
Non-Patent Citations (1)
Title |
---|
WILLIAM GALE ET AL: "Experiments in Character-Level Neural Network Models for Punctuation", INTERSPEECH 2017, 1 January 2017 (2017-01-01), ISCA, pages 2794 - 2798, XP055762527, DOI: 10.21437/Interspeech.2017-1710 * |
Also Published As
Publication number | Publication date |
---|---|
KR102630243B1 (en) | 2024-01-25 |
CN111241810B (en) | 2023-08-01 |
JP2021114284A (en) | 2021-08-05 |
CN111241810A (en) | 2020-06-05 |
KR20210092692A (en) | 2021-07-26 |
US11216615B2 (en) | 2022-01-04 |
US20210224480A1 (en) | 2021-07-22 |
JP7133002B2 (en) | 2022-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11625539B2 (en) | Extracting trigger words and arguments from text to obtain an event extraction result | |
EP3923160A1 (en) | Method, apparatus, device and storage medium for training model | |
EP3822851A2 (en) | Method and apparatus for recognizing table, computer program product, computer-readable storage medium | |
EP3852013A1 (en) | Method, apparatus, and storage medium for predicting punctuation in text | |
US11403468B2 (en) | Method and apparatus for generating vector representation of text, and related computer device | |
CN111414482B (en) | Event argument extraction method and device and electronic equipment | |
EP3916612A1 (en) | Method and apparatus for training language model based on various word vectors, device, medium and computer program product | |
EP3852000A1 (en) | Method and apparatus for processing semantic description of text entity, device and storage medium | |
US11573992B2 (en) | Method, electronic device, and storage medium for generating relationship of events | |
US20210200813A1 (en) | Human-machine interaction method, electronic device, and storage medium | |
EP3916613A1 (en) | Method and apparatus for obtaining word vectors based on language model, device and storage medium | |
EP3885963A1 (en) | Method and apparatus for determining causality, electronic device and storage medium | |
US20210383233A1 (en) | Method, electronic device, and storage medium for distilling model | |
CN112926306B (en) | Text error correction method, device, equipment and storage medium | |
US11775766B2 (en) | Method and apparatus for improving model based on pre-trained semantic model | |
US11182648B2 (en) | End-to-end model training method and apparatus, and non-transitory computer-readable medium | |
EP3992774A1 (en) | Method and device for implementing dot product operation, electronic device, and storage medium | |
EP3855341A1 (en) | Language generation method and apparatus, electronic device and storage medium | |
EP3855339A1 (en) | Method and apparatus for generating text based on semantic representation | |
US11893977B2 (en) | Method for recognizing Chinese-English mixed speech, electronic device, and storage medium | |
CN115688796B (en) | Training method and device for pre-training model in natural language processing field | |
CN115437547A (en) | Method, device and equipment for inputting characters based on keyboard and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20220114 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN |
|
18W | Application withdrawn |
Effective date: 20240308 |