CN112580310A - Missing character/word completion method and electronic equipment - Google Patents

Missing character/word completion method and electronic equipment Download PDF

Info

Publication number
CN112580310A
CN112580310A CN202011582902.2A CN202011582902A CN112580310A CN 112580310 A CN112580310 A CN 112580310A CN 202011582902 A CN202011582902 A CN 202011582902A CN 112580310 A CN112580310 A CN 112580310A
Authority
CN
China
Prior art keywords
missing
word
sentence
words
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011582902.2A
Other languages
Chinese (zh)
Other versions
CN112580310B (en
Inventor
王宝鑫
伍大勇
车万翔
王士进
胡国平
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Xunfei Internet Beijing Information Technology Co ltd
Hebei Xunfei Institute Of Artificial Intelligence
iFlytek Co Ltd
Original Assignee
Zhongke Xunfei Internet Beijing Information Technology Co ltd
Hebei Xunfei Institute Of Artificial Intelligence
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Xunfei Internet Beijing Information Technology Co ltd, Hebei Xunfei Institute Of Artificial Intelligence, iFlytek Co Ltd filed Critical Zhongke Xunfei Internet Beijing Information Technology Co ltd
Priority to CN202011582902.2A priority Critical patent/CN112580310B/en
Publication of CN112580310A publication Critical patent/CN112580310A/en
Application granted granted Critical
Publication of CN112580310B publication Critical patent/CN112580310B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a completion method of missing characters/words, which comprises the steps of identifying missing positions in a missing sentence by using a language model, wherein the language model is a model obtained by pre-training by taking pseudo data as input, and the missing sentence represents a sentence with component missing errors; generating a plurality of candidate characters/words missing at the missing position by using a language model; the plurality of candidate words/phrases are ranked to determine the missing word/phrase missing at the missing location. The application also provides corresponding electronic equipment. By the method, the problem of missing characters and words in the text can be corrected and completed more quickly and accurately.

Description

Missing character/word completion method and electronic equipment
Technical Field
The disclosed embodiments of the present application relate to the field of information processing technologies, and more particularly, to a method for completing missing characters/words and an electronic device.
Background
Language beginners often have the phenomena of word missing, word missing and the like when writing articles, and the problem of word missing caused by negligence and the like is also often caused in the writing of our daily life. The loss of characters and words is a common type of grammar errors, and because the errors need to supplement the characters and words according to the semantics of the original sentence, the difficulty of the errors is often higher than the difficulty of correcting errors of different types such as different characters, disorder and the like.
Disclosure of Invention
According to the embodiment of the application, the application provides a method for completing missing characters/words and an electronic device to solve the problems.
According to a first aspect of the application, a completion method for missing characters/words is disclosed, which comprises the steps of identifying missing positions in a missing sentence by using a language model, wherein the language model is obtained by pre-training with pseudo data as input, and the missing sentence represents a sentence with component missing errors; generating a plurality of candidate characters/words missing at the missing position by using the language model; and sorting the candidate characters/words to determine missing characters/words at the missing positions.
According to a second aspect of the present application, an electronic device is disclosed, comprising a processor and a memory, said memory storing instructions that, when executed, cause said processor to perform the method of complementing missing words/words as described in the first aspect.
The beneficial effect of this application is: the missing positions in the missing sentences are identified by the language model, a plurality of candidate characters/words missing at the missing positions are generated, and the candidate characters/words are sequenced, so that the missing characters/words missing at the missing positions are determined, the problem of character and word missing in the text is quickly and accurately corrected and completed, wherein the language model is obtained by pre-training by taking pseudo data as input, and the shortage of training data is relieved.
Drawings
The present application will be further described with reference to the accompanying drawings and embodiments, in which:
FIG. 1 is a flow chart of a completion method of an embodiment of the present application;
FIG. 2 is a block diagram of a basic model of a language model using a BERT model according to an embodiment of the present application;
FIG. 3 is a partial flow diagram of a method for completing a candidate word ordering according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 5 is a schematic diagram of a storage medium according to an embodiment of the present application.
Detailed Description
At present, a missing word completion method generally adopts a Seq2Seq generation model, inputs a sentence containing a missing error, and automatically generates a completed correct sentence through training data. However, the decoding speed of the method using the generative model is slow, a complete sentence needs to be generated from the beginning, and the prediction speed is related to the sentence length, and the long sentence prediction is slow. In addition, the method of generating the model is used for directly generating a correct sentence, and the wrong position and type cannot be conveniently given, so that the wrong positioning is not intuitive, and the wrong position and the wrong modification mode can be found only by manual comparative analysis. For example, for a wrong sentence, "a person has outweighed hunger, just trying to get a better, healthier thing for the next generation. ", using the method of generating the model, the correct expression will be given directly: "people struggle with hunger and can only strive to do better and healthier things for the next generation. ", without giving explicit wrong locations and amendments. In addition, the generated model has higher requirement on training data, but the currently labeled training corpus is insufficient, and the cost for acquiring real data is higher.
Therefore, the application provides a completion method of the missing characters/words, which uses a language model to identify the positions of the missing words and determine the missing words, thereby realizing the completion of the missing words, namely realizing the quick and accurate correction and completion of the missing characters/words in the text, wherein the language model is obtained by pre-training by taking pseudo data as input, and relieving the shortage of training data.
In order to make those skilled in the art better understand the technical solutions of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description.
Please refer to fig. 1, which is a flowchart illustrating a method for completing missing words/phrases according to an embodiment of the present application. The method comprises the following steps:
step 110: and identifying the missing position in the missing sentence by using a language model.
The language model is obtained by pre-training with the pseudo data as input.
In some examples, the language model may be a BERT (Bidirectional Encoder from Transformer) model, and specifically, a transform Encoder structure corresponding to the BERT model is used as a base model, as shown in fig. 2, which includes an Embedding layer and an Encoder layer, an Input text Input is subjected to Input Embedding, i.e., linear transformation, to obtain word Embedding (word vector), and then, the word Embedding is added to a position vector (position Embedding) output by the position Encoding, and the added result is used as an Input of the Encoder layer. Those skilled in the art will understand that the encoder layer employs an Attention-based (Attention) mechanism and includes a multiple encoding process, as shown in detail in fig. 2, which is not described in detail herein for brevity.
The pseudo data is data obtained by performing random operation on preset data, for example, a correct statement is randomly modified, and the obtained randomly modified statement is the pseudo data.
In the process of obtaining the BERT model, inputting the pseudo data into the model, and pre-training until the effect of the model in the verification set is converged, namely, judging that the model has the best effect in the verification set by adopting the technology such as early stop and the like, thereby obtaining the pre-trained model, namely the BERT model.
Missing sentences represent sentences for which there is a component missing error. The component missing errors mainly comprise two types of missing characters and missing words. The word missing type means that a word is missing in a sentence, for example, the sentence "people fight hunger", "miss" a word between "fight" and "have". The missing word type means that a word is missing in the sentence, for example, the word "people are hungry", "people" and "have" they are missing "to win.
Step 120: and generating a plurality of candidate characters/words missing at the missing position by using a language model.
And generating a plurality of candidate characters/words missing at the missing positions by using a language model, and inputting the missing sentences and the missing positions into the language model for prediction so as to obtain the candidate characters/words missing at the missing positions.
Step 130: the plurality of candidate words/phrases are ranked to determine the missing word/phrase missing at the missing location.
And sequencing the obtained candidate characters/words, and determining the candidate characters/words sequenced at the front or the back as the missing characters/words missing at the missing position.
In the embodiment, a language model is used for identifying missing positions in a missing sentence, generating a plurality of candidate characters/words missing at the missing positions, and sequencing the candidate characters/words, so that the missing characters/words missing at the missing positions are determined, the problem of missing characters/words in a text is quickly and accurately corrected and completed, wherein the language model is obtained by pre-training by taking pseudo data as input, and the shortage of training data is relieved.
As described above, the language model is used to identify the missing position in the missing sentence, so as to obtain the identified missing sentence, in some embodiments, the identified missing sentence is characterized by a list, the list includes at least one content tuple and an ending tuple, each content tuple represents a word/word in the missing sentence, and the ending tuple represents the end of the missing sentence.
For example, the identified missing sentence 1, characterized by list 1, is of the form:
list 1: [ content tuple 1, content tuple 2, content tuple 3, content tuple 4, content tuple 5, end tuple ],
the content tuple 1, the content tuple 2, the content tuple 3, the content tuple 4 and the content tuple 5 respectively represent one word in the missing sentence 1, and the ending tuple represents the end of the missing sentence 1.
Each content tuple comprises each word/word and a first label, and the first label represents whether each word/word is missing or not and the number of the missing words/words. The end tuple includes an end character and a second label, and the first label characterizes whether the end character is missing and the number of missing words/words in the end tuple. It can be seen that each content tuple includes a first tag, and the end tuple includes a second tag, where the first tag represents whether a corresponding word/word is missing or not and the number of missing words/words, and the second tag represents whether an end word is missing or not and the number of missing words/words.
Continuing with the above-mentioned missing sentence 1 as an example, in the missing sentence 1, the content tuple 1 is (, first label), the content tuple 2 is (, first label) …, and the ending tuple is (ending character, second label), where the content tuple 1 indicates the corresponding word in each tuple.
The tag value of the first tag may be set to O and M, where O indicates that there is no deletion before the corresponding word/word or end character, and M indicates that there is a deletion before the corresponding word/word or end character, at this time, in order to further indicate the number of the missing word/word, the tag value of the first tag may be set to Mx, where x indicates the number of the missing word/word, the values are 1,2,3 …, etc., and the maximum value of x is determined according to the actual situation of the missing sentence. For example, a tag value of M2 indicates that there is a miss before the corresponding word/phrase or terminator and 2 words are missing.
The label value of the second label may also be set to O and M, where O indicates that there is no missing before the terminator, M indicates that there is missing before the terminator, at this time, in order to further indicate the number of missing words/words, the label value of the second label may be set to Mx, where x indicates the number of missing words/words, the value is 1,2,3 …, etc., and the maximum value of x is determined according to the actual situation of the missing sentence. For example, a tag value of M2 indicates that there is a miss before the corresponding word/phrase or terminator and 2 words are missing.
In an example, a sequence labeling method may be employed to predict the tag value of the first tag or the second tag of each tuple.
The end symbol may be implemented using a special symbol or string of characters, which may be < EOS > in one example.
Continuing with the above example of missing sentence 1, list 1 is shown below:
list 1: [ (x, tag value 1), (x, tag value 2), (x, tag value 3), (x, tag value 4), (< EOS >, tag value 5) ].
In the following description, an actual sentence is taken as an example, and for example, the above sentence "people are hungry" is expressed as follows after the language model is identified:
lists [ (human, O), (s, M2), (hunger, O), (< EOS >, O) ], or [ ("human", "O"), ("s", "O"), ("has", "M2"), ("hunger", "O"), ("EOS >", "O") ].
From this list, it can be seen that in this statement there is a miss before "has", and 2 words are missing (i.e., "wins"), and there is no miss before other words or terminators.
In other embodiments, each content tuple further includes a third tag characterizing a part of speech of each word/term, and the end tuple further includes a fourth tag characterizing a part of speech of the end character. It can be seen that each content tuple includes a third tag, the second tag characterizing a part of speech of the corresponding word/term or terminator, and the end tuple includes a fourth tag characterizing a part of speech of the terminator.
Continuing with the above-mentioned missing sentence 1 as an example, in the missing sentence 1, the content tuple 1 is (, third label, first label), the content tuple 2 is (, third label, first label) … and the ending tuple is (ending character, fourth label, second label), where ″) represents the corresponding word in each tuple.
The tag values of the third tag and the fourth tag may be set according to the components of the corresponding word/phrase or the terminator in the whole sentence, for example, the tag value is n, which indicates that the corresponding word/phrase is a noun, the tag value is u, which indicates that the corresponding word/phrase is a helper, the tag value is v, which indicates that the corresponding word/phrase is a verb, the tag value is None, which indicates that there is no part-of-speech, and the like, which are not limited herein.
In the following description, an actual sentence is taken as an example, and for example, the above sentence "people are hungry" is expressed as follows after the language model is identified:
lists [ (human, n, O), (u, M2), (hunger, n, O), (< EOS >, None, O) ], or [ ("human", "n", "O"), ("human", "u", "M2"), ("hunger", "n", "O"), ("EOS >, (" one "O") ].
From this list, it can be seen that there is a miss before the help word "has" in this statement, and 2 words are missing (i.e., "wins"), and there is no miss before other words or terminators.
As described above, a plurality of candidate characters/words missing at the missing position are generated using the language model, and the missing sentence and the missing position are input to the language model for prediction, so that the candidate characters/words missing at the missing position can be obtained. Specifically, in some embodiments, first, at least one placeholder is filled at the missing position, wherein the number of placeholders corresponds to the number of missing words/words, that is, how many words/words are missing, how many placeholders are filled. And then, inputting the filled missing sentence into a language model, and predicting the prediction information corresponding to at least one placeholder so as to obtain a plurality of candidate words.
The placeholder may be a special symbol or string, and in some examples, the placeholder may be [ MASK ].
The following description will be made by taking an actual sentence as an example, for example, the above sentence "people are hungry". As described above, in this sentence, there is a deletion before "there is" there, and 2 words are deleted (i.e., "there is a win"), and there is no deletion before other words or terminators, i.e., "there is a deletion position before" there is "there. First, fill 2 placeholders, e.g., [ MASK ], at the miss location, and after filling, the statement is specified as follows:
people [ MASK ] [ MASK ] have starvation
Then, the filled sentence is input into a language model, such as a BERT model, and prediction information corresponding to two placeholders [ MASK ] in the sentence is predicted, so as to obtain a plurality of candidate words/phrases, such as "overcome", "defeat", or "defeat", etc.
In some embodiments, at least one placeholder is two placeholders, e.g., in the example of the actual statement above, "yes" is preceded by a missing position at which 2 placeholders are filled.
At this time, the prediction information corresponding to each placeholder includes a word matrix and a probability vector, wherein words in the word matrix represent words corresponding to the placeholder, and a probability value in the probability vector represents a prediction probability of the words corresponding to the placeholder.
The following description will be made by taking an actual sentence as an example, where the above sentence "people are hungry" and the filled sentence is as described above.
After prediction, in this statement, the prediction information corresponding to the first placeholder [ MASK ] includes that the word matrix W1 can be [ win, war, uniform, gram ], and the probability vector x1 is (0.1,0.4,0.02,0.47,0.01), that is, it is predicted that the probability that the first placeholder [ MASK ] is "win" is 0.1, the probability that the first placeholder [ MASK ] is "war" is 0.4, the probability that the first placeholder [ MASK ] is "uniform" is 0.02, the probability that the first placeholder [ MASK ] is "gram" is 0.47, and the probability that the first placeholder [ MASK ] is "word" is 0.01.
In this statement, the prediction information corresponding to the second placeholder [ MASK ] includes that word matrix W2 may be [ uniform, warfare, win, and win ], and that probability vector x2 may be (0.4,0.02,0.5,0.07, 0.01). That is, it is predicted that the probability of the second placeholder [ MASK ] being "uniform" is 0.4, the probability of being "war" is 0.02, the probability of being "win" is 0.5, the probability of being "gram" is 0.07, and the probability of being "what" word is 0.01.
In embodiments where at least one placeholder is two placeholders, then a language model is used to generate a plurality of candidate words/words missing at the missing location. Specifically, in some embodiments, first, each placeholder is replaced with a predictor, where the predictor represents a product of a word matrix and a probability vector in prediction information corresponding to each placeholder, then, the replaced missing sentence is input to the language model to obtain an output of the language model, the output of the language model and the replaced missing sentence are input to the recurrent neural network, and a missing word at a missing position is predicted, so that a plurality of candidate words are obtained.
The predictor may be a special symbol or string, and in some examples, the predictor may be [ SOFT ], e.g., [ SOFT1] representing the first predictor and [ SOFT2] representing the second predictor. The predictor characterizes the product of the word matrix and the probability vector in the prediction information corresponding to each placeholder, namely the word matrix and the probability vector.
The output of the language model and the replaced missing sentence may be input to a Long Short Term Memory (LSTM) model. The LSTM model ensures that each current time can record the information of the previous time by storing the value of the hidden layer at each time for the next time. The input at each time of the LSTM model is the predictor and the word representation predicted at the corresponding position, that is, the input at each time of the LSTM model is the output of the language model obtained by passing the replaced missing sentence and the replaced missing sentence through the language model, that is, the word representation predicted at the corresponding position is the output of the language model. In the LSTM model, the positions of the two predictors are the generation model, and therefore, a plurality of candidate words can be obtained by adopting a beam search method.
In this embodiment, the two placeholders are replaced by the predictors, and the replaced missing sentence is input to the language model to obtain the output of the language model, so that the output of the language model and the replaced missing sentence are input to the recurrent neural network, thereby solving the Multi-modal problem that may occur when the two placeholders are directly predicted together by the language model.
The following description will proceed with the above statement "people are hungry" as an example, the statement is filled as described above, and after prediction, the prediction information corresponding to each placeholder is as described above. First, the first placeholder [ MASK ] is replaced by a predictor [ SOFT1], and the second placeholder [ MASK ] is replaced by a predictor [ SOFT2], where the predictor [ SOFT1] ═ W1 × 1, and the predictor [ SOFT2] ═ W2 × 2, that is, the replaced statement is specifically as follows:
people [ SOFT1] [ SOFT2] starved
The replaced statement is then input into a language model, e.g. a BERT model, resulting in an output of the language model, i.e. a representation of the word predicted for the corresponding location, and the output of the language model and the replaced statement are input into a recurrent neural network, e.g. an LSTM model. In the LSTM model, for example, the first time, the input is the predictors SOFT1 and "man", and the output predictor is "war"; the predictors SOFT2 and war are sent to the second input location of the LSTM model, and from the last time the information is war, the predictor SOFT2 is predicted to be win, i.e. the word missing at these two predictors is predicted to be win. By analogy, according to different inputs, a word missing at the missing position (i.e. the two predictors) can also be predicted, e.g. "overcome", resulting in multiple candidate words, e.g. "defeat", "overcome", etc. It can be seen that the case of directly predicting the missing words at the two predictors as "win" using the BERT model is avoided, i.e., the Multi-model problem that may occur when directly predicting the two placeholders together is solved.
As described above, in step 130, the plurality of candidate words/phrases are ranked to determine the missing word/phrase missing at the missing position. In some embodiments, the ranking may be based on a score of each candidate word, that is, a scoring operation may be performed on a plurality of candidate words obtained by prediction, so as to rank according to the corresponding scores. Specifically, as shown in fig. 3, the step 130 may include:
step S131: and filling each candidate character/word into the missing position in the missing sentence to form a complete sentence.
Continuing with the above-mentioned sentence "people are hungry", as mentioned above, after passing through the language model and the recurrent neural network, a plurality of candidate words are predicted, such as "defeat", "overcome", and the like. Filling the candidate words into the sentence "people are hungry", respectively obtaining a complete sentence 1 "people have overcome hungry", and a complete sentence 2 "people have overcome hungry", etc.
Step S132: using a language model, a first score for each complete sentence is determined.
The first score is represented by the sum of probability values of each character in the complete sentence, and the probability value of each character in the complete sentence is obtained by predicting the filled candidate character/word by using a pre-trained language model.
Using a language model, such as the BERT language model, a first score for each complete sentence is determined, which is characterized by a sum of probability values for each word in the complete sentence, and specifically, the calculation formula is as follows:
scoreBERT=log(P(w1))+...+log(P(wn))
wherein, P (w)n) Representing probability values of the respective word, e.g. P (w)1) Representing the probability of the first word in a complete sentence. log (P (w)n) Means log the probability of the corresponding word, it can be seen that the first score is the sum of the log the probability values of each word in the complete sentence, i.e. is characterized by the sum of the probability values of each word in the complete sentence.
The probability value for each word is pre-trained using a language model, such as the BERT model, on the filled-in candidate words/phrases.
Taking the above sentence "people are hungry" as an example, as described above, the sentence is processed to obtain the complete sentence 1 and the complete sentence 2, etc., e.g., the first score of the complete sentence 1BERT=log(P(w1))+log(P(w2))+log(P(w3))+log(P(w4))+log(P(w5))+log(P(w6))+log(P(w7) Wherein P (w)1)...P(w7) Respectively, the probabilities of "people" and "people" in the complete sentence 1. Similarly, a first score of the complete sentence 2 or the like may be obtained.
Step S133: using another language model, a second score for each complete sentence is determined.
The other language model is a model taking the word level of the quintuple as a training object, and the second score is a probability value obtained by training the missing sentence and the corresponding complete sentence by using the other language model.
The other language model is trained by using word level of quintuple as a training object, where the word level of quintuple indicates that 5 consecutive words/words in a sentence are used as a unit, and the training is performed, for example, the language model may be an N-Gram model, and a second score of each complete sentence is determined, specifically, the language model trains a missing sentence and a corresponding complete sentence to obtain the second score of the complete sentence, and a calculation formula thereof is specifically as follows:
scorengram=log(P(w1))+log(P(w2|w1)+...+log(P(wn|wn-4,wn-3,wn-2,wn-1))
wherein, n is 5, P (w) because the word level of the quintuple is the training object1) Probability value of word 1, P (w)2|w1) Conditional probability value for the second word based on the presence of the 1 st word, P (w)n|wn-4,wn-3,wn-2,wn-1) Exist in the 1 st word, the 2 nd word, the 3 rd word and the 4 th wordBased on the conditional probability value of the 5 th word. The probability value can be obtained by training an N-Gram model.
Taking the above sentence "people are hungry" as an example, as described above, the sentence is subjected to the above operation to obtain the complete sentence 1, the complete sentence 2, and the like, for example, the second score of the complete sentence 1 can be obtained by the above calculation formula.
Step S134: and obtaining the score corresponding to each candidate word/word according to the first score and the second score, and sequencing according to the score corresponding to each candidate word/word.
And obtaining a score corresponding to each candidate word/word according to the first score and the second score, for example, performing an operation on the first score and the second score, for example, adding, averaging, weighting, and averaging, to obtain a score corresponding to each candidate word/word, and thereby sorting according to the obtained score corresponding to each candidate word/word.
In one example, the first score and the second score are weighted-averaged, for example, the first score is weighted as a and the second score is weighted as (1-a), where a is a hyperparameter between 0 and 1. Specifically, the calculation formula of the score of the complete sentence is as follows:
score=α*scoreBERT+(1-α)*scorengram
in some examples, a may be 0.5.
Through the steps, the score of each complete sentence can be obtained, and therefore the candidate character/word in the complete sentence with the largest score is used as the character/word needing to be completed in the missing sentence. For example, in the example described above with the above-mentioned sentence "people are hungry", it can be found that the score of the complete sentence 1 is larger than the score of the complete sentence 2, and thus "win" is a complementary word required in the above-mentioned missing sentence "people are hungry".
As described above, the language model is a model obtained by pre-training using dummy data as input. The following description will be made by taking the BERT model as an example.
In some embodiments, the generating of the dummy data comprises: and generating first random numbers uniformly distributed in a preset interval to randomly modify each correct statement in a preset set so as to obtain the pseudo data of each correct statement. Wherein the preset set is a pre-collected correct text comprising a plurality of correct sentences. In some examples, the preset set may be a set corresponding to wikipedia text and news text.
The pseudo data of each correct sentence is characterized by a preset list, the preset list comprises at least one binary group and/or at least one triple group, each binary group comprises a word and a first label value which represents that the word is not missed before, and each triple group comprises a word, a second label value which represents that the word is missed before and the number of missed words, and missed words/words which are missed before the word.
The first tag value and the second tag value are consistent with the setting of the tag value of the first tag in the missing sentence identified in the above embodiment, and the detailed description of the above embodiment is given, and for brevity, the description is not repeated here.
The example of the correct sentence "people have conquered hunger" is used for explanation, and the pseudo data may be "people have hunger", "people have wared hunger", and so on.
The pseudo data "people are hungry" can be characterized by a preset list, as follows:
preset list 1: [ ("human", "O"), ("up", "M2", "defeat"), ("hunger", "O"), ("< EOS >," O ") ]
Where ("human", "O"), ("hunger", "O"), ("< EOS >", "O") are tuples, respectively, including the corresponding word and a first tag value "O" characterizing the word not missing before it, ("up", "M2", "win") are triples, including the "word," a second tag value "M2" characterizing the "number of deletions and deletions before the word, and" the word missing before the word "win".
In this embodiment, the first random numbers uniformly distributed in the preset interval are generated to randomly modify each correct sentence in the preset set, so as to obtain the pseudo data of each correct sentence, and then the generated pseudo data is input to a language model such as a BERT model to perform pre-training, so that the convergence speed can be greatly increased, the parameter amount required to be learned again is small, and the problem of insufficient training data is solved.
Further, in some embodiments, each correct statement in the preset set may be randomly modified, and specifically, when the first random number is smaller than the first preset value, corresponding missing dummy data is generated for the correct statement. And when the first random number is greater than or equal to the first preset value, keeping the correct statement unchanged.
The missing pseudo data is generated by word missing or word missing operation in the correct sentence, for example, "people are hungry" and "people fight hungry" are both missing pseudo data of the correct sentence "people fight hungry".
The first random number is a random number r1 uniformly distributed between 0 and 1, and if the first preset value is 0.05, when r1 is less than 0.05, missing pseudo data is generated, and the correct statement is randomly modified, namely, a word missing operation or a word missing operation is executed. When r1> is 0.05, the correct sentence is kept unchanged.
Further, in some embodiments, generating the corresponding missing dummy data for the correct statement comprises: and randomly generating second random numbers uniformly distributed in the preset interval so as to execute the character missing operation or the word missing operation on the correct sentence.
If the second random number is smaller than a second preset value, the word missing operation is executed on the correct sentence, and if the second random number is larger than or equal to the second preset value, the word missing operation is executed on the correct sentence.
The second random number is a probability r2 randomly generated from a (0,1) uniform distribution, and assuming that the second preset value is 0.7, when r2<0.7, a word-missing operation is performed, and when r2> is 0.7, a word-missing operation is performed.
The process of executing the word missing operation is as follows: randomly generating a first probability value (i.e. a first random number) uniformly distributed in a preset interval (i.e. 0-1) for each word in the correct sentence, and deleting the corresponding word in the sentence if the first probability value of the corresponding word is smaller than a first preset value, for example, 0.05.
The example of the correct sentence "people have overcome hunger" is described, wherein the first probability values randomly generated for "people", "fighting", "winning", "having", "hunger" and "hungry" are 0.2, 0.5, 0.1, 0.8, 0.33, 0.6 and 0.02 respectively, and the first preset value is 0.05, it can be seen that the probability of "hungry" word is 0.02 <0.05, the "hungry" word is deleted, and the corresponding missing pseudo data "people have overcome hungry" is generated.
The process of performing the word-missing operation is as follows: randomly generating a second probability value (namely, a first random number) uniformly distributed in a preset interval (namely, 0-1) for each word in the correct sentence, and deleting the corresponding word in the correct sentence if the second probability value of the corresponding word is smaller than a first preset value, for example, 0.05.
The example is explained by taking a correct sentence "people have conquered hunger", wherein, for "people", "conquered", "having", and "hunger", the randomly generated second probability values are 0.2, 0.01, 0.5, 0.3, respectively, and the first preset value is 0.05, it can be seen that, when the probability of "conquered" word is 0.01 <0.05, the "conquered" is deleted, and the corresponding missing pseudo data "people are hungry" is generated.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 210 and a processor 220, the memory 210 and the processor 220 being interconnected.
Memory 210 may include read-only memory and/or random access memory, etc., and provides instructions and data to processor 220. A portion of memory 210 may also include non-volatile random access memory (NVRAM). The memory 210 stores instructions that, when executed by the processor 220, implement the completion method provided by any one of the above embodiments of the present application, and any non-conflicting combinations.
The processor 220 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 220. The processor 220 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed.
The present invention also provides an embodiment of a non-volatile storage medium, as shown in fig. 5, the non-volatile storage medium 500 stores instructions 501 executable by a processor, and the instructions 501 are used for executing the method in the above embodiment. In particular, the storage medium 500 may be embodied as the memory 210 shown in fig. 4 or as a part of the memory 210.
It will be apparent to those skilled in the art that many modifications and variations can be made in the devices and methods while maintaining the teachings of the present application. Accordingly, the above disclosure should be considered limited only by the scope of the following claims.

Claims (11)

1. A completion method for missing characters/words is characterized by comprising the following steps:
identifying a missing position in a missing sentence by using a language model, wherein the language model is a model obtained by pre-training by taking pseudo data as input, and the missing sentence represents a sentence with component missing errors;
generating a plurality of candidate characters/words missing at the missing position by using the language model; and
ranking the plurality of candidate words/terms to determine the missing word/term missing at the missing location.
2. A completion method according to claim 1, wherein said missing sentence after identification is characterized by a list, said list comprising at least one content tuple and an end tuple, each said content tuple characterizing a word in said missing sentence, said end tuple characterizing said missing sentence end;
wherein each of the content tuples comprises:
each character/word; and
the first label is used for representing whether each character/word is lost or not and the number of the lost characters/words;
the end tuple includes:
an end symbol; and
and the second label is used for representing whether the ending character is missing or not and the number of the missing characters/words.
3. The completion method as claimed in claim 2, wherein each of said content tuples further comprises a third tag characterizing a part of speech of said each word/phrase;
the end tuple further comprises a fourth label characterizing a part of speech of the end character.
4. The completion method of claim 2, wherein said generating a plurality of candidate words/terms missing at said missing location using said language model comprises:
populating at least one placeholder at the missing location, wherein a number of the placeholders corresponds to a number of missing words/phrases;
and inputting the filled missing sentence into the language model, and predicting the prediction information corresponding to the at least one placeholder so as to obtain a plurality of candidate words.
5. The replenishment method of claim 4, wherein the at least one placeholder is two placeholders;
the prediction information corresponding to each placeholder comprises a word matrix and a probability vector, wherein words in the word matrix represent words corresponding to the placeholder, and a probability value in the probability vector represents the prediction probability of the words corresponding to the placeholder;
the generating the plurality of candidate words/words missing at the missing position with the language model comprises:
replacing each placeholder with a predictor, wherein the predictor characterizes a product of the word matrix and the probability vector in the prediction information corresponding to each placeholder;
and inputting the replaced missing sentence into the language model to obtain the output of the language model, inputting the output of the language model and the replaced missing sentence into a recurrent neural network, and predicting the missing word at the missing position to obtain a plurality of candidate words.
6. The completion method of claim 1, wherein said ranking said plurality of candidate words comprises:
filling each candidate character/word to the missing position in the missing sentence to form a complete sentence;
determining a first score of each complete sentence by using the language model, wherein the first score is characterized by the sum of probability values of each word in the complete sentence, and the probability value of each word in the complete sentence is obtained by predicting the filled candidate word/word by using the language model after pre-training;
determining a second score of each complete sentence by using another language model, wherein the another language model is a model taking the word level of a quintuple as a training object, and the second score is a probability value obtained by training the missing sentence and the corresponding complete sentence by using the another language model; and
and obtaining the score corresponding to each candidate word/word according to the first score and the second score, and sequencing according to the score corresponding to each candidate word/word.
7. The completion method according to any one of claims 1 to 6, wherein the generation of the dummy data includes:
generating first random numbers uniformly distributed in a preset interval to randomly modify each correct statement in a preset set so as to obtain the pseudo data of each correct statement, wherein the preset set is a pre-collected correct text comprising a plurality of correct statements;
the pseudo data of each correct sentence is characterized by a preset list, the preset list comprises at least one binary group and/or at least one triple, each binary group comprises a word and a first label value for representing the word which is not missed before, and each triple comprises a word, a second label value for representing the word which is missed before and the number of missed words, and the missed words/words which are missed before the word.
8. The completion method of claim 7, wherein said randomly modifying each correct statement in the predetermined set comprises:
when the first random number is smaller than a first preset value, generating corresponding missing pseudo data for the correct statement;
when the first random number is larger than or equal to the first preset value, keeping the correct statement unchanged.
9. The completion method of claim 8, wherein said generating corresponding missing dummy data for said correct statement comprises:
randomly generating second random numbers uniformly distributed in the preset interval so as to execute word missing operation or word missing operation on the correct sentence;
if the second random number is smaller than a second preset value, the word missing operation is executed on the correct statement, and if the second random number is larger than or equal to the second preset value, the word missing operation is executed on the correct statement.
10. An electronic device comprising a processor and a memory, the memory storing instructions that, when executed, cause the processor to perform the method of completing a missing word/word as claimed in any one of claims 1-9.
11. A non-transitory computer storage medium having stored thereon instructions that, when executed, cause a processor to perform the method of completing a missing word/phrase of any of claims 1-9.
CN202011582902.2A 2020-12-28 2020-12-28 Missing character/word completion method and electronic equipment Active CN112580310B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011582902.2A CN112580310B (en) 2020-12-28 2020-12-28 Missing character/word completion method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011582902.2A CN112580310B (en) 2020-12-28 2020-12-28 Missing character/word completion method and electronic equipment

Publications (2)

Publication Number Publication Date
CN112580310A true CN112580310A (en) 2021-03-30
CN112580310B CN112580310B (en) 2023-04-18

Family

ID=75140266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011582902.2A Active CN112580310B (en) 2020-12-28 2020-12-28 Missing character/word completion method and electronic equipment

Country Status (1)

Country Link
CN (1) CN112580310B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434632A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Text completion method, device, equipment and storage medium based on language model
CN117056859A (en) * 2023-08-15 2023-11-14 丁杨 Method for complementing missing characters in cultural relics
US11880650B1 (en) * 2020-10-26 2024-01-23 Ironclad, Inc. Smart detection of and templates for contract edits in a workflow
CN117056859B (en) * 2023-08-15 2024-05-10 丁杨 Method for complementing missing characters in cultural relics

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169366A (en) * 2016-03-08 2017-09-15 环达电脑(上海)有限公司 The guard method of smart machine personal data information safety
CN108124477A (en) * 2015-02-02 2018-06-05 微软技术授权有限责任公司 Segmenter is improved based on pseudo- data to handle natural language
CN108334487A (en) * 2017-07-14 2018-07-27 腾讯科技(深圳)有限公司 Lack semantics information complementing method, device, computer equipment and storage medium
CN108960409A (en) * 2018-06-13 2018-12-07 南昌黑鲨科技有限公司 Labeled data generation method, equipment and computer readable storage medium
CN109726389A (en) * 2018-11-13 2019-05-07 北京邮电大学 A kind of Chinese missing pronoun complementing method based on common sense and reasoning
CN110413658A (en) * 2019-07-23 2019-11-05 中经柏诚科技(北京)有限责任公司 A kind of chain of evidence construction method based on the fact the correlation rule
CN111639489A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 Chinese text error correction system, method, device and computer readable storage medium
CN111708882A (en) * 2020-05-29 2020-09-25 西安理工大学 Transformer-based Chinese text information missing completion method
CN111738018A (en) * 2020-06-24 2020-10-02 深圳前海微众银行股份有限公司 Intention understanding method, device, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108124477A (en) * 2015-02-02 2018-06-05 微软技术授权有限责任公司 Segmenter is improved based on pseudo- data to handle natural language
CN107169366A (en) * 2016-03-08 2017-09-15 环达电脑(上海)有限公司 The guard method of smart machine personal data information safety
CN108334487A (en) * 2017-07-14 2018-07-27 腾讯科技(深圳)有限公司 Lack semantics information complementing method, device, computer equipment and storage medium
CN108960409A (en) * 2018-06-13 2018-12-07 南昌黑鲨科技有限公司 Labeled data generation method, equipment and computer readable storage medium
CN109726389A (en) * 2018-11-13 2019-05-07 北京邮电大学 A kind of Chinese missing pronoun complementing method based on common sense and reasoning
CN110413658A (en) * 2019-07-23 2019-11-05 中经柏诚科技(北京)有限责任公司 A kind of chain of evidence construction method based on the fact the correlation rule
CN111639489A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 Chinese text error correction system, method, device and computer readable storage medium
CN111708882A (en) * 2020-05-29 2020-09-25 西安理工大学 Transformer-based Chinese text information missing completion method
CN111738018A (en) * 2020-06-24 2020-10-02 深圳前海微众银行股份有限公司 Intention understanding method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11880650B1 (en) * 2020-10-26 2024-01-23 Ironclad, Inc. Smart detection of and templates for contract edits in a workflow
CN113434632A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Text completion method, device, equipment and storage medium based on language model
CN117056859A (en) * 2023-08-15 2023-11-14 丁杨 Method for complementing missing characters in cultural relics
CN117056859B (en) * 2023-08-15 2024-05-10 丁杨 Method for complementing missing characters in cultural relics

Also Published As

Publication number Publication date
CN112580310B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110232923B (en) Voice control instruction generation method and device and electronic equipment
CN112580310B (en) Missing character/word completion method and electronic equipment
US11232263B2 (en) Generating summary content using supervised sentential extractive summarization
JP2006031295A (en) Apparatus and method for estimating word boundary probability, apparatus and method for constructing probabilistic language model, apparatus and method for kana-kanji conversion, and method for constructing unknown word model
CN111046652A (en) Text error correction method, text error correction device, storage medium, and electronic apparatus
CN109408813B (en) Text correction method and device
JP2008216341A (en) Error-trend learning speech recognition device and computer program
Ali et al. Genetic approach for Arabic part of speech tagging
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
US20240028893A1 (en) Generating neural network outputs using insertion commands
CN112101032A (en) Named entity identification and error correction method based on self-distillation
US20230205994A1 (en) Performing machine learning tasks using instruction-tuned neural networks
CN114144774A (en) Question-answering system
CN114239589A (en) Robustness evaluation method and device of semantic understanding model and computer equipment
WO2021159803A1 (en) Text summary generation method and apparatus, and computer device and readable storage medium
CN114091448A (en) Text countermeasure sample generation method, system, computer device and storage medium
CN113177405A (en) Method, device and equipment for correcting data errors based on BERT and storage medium
CN112527967A (en) Text matching method, device, terminal and storage medium
WO2013191662A1 (en) Method for correcting grammatical errors of an input sentence
CN113486169B (en) Synonymous statement generation method, device, equipment and storage medium based on BERT model
US20230196016A1 (en) Automated text amendment based on additional domain text and control text
US8977538B2 (en) Constructing and analyzing a word graph
CN112380845B (en) Sentence noise design method, equipment and computer storage medium
CN112016281B (en) Method and device for generating wrong medical text and storage medium
CN116579327B (en) Text error correction model training method, text error correction method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 065000 608-609, Xinya R & D building, No.106, No.1 Road, Langfang Economic and Technological Development Zone, Hebei Province

Applicant after: Hebei Xunfei Institute of Artificial Intelligence

Applicant after: iFLYTEK (Beijing) Co.,Ltd.

Applicant after: IFLYTEK Co.,Ltd.

Address before: 065000 608-609, Xinya R & D building, No.106, No.1 Road, Langfang Economic and Technological Development Zone, Hebei Province

Applicant before: Hebei Xunfei Institute of Artificial Intelligence

Applicant before: Zhongke Xunfei Internet (Beijing) Information Technology Co.,Ltd.

Applicant before: IFLYTEK Co.,Ltd.

GR01 Patent grant
GR01 Patent grant