WO2021164310A1 - 文本纠错方法、装置、终端设备及计算机存储介质 - Google Patents

文本纠错方法、装置、终端设备及计算机存储介质 Download PDF

Info

Publication number
WO2021164310A1
WO2021164310A1 PCT/CN2020/125219 CN2020125219W WO2021164310A1 WO 2021164310 A1 WO2021164310 A1 WO 2021164310A1 CN 2020125219 W CN2020125219 W CN 2020125219W WO 2021164310 A1 WO2021164310 A1 WO 2021164310A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
word vector
word
input
error correction
Prior art date
Application number
PCT/CN2020/125219
Other languages
English (en)
French (fr)
Inventor
姚林霞
孟函可
祝官文
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021164310A1 publication Critical patent/WO2021164310A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application belongs to the field of artificial intelligence technology, and in particular relates to a text error correction method, device, terminal equipment, and computer storage medium.
  • the encoder-decoder model is usually used to implement text processing functions such as text error correction, text translation, document extraction, and question and answer systems.
  • an encoder and a decoder are provided.
  • the user can input the text to be corrected into the encoder of the encoder-decoder model.
  • the encoder converts the text input by the user into a semantic vector, and then the encoder transmits the semantic vector to the encoder- In the decoder of the decoder model, the semantic vector is decoded by the decoder, and the error-corrected text is obtained and output to the user.
  • the embodiments of the present application provide a text error correction method, device, terminal device, and computer storage medium, which can solve the problem that the decoding process of the current encoder-decoder model is uncontrollable and prone to misjudgment.
  • the first aspect of the embodiments of the present application provides a text error correction method, including:
  • the terminal device performs word vector conversion on the input text to obtain a word vector sequence corresponding to the input text, where the word vector sequence includes the input word vector corresponding to each word in the input text;
  • the terminal device inputs the word vector sequence into an encoder of an encoder-decoder model to obtain a semantic vector
  • the terminal device inputs the word vector sequence into an error correction judgment model to obtain an error correction label corresponding to each input word vector;
  • the terminal device inputs the word vector sequence, the semantic vector, and the error correction label corresponding to each input word vector into the decoder of the encoder-decoder model to obtain error-corrected text.
  • the terminal device first inputs the word vector sequence into the error correction judgment model for error correction judgment, and obtains the error correction label corresponding to each word in the input text.
  • the error correction tag is used to indicate whether each word in the input text needs to be corrected.
  • the decoder can perform targeted decoding according to the error correction tags of each word in the input text, and regulate the decoding process, thereby reducing the misjudgment of the decoder and improving the accuracy of text error correction.
  • the terminal device inputs the word vector sequence, the semantic vector, and the error correction label corresponding to each input word vector into the encoder-decoder model
  • the corrected text includes:
  • the terminal device sequentially inputs the input word vectors in the word vector sequence into the decoder of the encoder-decoder model
  • the terminal device calculates the attention corresponding to the input word vector according to the input word vector and the second hidden layer vector corresponding to the input word vector Vector and the second hidden layer vector corresponding to the next input word vector, where the second hidden layer vector is the hidden layer vector of the decoder, and the semantic vector is the second hidden layer vector corresponding to the first input word vector Layer vector
  • the terminal device controls the decoder to use the word corresponding to the input word vector as the decoded word corresponding to the input word vector, wherein the The error correction label includes a first label and a second label;
  • the terminal device corresponds to the error correction label corresponding to the input word vector, the attention vector corresponding to the input word vector, and the corresponding input word vector
  • the terminal device calculates the similarity between the first vector and the second vector corresponding to each word in the preset dictionary to obtain the first similarity corresponding to each word in the preset dictionary;
  • the terminal device determines the error-corrected text according to the decoded word corresponding to each input word vector.
  • the decoder uses the similarity comparison method for decoding, the computational complexity of decoding can be reduced, the loss of system performance, and the processing time can be reduced.
  • the terminal device determining the decoded word corresponding to the input word vector according to the first similarity includes:
  • the terminal device uses the word with the highest first similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the terminal device can directly use the first word with the highest similarity in the preset dictionary as the decoded word corresponding to the input word vector, thereby reducing the complexity of decoding calculation.
  • the input word vector includes a pinyin word vector and a character shape word vector
  • the terminal device determining the decoded word corresponding to the input word vector according to the first similarity includes:
  • the terminal device performs similarity calculation on the pinyin word vector in the input word vector and the pinyin word vector corresponding to each word in the preset dictionary to obtain the pinyin similarity corresponding to each word in the preset dictionary;
  • the terminal device performs similarity calculation on the character shape word vector in the input word vector and the character shape word vector corresponding to each word in the preset dictionary to obtain the character shape similarity corresponding to each word in the preset dictionary;
  • the terminal device calculates the edit distance between the word corresponding to the input word vector and each word in the preset dictionary to obtain the edit distance corresponding to each word in the preset dictionary;
  • the terminal device respectively performs a weighted summation of the first similarity, pinyin similarity, font similarity, and edit distance corresponding to each word in the preset dictionary to obtain the target similarity corresponding to each word in the preset dictionary ;
  • the terminal device uses the word with the highest target similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the terminal device needs to improve the accuracy of decoding, it can be combined with the first similarity, pinyin similarity, font similarity, edit distance and other domain knowledge for comprehensive evaluation to obtain the target similarity.
  • the terminal device uses the word with the highest target similarity in the preset dictionary as the decoded word corresponding to the input word vector, so as to improve the accuracy of decoding by the decoder.
  • the error correction judgment model includes a two-way coding representation model and a two-classifier
  • the terminal device inputting the word vector sequence into the error correction judgment model, and obtaining the error correction label corresponding to each input word vector includes:
  • the terminal device sequentially inputs each input word vector in the word vector sequence into an error correction judgment model to obtain a first output value corresponding to each input word vector;
  • the terminal device respectively inputs the first output value corresponding to each input word vector into the two classifier to obtain the error correction label corresponding to each input word vector.
  • the two-way coding characterization model has the advantages of high accuracy, convenient use, and fast adjustment speed.
  • the use of the two-way coding characterization model and the binary classifier can reduce the difficulty of constructing and training the error correction judgment model.
  • the second aspect of the embodiments of the present application provides a text error correction device, including:
  • the embedding module is used to perform word vector conversion on the input text to obtain a word vector sequence corresponding to the input text, wherein the word vector sequence includes the input word vector corresponding to each word in the input text;
  • the semantic module is used to input the word vector sequence into the encoder of the encoder-decoder model to obtain the semantic vector;
  • the label module is used to input the word vector sequence into the error correction judgment model to obtain the error correction label corresponding to each input word vector;
  • the error correction module is used to input the word vector sequence, the semantic vector and the error correction label corresponding to each input word vector into the decoder of the encoder-decoder model to obtain the error-corrected text.
  • the error correction module includes:
  • the vector input sub-module is used to input the input word vectors in the word vector sequence into the decoder of the encoder-decoder model in turn;
  • the hidden update sub-module is used to calculate the corresponding input word vector according to the input word vector and the second hidden layer vector corresponding to the input word vector after the input word vector is input to the decoder each time
  • the attention vector of and the second hidden layer vector corresponding to the next input word vector where the second hidden layer vector is the hidden layer vector of the decoder, and the semantic vector is the second hidden layer vector corresponding to the first input word vector Second hidden layer vector;
  • the first output sub-module is configured to, if the error correction label corresponding to the input word vector is the first label, control the decoder to use the word corresponding to the input word vector as the decoded word corresponding to the input word vector,
  • the error correction label includes a first label and a second label
  • the first vector sub-module is configured to, if the error correction label corresponding to the input word vector is the second label, according to the error correction label corresponding to the input word vector, the attention vector corresponding to the input word vector, and the The second hidden layer vector corresponding to the input word vector constructs the first vector;
  • the first calculation sub-module is configured to perform similarity calculation between the first vector and the second vector corresponding to each word in the preset dictionary to obtain the first similarity corresponding to each word in the preset dictionary;
  • the second output sub-module is configured to determine the decoded word corresponding to the input word vector according to the first similarity
  • the text integration sub-module is used to determine the error-corrected text according to the decoded word corresponding to each input word vector.
  • the second output submodule is specifically configured to use the word with the highest first similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the input word vector includes a pinyin word vector and a character shape word vector
  • the second output submodule includes:
  • the second calculation submodule is used to calculate the similarity between the pinyin word vector in the input word vector and the pinyin word vector corresponding to each word in the preset dictionary to obtain the pinyin corresponding to each word in the preset dictionary Similarity
  • the fourth calculation submodule is used to calculate the edit distance between the word corresponding to the input word vector and each word in the preset dictionary to obtain the edit distance corresponding to each word in the preset dictionary;
  • the target calculation sub-module is used to respectively perform a weighted summation of the first similarity, pinyin similarity, font similarity, and edit distance corresponding to each word in the preset dictionary to obtain the corresponding word in the preset dictionary Target similarity
  • the target output submodule is configured to use the word with the highest target similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the error correction judgment model includes a two-way coding representation model and a two-classifier
  • the label module includes:
  • a pre-error correction sub-module configured to input each input word vector in the word vector sequence into the bidirectional encoding characterization model in turn to obtain the first output value corresponding to each input word vector;
  • the label classification sub-module is configured to input the first output value corresponding to each input word vector into the two classifier to obtain the error correction label corresponding to each input word vector.
  • the third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • a terminal device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, , So that the terminal device implements the steps of the above method.
  • the fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the terminal device realizes the steps of the above-mentioned method.
  • the fifth aspect of the embodiments of the present application provides a computer program product, which when the computer program product runs on a terminal device, enables the terminal device to implement the steps of the above-mentioned method.
  • the embodiment of the present application provides a text error correction method. Before the decoder in the encoder-decoder model decodes, it needs to use the error correction judgment model to classify each input word vector to obtain each input word vector. Error correction label. The above-mentioned error correction label is used to indicate whether the corresponding word needs to be corrected.
  • the terminal device After obtaining the error correction label corresponding to each input word vector in the input text, the terminal device inputs the error correction label corresponding to each input word vector into the above decoder, so that the decoder can perform processing according to the error correction label corresponding to each input word vector Targeted decoding, regulating the decoding process, thereby reducing the misjudgment of the decoder, improving the accuracy of the text error correction, and solving the problem that the decoding process of the current encoder-decoder model is uncontrollable and prone to misjudgment.
  • FIG. 1 is a schematic flowchart of a text error correction method provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a text error correction system provided by an embodiment of the present application.
  • FIG. 3 is a working schematic diagram of a word vector embedding model provided by an embodiment of the present application.
  • FIG. 4 is a working schematic diagram of an encoder provided by an embodiment of the present application.
  • Fig. 5 is a working schematic diagram of an error correction judgment model provided by an embodiment of the present application.
  • Fig. 6 is a schematic diagram of a preset dictionary provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a text error correction device provided by an embodiment of the present application.
  • Fig. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application.
  • the term “if” can be construed as “when” or “once” or “in response to determination” or “in response to detecting “.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the text error correction method provided in the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, laptop computers, and ultra-mobile personal
  • augmented reality (AR)/virtual reality (VR) devices such as ultra-mobile personal computers (UMPCs), netbooks, and personal digital assistants (personal digital assistants, PDAs)
  • UMPCs ultra-mobile personal computers
  • PDAs personal digital assistants
  • the terminal device may be a station (STAION, ST) in a WLAN, a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, Personal Digital Assistant (PDA) devices, handheld devices with wireless communication functions, computing devices or other processing devices connected to wireless modems, in-vehicle devices, car networking terminals, computers, laptop computers, handheld communication devices , Handheld computing devices, satellite wireless devices, wireless modem cards, television set top boxes (STB), customer premise equipment (customer premise equipment, CPE), and/or other equipment used to communicate on the wireless system and download
  • a first-generation communication system for example, a mobile terminal in a 5G network or a mobile terminal in a public land mobile network (PLMN) network that will evolve in the future.
  • PLMN public land mobile network
  • the encoder-decoder model is usually used to implement text processing functions such as text error correction, text translation, document extraction, and question and answer systems.
  • an encoder and a decoder are provided.
  • the user can input the text to be corrected into the encoder of the encoder-decoder model.
  • the encoder converts the text input by the user into a semantic vector, and then the encoder transmits the semantic vector to the encoder- In the decoder of the decoder model, the decoder decodes the semantic vector and the input text to obtain the error-corrected text and output it to the user.
  • the embodiment of the present application provides a text error correction method.
  • the decoder Before decoding in the encoder-decoder model, the decoder needs to use the error correction judgment model to classify each input word vector to obtain Error correction label of each input word vector.
  • the above-mentioned error correction label is used to indicate whether the corresponding word needs to be corrected.
  • the terminal device After obtaining the error correction label corresponding to each input word vector in the input text, the terminal device inputs the error correction label corresponding to each input word vector into the above decoder, so that the decoder can perform processing according to the error correction label corresponding to each input word vector Targeted decoding, regulating the decoding process, thereby reducing the misjudgment of the decoder, improving the accuracy of the text error correction, and solving the problem that the decoding process of the current encoder-decoder model is uncontrollable and prone to misjudgment.
  • the terminal device performs word vector conversion on the input text to obtain a word vector sequence corresponding to the input text, where the word vector sequence includes the input word vector corresponding to each word in the input text;
  • the text error correction system may include a word vector embedding model 201, an encoder-decoder model 202, and an error correction decision model 203.
  • the word vector embedding model 201 is used to perform word vector conversion on the input text. This process may also be referred to as word vector embedding (embedding), which converts the input text from a natural language into an input word vector of a first preset length.
  • the type of the word vector embedding model 201 can be selected according to the actual situation. For example, assuming that the input text is Chinese text, the terminal device can select any one or a combination of the pinyin word vector model, character shape word vector model, n-gram language model (n-gram language model) as the word vector Embedded model 201.
  • the language model is a model used to calculate the probability of a sentence.
  • Language models are widely used in the fields of machine translation, Chinese word segmentation and grammatical analysis.
  • the main language model used by people is the n-gram language model.
  • the n in the n-gram language model is a preset value.
  • a word has the highest correlation with its first n-1 words.
  • n 1, the n-gram language model means that each word in the sentence has nothing to do with the preceding word and is independent of each other.
  • n it means that a word is only related to the word before it.
  • n is 3, it means that a word is only related to the two words before it.
  • word vector embedding model 201 may be a single model, or the word vector embedding model 201 may also be a combination of multiple models.
  • the terminal device selects a combination of the pinyin word vector model, the character shape word vector model, and the n-gram language model as the word vector model
  • the input text can be input into the pinyin word vector model, the character word vector model, and the n-gram language model respectively.
  • the pinyin word vector, character shape word vector and context word vector corresponding to each word in the input text are obtained.
  • the terminal device respectively splices the pinyin word vector, the character shape word vector, and the context word vector corresponding to each word to obtain the input word vector of each word in the input text.
  • the knowledge and domain characteristics of different fields can be fully utilized to improve the accuracy of text error correction.
  • the terminal device can embed the input text input word vector into the model 201 for embedding processing to obtain the word vector sequence corresponding to the input text.
  • the word vector sequence includes the input word vector of each word in the input text.
  • the token mentioned in the above description can be defined according to the language type of the input text and the content pre-configured by the user.
  • the word mentioned above can be a word, or a combination of multiple characters can be a word; in an English text, the word mentioned above can be a word
  • a word can also be a phrase composed of multiple words. This embodiment does not limit the definition of words.
  • the terminal device inputs the word vector sequence into the encoder 2021 of the encoder-decoder model 202 to obtain a semantic vector.
  • the encoder-decoder model 202 is a model applied to sequence-to-sequence (Seq2Seq) problems, and can be applied to text processing fields such as text translation, document extraction, and question and answer systems.
  • the input and output of the encoder-decoder model 202 represent different meanings.
  • the input of the encoder-decoder model 202 is the text to be translated, and the output of the encoder-decoder model 202 is the translated text
  • the input of the encoder-decoder model 202 is the text to be translated.
  • the input is the question, and the output of the encoder-decoder model 202 is the answer.
  • an encoder 2021 and a decoder 2022 are provided in the encoder-decoder model 202.
  • the encoder 2021 is used to convert the input sequence into a fixed-length vector
  • the decoder 2022 is used to convert the fixed-length vector generated by the encoder 2021 into an output sequence.
  • the type of encoder 2021 and the type of decoder 2022 can be selected according to actual conditions.
  • the type of encoder 2021 and the type of decoder 2022 may be Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) model, and Gated Recurrent Unit (GRU). ) Model, text convolutional (Text Convolutional Neural Networks, TextCNN) model, change (transformer) model and other models.
  • RNN Recurrent Neural Network
  • LSTM Long Short-Term Memory
  • GRU Gated Recurrent Unit
  • the type of the encoder 2021 may be consistent with the type of the decoder 2022, or the type of the encoder 2021 may also be inconsistent with the type of the decoder 2022.
  • the type of encoder 2021 and the type of decoder 2022 may both be LSTM models.
  • the type of the encoder 2021 may be an LSTM model
  • the type of the decoder 2022 may be a transformer model.
  • the terminal device may sequentially input the input word vectors in the word vector sequence into the encoder 2021 of the encoder-decoder model 202 to obtain the semantic vector.
  • the encoder 2021 When the terminal device sequentially inputs the input word vectors in the word vector sequence into the encoder 2021, each time an input word vector is input, the encoder 2021 will update the current first hidden layer vector according to the input word vector to obtain a new The first hidden layer vector.
  • the aforementioned first hidden layer vector is the hidden layer vector of the encoder 2021.
  • the encoder 2021 After the terminal device inputs the last input word vector into the encoder 2021, the encoder 2021 will update the first hidden layer vector for the last time according to the last input word vector to obtain the semantic vector.
  • S103 The terminal device inputs the word vector sequence into the error correction judgment model 203 to obtain the error correction label corresponding to each input word vector;
  • the terminal device will then convert the semantic vector output by the encoder 2021 and the above-mentioned word vector sequence In the input decoder 2022, the decoder 2022 performs a decoding operation according to the semantic vector and the word vector sequence, and outputs the error-corrected text.
  • the decoding process is uncontrollable, that is, every word in the input text, regardless of whether it is a correct word, may be error-corrected. Therefore, in the current encoder-decoder model 202, the decoding process is uncontrollable, and misjudgments are prone to occur. Some words that do not need to be corrected may be corrected, or some words that need to be corrected may not be corrected. Perform error correction.
  • the word vector sequence is first input to the error correction determination model 203 for error correction determination, and the error correction tags corresponding to each word in the input text are obtained.
  • the error correction judgment model 203 is used to identify whether each word in the input text is correct, so as to determine which words in the input text need to be corrected, which words do not need to be corrected, and obtain the corresponding words in the input text Error correction label.
  • the error correction label is used to indicate whether each word in the input text needs to be corrected.
  • the error correction label may include a first label and a second label.
  • the first label indicates that no error correction is required, and the second label indicates that error correction is required.
  • the form of the error correction label can be set according to the actual situation. For example, in some embodiments, 0 may be used to represent the first label, and 1 may be used to represent the second label.
  • the structure of the error correction judgment model 203 can be set according to actual conditions.
  • a combination of a Bidirectional Encoder Representations from Transformers (Bert) model and two classifiers may be used as the error correction decision model 203.
  • the terminal device may sequentially input the input word vectors in the word vector sequence into the Bert model to obtain the first output value corresponding to each input word vector. Then, the terminal device respectively inputs the first output value corresponding to each input word vector into the binary classifier for classification, and obtains the error correction label corresponding to each input word vector in the input text. Among them, the first output value of the Bert model has a one-to-one correspondence with the input word vector.
  • the Bert model has the advantages of high accuracy, convenient use, and fast adjustment speed.
  • the Bert model and the binary classifier are used to construct the error correction judgment model 203, the construction difficulty and training difficulty of the error correction judgment model 203 can be reduced.
  • the terminal device can select the RNN model, the LSTM model, the GRU model, the TextCNN model, the transformer model, and other models to construct the error correction judgment model 203.
  • the specific structure of the error correction judgment model 203 can be set according to actual conditions.
  • the terminal device inputs the word vector sequence, the semantic vector, and the error correction label corresponding to each input word vector into the decoder 2022 of the encoder-decoder model 202 to obtain an error-corrected text.
  • the terminal device can input the word vector sequence, the semantic vector, and the error correction label of each word in the input text into the decoder 2022 of the encoder-decoder model 202. , Is decoded by the decoder 2022.
  • the decoder 2022 can perform targeted decoding according to the error correction tags of each word in the input text, and regulate the decoding process, thereby reducing the misjudgment of the decoder and improving the accuracy of text error correction.
  • the decoder 2022 may use a similarity comparison method to perform decoding, which reduces the computational complexity of decoding, reduces the loss of system performance, and reduces the processing time.
  • the terminal device may use the aforementioned semantic vector as the initial value of the second hidden layer vector of the decoder 2022.
  • the semantic vector is the second hidden layer vector corresponding to the first input word vector.
  • the second hidden layer vector is the hidden layer vector of the decoder 2022.
  • the terminal device sequentially inputs the input word vector in the word vector sequence into the decoder 2022, and calculates the attention vector corresponding to the input word vector and the next input according to the above input word vector and the second hidden layer vector corresponding to the input word vector The second hidden layer vector corresponding to the word vector.
  • the terminal device controls the decoder 2022 to use the word corresponding to the input word vector as the input word vector correspondence
  • the decoded word is the output value of the decoder.
  • the terminal device will use the error correction label corresponding to the input word vector and the attention corresponding to the input word vector.
  • the vector and the second hidden layer vector corresponding to the input word vector construct the first vector.
  • the construction method of the first vector can be set according to the actual situation.
  • the error correction label corresponding to the input word vector, the attention vector corresponding to the input word vector, and the second hidden layer vector corresponding to the input word vector may be directly spliced into the first vector.
  • the terminal device calculates the similarity between the first vector and the second vector corresponding to each word in the preset dictionary to obtain the first similarity corresponding to each word in the preset dictionary, and determines the corresponding input word vector according to the first similarity. Decode the word.
  • the terminal device may directly control the decoder 2022 to determine the first word with the highest similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the terminal device may also perform a comprehensive comparison with knowledge in other domains to obtain the target similarity, and determine the decoded word corresponding to the input word vector according to the target similarity.
  • the above-mentioned other domain knowledge may include one or more of domain knowledge such as pinyin similarity, font similarity, and edit distance.
  • the input word vector includes a pinyin word vector and a character shape word vector.
  • the terminal device can perform similarity calculation according to the pinyin word vector in the input word vector and the pinyin word vector corresponding to each word in the preset dictionary to obtain the pinyin similarity corresponding to each word in the preset dictionary.
  • the terminal device can perform similarity calculation according to the character shape word vector and the character shape word vector corresponding to each word in the preset dictionary, to obtain the character shape similarity corresponding to each word in the preset dictionary.
  • the terminal device calculates the edit distance (Edit Distance) between the word corresponding to the input word vector and each word in the preset dictionary to obtain the edit distance corresponding to each word in the preset dictionary.
  • Edit distance refers to the minimum number of editing operations required to convert one character string to another character string between two character strings.
  • the application range of edit distance is very wide, especially on similarity issues, such as text error correction and plagiarism recognition.
  • Editing in edit distance includes three operations: insert, delete and replace.
  • a dynamic programming algorithm is usually used to calculate the edit distance between two strings.
  • the terminal device performs a weighted summation of the first similarity, pinyin similarity, font similarity, and editing distance corresponding to each word in the preset dictionary to obtain the target similarity corresponding to each word in the preset dictionary.
  • the first weight value corresponding to the first similarity degree, the second weight value corresponding to the pinyin similarity degree, the third weight value corresponding to the glyph similarity degree, and the fourth weight value corresponding to the edit distance are all preset value.
  • a suitable similarity algorithm can be selected for calculation according to the actual situation. For example, you can choose to calculate the cosine distance to calculate the similarity; alternatively, you can choose to calculate the Euclidean distance to calculate the similarity; or, you can also choose other similarity algorithms to calculate the similarity.
  • the terminal device may determine the word with the highest target similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the terminal device After the terminal device obtains the decoded word corresponding to each input word vector, it determines the error-corrected text according to the decoded word corresponding to each input word vector.
  • the text error correction system includes an encoder-decoder model, an error correction decision model and a word vector embedding model.
  • the encoder-decoder model both the encoder and the decoder are LSTM models.
  • the error correction judgment model is a Bert model followed by a classifier.
  • the word vector embedding model includes a pinyin word vector model, a character shape word vector model, and an n-gram language model.
  • the text error correction system is trained using training corpus.
  • the training corpus may include the collected labeled text corpus after Automatic Speech Recognition (ASR) recognition, public data in newspapers, general entity vocabulary, and other public corpus, such as Sighan Bakeoff corpus.
  • ASR Automatic Speech Recognition
  • the training corpus can be preprocessed, including:
  • the quality of the training corpus can be improved.
  • the training corpus can be converted into training word vectors through the word vector embedding model, and the encoder, decoder, and error correction judgment module can be trained using the training word vectors.
  • the first loss value corresponding to the encoder, the second loss value corresponding to the decoder, and the third loss value corresponding to the error correction judgment module are calculated, and the encoder is iteratively updated according to the first loss value.
  • the second loss value iteratively updates the decoder and the error correction decision model is iteratively updated according to the third loss value.
  • Training suspension conditions can be set according to actual conditions.
  • the training suspension condition can be that the number of training times reaches the preset number of iterations; or, the training suspension condition can be that the first loss value, the second loss value, and the third loss value are less than the preset loss threshold; or the training suspension condition can also be Other conditions.
  • the trained text error correction system can be used for text error correction.
  • the word vector sequence includes The input word vector corresponding to each word (that is, the word vector A1 to the word vector A6 in FIG. 3), and each input word vector is obtained by concatenating the pinyin word vector, the character shape word vector and the context word vector.
  • the terminal device inputs the word vector A1 to the word vector A6 into the encoder one by one. Every time an input word vector is input to the encoder, the encoder updates the first hidden layer vector of the encoder according to the input word vector.
  • the initial value of the encoder's first hidden layer vector is h0.
  • the encoder updates h0 to h1 according to the word vector A1; after the terminal device inputs the word vector A2 into the encoder, the encoder updates h1 to h2 according to the word vector A2; and so on,
  • the encoder updates h5 to h6 according to the word vector A6.
  • the terminal device uses h6 as a semantic vector, and controls the encoder to input h6 into the decoder as the initial value s1 of the second hidden layer vector of the decoder.
  • the terminal device inputs the word vector A1 to the word vector A6 into the Bert model in the error correction judgment model one by one, and the Bert model outputs the first output value corresponding to the word vector A1 to the word vector A6.
  • An output value is input into the two classifier to obtain the error correction labels corresponding to the word vector A1 to the word vector A6.
  • the error correction tag includes a first tag and a second tag, the value of the first tag is 0, and the value of the second tag is 1.
  • the error correction labels corresponding to word vector A1, word vector A5, and word vector A6 are 0, indicating that the words corresponding to word vector A1, word vector A5, and word vector A6 do not need error correction; word vector A2, word vector A3,
  • the error correction label corresponding to the word vector A4 is 1, indicating that the words corresponding to the word vector A2, the word vector A3, and the word vector A4 need to be corrected.
  • the terminal device inputs the word vector A1 to the word vector A6 and the error correction label corresponding to each input word vector into the decoder to obtain the decoded word corresponding to each input word vector.
  • the specific process is as follows:
  • the terminal device inputs the error correction labels corresponding to the word vector A1 and the word vector A1 into the decoder, and the decoder updates s1 to s2 according to the word vector A1. Since the error correction label corresponding to the word vector A1 is 0, the decoder outputs the word "listen” corresponding to the word vector A1.
  • the terminal device inputs the error correction labels corresponding to the word vector A2 and the word vector A2 into the decoder, and the decoder updates s2 to s3 according to the word vector A2. Since the error correction label corresponding to the word vector A2 is 1, the decoder calculates the attention vector b1 corresponding to the word vector A2 according to s2 and h1 to h6, and then constructs the word vector A2 according to the error correction label corresponding to the word vector A2, s2 and b1 The corresponding first vector c1.
  • the terminal device calculates the similarity between the first vector c1 and the second vector of each word in the preset dictionary to obtain the first similarity corresponding to each word in the preset dictionary. At this time, the first similarity between the first vector c1 and the second vector d2 is the highest, and the decoder outputs the word "Xue" corresponding to the second vector d2.
  • the terminal device inputs the error correction tags corresponding to the word vector A3 and the word vector A3 into the decoder, and the decoder updates s3 to s4 according to the word vector A3. Since the error correction label corresponding to the word vector A3 is 1, the decoder calculates the attention vector b2 corresponding to the word vector A3 according to s3 and h1 to h6, and then constructs the word vector A3 according to the error correction label corresponding to the word vector A3, s3 and b2 The corresponding first vector c2.
  • the terminal device calculates the similarity between the first vector c2 and the second vector of each word in the preset dictionary to obtain the first similarity corresponding to each word in the preset dictionary, and the decoder outputs the first similarity highest in the preset dictionary The word "Zhi”.
  • the terminal device inputs the error correction labels corresponding to the word vector A4 and the word vector A4 into the decoder, and the decoder updates s4 to s5 according to the word vector A4. Since the error correction label corresponding to the word vector A4 is 1, the decoder calculates the attention vector b3 corresponding to the word vector A4 according to s4 and h1 to h6, and then constructs the word vector A4 according to the error correction label corresponding to the word vector A4, s4 and b3 The corresponding first vector c3.
  • the terminal device calculates the similarity between the first vector c3 and the second vector of each word in the preset dictionary to obtain the first similarity corresponding to each word in the preset dictionary, and the decoder outputs the first similarity highest in the preset dictionary The word "qian”.
  • the terminal device inputs the error correction tags corresponding to the word vector A5 and the word vector A5 into the decoder, and the decoder updates s5 to s6 according to the word vector A5. Since the error correction label corresponding to the word vector A5 is 0, the decoder outputs the word " ⁇ " corresponding to the word vector A5.
  • the terminal device inputs the error correction tags corresponding to the word vector A6 and the word vector A6 into the decoder, and the decoder updates s6 to s7 according to the word vector A6. Since the error correction label corresponding to the word vector A6 is 0, the decoder outputs the word "song" corresponding to the word vector A6.
  • the terminal device After the terminal device obtains the decoded words "listen”, “xue”, “zhi”, “qian”, “de”, and “song” corresponding to word vector A1 to word vector A6, it arranges the decoded words in order to obtain error correction
  • the latter text is "Listen to Xue Zhiqian's Song”.
  • the embodiments of the present application provide a text error correction method.
  • the decoder Before decoding in the encoder-decoder model, the decoder needs to use the error correction decision model to classify each input word vector. Obtain the error correction label of each input word vector. The above-mentioned error correction label is used to indicate whether the corresponding word needs to be corrected.
  • the terminal device After obtaining the error correction label corresponding to each input word vector in the input text, the terminal device inputs the error correction label corresponding to each input word vector into the above decoder, so that the decoder can perform processing according to the error correction label corresponding to each input word vector Targeted decoding, regulating the decoding process, thereby reducing the misjudgment of the decoder, improving the accuracy of the text error correction, and solving the problem that the decoding process of the current encoder-decoder model is uncontrollable and prone to misjudgment.
  • the corrected text can be obtained.
  • the corrected text can be widely used in various downstream tasks, such as word segmentation tasks, part-of-speech tagging tasks, entity recognition tasks, intent classification tasks, slot filling tasks, dialogue management tasks, text generation tasks, etc.
  • FIG. 7 An embodiment of the present application provides a text error correction device. For ease of description, only the parts related to the present application are shown. As shown in FIG. 7, the text error correction device includes:
  • the embedding module 701 is configured to perform word vector conversion on the input text to obtain a word vector sequence corresponding to the input text, where the word vector sequence includes the input word vector corresponding to each word in the input text;
  • the semantic module 702 is used to input the word vector sequence into the encoder of the encoder-decoder model to obtain the semantic vector;
  • the label module 703 is configured to input the word vector sequence into the error correction judgment model to obtain the error correction label corresponding to each input word vector;
  • the error correction module 704 is configured to input the word vector sequence, the semantic vector, and the error correction label corresponding to each input word vector into the decoder of the encoder-decoder model to obtain error-corrected text .
  • the error correction module 704 includes:
  • the vector input sub-module is used to input the input word vectors in the word vector sequence into the decoder of the encoder-decoder model in turn;
  • the hidden update sub-module is used to calculate the corresponding input word vector according to the input word vector and the second hidden layer vector corresponding to the input word vector after the input word vector is input to the decoder each time
  • the attention vector of and the second hidden layer vector corresponding to the next input word vector where the second hidden layer vector is the hidden layer vector of the decoder, and the semantic vector is the second hidden layer vector corresponding to the first input word vector Second hidden layer vector;
  • the first output sub-module is configured to, if the error correction label corresponding to the input word vector is the first label, control the decoder to use the word corresponding to the input word vector as the decoded word corresponding to the input word vector,
  • the error correction label includes a first label and a second label
  • the first vector sub-module is configured to, if the error correction label corresponding to the input word vector is the second label, according to the error correction label corresponding to the input word vector, the attention vector corresponding to the input word vector, and the The second hidden layer vector corresponding to the input word vector constructs the first vector;
  • the first calculation sub-module is configured to perform similarity calculation between the first vector and the second vector corresponding to each word in the preset dictionary to obtain the first similarity corresponding to each word in the preset dictionary;
  • the second output sub-module is configured to determine the decoded word corresponding to the input word vector according to the first similarity
  • the text integration sub-module is used to determine the error-corrected text according to the decoded word corresponding to each input word vector.
  • the second output submodule is specifically configured to use the word with the highest first similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the input word vector includes a pinyin word vector and a character shape word vector
  • the second output submodule includes:
  • the second calculation submodule is used to calculate the similarity between the pinyin word vector in the input word vector and the pinyin word vector corresponding to each word in the preset dictionary to obtain the pinyin corresponding to each word in the preset dictionary Similarity
  • the fourth calculation submodule is used to calculate the edit distance between the word corresponding to the input word vector and each word in the preset dictionary to obtain the edit distance corresponding to each word in the preset dictionary;
  • the target calculation sub-module is used to respectively perform a weighted summation of the first similarity, pinyin similarity, font similarity, and edit distance corresponding to each word in the preset dictionary to obtain the corresponding word in the preset dictionary Target similarity
  • the target output submodule is configured to use the word with the highest target similarity in the preset dictionary as the decoded word corresponding to the input word vector.
  • the error correction judgment model includes a two-way coding representation model and a two-classifier
  • the label module 703 includes:
  • a pre-error correction sub-module configured to input each input word vector in the word vector sequence into the bidirectional encoding characterization model in turn to obtain the first output value corresponding to each input word vector;
  • the label classification sub-module is configured to input the first output value corresponding to each input word vector into the two classifier to obtain the error correction label corresponding to each input word vector.
  • an embodiment of the present application also provides a terminal device.
  • the terminal device 8 includes a processor 80, a memory 81, and a computer program stored in the memory 81 and running on the processor 80. 82.
  • the processor 80 implements the steps in the embodiment of the number privacy protection method when the computer program 82 is executed, such as steps S101 to S104 shown in FIG. 1.
  • the processor 80 executes the computer program 82, the functions of the modules/units in the foregoing device embodiments, for example, the functions of the modules 701 to 704 shown in FIG. 7 are realized.
  • the computer program 82 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete This application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the terminal device 8.
  • the computer program 82 can be divided into an embedded module, a semantic module, a tag module, and an error correction module. The specific functions of each module are as follows:
  • the embedding module is used to perform word vector conversion on the input text to obtain a word vector sequence corresponding to the input text, wherein the word vector sequence includes the input word vector corresponding to each word in the input text;
  • the semantic module is used to input the word vector sequence into the encoder of the encoder-decoder model to obtain the semantic vector;
  • the label module is used to input the word vector sequence into the error correction judgment model to obtain the error correction label corresponding to each input word vector;
  • the error correction module is used to input the word vector sequence, the semantic vector and the error correction label corresponding to each input word vector into the decoder of the encoder-decoder model to obtain the error-corrected text.
  • the terminal device 8 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device may include, but is not limited to, a processor 80 and a memory 81.
  • FIG. 8 is only an example of the terminal device 8 and does not constitute a limitation on the terminal device 8. It may include more or less components than shown in the figure, or a combination of certain components, or different components.
  • the terminal device may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 81 may be an internal storage unit of the terminal device 8, such as a hard disk or memory of the terminal device 8.
  • the memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk equipped on the terminal device 8, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD). Card, Flash Card, etc. Further, the memory 81 may also include both an internal storage unit of the terminal device 8 and an external storage device.
  • the memory 81 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 81 can also be used to temporarily store data that has been output or will be output.
  • the disclosed device/terminal device and method may be implemented in other ways.
  • the device/terminal device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signal telecommunications signal
  • software distribution media etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction.
  • the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

一种文本纠错方法、装置、终端设备及计算机存储介质,适用于人工智能技术领域,所述方法的编码器-解码器模型中的解码器在进行解码之前,需要先使用纠错判定模型对各个输入词向量进行标签分类,得到各个输入词向量的纠错标签;上述纠错标签用于指示对应的词是否需要进行纠错;终端设备在得到输入文本中各个输入词向量对应的纠错标签之后,将各个输入词向量对应的纠错标签输入至上述解码器中,使解码器可以根据各个输入词向量对应的纠错标签进行针对性的解码,调控解码过程,从而减少解码器的误判情况,提高文本纠错的准确性,解决了当前的编码器-解码器模型的解码过程不可控,容易产生误判情况的问题。

Description

文本纠错方法、装置、终端设备及计算机存储介质
本申请要求于2020年02月21日提交国家知识产权局、申请号为202010110410.7、申请名称为“文本纠错方法、装置、终端设备及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能技术领域,尤其涉及一种文本纠错方法、装置、终端设备及计算机存储介质。
背景技术
在当前的文本处理领域中,通常使用编码器-解码器模型实现文本纠错、文本翻译、文档摘取、问答系统等文本处理功能。
在编码器-解码器模型中,设置有编码器和解码器。在进行文本纠错时,用户可以将需要纠错的文本输入编码器-解码器模型的编码器中,编码器将用户输入的文本转化成语义向量,然后编码器将语义向量传递至编码器-解码器模型的解码器中,由解码器对语义向量进行解码,得到纠错后的文本并输出给用户。
但是,在当前的编码器-解码器模型中,模型的解码过程是不可控的,容易产生误判的情况,可能会将部分正确词语误判为错误词语进行纠错,或者将部分错误词语误判为正确词语不进行纠错。
发明内容
本申请实施例提供了一种文本纠错方法、装置、终端设备及计算机存储介质,可以解决当前的编码器-解码器模型的解码过程是不可控,容易产生误判情况的问题。
本申请实施例的第一方面提供了一种文本纠错方法,包括:
终端设备对输入文本进行词向量转换,得到所述输入文本对应的词向量序列,其中,所述词向量序列包括所述输入文本中各个词对应的输入词向量;
所述终端设备将所述词向量序列输入编码器-解码器模型的编码器中,得到语义向量;
所述终端设备将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签;
所述终端设备将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本。
需要说明的是,在解码器解码之前,终端设备先将词向量序列输入纠错判定模型进行纠错判定,得到输入文本中各个词对应的纠错标签。纠错标签用于指示输入文本中的各个词是否需要进行纠错。
在解码的过程中,解码器可以根据输入文本中各个词的纠错标签进行针对性解码,调控解码过程,从而减少解码器的误判情况,提高文本纠错的准确性。
在第一方面的一种可能的实现方式中,所述终端设备将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本包括:
所述终端设备将所述词向量序列中的输入词向量依次输入至所述编码器-解码器模型的解码器中;
在每一次将所述输入词向量输入所述解码器后,所述终端设备根据所述输入词向量以及所述输入词向量对应的第二隐藏层向量,计算所述输入词向量对应的注意力向量以及下一个输入词向量对应的第二隐藏层向量,其中,所述第二隐藏层向量为所述解码器的隐藏层向量,所述语义向量为第一个输入词向量对应的第二隐藏层向量;
若所述输入词向量对应的纠错标签为第一标签,则所述终端设备控制所述解码器将所述输入词向量对应的词作为所述输入词向量对应的解码词,其中,所述纠错标签包括第一标签和第二标签;
若所述输入词向量对应的纠错标签为第二标签,则所述终端设备根据所述输入词向量对应的纠错标签、所述输入词向量对应的注意力向量以及所述输入词向量对应的第二隐藏层向量构建第一向量;
所述终端设备将所述第一向量与预设词典中各个词对应的第二向量进行相似度计算,得到所述预设词典中各个词对应的第一相似度;
所述终端设备根据所述第一相似度确定所述输入词向量对应的解码词;
所述终端设备根据各个输入词向量对应的解码词确定所述纠错后的文本。
需要说明的是,当解码器使用相似度比对的方式进行解码时,可以降低解码的计算复杂度,减少对系统性能的损耗,降低处理时长。
在第一方面的一种可能的实现方式中,所述终端设备根据所述第一相似度确定所述输入词向量对应的解码词包括:
所述终端设备将所述预设词典中第一相似度最高的词作为所述输入词向量对应的解码词。
需要说明的是,终端设备可以直接将预设词典中第一相似度最高的词作为该输入词向量对应的解码词,降低解码计算的复杂度。
在第一方面的另一种可能的实现方式中,所述输入词向量包括拼音词向量和字形词向量;
相应的,所述终端设备根据所述第一相似度确定所述输入词向量对应的解码词包括:
所述终端设备将所述输入词向量中的拼音词向量与所述预设词典中各个词对应的拼音词向量进行相似度计算,得到所述预设词典中各个词对应的拼音相似度;
所述终端设备将所述输入词向量中的字形词向量与所述预设词典中各个词对应的字形词向量进行相似度计算,得到所述预设词典中各个词对应的字形相似度;
所述终端设备将所述输入词向量对应的词与所述预设词典中各个词进行编辑距离计算,得到所述预设词典中各个词对应的编辑距离;
所述终端设备分别对所述预设词典中各个词对应的第一相似度、拼音相似度、字形相似度以及编辑距离进行加权求和,得到所述预设词典中各个词对应的目标相似度;
所述终端设备将所述预设词典中目标相似度最高的词作为所述输入词向量对应的解码词。
需要说明的是,当终端设备需要提高解码的准确性时,可以结合第一相似度、拼音相似度、字形相似度、编辑距离等领域知识进行综合评估,得到目标相似度。
然后,终端设备再将预设词典中目标相似度最高的词作为上述输入词向量对应的解码词,从而提高解码器解码的准确性。
在第一方面的一种可能的实现方式中,所述纠错判定模型包括双向编码表征模型和二分类器;
相应的,所述终端设备将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签包括:
所述终端设备将所述词向量序列中各个输入词向量依次输入纠错判定模型中,得到各个输入词向量对应的第一输出值;
所述终端设备分别将各个输入词向量对应的第一输出值输入所述二分类器中,得到所述各个输入词向量对应的纠错标签。
需要说明的是,双向编码表征模型具有准确性高、使用方便、调节速度快等优点,使用双向编码表征模型和二分类器可以降低纠错判定模型的构建难度和训练难度。
本申请实施例的第二方面提供了一种文本纠错装置,包括:
嵌入模块,用于对输入文本进行词向量转换,得到所述输入文本对应的词向量序列,其中,所述词向量序列包括所述输入文本中各个词对应的输入词向量;
语义模块,用于将所述词向量序列输入编码器-解码器模型的编码器中,得到语义向量;
标签模块,用于将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签;
纠错模块,用于将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本。
在第二方面的一种可能的实现方式中,所述纠错模块包括:
向量输入子模块,用于将所述词向量序列中的输入词向量依次输入至所述编码器-解码器模型的解码器中;
隐藏更新子模块,用于在每一次将所述输入词向量输入所述解码器后,根据所述输入词向量以及所述输入词向量对应的第二隐藏层向量,计算所述输入词向量对应的注意力向量以及下一个输入词向量对应的第二隐藏层向量,其中,所述第二隐藏层向量为所述解码器的隐藏层向量,所述语义向量为第一个输入词向量对应的第二隐藏层向量;
第一输出子模块,用于若所述输入词向量对应的纠错标签为第一标签,则控制所述解码器将所述输入词向量对应的词作为所述输入词向量对应的解码词,其中,所述纠错标签包括第一标签和第二标签;
第一向量子模块,用于若所述输入词向量对应的纠错标签为第二标签,则根据所述输入词向量对应的纠错标签、所述输入词向量对应的注意力向量以及所述输入词向量对应的第二隐藏层向量构建第一向量;
第一计算子模块,用于将所述第一向量与预设词典中各个词对应的第二向量进行相似度计算,得到所述预设词典中各个词对应的第一相似度;
第二输出子模块,用于根据所述第一相似度确定所述输入词向量对应的解码词;
文本整合子模块,用于根据各个输入词向量对应的解码词确定所述纠错后的文本。
在第二方面的一种可能的实现方式中,所述第二输出子模块,具体用于将所述预设词典中第一相似度最高的词作为所述输入词向量对应的解码词。
在第二方面的另一种可能的实现方式中,所述输入词向量包括拼音词向量和字形词向量;
相应的,所述第二输出子模块包括:
第二计算子模块,用于将所述输入词向量中的拼音词向量与所述预设词典中各个词对应的拼音词向量进行相似度计算,得到所述预设词典中各个词对应的拼音相似度;
第三计算子模块,用于将所述输入词向量中的字形词向量与所述预设词典中各个词对应的字形词向量进行相似度计算,得到所述预设词典中各个词对应的字形相似度;
第四计算子模块,用于将所述输入词向量对应的词与所述预设词典中各个词进行编辑距离计算,得到所述预设词典中各个词对应的编辑距离;
目标计算子模块,用于分别对所述预设词典中各个词对应的第一相似度、拼音相似度、字形相似度以及编辑距离进行加权求和,得到所述预设词典中各个词对应的目标相似度;
目标输出子模块,用于将所述预设词典中目标相似度最高的词作为所述输入词向量对应的解码词。
在第二方面的一种可能的实现方式中,所述纠错判定模型包括双向编码表征模型和二分类器;
相应的,所述标签模块包括:
预纠错子模块,用于将所述词向量序列中各个输入词向量依次输入所述双向编码表征模型中,得到各个输入词向量对应的第一输出值;
标签分类子模块,用于分别将各个输入词向量对应的第一输出值输入所述二分类器中,得到所述各个输入词向量对应的纠错标签。
本申请实施例的第三方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,使得终端设备实现如上述方法的步骤。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,使得终端设备实现如上述方法的步骤。
本申请实施例的第五方面提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备实现如上述方法的步骤。
本申请实施例与现有技术相比存在的有益效果是:
本申请实施例提供了一种文本纠错方法,在编码器-解码器模型中的解码器在进行解码之前,需要先使用纠错判定模型对各个输入词向量进行标签分类,得到各个输入词向量的纠错标签。上述纠错标签用于指示对应的词是否需要进行纠错。终端设备在 得到输入文本中各个输入词向量对应的纠错标签之后,将各个输入词向量对应的纠错标签输入至上述解码器中,使解码器可以根据各个输入词向量对应的纠错标签进行针对性的解码,调控解码过程,从而减少解码器的误判情况,提高文本纠错的准确性,解决了当前的编码器-解码器模型的解码过程不可控,容易产生误判情况的问题。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种文本纠错方法的流程示意图;
图2是本申请实施例提供的一种文本纠错系统的结构示意图;
图3是本申请实施例提供的词向量嵌入模型的工作示意图;
图4是本申请实施例提供的编码器的工作示意图;
图5是本申请实施例提供的纠错判定模型的工作示意图;
图6是本申请实施例提供的预设词典的示意图;
图7是本申请实施例提供的一种文本纠错装置的结构示意图;
图8是本申请实施例提供的终端设备的示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术 语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本申请实施例提供的文本纠错方法可以应用于手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。
例如,所述终端设备可以是WLAN中的站点(STAION,ST),可以是蜂窝电话、无绳电话、会话启动协议(Session InitiationProtocol,SIP)电话、无线本地环路(Wireless Local Loop,WLL)站、个人数字处理(Personal Digital Assistant,PDA)设备、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、车载设备、车联网终端、电脑、膝上型计算机、手持式通信设备、手持式计算设备、卫星无线设备、无线调制解调器卡、电视机顶盒(set top box,STB)、用户驻地设备(customer premise equipment,CPE)和/或用于在无线系统上进行通信的其它设备以及下一代通信系统,例如,5G网络中的移动终端或者未来演进的公共陆地移动网络(Public Land Mobile Network,PLMN)网络中的移动终端等。
在当前的文本处理领域中,通常使用编码器-解码器模型实现文本纠错、文本翻译、文档摘取、问答系统等文本处理功能。
在编码器-解码器模型中,设置有编码器和解码器。在进行文本纠错时,用户可以将需要纠错的文本输入编码器-解码器模型的编码器中,编码器将用户输入的文本转化成语义向量,然后编码器将语义向量传递至编码器-解码器模型的解码器中,由解码器根据语义向量以及输入文本进行解码,得到纠错后的文本并输出给用户。
但是,在当前的编码器-解码器模型中,模型的解码过程是不可控的,容易产生误判的情况,可能会将部分正确词语误判为错误词语进行纠错,或者将部分错误词语误判为正确词语不进行纠错。
有鉴于此,本申请实施例提供了一种文本纠错方法,在编码器-解码器模型中的解码器在进行解码之前,需要先使用纠错判定模型对各个输入词向量进行标签分类,得到各个输入词向量的纠错标签。上述纠错标签用于指示对应的词是否需要进行纠错。终端设备在得到输入文本中各个输入词向量对应的纠错标签之后,将各个输入词向量对应的纠错标签输入至上述解码器中,使解码器可以根据各个输入词向量对应的纠错标签进行针对性的解码,调控解码过程,从而减少解码器的误判情况,提高文本纠错的准确性,解决了当前的编码器-解码器模型的解码过程不可控,容易产生误判情况的问题。
接下来,将从终端设备的角度,对本实施例提高的文本纠错方法的内容进行描述。请参阅图1所述的文本纠错方法的流程图,该方法包括:
S101、终端设备对输入文本进行词向量转换,得到所述输入文本对应的词向量序列,其中,所述词向量序列包括所述输入文本中各个词对应的输入词向量;
可以参阅图2所示的文本纠错系统的系统图,文本纠错系统中可以包括词向量嵌入模型201、编码器-解码器模型202和纠错判定模型203。
其中,词向量嵌入模型201用于对输入文本进行词向量转换,该过程也可以称为词向量嵌入(embedding)处理,将输入文本从自然语言转换成第一预设长度的输入词向量。
词向量嵌入模型201的类型可以根据实际情况进行选择。例如,假设输入文本为中文文本时,终端设备可以选择拼音词向量模型、字形词向量模型、n元语言模型(n-gram language model)等模型中的任意一种或多种的组合作为词向量嵌入模型201。
其中,语言模型(language mode)是用于计算一个句子的概率的模型。语言模型在机器翻译、中文分词和语法分析等领域中有着广泛的应用。目前人们主要使用的语言模型是n元语言模型。n元语言模型中的n为预设的数值。在n元语言模型中,一个词只和它的前n-1个词相关性最高。当n为1时,n元语言模型表示句子中的每个词都和前面的词无关,彼此独立。当n为2时,表示一个词只和它前面的一个词有关。当n为3时,表示一个词只和它前面的两个词有关。
此外,词向量嵌入模型201可以为单一的模型,或者,词向量嵌入模型201也可以是多个模型的结合。
例如,当终端设备选择拼音词向量模型、字形词向量模型、n元语言模型这三种模型的组合作为词向量模型时,可以将输入文本分别输入拼音词向量模型、字形词向量模型、n元语言模型中,得到输入文本中各个词对应的拼音词向量、字形词向量和上下文词向量。然后,终端设备分别拼接各个词对应的拼音词向量、字形词向量和上下文词向量,得到输入文本中各个词的输入词向量。
当输入词向量由多种词向量组合得到时,可以充分利用不同领域的知识和领域特征,提高文本纠错的准确率。
确定了词向量嵌入模型201后,终端设备可以将输入文本输入词向量嵌入模型201中进行embedding处理,得到输入文本对应的词向量序列。词向量序列中包括输入文本中各个词的输入词向量。
应当理解的是,上述描述中提及的词(token),可以根据输入文本的语言类型以及用户预先配置的内容进行定义。例如,在中文文本中,上述提及的词,可以将一个字作为一个词,也可以将多个字的组合作为一个词;在英文文本中,上述提及的词,可以是将一个单词作为一个词,也可以将多个单词组成的短语作为一个词。本实施例对词的定义方式不进行限制。
S102、所述终端设备将所述词向量序列输入编码器-解码器模型202的编码器2021中,得到语义向量;
编码器-解码器模型202是一种应用于序列到序列(Sequence to Sequence,Seq2Seq)问题的模型,可以应用于文本翻译、文档摘取、问答系统等文本处理领域。在不同的文本处理领域,编码器-解码器模型202的输入和输出表征不同的含义。例如,在文本翻译领域,编码器-解码器模型202的输入为待翻译的文本,编码器-解码器模型202的输出为翻译后的文本;在问答系统领域,编码器-解码器模型202的输入为问题,编码器-解码器模型202的输出为答案。
在编码器-解码器模型202中,设置有编码器2021和解码器2022。编码器2021用于将输入序列转化为一个固定长度的向量,解码器2022用于将编码器2021生成的 固定长度的向量再转化为输出序列。
编码器2021的类型和解码器2022的类型可以根据实际情况进行选择。
例如,编码器2021的类型和解码器2022的类型可以为循环神经网络模型(Recurrent Neural Network,RNN)、长短期记忆(Long Short-Term Memory,LSTM)模型,门控循环(Gated Recurrent Unit,GRU)模型,文本卷积(Text Convolutional Neural Networks,TextCNN)模型,变化(transformer)模型等模型中的任意一种。
并且,编码器2021的类型可以与解码器2022的类型一致,或者,编码器2021的类型也可以与解码器2022的类型不一致。
例如,编码器2021的类型和解码器2022的类型可以均为LSTM模型。或者,也可以编码器2021的类型为LSTM模型,解码器2022的类型为transformer模型。
获取到输入文本的词向量序列之后,终端设备可以将上述词向量序列中的输入词向量依次输入编码器-解码器模型202的编码器2021中,得到语义向量。
在终端设备将词向量序列中的输入词向量依次输入编码器2021的过程中,每输入一个输入词向量,编码器2021就会根据输入词向量对当前的第一隐藏层向量进行更新,得到新的第一隐藏层向量。上述第一隐藏层向量为编码器2021的隐藏层向量。当终端设备将最后一个输入词向量输入编码器2021之后,编码器2021会根据最后一个输入词向量对第一隐藏层向量进行最后一次更新,得到语义向量。
S103、所述终端设备将所述词向量序列输入纠错判定模型203中,得到各个输入词向量对应的纠错标签;
在现有的编码器-解码器模型202的文本纠错方案中,在编码器2021对词向量序列进行编码,得到语义向量后,终端设备就将编码器2021输出的语义向量以及上述词向量序列输入解码器2022中,由解码器2022根据语义向量和词向量序列进行解码操作,输出纠错后的文本。
但是,在这种解码方式中,解码过程是不可控的,即输入文本中的每个词,不管是否为正确的词,都有可能被纠错。因此,在当前的编码器-解码器模型202中,解码过程不可控,容易产生误判情况,可能会对部分不需要纠错的词进行纠错,也可能会对部分需要纠错的词不进行纠错。
对此,在本实施例的纠错方法中,在解码器2022进行解码之前,先将词向量序列输入纠错判定模型203进行纠错判定,得到输入文本中各个词对应的纠错标签。
其中,纠错判定模型203用于识别输入文本中各个词是否为正确的词,从而确定输入文本中哪一些词需要被纠错,哪一些词不需要被纠错,得到输入文本中各个词对应的纠错标签。
纠错标签用于指示输入文本中的各个词是否需要进行纠错,纠错标签可以包括第一标签和第二标签,第一标签表示不需要进行纠错,第二标签表示需要进行纠错。纠错标签的形式可以根据实际情况进行设置。例如,在一些实施例中,可以用0表示第一标签,用1表示第二标签。
纠错判定模型203的结构可以根据实际情况进行设置。在一些实施例中,可以将双向编码表征模型(Bidirectional Encoder Representations from Transformers,Bert)模型以及二分类器的组合作为纠错判定模型203。
终端设备可以将词向量序列中的输入词向量依次输入Bert模型中,得到各个输入词向量对应的第一输出值。然后,终端设备分别将各个输入词向量对应的第一输出值输入二分类器进行分类,得到输入文本中各个输入词向量对应的纠错标签。其中,Bert模型的第一输出值与输入词向量为一一对应的关系。
Bert模型具有准确性高、使用方便、调节速度快等优点,当使用Bert模型和二分类器构建纠错判定模型203时,可以降低纠错判定模型203的构建难度和训练难度。
在另一些实施例中,也可以选择其他模型作为纠错判定模型203。比如,终端设备可以选择RNN模型、LSTM模型、GRU模型、TextCNN模型、transformer模型等模型构建纠错判定模型203。纠错判定模型203的具体结构可以根据实际情况进行设置。
S104、所述终端设备将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型202的解码器2022中,得到纠错后的文本。
终端设备在得到输入文本中各个词的纠错标签之后,可以将上述词向量序列、上述语义向量以及上述输入文本中各个词的纠错标签输入上述编码器-解码器模型202的解码器2022中,由解码器2022进行解码。
在解码的过程中,解码器2022可以根据输入文本中各个词的纠错标签进行针对性解码,调控解码过程,从而减少解码器的误判情况,提高文本纠错的准确性。
此外,解码器2022解码的过程中,通常使用softmax函数进行解码,这种解码方式会占用系统较多的计算性能,处理时间长。因此,在一些可能的实现方式中,解码器2022可以采用相似度比对的方式进行解码,降低解码的计算复杂度,减少对系统性能的损耗,减少处理时长。
在解码的过程中,终端设备可以将上述语义向量作为解码器2022的第二隐藏层向量的初始值,此时,语义向量为第一个输入词向量对应的第二隐藏层向量。第二隐藏层向量为解码器2022的隐藏层向量。
终端设备依次将词向量序列中的输入词向量输入至解码器2022中,根据上述输入词向量以及该输入词向量对应的第二隐藏层向量计算该输入词向量对应的注意力向量以及下一个输入词向量对应的第二隐藏层向量。
若该输入词向量对应的纠错标签为第一标签,表示该输入词向量对应的词不需要进行纠错,则终端设备控制解码器2022将该输入词向量对应的词作为该输入词向量对应的解码词,解码词为解码器的输出值。
若该输入词向量对应的纠错标签为第二标签,表示该输入词向量对应的词需要进行纠错,则终端设备根据该输入词向量对应的纠错标签、该输入词向量对应的注意力向量以及该输入词向量对应的第二隐藏层向量构建第一向量。第一向量的构建方式可以根据实际情况进行设置。例如,在一些实施例中,可以直接将该输入词向量对应的纠错标签、该输入词向量对应的注意力向量以及该输入词向量对应的第二隐藏层向量拼接成第一向量。
终端设备将第一向量与预设词典中各个词对应的第二向量进行相似度计算,得到预设词典中各个词对应的第一相似度,根据该第一相似度确定上述输入词向量对应的 解码词。
在一些可能的实现方式中,终端设备可以直接控制解码器2022将预设词典中第一相似度最高的词确定为上述输入词向量对应的解码词。
在另一些可能的实现方式中,为了提高解码的准确率,终端设备还可以结合其他领域知识进行综合比对,得到目标相似度,根据目标相似度确定上述输入词向量对应的解码词。上述其他领域知识可以包括拼音相似度、字形相似度、编辑距离等领域知识中的一种或多种。
例如,以中文文本为例,假设输入词向量包括拼音词向量和字形词向量。
终端设备可以根据该输入词向量中的拼音词向量以及预设词典中各个词对应的拼音词向量进行相似度计算,得到预设词典中各个词对应的拼音相似度。
终端设备可以根据字形词向量以及预设词典中各个词对应的字形词向量进行相似度计算,得到预设词典中各个词对应的字形相似度。
终端设备对上述输入词向量对应的词和预设词典中各个词进行编辑距离(Edit Distance)计算,得到预设词典中各个词对应的编辑距离。
编辑距离是指两个字符串之间,由一个字符串转成另一个字符串所需的最少编辑操作次数。编辑距离的应用范围非常广泛,尤其是在相似性问题上,例如文本纠错、抄袭识别等。
编辑距离中的编辑包括三种操作:插入、删除和替换。通常使用动态规划算法计算两个字符串之间的编辑距离。
终端设备对预设词典中各个词对应的第一相似度、拼音相似度、字形相似度以及编辑距离进行加权求和,得到预设词典中各个词对应目标相似度。其中,上述第一相似度对应的第一权重值、上述拼音相似度对应的第二权重值、上述字形相似度对应的第三权重值以及上述编辑距离对应的第四权重值均为预先设置的值。
可以理解的是,在上述提及的相似度计算过程中,可以根据实际情况选择合适的相似度算法进行计算。例如,可以选择计算余弦距离的方式计算相似度;或者,可以选择计算欧式距离的方式计算相似度;或者,也可以选择其他相似度算法计算相似度。
计算得到目标相似度之后,终端设备可以将预设词典中目标相似度最高的词确定为上述输入词向量对应的解码词。
终端设备得到各个输入词向量对应的解码词后,根据上述各个输入词向量对应的解码词确定纠错后的文本。
以下结合具体的应用场景对本实施例的文本纠错方法进行说明:
假设文本纠错系统包括编码器-解码器模型、纠错判定模型和词向量嵌入模型。编码器-解码器模型中,编码器和解码器均为LSTM模型,纠错判定模型为Bert模型后接分类器,词向量嵌入模型包括拼音词向量模型、字形词向量模型、n元语言模型。
在进行文本纠错之前,使用训练语料对文本纠错系统进行训练。其中,训练语料可以包括收集到的自动语音识别(Automatic Speech Recognition,ASR)识别后的带标签的文本语料、报纸上的公开数据、通用实体词汇以及其他公开的语料,例如Sighan Bakeoff语料等。
在获取训练语料的过程中,可以对训练语料进行预处理,包括:
1.1、收集语料并标注;
1.2、将收集到的语料中的中文繁体字转化为中文简体字;
1.3、将中文字符转化为拼音;
1.4、语料质量筛选以及语料分析统计。
通过上述预处理过程以后,可以提高训练语料的质量。之后,可以通过词向量嵌入模型将训练语料转化为训练词向量,使用训练词向量对编码器、解码器和纠错判定模块进行训练。
在每一轮训练中,计算编码器对应的第一损失值、解码器对应的第二损失值、纠错判定模块对应的第三损失值,根据第一损失值对编码器进行迭代更新、根据第二损失值对解码器进行迭代更新以及根据第三损失值对纠错判定模型进行迭代更新。
重复训练,直至达到预设的训练中止条件。训练中止条件可以根据实际情况进行设置。例如,训练中止条件可以为训练次数达到预设迭代次数;或者,训练中止条件可以为第一损失值、第二损失值以及第三损失值小于预设损失阈值;或者,训练中止条件也可以为其他条件。
训练完成后,可以使用经过训练的文本纠错系统进行文本纠错。
如图3所示,假设输入文本为“听学习前的歌”,以一个字作为一个词,将输入文本输入词向量嵌入模型中,得到纠错文本对应的词向量序列,词向量序列中包括每一个词对应的输入词向量(即图3中的词向量A1至词向量A6),每一个输入词向量由拼音词向量、字形词向量和上下文词向量拼接得到。
终端设备将词向量A1至词向量A6逐一输入至编码器中。每一次将输入词向量输入至编码器时,编码器根据该输入词向量对编码器的第一隐藏层向量进行更新。
如图4所示,编码器的第一隐藏层向量的初始值为h0。终端设备将词向量A1输入编码器后,编码器根据词向量A1将h0更新为h1;终端设备将词向量A2输入编码器后,编码器根据词向量A2将h1更新为h2;以此类推,终端设备将词向量A6输入编码器后,编码器根据词向量A6将h5更新为h6。终端设备将h6作为语义向量,控制编码器将h6输入至解码器中,作为解码器的第二隐藏层向量的初始值s1。
如图5所示,终端设备将词向量A1至词向量A6逐一输入至纠错判定模型中的Bert模型中,Bert模型输出词向量A1至词向量A6对应的第一输出值,并将各个第一输出值输入二分类器中,得到词向量A1至词向量A6对应的纠错标签。其中,纠错标签包括第一标签和第二标签,第一标签的值为0,第二标签的值为1。
此时,词向量A1、词向量A5、词向量A6对应的纠错标签为0,表示词向量A1、词向量A5、词向量A6对应的词不需要纠错;词向量A2、词向量A3、词向量A4对应的纠错标签为1,表示词向量A2、词向量A3、词向量A4对应的词需要纠错。
之后,终端设备将词向量A1至词向量A6以及各个输入词向量对应的纠错标签输入解码器中,得到各个输入词向量对应的解码词。其具体过程如下:
2.1、终端设备将词向量A1和词向量A1对应的纠错标签输入解码器中,解码器根据词向量A1将s1更新为s2。由于词向量A1对应的纠错标签为0,因此解码器输出词向量A1对应的词“听”。
2.2、终端设备将词向量A2和词向量A2对应的纠错标签输入解码器中,解码器 根据词向量A2将s2更新为s3。由于词向量A2对应的纠错标签为1,因此解码器根据s2以及h1至h6计算词向量A2对应的注意力向量b1,然后根据词向量A2对应的纠错标签、s2以及b1构建词向量A2对应的第一向量c1。
如图6所示,假设预设词典中存在m个词,m为预设正整数。则每个词存在对应的第二向量。比如,“雪”对应第二向量d1,“薛”对应第二向量d2,“天”对应第二向量d3,“晶”对应第二向量d4,“劈”对应第二向量dm。
终端设备对第一向量c1与预设词典中各个词的第二向量进行相似度计算,得到预设词典中各个词对应的第一相似度。此时,第一向量c1和第二向量d2的第一相似度最高,解码器输出第二向量d2对应的词“薛”。
2.3、终端设备将词向量A3和词向量A3对应的纠错标签输入解码器中,解码器根据词向量A3将s3更新为s4。由于词向量A3对应的纠错标签为1,因此解码器根据s3以及h1至h6计算词向量A3对应的注意力向量b2,然后根据词向量A3对应的纠错标签、s3以及b2构建词向量A3对应的第一向量c2。
终端设备对第一向量c2与预设词典中各个词的第二向量进行相似度计算,得到预设词典中各个词对应的第一相似度,解码器输出预设词典中第一相似度最高的词“之”。
2.4、终端设备将词向量A4和词向量A4对应的纠错标签输入解码器中,解码器根据词向量A4将s4更新为s5。由于词向量A4对应的纠错标签为1,因此解码器根据s4以及h1至h6计算词向量A4对应的注意力向量b3,然后根据词向量A4对应的纠错标签、s4以及b3构建词向量A4对应的第一向量c3。
终端设备对第一向量c3与预设词典中各个词的第二向量进行相似度计算,得到预设词典中各个词对应的第一相似度,解码器输出预设词典中第一相似度最高的词“谦”。
2.5、终端设备将词向量A5和词向量A5对应的纠错标签输入解码器中,解码器根据词向量A5将s5更新为s6。由于词向量A5对应的纠错标签为0,因此解码器输出词向量A5对应的词“的”。
2.6、终端设备将词向量A6和词向量A6对应的纠错标签输入解码器中,解码器根据词向量A6将s6更新为s7。由于词向量A6对应的纠错标签为0,因此解码器输出词向量A6对应的词“歌”。
终端设备得到词向量A1至词向量A6对应的解码词“听”、“薛”、“之”、“谦”、“的”、“歌”之后,将各个解码词按顺序排列,得到纠错后的文本,即“听薛之谦的歌”。
综上所述,本申请实施例提供了一种文本纠错方法,在编码器-解码器模型中的解码器在进行解码之前,需要先使用纠错判定模型对各个输入词向量进行标签分类,得到各个输入词向量的纠错标签。上述纠错标签用于指示对应的词是否需要进行纠错。终端设备在得到输入文本中各个输入词向量对应的纠错标签之后,将各个输入词向量对应的纠错标签输入至上述解码器中,使解码器可以根据各个输入词向量对应的纠错标签进行针对性的解码,调控解码过程,从而减少解码器的误判情况,提高文本纠错的准确性,解决了当前的编码器-解码器模型的解码过程不可控,容易产生误判情况的问题。
通过上述文本纠错方法对输入文本进行纠错之后,可以得到纠错后的文本。纠错 后的文本可以广泛应用于各类下游任务,例如分词任务、词性标注任务、实体识别任务、意图分类任务、槽位填充任务、对话管理任务、文本生成任务等。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
请参阅图7,本申请实施例提供了一种文本纠错装置,为便于说明,仅示出与本申请相关的部分,如图7所示,文本纠错装置包括,
嵌入模块701,用于对输入文本进行词向量转换,得到所述输入文本对应的词向量序列,其中,所述词向量序列包括所述输入文本中各个词对应的输入词向量;
语义模块702,用于将所述词向量序列输入编码器-解码器模型的编码器中,得到语义向量;
标签模块703,用于将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签;
纠错模块704,用于将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本。
进一步地,其特征在于,所述纠错模块704包括:
向量输入子模块,用于将所述词向量序列中的输入词向量依次输入至所述编码器-解码器模型的解码器中;
隐藏更新子模块,用于在每一次将所述输入词向量输入所述解码器后,根据所述输入词向量以及所述输入词向量对应的第二隐藏层向量,计算所述输入词向量对应的注意力向量以及下一个输入词向量对应的第二隐藏层向量,其中,所述第二隐藏层向量为所述解码器的隐藏层向量,所述语义向量为第一个输入词向量对应的第二隐藏层向量;
第一输出子模块,用于若所述输入词向量对应的纠错标签为第一标签,则控制所述解码器将所述输入词向量对应的词作为所述输入词向量对应的解码词,其中,所述纠错标签包括第一标签和第二标签;
第一向量子模块,用于若所述输入词向量对应的纠错标签为第二标签,则根据所述输入词向量对应的纠错标签、所述输入词向量对应的注意力向量以及所述输入词向量对应的第二隐藏层向量构建第一向量;
第一计算子模块,用于将所述第一向量与预设词典中各个词对应的第二向量进行相似度计算,得到所述预设词典中各个词对应的第一相似度;
第二输出子模块,用于根据所述第一相似度确定所述输入词向量对应的解码词;
文本整合子模块,用于根据各个输入词向量对应的解码词确定所述纠错后的文本。
进一步地,所述第二输出子模块,具体用于将所述预设词典中第一相似度最高的词作为所述输入词向量对应的解码词。
进一步地,所述输入词向量包括拼音词向量和字形词向量;
相应的,所述第二输出子模块包括:
第二计算子模块,用于将所述输入词向量中的拼音词向量与所述预设词典中各个词对应的拼音词向量进行相似度计算,得到所述预设词典中各个词对应的拼音相似度;
第三计算子模块,用于将所述输入词向量中的字形词向量与所述预设词典中各个词对应的字形词向量进行相似度计算,得到所述预设词典中各个词对应的字形相似度;
第四计算子模块,用于将所述输入词向量对应的词与所述预设词典中各个词进行编辑距离计算,得到所述预设词典中各个词对应的编辑距离;
目标计算子模块,用于分别对所述预设词典中各个词对应的第一相似度、拼音相似度、字形相似度以及编辑距离进行加权求和,得到所述预设词典中各个词对应的目标相似度;
目标输出子模块,用于将所述预设词典中目标相似度最高的词作为所述输入词向量对应的解码词。
进一步地,所述纠错判定模型包括双向编码表征模型和二分类器;
相应的,所述标签模块703包括:
预纠错子模块,用于将所述词向量序列中各个输入词向量依次输入所述双向编码表征模型中,得到各个输入词向量对应的第一输出值;
标签分类子模块,用于分别将各个输入词向量对应的第一输出值输入所述二分类器中,得到所述各个输入词向量对应的纠错标签。
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。
请参阅图8,本申请实施例还提供了一种终端设备,该终端设备8包括:处理器80、存储器81以及存储在所述存储器81中并可在所述处理器80上运行的计算机程序82。所述处理器80执行所述计算机程序82时实现上述号码隐私保护方法实施例中的步骤,例如图1所示的步骤S101至S104。或者,所述处理器80执行所述计算机程序82时实现上述各装置实施例中各模块/单元的功能,例如图7所示模块701至704的功能。
示例性的,所述计算机程序82可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器81中,并由所述处理器80执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序82在所述终端设备8中的执行过程。例如,所述计算机程序82可以被分割成嵌入模块、语义模块、标签模块以及纠错模块,各模块具体功能如下:
嵌入模块,用于对输入文本进行词向量转换,得到所述输入文本对应的词向量序列,其中,所述词向量序列包括所述输入文本中各个词对应的输入词向量;
语义模块,用于将所述词向量序列输入编码器-解码器模型的编码器中,得到语义向量;
标签模块,用于将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签;
纠错模块,用于将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本。
所述终端设备8可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设 备。所述终端设备可包括,但不仅限于,处理器80、存储器81。本领域技术人员可以理解,图8仅仅是终端设备8的示例,并不构成对终端设备8的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器80可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器81可以是所述终端设备8的内部存储单元,例如终端设备8的硬盘或内存。所述存储器81也可以是所述终端设备8的外部存储设备,例如所述终端设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器81还可以既包括所述终端设备8的内部存储单元也包括外部存储设备。所述存储器81用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/终端设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/终端设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (12)

  1. 一种文本纠错方法,其特征在于,包括:
    终端设备对输入文本进行词向量转换,得到所述输入文本对应的词向量序列,其中,所述词向量序列包括所述输入文本中各个词对应的输入词向量;
    所述终端设备将所述词向量序列输入编码器-解码器模型的编码器中,得到语义向量;
    所述终端设备将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签;
    所述终端设备将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本。
  2. 如权利要求1所述的一种文本纠错方法,其特征在于,所述终端设备将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本包括:
    所述终端设备将所述词向量序列中的输入词向量依次输入至所述编码器-解码器模型的解码器中;
    在每一次将所述输入词向量输入所述解码器后,所述终端设备根据所述输入词向量以及所述输入词向量对应的第二隐藏层向量,计算所述输入词向量对应的注意力向量以及下一个输入词向量对应的第二隐藏层向量,其中,所述第二隐藏层向量为所述解码器的隐藏层向量,所述语义向量为第一个输入词向量对应的第二隐藏层向量;
    若所述输入词向量对应的纠错标签为第一标签,则所述终端设备控制所述解码器将所述输入词向量对应的词作为所述输入词向量对应的解码词,其中,所述纠错标签包括第一标签和第二标签;
    若所述输入词向量对应的纠错标签为第二标签,则所述终端设备根据所述输入词向量对应的纠错标签、所述输入词向量对应的注意力向量以及所述输入词向量对应的第二隐藏层向量构建第一向量;
    所述终端设备将所述第一向量与预设词典中各个词对应的第二向量进行相似度计算,得到所述预设词典中各个词对应的第一相似度;
    所述终端设备根据所述第一相似度确定所述输入词向量对应的解码词;
    所述终端设备根据各个输入词向量对应的解码词确定所述纠错后的文本。
  3. 如权利要求2所述的一种文本纠错方法,其特征在于,所述终端设备根据所述第一相似度确定所述输入词向量对应的解码词包括:
    所述终端设备将所述预设词典中第一相似度最高的词作为所述输入词向量对应的解码词。
  4. 如权利要求2所述的一种文本纠错方法,其特征在于,所述输入词向量包括拼音词向量和字形词向量;
    相应的,所述终端设备根据所述第一相似度确定所述输入词向量对应的解码词包括:
    所述终端设备将所述输入词向量中的拼音词向量与所述预设词典中各个词对应的拼音词向量进行相似度计算,得到所述预设词典中各个词对应的拼音相似度;
    所述终端设备将所述输入词向量中的字形词向量与所述预设词典中各个词对应的字形词向量进行相似度计算,得到所述预设词典中各个词对应的字形相似度;
    所述终端设备将所述输入词向量对应的词与所述预设词典中各个词进行编辑距离计算,得到所述预设词典中各个词对应的编辑距离;
    所述终端设备分别对所述预设词典中各个词对应的第一相似度、拼音相似度、字形相似度以及编辑距离进行加权求和,得到所述预设词典中各个词对应的目标相似度;
    所述终端设备将所述预设词典中目标相似度最高的词作为所述输入词向量对应的解码词。
  5. 如权利要求1所述的一种文本纠错方法,其特征在于,所述纠错判定模型包括双向编码表征模型和二分类器;
    相应的,所述终端设备将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签包括:
    所述终端设备将所述词向量序列中各个输入词向量依次输入纠错判定模型中,得到各个输入词向量对应的第一输出值;
    所述终端设备分别将各个输入词向量对应的第一输出值输入所述二分类器中,得到所述各个输入词向量对应的纠错标签。
  6. 一种文本纠错装置,其特征在于,包括:
    嵌入模块,用于对输入文本进行词向量转换,得到所述输入文本对应的词向量序列,其中,所述词向量序列包括所述输入文本中各个词对应的输入词向量;
    语义模块,用于将所述词向量序列输入编码器-解码器模型的编码器中,得到语义向量;
    标签模块,用于将所述词向量序列输入纠错判定模型中,得到各个输入词向量对应的纠错标签;
    纠错模块,用于将所述词向量序列、所述语义向量以及所述各个输入词向量对应的纠错标签输入所述编码器-解码器模型的解码器中,得到纠错后的文本。
  7. 如权利要求6所述的一种文本纠错装置,其特征在于,所述纠错模块包括:
    向量输入子模块,用于将所述词向量序列中的输入词向量依次输入至所述编码器-解码器模型的解码器中;
    隐藏更新子模块,用于在每一次将所述输入词向量输入所述解码器后,根据所述输入词向量以及所述输入词向量对应的第二隐藏层向量,计算所述输入词向量对应的注意力向量以及下一个输入词向量对应的第二隐藏层向量,其中,所述第二隐藏层向量为所述解码器的隐藏层向量,所述语义向量为第一个输入词向量对应的第二隐藏层向量;
    第一输出子模块,用于若所述输入词向量对应的纠错标签为第一标签,则控制所述解码器将所述输入词向量对应的词作为所述输入词向量对应的解码词,其中,所述纠错标签包括第一标签和第二标签;
    第一向量子模块,用于若所述输入词向量对应的纠错标签为第二标签,则根据所述输入词向量对应的纠错标签、所述输入词向量对应的注意力向量以及所述输入词向量对应的第二隐藏层向量构建第一向量;
    第一计算子模块,用于将所述第一向量与预设词典中各个词对应的第二向量进行相似度计算,得到所述预设词典中各个词对应的第一相似度;
    第二输出子模块,用于根据所述第一相似度确定所述输入词向量对应的解码词;
    文本整合子模块,用于根据各个输入词向量对应的解码词确定所述纠错后的文本。
  8. 如权利要求7所述的一种文本纠错装置,其特征在于,所述第二输出子模块,具体用于将所述预设词典中第一相似度最高的词作为所述输入词向量对应的解码词。
  9. 如权利要求7所述的一种文本纠错装置,其特征在于,所述输入词向量包括拼音词向量和字形词向量;
    相应的,所述第二输出子模块包括:
    第二计算子模块,用于将所述输入词向量中的拼音词向量与所述预设词典中各个词对应的拼音词向量进行相似度计算,得到所述预设词典中各个词对应的拼音相似度;
    第三计算子模块,用于将所述输入词向量中的字形词向量与所述预设词典中各个词对应的字形词向量进行相似度计算,得到所述预设词典中各个词对应的字形相似度;
    第四计算子模块,用于将所述输入词向量对应的词与所述预设词典中各个词进行编辑距离计算,得到所述预设词典中各个词对应的编辑距离;
    目标计算子模块,用于分别对所述预设词典中各个词对应的第一相似度、拼音相似度、字形相似度以及编辑距离进行加权求和,得到所述预设词典中各个词对应的目标相似度;
    目标输出子模块,用于将所述预设词典中目标相似度最高的词作为所述输入词向量对应的解码词。
  10. 如权利要求6所述的一种文本纠错装置,其特征在于,所述纠错判定模型包括双向编码表征模型和二分类器;
    相应的,所述标签模块包括:
    预纠错子模块,用于将所述词向量序列中各个输入词向量依次输入所述双向编码表征模型中,得到各个输入词向量对应的第一输出值;
    标签分类子模块,用于分别将各个输入词向量对应的第一输出值输入所述二分类器中,得到所述各个输入词向量对应的纠错标签。
  11. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时,使得终端设备实现如权利要求1至5任一项所述方法的步骤。
  12. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,使得终端设备实现如权利要求1至5任一项所述方法的步骤。
PCT/CN2020/125219 2020-02-21 2020-10-30 文本纠错方法、装置、终端设备及计算机存储介质 WO2021164310A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010110410.7 2020-02-21
CN202010110410.7A CN113297833A (zh) 2020-02-21 2020-02-21 文本纠错方法、装置、终端设备及计算机存储介质

Publications (1)

Publication Number Publication Date
WO2021164310A1 true WO2021164310A1 (zh) 2021-08-26

Family

ID=77318559

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125219 WO2021164310A1 (zh) 2020-02-21 2020-10-30 文本纠错方法、装置、终端设备及计算机存储介质

Country Status (2)

Country Link
CN (1) CN113297833A (zh)
WO (1) WO2021164310A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515934A (zh) * 2021-04-28 2021-10-19 新东方教育科技集团有限公司 文本纠错方法、装置、存储介质及电子设备
CN114550185A (zh) * 2022-04-19 2022-05-27 腾讯科技(深圳)有限公司 一种文档生成的方法、相关装置、设备以及存储介质
CN114564942A (zh) * 2021-09-06 2022-05-31 北京数美时代科技有限公司 一种用于监管领域的文本纠错方法、存储介质和装置
CN115906815A (zh) * 2023-03-08 2023-04-04 北京语言大学 一种用于修改一种或多种类型错误句子的纠错方法及装置
CN116011682A (zh) * 2023-02-22 2023-04-25 合肥本源量子计算科技有限责任公司 一种气象数据预测方法、装置、存储介质及电子装置
CN116991874A (zh) * 2023-09-26 2023-11-03 海信集团控股股份有限公司 一种文本纠错、基于大模型的sql语句生成方法及设备
CN117151084A (zh) * 2023-10-31 2023-12-01 山东齐鲁壹点传媒有限公司 一种中文拼写、语法纠错方法、存储介质及设备

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807973B (zh) * 2021-09-16 2023-07-25 平安科技(深圳)有限公司 文本纠错方法、装置、电子设备及计算机可读存储介质
CN114611494B (zh) * 2022-03-17 2024-02-02 平安科技(深圳)有限公司 文本纠错方法、装置、设备及存储介质
CN115879458A (zh) * 2022-04-08 2023-03-31 北京中关村科金技术有限公司 一种语料扩充方法、装置及存储介质
CN114548080B (zh) * 2022-04-24 2022-07-15 长沙市智为信息技术有限公司 一种基于分词增强的中文错字校正方法及系统
CN114781377B (zh) * 2022-06-20 2022-09-09 联通(广东)产业互联网有限公司 非对齐文本的纠错模型、训练及纠错方法
CN115858776B (zh) * 2022-10-31 2023-06-23 北京数美时代科技有限公司 一种变体文本分类识别方法、系统、存储介质和电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463928A (zh) * 2017-07-28 2017-12-12 顺丰科技有限公司 基于ocr和双向lstm的文字序列纠错算法、系统及其设备
CN107992211A (zh) * 2017-12-08 2018-05-04 中山大学 一种基于cnn-lstm的汉字拼写错别字改正方法
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN110489760A (zh) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 基于深度神经网络文本自动校对方法及装置
CN110502754A (zh) * 2019-08-26 2019-11-26 腾讯科技(深圳)有限公司 文本处理方法和装置
CN110765772A (zh) * 2019-10-12 2020-02-07 北京工商大学 拼音作为特征的中文语音识别后的文本神经网络纠错模型

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN107463928A (zh) * 2017-07-28 2017-12-12 顺丰科技有限公司 基于ocr和双向lstm的文字序列纠错算法、系统及其设备
CN107992211A (zh) * 2017-12-08 2018-05-04 中山大学 一种基于cnn-lstm的汉字拼写错别字改正方法
CN110502754A (zh) * 2019-08-26 2019-11-26 腾讯科技(深圳)有限公司 文本处理方法和装置
CN110489760A (zh) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 基于深度神经网络文本自动校对方法及装置
CN110765772A (zh) * 2019-10-12 2020-02-07 北京工商大学 拼音作为特征的中文语音识别后的文本神经网络纠错模型

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515934A (zh) * 2021-04-28 2021-10-19 新东方教育科技集团有限公司 文本纠错方法、装置、存储介质及电子设备
CN114564942A (zh) * 2021-09-06 2022-05-31 北京数美时代科技有限公司 一种用于监管领域的文本纠错方法、存储介质和装置
CN114550185A (zh) * 2022-04-19 2022-05-27 腾讯科技(深圳)有限公司 一种文档生成的方法、相关装置、设备以及存储介质
CN116011682A (zh) * 2023-02-22 2023-04-25 合肥本源量子计算科技有限责任公司 一种气象数据预测方法、装置、存储介质及电子装置
CN115906815A (zh) * 2023-03-08 2023-04-04 北京语言大学 一种用于修改一种或多种类型错误句子的纠错方法及装置
CN115906815B (zh) * 2023-03-08 2023-06-27 北京语言大学 一种用于修改一种或多种类型错误句子的纠错方法及装置
CN116991874A (zh) * 2023-09-26 2023-11-03 海信集团控股股份有限公司 一种文本纠错、基于大模型的sql语句生成方法及设备
CN116991874B (zh) * 2023-09-26 2024-03-01 海信集团控股股份有限公司 一种文本纠错、基于大模型的sql语句生成方法及设备
CN117151084A (zh) * 2023-10-31 2023-12-01 山东齐鲁壹点传媒有限公司 一种中文拼写、语法纠错方法、存储介质及设备
CN117151084B (zh) * 2023-10-31 2024-02-23 山东齐鲁壹点传媒有限公司 一种中文拼写、语法纠错方法、存储介质及设备

Also Published As

Publication number Publication date
CN113297833A (zh) 2021-08-24

Similar Documents

Publication Publication Date Title
WO2021164310A1 (zh) 文本纠错方法、装置、终端设备及计算机存储介质
CN111309915B (zh) 联合学习的自然语言训练方法、系统、设备及存储介质
CN111639175B (zh) 一种自监督的对话文本摘要方法及系统
CN108549646B (zh) 一种基于胶囊的神经网络机器翻译系统、信息数据处理终端
CN110196978A (zh) 一种关注关联词的实体关系抽取方法
WO2022142041A1 (zh) 意图识别模型的训练方法、装置、计算机设备和存储介质
CN113283244B (zh) 一种基于预训练模型的招投标数据命名实体识别方法
CN112528637B (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN109918681B (zh) 一种基于汉字-拼音的融合问题语义匹配方法
CN113158656B (zh) 讽刺内容识别方法、装置、电子设备以及存储介质
CN113743101B (zh) 文本纠错方法、装置、电子设备和计算机存储介质
CN112560506B (zh) 文本语义解析方法、装置、终端设备及存储介质
CN114676255A (zh) 文本处理方法、装置、设备、存储介质及计算机程序产品
CN114218945A (zh) 实体识别方法、装置、服务器及存储介质
CN113673228A (zh) 文本纠错方法、装置、计算机存储介质及计算机程序产品
CN113779277A (zh) 用于生成文本的方法和装置
CN115017890A (zh) 基于字音字形相似的文本纠错方法和装置
CN115565177A (zh) 文字识别模型训练、文字识别方法、装置、设备及介质
CN116416480A (zh) 一种基于多模板提示学习的视觉分类方法和装置
CN116956835A (zh) 一种基于预训练语言模型的文书生成方法
WO2021159803A1 (zh) 文本摘要生成方法、装置、计算机设备及可读存储介质
CN113449081A (zh) 文本特征的提取方法、装置、计算机设备及存储介质
CN111462734A (zh) 语义槽填充模型训练方法及系统
CN115906854A (zh) 一种基于多级对抗的跨语言命名实体识别模型训练方法
WO2022078348A1 (zh) 邮件内容提取方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920332

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920332

Country of ref document: EP

Kind code of ref document: A1