WO2022134356A1 - 句子智能纠错方法、装置、计算机设备及存储介质 - Google Patents

句子智能纠错方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022134356A1
WO2022134356A1 PCT/CN2021/083955 CN2021083955W WO2022134356A1 WO 2022134356 A1 WO2022134356 A1 WO 2022134356A1 CN 2021083955 W CN2021083955 W CN 2021083955W WO 2022134356 A1 WO2022134356 A1 WO 2022134356A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
sentence
mask
words
preset
Prior art date
Application number
PCT/CN2021/083955
Other languages
English (en)
French (fr)
Inventor
邓悦
郑立颖
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022134356A1 publication Critical patent/WO2022134356A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of detection models, and in particular, to a sentence intelligent error correction method, device, computer equipment and storage medium.
  • the embodiments of the present application provide a sentence intelligent error correction method, device, computer equipment and storage medium to solve the problem of low error correction accuracy for sentences with errors.
  • a sentence intelligent error correction method comprising:
  • the wrong sentence into a preset dependency relationship detection model, predict the dependency relationship between the words in the wrong sentence, and obtain the dependency probability between each of the words and their corresponding subsidiary words , and associate each of the words with their corresponding dependency probability;
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • a to-be-corrected mask sentence is obtained, and the to-be-corrected mask sentence is input into a preset language model, and the to-be-corrected words are corrected error prediction, and obtain the predicted replacement word corresponding to the word to be corrected;
  • the word to be corrected is replaced by the predicted replacement word corresponding to the word to be corrected, and the replaced mask sentence to be corrected is recorded as the correct sentence corresponding to the wrong sentence.
  • a sentence intelligent error correction device comprising:
  • Error correction instruction receiving module used to receive sentence error correction instructions containing wrong sentences
  • the dependency probability determination module is used to input the wrong sentence into the preset dependency relationship detection model, predict the dependency relationship between the words in the wrong sentence, and obtain each of the words and their corresponding dependencies The dependency probability between words, and associate each said word with its corresponding dependency probability;
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • a word determination module to be corrected for comparing the dependency probability associated with each of the words with a preset dependency threshold, and recording the associated word with a dependency probability smaller than the preset dependency threshold as the word to be corrected ;
  • the error correction prediction module is used to obtain a mask sentence to be corrected after masking the words to be corrected, and input the mask sentence to be corrected into a preset language model, and to Perform error correction prediction on the word to be corrected, and obtain a predicted replacement word corresponding to the word to be corrected;
  • the word replacement module is used to replace the word to be corrected with the predicted replacement word corresponding to the word to be corrected, and record the replaced mask sentence to be corrected as the correct one corresponding to the wrong sentence. sentence.
  • a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:
  • the wrong sentence into a preset dependency relationship detection model, predict the dependency relationship between the words in the wrong sentence, and obtain the dependency probability between each of the words and their corresponding subsidiary words , and associate each of the words with their corresponding dependency probability;
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • a to-be-corrected mask sentence is obtained, and the to-be-corrected mask sentence is input into a preset language model, and the to-be-corrected words are corrected error prediction, and obtain the predicted replacement word corresponding to the word to be corrected;
  • the word to be corrected is replaced by the predicted replacement word corresponding to the word to be corrected, and the replaced mask sentence to be corrected is recorded as the correct sentence corresponding to the wrong sentence.
  • One or more readable storage media storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the wrong sentence into a preset dependency relationship detection model, predict the dependency relationship between the words in the wrong sentence, and obtain the dependency probability between each of the words and their corresponding subsidiary words , and associate each of the words with their corresponding dependency probability;
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • a to-be-corrected mask sentence is obtained, and the to-be-corrected mask sentence is input into a preset language model, and the to-be-corrected words are corrected error prediction, and obtain the predicted replacement word corresponding to the word to be corrected;
  • the word to be corrected is replaced by the predicted replacement word corresponding to the word to be corrected, and the replaced mask sentence to be corrected is recorded as the correct sentence corresponding to the wrong sentence.
  • the present application reduces the error correction workload of the preset language model, and improves the efficiency of sentence intelligent error correction.
  • FIG. 1 is a schematic diagram of an application environment of a sentence intelligent error correction method in an embodiment of the present application
  • Fig. 2 is a flowchart of a sentence intelligent error correction method in an embodiment of the present application
  • Fig. 3 is a flowchart of step S20 in the sentence intelligent error correction method in an embodiment of the present application
  • Fig. 4 is a flowchart of step S201 in the sentence intelligent error correction method in an embodiment of the present application.
  • step S40 is a flowchart of step S40 in the sentence intelligent error correction method in an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a sentence intelligent error correction device in an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a dependency probability determination module in a sentence intelligent error correction device in an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a word information processing unit in a sentence intelligent error correction device according to an embodiment of the present application
  • Fig. 9 is a principle block diagram of the error correction prediction module in the sentence intelligent error correction device in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the sentence intelligent error correction method provided by the embodiment of the present application can be applied in the application environment shown in FIG. 1 .
  • the sentence intelligent error correction method is applied in a sentence intelligent error correction system, and the sentence intelligent error correction system includes a client and a server as shown in FIG.
  • the problem of lower error correction accuracy for wrong sentences is applied.
  • the client also known as the client, refers to the program corresponding to the server and providing local services for the client.
  • Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a sentence intelligent error correction method is provided, and the method is applied to the server in FIG. 1 as an example to illustrate, including the following steps:
  • the sentence error correction instruction can be actively sent by the user, or can be actively triggered when the user types a wrong sentence in the system.
  • Error sentences are sentences with typos or grammatical errors.
  • S20 Input the erroneous sentence into a preset dependency relationship detection model, predict the dependency relationship between the words in the erroneous sentence, and obtain the relationship between each of the words and their corresponding auxiliary words dependency probability, and associate each of the words with their corresponding dependency probability;
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • the preset dependency detection model is used to determine the dependency probability between each word and its corresponding subsidiary word, and the dependency probability refers to the probability that each word needs to depend on and the subsidiary word exists in the wrong sentence.
  • an auxiliary word can be an adjacent word before or after a word, or two adjacent words before or after a word, or a phase before or after a word.
  • a word adjacent to a word, and two adjacent words after a word for example, if a sentence is "I ate a lot at noon today", for the word noon, its corresponding subordinate Words can be "today” and "eat", or "me”, "today” and "eat”.
  • step S20 includes:
  • S201 Perform word information processing on the wrong sentence to obtain a forward hidden layer vector and a reverse hidden layer vector corresponding to the wrong sentence; It is determined by sorting in the forward order; the reverse hidden layer vector is determined by sorting the words in the wrong sentence according to the reverse order;
  • information processing refers to operations including word segmentation, word embedding processing, and context information extraction for erroneous sentences.
  • step S201 includes:
  • S2011 After performing word segmentation on the erroneous sentence, determine a word sequence corresponding to the erroneous sentence;
  • the erroneous sentence can be segmented by the stuttering word segmentation method to obtain each word in the erroneous sentence, and after replacing the codes corresponding to each word in the preset word dictionary, determine the error sentence corresponding to the erroneous sentence.
  • word sequence wherein, the preset word dictionary refers to pre-encoding each word in other sample documents, and then in step S2011, the preset word dictionary can be used to find and correct each word in the sentence The corresponding codes are replaced to obtain the word sequence.
  • S2012 Perform word embedding processing on the word sequence by using a word embedding method to obtain a word vector sequence corresponding to the word sequence;
  • the word embedding process refers to the process of replacing each word in the word sequence with a word vector that can be recognized and trained by the preset affiliation detection model.
  • a word embedding method (such as: the skip-gram model of Word2Vec, the CBOW model of Word2Vec or the GloVe word vector is used) is used. etc.) perform word embedding processing on each word in the word sequence, that is, replace each word in the word sequence with a word vector that can be recognized and trained by the preset affiliation detection model, and then obtain the corresponding word sequence. sequence of word vectors.
  • S2013 Perform forward context information extraction on the word vector sequence through the information extraction module in the preset affiliation detection model to obtain the forward hidden layer vector; and perform reverse context on the word vector sequence at the same time Information extraction to obtain the reverse hidden layer.
  • the context information extraction includes the extraction of the above information and the extraction of the following information
  • the information extraction module is the Bilstm (Bi-directional Long Short-Term Memory, bidirectional long and short-term memory network) module.
  • the word vector sequence is arranged according to the forward order of each word, and then the preset affiliate relationship detection model can be used.
  • the information extraction module in the word vector sequence extracts the word vector sequence based on the above information in the forward order of each word in the word vector sequence. Understandably, that is, the second word vector in the word vector sequence needs to be based on the first word vector sequence.
  • the word vectors are determined, and then the forward hidden layer vector is obtained.
  • the word vector sequence is processed according to the forward order of each word in the word vector sequence based on the following information. It is understandable that from the last word vector of the word vector sequence, the penultimate word vector needs to be determined according to the penultimate word vector, and then the reverse hidden layer vector is obtained.
  • S202 Input the forward hidden layer vector into the first linear transformation layer module in the preset affiliation detection model, and determine the first word feature corresponding to the forward hidden layer vector; Input the hidden layer vector to the second linear transformation layer module in the preset affiliation detection model, and determine the second word feature corresponding to the reverse hidden layer vector;
  • the forward hidden layer vector is input into the preset
  • the first linear transformation layer module in the affiliation detection model performs linear mapping on the forward hidden layer vector to determine the first word feature corresponding to the forward hidden layer vector;
  • the vector is input to the second linear transformation layer module in the preset affiliation detection model, and linear mapping is performed on the reverse hidden layer vector to determine the second word feature corresponding to the reverse hidden layer vector.
  • the first word feature refers to the features representing the auxiliary word and the attached word
  • the words in the wrong sentence that the attached word refers to and the auxiliary word refers to the adjacent word before or after an attached word.
  • the attached word is the word order in the forward order. Words are adjuncts, and the rest of the adjacent words are adjuncts.
  • the word feature corresponding to each word determined is (in, is the word feature corresponding to the nth vector in the forward hidden layer vector, that is, the word feature of the nth word in the wrong sentence in the forward order), and (in, is the word feature corresponding to the nth vector in the reverse hidden layer vector, that is, the word feature of the nth word in the wrong sentence in reverse order), understandably, for the wrong sentence in the forward order
  • the first word, its corresponding word feature in the forward hidden layer vector is
  • the corresponding features in the reverse hidden layer vector are, That is, the last word feature.
  • the second word features are: Wherein, Warc-head and barc-head are trainable parameters of the first linear transformation layer module, and Warc-dep and barc-dep are trainable parameters of the second linear transformation layer module.
  • S203 According to the first word feature and the second word feature, use a double affine attention mechanism to determine the dependency probability between each of the words and their corresponding auxiliary words.
  • the first word feature corresponding to the forward hidden layer vector is determined;
  • the reverse hidden layer vector is input to the second linear transformation layer module in the preset affiliation detection model, and after determining the second word feature corresponding to the reverse hidden layer vector, all the first word features are stacked.
  • the dependency probability between each of the words and their corresponding auxiliary words is determined through a double affine attention mechanism.
  • the detection ability of wrong words in wrong sentences is improved by the double affine attention mechanism.
  • the double affine attention mechanism can well correlate words with words, and can better combine the context. information to improve the accuracy and efficiency of wrong word detection.
  • the preset dependency threshold may be manually determined according to historical sample data. Exemplarily, the preset dependency threshold may be 0.8, 0.9, etc. The preset dependency threshold may be adjusted according to the exact accuracy requirements of specific application scenarios, if necessary To determine a scene with a higher accuracy rate of error correction words, the preset dependency threshold can be set to a higher value, such as 0.95.
  • the dependencies between the words in the erroneous sentence are predicted to obtain each of the words and their corresponding auxiliary words.
  • the dependency probability associated with each of the words is compared with the preset dependency threshold, and the dependency probability associated with the word is less than
  • the default dependency threshold it indicates that the dependency between the word and other words in the wrong sentence is weak, that is, the word may be a wrong word, so the dependency probability will be smaller than the default dependency threshold
  • the associated word is recorded as the word to be corrected.
  • the dependency probability associated with the word is greater than or equal to the preset dependency threshold, it indicates that the word has a strong dependency relationship with other words in the wrong sentence, that is, the word is not a wrong word. word.
  • the masking process refers to masking the word to be corrected by using special characters, so as to perform error correction prediction on the word at the position.
  • step S40 includes the following steps:
  • S401 Convert and encode the masked sentence by using the encoding module in the preset language model to obtain an encoding vector corresponding to the masked sentence; the encoding vector includes the words to be corrected Corresponding error correction coding vector;
  • the encoding module is essentially a vector conversion layer in the preset language model, and the encoding module is used to convert the mask sentence to be corrected into an encoded vector that can be recognized.
  • the predetermined dependency threshold is used to record the word to be corrected.
  • the encoding module in the language model converts and encodes the masked sentence to obtain an encoding vector corresponding to the masked sentence.
  • S402 Perform linear mapping on the encoding vector, and determine that the error correction encoding vector belongs to the matching score of each word to be replaced in the preset character dictionary;
  • the preset character dictionary may be obtained by performing word segmentation, encoding, etc. according to historical sample data, and the preset character dictionary stores encoding vectors corresponding to multiple words.
  • the masked sentence is converted and encoded to obtain an encoding vector corresponding to the masked sentence, and the encoding vector is linearly mapped to correct the error.
  • the coding vector is matched with each coding vector in the preset character dictionary.
  • the error correction coding vector can be matched with the Euclidean distance of each coding vector in the preset character dictionary, and then the Euclidean distance can be determined according to the Euclidean distance.
  • the error correction coding vector belongs to the matching score of each word to be replaced in the preset character dictionary.
  • S404 Record the word to be replaced with the largest matching probability and greater than or equal to a preset matching threshold as the predicted replacement word.
  • the preset matching threshold may be 0.9, 0.95, or the like.
  • the word containing the maximum value of the matching probability is sent to the preset receiver, and the preset receiver may be the sender of the sentence error correction instruction, Indicates that no predicted replacement words have been found that match the words at that location.
  • a masked sentence to be corrected is obtained, and the masked sentence to be corrected is input into a preset language model, and the masked sentence to be corrected is input into a preset language model.
  • Perform error correction prediction on the words and after obtaining the predicted replacement word corresponding to the word to be corrected, replace the word to be corrected with the predicted replacement word corresponding to the word to be corrected, and replace The subsequent to-be-corrected mask sentence is recorded as the correct sentence corresponding to the wrong sentence, thereby completing the entire sentence intelligent error correction process.
  • step S40 before step S40, it further includes:
  • the correct sample sentence refers to a sentence with no typos and no syntactic error, and the correct sample sentence can be extracted from documents in various application scenarios.
  • this step includes the following steps:
  • a second mask word is selected from the sample predicted words, and the selected second mask word is replaced with a preset homophone word.
  • the word record is the sample predicted word, preferably, the selection probability of selecting the first mask word can be 12% (the selection probability is an experimental value), that is, each word in the correct sample sentence is selected as the first mask word.
  • the probability of masking words is 12%; further, in order to improve the error correction ability of the model for wrong sentences, a second mask word is selected from the sample predicted words, and the selected homophone is used to replace the selected word.
  • the second mask word and then create the scene of homophone substitution error (for example, in the intelligent multi-round dialogue, the robot often has the error of homophone), and the first mask word except the second mask word Words are not replaced, and only their corresponding positions are recorded.
  • the error correction training process of homophone replacement can be achieved, and the training process of mask error correction can also be achieved.
  • transform coding the sample mask sentence to obtain a sample encoding vector corresponding to the sample mask sentence; the sample encoding vector Perform linear mapping to determine the matching score of each word to be replaced in the preset character dictionary with each first mask word and second mask word; perform normalization processing on each matching score to obtain a matching score corresponding to each matching score For the corresponding matching probability, the word to be replaced with the largest matching probability among the matching probabilities and greater than or equal to the preset matching threshold is recorded as the sample predicted word corresponding to the first mask word or the second mask word.
  • the sample mask sentence into a preset training model including initial parameters, performing mask prediction on the sample mask words, and obtaining sample prediction words corresponding to the sample mask words
  • the word replace the sample mask word with the sample predicted word, record the replaced sample mask sentence as the sample predicted sentence; and then determine the preset training model according to the sample predicted sentence and its corresponding correct sample sentence.
  • the predicted loss value which can be determined by the cross-entropy loss function.
  • the convergence condition can be the condition that the predicted loss value is less than the set threshold, that is, when the predicted loss value is less than the set threshold, the training is stopped; the convergence condition can also be that the predicted loss value after 10,000 calculations is The condition that is very small and will not decrease again, that is, when the predicted loss value is small and will not decrease after 10,000 calculations, stop training, and record the preset training model after convergence as the preset language Model.
  • the initial parameters of the model are adjusted again according to the predicted loss value, so that the predicted loss value corresponding to the correct sample sentence reaches the preset convergence condition.
  • the output results of the preset training model can continue to be closer to the accurate results, so that the recognition accuracy is getting higher and higher, until all correct samples
  • the preset training model after convergence is recorded as the preset language model.
  • a sentence intelligent error correction device is provided, and the sentence intelligent error correction device corresponds one-to-one with the sentence intelligent error correction method in the above embodiment.
  • the sentence intelligent error correction device includes an error correction instruction receiving module 10 , a dependency probability determination module 20 , a word determination module 30 to be corrected, an error correction prediction module 40 and a word replacement module 50 .
  • the detailed description of each functional module is as follows:
  • Error correction instruction receiving module 10 used for receiving sentence error correction instructions containing wrong sentences
  • the dependency probability determination module 20 is used to input the wrong sentence into the preset dependency relationship detection model, predict the dependency relationship between the words in the wrong sentence, and obtain each of the words and their corresponding words.
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • the word determination module 30 to be corrected is used to compare the dependency probability associated with each of the words with a preset dependency threshold, and record the associated word with a dependency probability smaller than the preset dependency threshold as the word to be corrected word;
  • the error correction prediction module 40 is used for masking the words to be corrected to obtain a masked sentence to be corrected, and inputting the masked sentence to be corrected into a preset language model. performing error correction prediction on the words to be corrected, to obtain predicted replacement words corresponding to the words to be corrected;
  • the word replacement module 50 is used to replace the word to be corrected with the predicted replacement word corresponding to the word to be corrected, and record the mask sentence to be corrected after the replacement as the corresponding error sentence. correct sentence.
  • the dependency probability determination module 20 includes:
  • the word information processing unit 201 is used to process the word information on the wrong sentence, and obtain the forward hidden layer vector and the reverse hidden layer vector corresponding to the wrong sentence;
  • the words in the wrong sentence are sorted and determined according to the forward order;
  • the reverse hidden layer vector is determined by sorting the words in the wrong sentence according to the reverse order;
  • Linear mapping unit 202 configured to input the forward hidden layer vector into the first linear transformation layer module in the preset affiliation detection model, and determine the first word feature corresponding to the forward hidden layer vector; At the same time, the reverse hidden layer vector is input into the second linear transformation layer module in the preset affiliation detection model, and the second word feature corresponding to the reverse hidden layer vector is determined;
  • the dependency probability determining unit 203 determines, according to the first word feature and the second word feature, the dependency probability between each of the words and their corresponding auxiliary words through a double affine attention mechanism.
  • the word information processing unit 201 includes:
  • the word sequence determination subunit 2011 is used to determine the word sequence corresponding to the wrong sentence after the word segmentation process is carried out to the wrong sentence;
  • a word embedding processing subunit 2012 configured to perform word embedding processing on the word sequence through a word embedding method to obtain a word vector sequence corresponding to the word sequence;
  • the information extraction subunit 2013 is configured to perform forward context information extraction on the word vector sequence through the information extraction module in the preset affiliation detection model to obtain the forward hidden layer vector;
  • the reverse context information is extracted from the vector sequence to obtain the reverse hidden layer.
  • the error correction prediction module 40 includes:
  • the conversion coding unit 401 is configured to perform conversion coding on the masked sentence through the coding module in the preset language model to obtain a coding vector corresponding to the masked sentence;
  • a matching score determination unit 402 configured to perform linear mapping on the encoding vector, and determine that the error correction encoding vector belongs to the matching score of each word to be replaced in the preset character dictionary;
  • a matching probability determining unit 403 configured to perform normalization processing on each of the matching scores to obtain a matching probability corresponding to each of the matching scores;
  • the predicted replacement word determination unit 404 is configured to record the to-be-replaced word whose matching probability is the largest among the matching probabilities and is greater than or equal to a preset matching threshold as the predicted replacement word.
  • the sentence intelligent error correction device also includes:
  • a sample sentence set obtaining module used to obtain a correct sample sentence set; the correct sample sentence set includes at least one correct sample sentence;
  • a mask processing module for performing mask processing on the correct sample sentence to obtain a sample mask sentence;
  • the sample mask sentence contains at least one sample mask word;
  • a mask prediction module is used to input the sample mask sentence into a preset training model including initial parameters, perform mask prediction on the sample mask words, and obtain the corresponding sample mask words. sample predicted words;
  • a sample prediction sentence recording module configured to record the replaced sample mask sentence as a sample prediction sentence after replacing the sample mask word with the sample prediction word;
  • a prediction loss value determination module configured to determine the prediction loss value of the preset training model according to the sample prediction sentence and its corresponding correct sample sentence;
  • a parameter update module configured to update and iterate the initial parameters of the preset training model when the predicted loss value does not reach the preset convergence condition, until the predicted loss value reaches the preset convergence condition, update the The preset training model after convergence is recorded as the preset language model.
  • the mask processing module includes:
  • the first mask word selection unit is used to select the first mask word from the correct sample sentence, replace the first mask word with the preset replacement character, and replace the first mask word after the replacement.
  • the code word is recorded as the sample predicted word;
  • the second mask word selection unit is configured to select a second mask word from the sample predicted words, and replace the selected second mask word with a preset homophone word.
  • Each module in the above sentence intelligent error correction device can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a readable storage medium, an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the readable storage medium.
  • the database of the computer device is used to store the data used in the sentence intelligent error correction in the above embodiment.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer readable instructions when executed by a processor, implement a sentence intelligent error correction method.
  • the readable storage medium provided by this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor executes the computer
  • the following steps are implemented when readable instructions:
  • the wrong sentence into a preset dependency relationship detection model, predict the dependency relationship between the words in the wrong sentence, and obtain the dependency probability between each of the words and their corresponding subsidiary words , and associate each of the words with their corresponding dependency probability;
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • a to-be-corrected mask sentence is obtained, and the to-be-corrected mask sentence is input into a preset language model, and the to-be-corrected words are corrected error prediction, and obtain the predicted replacement word corresponding to the word to be corrected;
  • the word to be corrected is replaced by the predicted replacement word corresponding to the word to be corrected, and the replaced mask sentence to be corrected is recorded as the correct sentence corresponding to the wrong sentence.
  • one or more readable storage media having computer-readable instructions stored thereon, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processing The device performs the following steps:
  • the wrong sentence into a preset dependency relationship detection model, predict the dependency relationship between the words in the wrong sentence, and obtain the dependency probability between each of the words and their corresponding subsidiary words , and associate each of the words with their corresponding dependency probability;
  • the auxiliary word refers to at least one word that is located before or after a word and is adjacent to it;
  • a to-be-corrected mask sentence is obtained, and the to-be-corrected mask sentence is input into a preset language model, and the to-be-corrected words are corrected error prediction, and obtain the predicted replacement word corresponding to the word to be corrected;
  • the word to be corrected is replaced by the predicted replacement word corresponding to the word to be corrected, and the replaced mask sentence to be corrected is recorded as the correct sentence corresponding to the wrong sentence.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种句子智能纠错方法、装置、计算机设备及存储介质。该方法通过将错误句子输入至预设附属关系检测模型中,对错误句子中各字词之间的依存关系进行预测,得到各字词关联的依存概率,并将各所述字词以及与其对应的依存概率关联,所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词(S20);将各字词关联的依存概率与预设依存阈值进行比较,将小于预设依存阈值的依存概率关联的字词记录为待纠错字词(S30);对待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将待纠错掩码句子输入至预设语言模型中,对待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词(S40);将与待纠错字词对应的预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为错误句子对应的正确句子(S50)。该方法提高了句子智能纠错的准确率及效率。

Description

句子智能纠错方法、装置、计算机设备及存储介质
本申请要求于2020年12月25日提交中国专利局、申请号为202011564979.7,发明名称为“句子智能纠错方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及检测模型技术领域,尤其涉及一种句子智能纠错方法、装置、计算机设备及存储介质。
背景技术
在会议记录或者公文撰写的过程中,时常会有单个错别字或者多个连续错别字出现的情况,为了提高文本的撰写质量,对文本中错别字的智能识别并纠错是非常有必要的。
发明人意识到,现有技术中,通常通过以下方式对文本中错别字进行智能识别:将文本中存在错误的句子输入编码器并使用解码器对文本中各个字词逐个解码并输出正确的句子,但是该方法存在以下不足之处:解码器每一步的输出都依赖于上一步解码器的输出,进而在上一步解码器的输出存在错误时,会导致错误延续传播,使得对存在错误的句子纠错准确率较低。
申请内容
本申请实施例提供一种句子智能纠错方法、装置、计算机设备及存储介质,以解决对存在错误的句子纠错准确率较低的问题。
一种句子智能纠错方法,包括:
接收包含错误句子的句子纠错指令;
将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
一种句子智能纠错装置,包括:
纠错指令接收模块,用于接收包含错误句子的句子纠错指令;
依存概率确定模块,用于将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
待纠错字词确定模块,用于将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
纠错预测模块,用于对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并 将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
字词替换模块,用于将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
接收包含错误句子的句子纠错指令;
将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
接收包含错误句子的句子纠错指令;
将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
本申请减少了预设语言模型的纠错工作量,提高了句子智能纠错的效率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中句子智能纠错方法的一应用环境示意图;
图2是本申请一实施例中句子智能纠错方法的一流程图;
图3是本申请一实施例中句子智能纠错方法中步骤S20的一流程图;
图4是本申请一实施例中句子智能纠错方法中步骤S201的一流程图;
图5是本申请一实施例中句子智能纠错方法中步骤S40的一流程图;
图6是本申请一实施例中句子智能纠错装置的一原理框图;
图7是本申请一实施例中句子智能纠错装置中依存概率确定模块的一原理框图;
图8是本申请一实施例中句子智能纠错装置中字词信息处理单元的一原理框图;
图9是本申请一实施例中句子智能纠错装置中纠错预测模块的一原理框图;
图10是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的句子智能纠错方法,该句子智能纠错方法可应用如图1所示的应用环境中。具体地,该句子智能纠错方法应用在句子智能纠错系统中,该句子智能纠错系统包括如图1所示的客户端和服务器,客户端与服务器通过网络进行通信,用于解决对存在错误的句子纠错准确率较低的问题。其中,客户端又称为用户端,是指与服务器相对应,为客户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种句子智能纠错方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S10:接收包含错误句子的句子纠错指令;
其中,句子纠错指令可以由用户主动发送,也可以为用户在系统中键入错误句子时主动触发。错误句子指的是存在错别字或者句法错误的句子。
S20:将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
其中,预设附属关系检测模型用于确定各字词以及与其对应的附属字词之间的依存概率,依存概率是指各字词在该错误句子中需要依赖与附属字词存在的概率。可以理解地,附属字词可以为一个字词之前或者之后的一个相邻的字词,也可以为一个字词之前或之后的两个相邻的字词,还可以为一个字词之前的相邻的一个字词,一个字词之后的相邻的两个字词;示例性地,假设一个句子为“我今天中午吃了很多东西”,则针对于中午这一字词,与其对应的附属字词可以为“今天”以及“吃”,亦或者是“我”“今天”“吃”。
在一实施例中,如图3所示,步骤S20中,包括:
S201:对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量;所述正向隐藏层向量是将所述错误句子中各字词按照正向顺序排序确定的;所述反向隐藏层向量是将所述错误句子中各字词按照反向顺序排序确定的;
可以理解地,信息处理指的是包含对错误句子进行分词、词嵌入处理以及上下文信息提取等操作。
在一具体实施例中,如图4所示,步骤S201中包括:
S2011:对所述错误句子进行分词处理之后,确定与所述错误句子对应的字词序列;
示例性地,可以通过结巴分词方法对错误句子进行分词处理,以得到错误句子中各字词,并根据预设字词字典中与各字词对应的编码进行替换之后,确定与错误句子对应的字词序列;其中,预设字词字典指的是对其它样本文档中各字词进行预先编码得到的,进而在步骤S2011中,可以从预设字词字典中查找与错误句子中各字词对应的编码进行替换,进而得到字词序列。
S2012:通过词嵌入方法对所述字词序列进行词嵌入处理,得到与所述字词序列对应的词向量序列;
其中,词嵌入处理指的是使用一个可以被预设附属关系检测模型识别且训练的字词向量替换字词序列中各字词的处理过程。
具体地,在对所述错误句子进行分词处理之后,确定与所述错误句子对应的字词序列之后,通过词嵌入方法(如:通过Word2Vec的skip‐gram模型、Word2Vec的CBOW模型或者GloVe词向量等)对字词序列中各字词进行词嵌入处理,也即使用一个可以被预设附属关系检测模型识别且训练的字词向量替换字词序列中各字词,进而得到与字词序列对应的词向量序列。
S2013:通过所述预设附属关系检测模型中的信息提取模块,对所述词向量序列进行正向上下文信息提取,得到所述正向隐藏层向量;同时对所述词向量序列进行反向上下文信息提取,得到所述反向隐藏层。
其中,上下文信息提取包含对上文信息提取以及对下文信息提取,信息提取模块为Bilstm(Bi‐directional Long Short‐Term Memory,双向长短期记忆网络)模块可以理解地,在通过词嵌入方法对所述字词序列进行词嵌入处理,得到与所述字词序列对应的词向量序列之后,该词向量序列中是按照各字词的正向顺序进行排列的,进而可以通过预设附属关系检测模型中的信息提取模块,对词向量序列按照词向量序列中各字词的正向顺序进行基于上文信息的提取,可以理解地,也即词向量序列中的第二个词向量需要根据第一个词向量进行确定,进而得到正向隐藏层向量;同理,通过预设附属关系检测模型中的信息提取模块,对词向量序列按照词向量序列中各字词的正向顺序进行基于下文信息的提取,可以理解地,自词向量序列的最后一个词向量起,倒数第二词向量需要根据倒数第一个词向量进行确定,进而得到反向隐藏层向量。
S202:将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,确定与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,确定所述反向隐藏层向量对应的第二词特征;
具体地,在对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量之后,将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,对所述正向隐藏层向量进行线性映射,以确定与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,对所述反向隐藏层向量进行线性映射,以确定所述反向隐藏层向量对应的第二词特征。其中,第一词特征中指的是代表附属词和被附属词的特征,被附属词指的错误句子中各字词,附属词指的是位于一个被附属字词之前或者之后且与其相邻的至少一个字词,可以理解地,由于正向隐藏层向量是根据各字词的正向顺序确定的,针对于第一词特征来说,被附属词为正向排序中字词顺序在前的字词作为附属词,其余相邻的字词作为附属词。
进一步地,假设确定的各字词对应的词特征为
Figure PCTCN2021083955-appb-000001
(其中,
Figure PCTCN2021083955-appb-000002
为正向隐藏层向量中第n个向量对应的词特征,也即错误句子中按照正向顺序时第n个字词的词特征),以及
Figure PCTCN2021083955-appb-000003
(其中,
Figure PCTCN2021083955-appb-000004
为反向隐藏层向量中第n个向量对应的词特征,也即错误句子中按照反向顺序时第n个字词的词特征),可以理解地,对于错误句子中按照正向顺序时的第一个字词,其 在正向隐藏层向量中对应的词特征为
Figure PCTCN2021083955-appb-000005
在反向隐藏层向量中对应的特征为,
Figure PCTCN2021083955-appb-000006
也即最后一个词特征。
进而根据上述
Figure PCTCN2021083955-appb-000007
以及
Figure PCTCN2021083955-appb-000008
确定第一词特征为:
Figure PCTCN2021083955-appb-000009
第二词特征为:
Figure PCTCN2021083955-appb-000010
其中,W arc-head以及b arc-head为第一线性变换层模块的可训练参数、W arc-dep以及b arc-dep为第二线性变换层模块的可训练参数。
S203:根据所述第一词特征以及所述第二词特征,通过双仿射注意力机制确定各所述字词以及与其对应的附属词之间的依存概率。
具体地,在将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,确定与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,确定所述反向隐藏层向量对应的第二词特征之后,将所有第一词特征进行堆叠之后,根据第二词特征以及堆叠之后的第一词特征,通过双仿射注意力机制确定各所述字词以及与其对应的附属词之间的依存概率。
进一步地,可以通过如下所述公式确定各所述字词以及与其对应的附属词之间的依存概率:
Figure PCTCN2021083955-appb-000011
其中,
Figure PCTCN2021083955-appb-000012
为错误句子中第n个字词以及与其对应的附属词之间的依存概率;H arc-head为对第一词特征进行堆叠之后得到的;
Figure PCTCN2021083955-appb-000013
为反向隐藏层向量中第n个向量对应的词特征;H head为正向隐藏层向量;W和U为双仿射注意力机制的训练参数。
在本实施例中,通过双仿射注意力机制提升错误句子中错误字词的检测能力,双仿射注意力机制能够很好的将词与词之间的关联关系,可以更好的结合上下文信息,提高错误字词检测的准确率以及效率。
S30:将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
其中,预设依存阈值可以根据历史样本数据进行人工确定,示例性地,该预设依存阈值可以为0.8,0.9等,该预设依存阈值可以根据具体应用场景的准确精度需求进行调整,如需要判别纠错字词准确率更高的场景,该预设依存阈值可以设定一个较高的值,如0.95等。
可以理解地,在将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联之后,将各所述字词关联的依存概率与预设依存阈值进行比较,在所述字词关联的依存概率小于预设依存阈值时,表征该字词在错误句子中与其它字词之间的依赖关系较弱,也即该字词可能是错误的字词,因此将小于所 述预设依存阈值的依存概率关联的字词记录为待纠错字词。
进一步地,在所述字词关联的依存概率大于或等于预设依存阈值时,表征该字词在错误句子中与其它字词之间的依赖关系较强,也即该字词不是错误的字词。
S40:对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
可以理解地,掩码处理指的是采用特殊字符将待纠错词掩盖掉,以对该位置上的字词进行纠错预测。
在一实施例中,如图5所示,步骤S40中包括如下步骤:
S401:通过所述预设语言模型中的编码模块,对所述掩码句子进行转换编码,得到与所述掩码句子对应的编码向量;所述编码向量中包括与所述待纠错字词对应的纠错编码向量;
其中,编码模块实质为预设语言模型中的向量转换层,该编码模块用于将待纠错掩码句子转换成可以被识别的编码向量。
具体地,在将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词之后,通过所述预设语言模型中的编码模块,对所述掩码句子进行转换编码,得到与所述掩码句子对应的编码向量。
S402:对所述编码向量进行线性映射,确定所述纠错编码向量属于预设字符字典中各待替换字词的匹配得分;
可以理解地,预设字符字典可以为根据历史样本数据进行分词、编码等处理得到的,该预设字符字典中存储多个字词对应的编码向量。
具体地,在通过所述预设语言模型中的编码模块,对所述掩码句子进行转换编码,得到与所述掩码句子对应的编码向量,对所述编码向量进行线性映射,将纠错编码向量与预设字符字典中各编码向量进行匹配,例如可以通过确定纠错编码向量与预设字符字典中各编码向量的欧几里得距离进行匹配,进而根据该欧几里得距离确定所述纠错编码向量属于预设字符字典中各待替换字词的匹配得分。
S403:对各所述匹配得分进行归一化处理,得到与各所述匹配得分对应的匹配概率;
S404:将所述匹配概率中匹配概率最大,且大于或等于预设匹配阈值的待替换字词记录为所述预测替换字词。
其中,预设匹配阈值可以为0.9,0.95等。
具体地,在对所述编码向量进行线性映射,确定所述纠错编码向量属于预设字符字典中各待替换字词的匹配得分之后,对各所述匹配得分进行归一化处理,得到与各所述匹配得分对应的匹配概率,进而将所述匹配概率中匹配概率最大,且大于预设匹配阈值的待替换字词记录为所述预测替换字词。
进一步地,若匹配概率中匹配概率最大值小于预设匹配阈值,则发送包含该匹配概率最大值对应的字词至预设接收方,该预设接收方可以为句子纠错指令的发送方,以提示暂未找到与该位置上的字词匹配的预测替换字词。
S50:将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
具体地,在对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词之后,将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子,进而完成整个句子智能纠错过程。
在本实施例中,通过先检测后纠错的方式,通过依存概率来判断错误句子中可能存在 错误的字词,提高了判断准确率;并将该可能存在错误的字词进行掩码处理后,通过预设语言模型对其进行纠错替换,如此减少了预设语言模型的纠错工作量,提高了句子智能纠错的效率。
在一实施例中,步骤S40之前还包括:
(1)获取正确样本句子集;所述正确样本句子集中包含至少一个正确样本句子;
可以理解地,正确样本句子指的是没有错别字,且没有句法错误的句子,该正确样本句子可以从各个应用场景下的文档中提取得到。
(2)对所述正确样本句子进行掩码处理,得到样本掩码句子;所述样本掩码句子中包含至少一个样本掩码字词;
具体地,该步骤包括如下步骤:
自所述正确样本句子中选取第一掩码字词,并将预设替换字符替换所述第一掩码字词,并将替换后的第一掩码字词记录为所述样本预测字词;
自所述样本预测字词中选取第二掩码字词,并采用预设同音字词替换被选取的所述第二掩码字词。
具体地,在获取正确样本句子集之后,自正确样本句子中选取第一掩码字词,并将预设替换字符替换所述第一掩码字词,并将替换后的第一掩码字词记录为所述样本预测字词,优选地,选取第一掩码字词的选取概率可以为12%(该选取概率为实验值),也即正确样本句子中各个字词被选取为第一掩码字词的概率为12%;进一步地,为了提高模型对错误句子的纠错能力,进而在样本预测字词中选取第二掩码字词,并采用预设同音字词替换被选取的所述第二掩码字词,进而制造同音替换错误的场景(如智能多轮对话中,机器人常常会出现同音字的错误),并且除了第二掩码字词之外的第一掩码字词不做替换,仅记录其对应的位置即可,如此可以达到同音字替换的纠错训练过程,也达到了掩码纠错的训练过程。
(3)将所述样本掩码句子输入至包含初始参数的预设训练模型中,对所述样本掩码字词进行掩码预测,得到与所述样本掩码字词对应的样本预测字词;
具体地,在对所述正确样本句子进行掩码处理,得到样本掩码句子之后,对所述样本掩码句子进行转换编码,得到与样本掩码句子对应的样本编码向量;对该样本编码向量进行线性映射,确定与各第一掩码字词以及第二掩码字词属于预设字符字典中各待替换字词的匹配得分;对各匹配得分进行归一化处理,得到与各匹配得分对应的匹配概率,将匹配概率中匹配概率最大,且大于或等于预设匹配阈值的待替换字词记录为与第一掩码字词或者第二掩码字词对应的样本预测字词。
(4)将所述样本预测字词替换所述样本掩码字词之后,将替换之后的样本掩码句子记录为样本预测句子;
(5)根据所述样本预测句子以及与其对应的正确样本句子确定所述预设训练模型的预测损失值;
具体地,在将所述样本掩码句子输入至包含初始参数的预设训练模型中,对所述样本掩码字词进行掩码预测,得到与所述样本掩码字词对应的样本预测字词之后,将样本预测字词替换所述样本掩码字词之后,将替换之后的样本掩码句子记录为样本预测句子;进而根据样本预测句子以及与其对应的正确样本句子确定预设训练模型的预测损失值,该预测损失值可以通过交叉熵损失函数进行确定。
(6)在所述预测损失值未达到预设的收敛条件时,更新迭代所述预设训练模型的初始参数,直至所述预测损失值达到所述预设的收敛条件时,将收敛之后的所述预设训练模型记录为所述预设语言模型。
可以理解地,该收敛条件可以为预测损失值小于设定阈值的条件,也即在预测损失值小于设定阈值时,停止训练;收敛条件还可以为预测损失值经过了10000次计算后值为很 小且不会再下降的条件,也即预测损失值经过10000次计算后值很小且不会下降时,停止训练,并将收敛之后的所述预设训练模型记录为所述预设语言模型。
进一步地,根据所述样本预测句子以及与其对应的正确样本句子确定所述预设训练模型的预测损失值之后,在预测损失值未达到预设的收敛条件时,根据该预测损失值调整预设训练模型的初始参数,并将该正确样本句子对应的样本掩码句子重新输入至调整初始参数后的预设训练模型中,以在该正确样本句子对应的预测损失值达到预设的收敛条件时,选取正确样本句子集中另一仅正确样本句子,并执行上述步骤(1)至(5),并得到与该正确样本句子对应的预测损失值,并在该预测损失值未达到预设的收敛条件时,根据该预测损失值再次调整预设训练模型的初始参数,使得该正确样本句子对应的预测损失值达到预设的收敛条件。
如此,在通过正确样本句子集中所有正确样本句子对预设训练模型进行训练之后,使得预设训练模型输出的结果可以不断向准确地结果靠拢,让识别准确率越来越高,直至所有正确样本句子对应的预测损失值均达到预设的收敛条件时,将收敛之后的所述预设训练模型记录为所述预设语言模型。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种句子智能纠错装置,该句子智能纠错装置与上述实施例中句子智能纠错方法一一对应。如图6所示,该句子智能纠错装置包括纠错指令接收模块10、依存概率确定模块20、待纠错字词确定模块30、纠错预测模块40和字词替换模块50。各功能模块详细说明如下:
纠错指令接收模块10,用于接收包含错误句子的句子纠错指令;
依存概率确定模块20,用于将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
待纠错字词确定模块30,用于将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
纠错预测模块40,用于对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
字词替换模块50,用于将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
优选地,如图7所示,所述依存概率确定模块20,包括:
字词信息处理单元201,用于对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量;所述正向隐藏层向量是将所述错误句子中各字词按照正向顺序排序确定的;所述反向隐藏层向量是将所述错误句子中各字词按照反向顺序排序确定的;
线性映射单元202,用于将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,确定与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,确定所述反向隐藏层向量对应的第二词特征;
依存概率确定单元203,根据所述第一词特征以及所述第二词特征,通过双仿射注意力机制确定各所述字词以及与其对应的附属词之间的依存概率。
优选地,如图8所示,字词信息处理单元201包括:
字词序列确定子单元2011,用于对所述错误句子进行分词处理之后,确定与所述错误 句子对应的字词序列;
词嵌入处理子单元2012,用于通过词嵌入方法对所述字词序列进行词嵌入处理,得到与所述字词序列对应的词向量序列;
信息提取子单元2013,用于通过所述预设附属关系检测模型中的信息提取模块,对所述词向量序列进行正向上下文信息提取,得到所述正向隐藏层向量;同时对所述词向量序列进行反向上下文信息提取,得到所述反向隐藏层。
优选地,如图9所示,纠错预测模块40包括:
转换编码单元401,用于通过所述预设语言模型中的编码模块,对所述掩码句子进行转换编码,得到与所述掩码句子对应的编码向量;所述编码向量中包括与所述待纠错字词对应的纠错编码向量;
匹配得分确定单元402,用于对所述编码向量进行线性映射,确定所述纠错编码向量属于预设字符字典中各待替换字词的匹配得分;
匹配概率确定单元403,用于对各所述匹配得分进行归一化处理,得到与各所述匹配得分对应的匹配概率;
预测替换字词确定单元404,用于将所述匹配概率中匹配概率最大,且大于或等于预设匹配阈值的待替换字词记录为所述预测替换字词。
优选地,所述句子智能纠错装置还包括:
样本句子集获取模块,用于获取正确样本句子集;所述正确样本句子集中包含至少一个正确样本句子;
掩码处理模块,用于对所述正确样本句子进行掩码处理,得到样本掩码句子;所述样本掩码句子中包含至少一个样本掩码字词;
掩码预测模块,用于将所述样本掩码句子输入至包含初始参数的预设训练模型中,对所述样本掩码字词进行掩码预测,得到与所述样本掩码字词对应的样本预测字词;
样本预测句子记录模块,用于将所述样本预测字词替换所述样本掩码字词之后,将替换之后的样本掩码句子记录为样本预测句子;
预测损失值确定模块,用于根据所述样本预测句子以及与其对应的正确样本句子确定所述预设训练模型的预测损失值;
参数更新模块,用于在所述预测损失值未达到预设的收敛条件时,更新迭代所述预设训练模型的初始参数,直至所述预测损失值达到所述预设的收敛条件时,将收敛之后的所述预设训练模型记录为所述预设语言模型。
所述掩码处理模块包括:
第一掩码字词选取单元,用于自所述正确样本句子中选取第一掩码字词,并将预设替换字符替换所述第一掩码字词,并将替换后的第一掩码字词记录为所述样本预测字词;
第二掩码字词选取单元,用于自所述样本预测字词中选取第二掩码字词,并采用预设同音字词替换被选取的所述第二掩码字词。
关于句子智能纠错装置的具体限定可以参见上文中对于句子智能纠错方法的限定,在此不再赘述。上述句子智能纠错装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计 算机设备的数据库用于存储上述实施例中句子智能纠错所使用到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种句子智能纠错方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
接收包含错误句子的句子纠错指令;
将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
接收包含错误句子的句子纠错指令;
将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或者易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种句子智能纠错方法,其中,包括:
    接收包含错误句子的句子纠错指令;
    将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
    将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
    对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
    将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
  2. 如权利要求1所述的句子智能纠错方法,其中,所述将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,包括:
    对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量;所述正向隐藏层向量是将所述错误句子中各字词按照正向顺序排序确定的;所述反向隐藏层向量是将所述错误句子中各字词按照反向顺序排序确定的;
    将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,确定与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,确定所述反向隐藏层向量对应的第二词特征;
    根据所述第一词特征以及所述第二词特征,通过双仿射注意力机制确定各所述字词以及与其对应的附属词之间的依存概率。
  3. 如权利要求2所述的句子智能纠错方法,其中,对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量,包括:
    对所述错误句子进行分词处理之后,确定与所述错误句子对应的字词序列;
    通过词嵌入方法对所述字词序列进行词嵌入处理,得到与所述字词序列对应的词向量序列;
    通过所述预设附属关系检测模型中的信息提取模块,对所述词向量序列进行正向上下文信息提取,得到所述正向隐藏层向量;同时对所述词向量序列进行反向上下文信息提取,得到所述反向隐藏层。
  4. 如权利要求1所述的句子智能纠错方法,其中,所述将所述掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词,包括:
    通过所述预设语言模型中的编码模块,对所述掩码句子进行转换编码,得到与所述掩码句子对应的编码向量;所述编码向量中包括与所述待纠错字词对应的纠错编码向量;
    对所述编码向量进行线性映射,确定所述纠错编码向量属于预设字符字典中各待替换字词的匹配得分;
    对各所述匹配得分进行归一化处理,得到与各所述匹配得分对应的匹配概率;
    将所述匹配概率中匹配概率最大,且大于或等于预设匹配阈值的待替换字词记录为所述预测替换字词。
  5. 如权利要求1所述的句子智能纠错方法,其中,所述将所述掩码句子输入至预设语言模型中之前,还包括:
    获取正确样本句子集;所述正确样本句子集中包含至少一个正确样本句子;
    对所述正确样本句子进行掩码处理,得到样本掩码句子;所述样本掩码句子中包含至少一个样本掩码字词;
    将所述样本掩码句子输入至包含初始参数的预设训练模型中,对所述样本掩码字词进行掩码预测,得到与所述样本掩码字词对应的样本预测字词;
    将所述样本预测字词替换所述样本掩码字词之后,将替换之后的样本掩码句子记录为样本预测句子;
    根据所述样本预测句子以及与其对应的正确样本句子确定所述预设训练模型的预测损失值;
    在所述预测损失值未达到预设的收敛条件时,更新迭代所述预设训练模型的初始参数,直至所述预测损失值达到所述预设的收敛条件时,将收敛之后的所述预设训练模型记录为所述预设语言模型。
  6. 如权利要求5所述的句子智能纠错方法,其中,所述对所述正确样本句子进行掩码处理,得到样本掩码句子,包括:
    自所述正确样本句子中选取第一掩码字词,并将预设替换字符替换所述第一掩码字词,并将替换后的第一掩码字词记录为所述样本预测字词;
    自所述样本预测字词中选取第二掩码字词,并采用预设同音字词替换被选取的所述第二掩码字词。
  7. 一种句子智能纠错装置,其中,包括:
    纠错指令接收模块,用于接收包含错误句子的句子纠错指令;
    依存概率确定模块,用于将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
    待纠错字词确定模块,用于将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
    纠错预测模块,用于对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
    字词替换模块,用于将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
  8. 如权利要求7所述的句子智能纠错装置,其中,所述依存概率确定模块,包括:
    字词信息处理单元,用于对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量;所述正向隐藏层向量是将所述错误句子中各字词按照正向顺序排序确定的;所述反向隐藏层向量是将所述错误句子中各字词按照反向顺序排序确定的;
    线性映射单元,用于将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,对所述正向隐藏层向量以及所述反向隐藏层向量进行特征映射,得到与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,得到所述反向隐藏层向量对应的第二词特征;
    依存概率确定单元,根据所述第一词特征以及所述第二词特征,通过双仿射注意力机制确定各所述字词以及与其对应的附属词之间的依存概率。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器 上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    接收包含错误句子的句子纠错指令;
    将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
    将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
    对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
    将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
  10. 如权利要求9所述的计算机设备,其中,所述将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,包括:
    对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量;所述正向隐藏层向量是将所述错误句子中各字词按照正向顺序排序确定的;所述反向隐藏层向量是将所述错误句子中各字词按照反向顺序排序确定的;
    将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,确定与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,确定所述反向隐藏层向量对应的第二词特征;
    根据所述第一词特征以及所述第二词特征,通过双仿射注意力机制确定各所述字词以及与其对应的附属词之间的依存概率。
  11. 如权利要求10所述的计算机设备,其中,对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量,包括:
    对所述错误句子进行分词处理之后,确定与所述错误句子对应的字词序列;
    通过词嵌入方法对所述字词序列进行词嵌入处理,得到与所述字词序列对应的词向量序列;
    通过所述预设附属关系检测模型中的信息提取模块,对所述词向量序列进行正向上下文信息提取,得到所述正向隐藏层向量;同时对所述词向量序列进行反向上下文信息提取,得到所述反向隐藏层。
  12. 如权利要求9所述的计算机设备,其中,所述将所述掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词,包括:
    通过所述预设语言模型中的编码模块,对所述掩码句子进行转换编码,得到与所述掩码句子对应的编码向量;所述编码向量中包括与所述待纠错字词对应的纠错编码向量;
    对所述编码向量进行线性映射,确定所述纠错编码向量属于预设字符字典中各待替换字词的匹配得分;
    对各所述匹配得分进行归一化处理,得到与各所述匹配得分对应的匹配概率;
    将所述匹配概率中匹配概率最大,且大于或等于预设匹配阈值的待替换字词记录为所述预测替换字词。
  13. 如权利要求9所述的计算机设备,其中,所述将所述掩码句子输入至预设语言模型中之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取正确样本句子集;所述正确样本句子集中包含至少一个正确样本句子;
    对所述正确样本句子进行掩码处理,得到样本掩码句子;所述样本掩码句子中包含至少一个样本掩码字词;
    将所述样本掩码句子输入至包含初始参数的预设训练模型中,对所述样本掩码字词进行掩码预测,得到与所述样本掩码字词对应的样本预测字词;
    将所述样本预测字词替换所述样本掩码字词之后,将替换之后的样本掩码句子记录为样本预测句子;
    根据所述样本预测句子以及与其对应的正确样本句子确定所述预设训练模型的预测损失值;
    在所述预测损失值未达到预设的收敛条件时,更新迭代所述预设训练模型的初始参数,直至所述预测损失值达到所述预设的收敛条件时,将收敛之后的所述预设训练模型记录为所述预设语言模型。
  14. 如权利要求13所述的计算机设备,其中,所述对所述正确样本句子进行掩码处理,得到样本掩码句子,包括:
    自所述正确样本句子中选取第一掩码字词,并将预设替换字符替换所述第一掩码字词,并将替换后的第一掩码字词记录为所述样本预测字词;
    自所述样本预测字词中选取第二掩码字词,并采用预设同音字词替换被选取的所述第二掩码字词。
  15. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    接收包含错误句子的句子纠错指令;
    将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,并将各所述字词和与其对应的依存概率关联;所述附属字词是指位于一个字词之前或者之后且与其相邻的至少一个字词;
    将各所述字词关联的依存概率与预设依存阈值进行比较,将小于所述预设依存阈值的依存概率关联的字词记录为待纠错字词;
    对所述待纠错字词进行掩码处理之后,得到待纠错掩码句子,并将所述待纠错掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词;
    将与待纠错字词对应的所述预测替换字词替换该待纠错字词,并将替换之后的待纠错掩码句子记录为所述错误句子对应的正确句子。
  16. 如权利要求15所述的可读存储介质,其中,所述将所述错误句子输入至预设附属关系检测模型中,对所述错误句子中各字词之间的依存关系进行预测,得到各所述字词以及与其对应的附属字词之间的依存概率,包括:
    对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量;所述正向隐藏层向量是将所述错误句子中各字词按照正向顺序排序确定的;所述反向隐藏层向量是将所述错误句子中各字词按照反向顺序排序确定的;
    将所述正向隐藏层向量输入至所述预设附属关系检测模型中的第一线性变换层模块,确定与所述正向隐藏层向量对应的第一词特征;同时将所述反向隐藏层向量输入至所述预设附属关系检测模型中的第二线性变换层模块,确定所述反向隐藏层向量对应的第二词特征;
    根据所述第一词特征以及所述第二词特征,通过双仿射注意力机制确定各所述字词以及与其对应的附属词之间的依存概率。
  17. 如权利要求16所述的可读存储介质,其中,对所述错误句子进行字词信息处理,得到与所述错误句子对应的正向隐藏层向量以及反向隐藏层向量,包括:
    对所述错误句子进行分词处理之后,确定与所述错误句子对应的字词序列;
    通过词嵌入方法对所述字词序列进行词嵌入处理,得到与所述字词序列对应的词向量序列;
    通过所述预设附属关系检测模型中的信息提取模块,对所述词向量序列进行正向上下文信息提取,得到所述正向隐藏层向量;同时对所述词向量序列进行反向上下文信息提取,得到所述反向隐藏层。
  18. 如权利要求15所述的可读存储介质,其中,所述将所述掩码句子输入至预设语言模型中,对所述待纠错字词进行纠错预测,得到与所述待纠错字词对应的预测替换字词,包括:
    通过所述预设语言模型中的编码模块,对所述掩码句子进行转换编码,得到与所述掩码句子对应的编码向量;所述编码向量中包括与所述待纠错字词对应的纠错编码向量;
    对所述编码向量进行线性映射,确定所述纠错编码向量属于预设字符字典中各待替换字词的匹配得分;
    对各所述匹配得分进行归一化处理,得到与各所述匹配得分对应的匹配概率;
    将所述匹配概率中匹配概率最大,且大于或等于预设匹配阈值的待替换字词记录为所述预测替换字词。
  19. 如权利要求15所述的可读存储介质,其中,所述将所述掩码句子输入至预设语言模型中之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取正确样本句子集;所述正确样本句子集中包含至少一个正确样本句子;
    对所述正确样本句子进行掩码处理,得到样本掩码句子;所述样本掩码句子中包含至少一个样本掩码字词;
    将所述样本掩码句子输入至包含初始参数的预设训练模型中,对所述样本掩码字词进行掩码预测,得到与所述样本掩码字词对应的样本预测字词;
    将所述样本预测字词替换所述样本掩码字词之后,将替换之后的样本掩码句子记录为样本预测句子;
    根据所述样本预测句子以及与其对应的正确样本句子确定所述预设训练模型的预测损失值;
    在所述预测损失值未达到预设的收敛条件时,更新迭代所述预设训练模型的初始参数,直至所述预测损失值达到所述预设的收敛条件时,将收敛之后的所述预设训练模型记录为所述预设语言模型。
  20. 如权利要求19所述的可读存储介质,其中,所述对所述正确样本句子进行掩码处理,得到样本掩码句子,包括:
    自所述正确样本句子中选取第一掩码字词,并将预设替换字符替换所述第一掩码字词,并将替换后的第一掩码字词记录为所述样本预测字词;
    自所述样本预测字词中选取第二掩码字词,并采用预设同音字词替换被选取的所述第二掩码字词。
PCT/CN2021/083955 2020-12-25 2021-03-30 句子智能纠错方法、装置、计算机设备及存储介质 WO2022134356A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011564979.7A CN112668313A (zh) 2020-12-25 2020-12-25 句子智能纠错方法、装置、计算机设备及存储介质
CN202011564979.7 2020-12-25

Publications (1)

Publication Number Publication Date
WO2022134356A1 true WO2022134356A1 (zh) 2022-06-30

Family

ID=75409362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083955 WO2022134356A1 (zh) 2020-12-25 2021-03-30 句子智能纠错方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112668313A (zh)
WO (1) WO2022134356A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997147A (zh) * 2022-08-04 2022-09-02 深圳依时货拉拉科技有限公司 基于混合mask的poi地址纠错方法、装置、存储介质和设备
CN115169331A (zh) * 2022-07-19 2022-10-11 哈尔滨工业大学 融入词语信息的中文拼写纠错方法
CN115879421A (zh) * 2023-02-16 2023-03-31 之江实验室 一种增强bart预训练任务的句子排序方法及装置
CN116306600A (zh) * 2023-05-25 2023-06-23 山东齐鲁壹点传媒有限公司 一种基于MacBert的中文文本纠错方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177405A (zh) * 2021-05-28 2021-07-27 中国平安人寿保险股份有限公司 基于bert的数据纠错方法、装置、设备及存储介质
CN113591475B (zh) * 2021-08-03 2023-07-21 美的集团(上海)有限公司 无监督可解释分词的方法、装置和电子设备
CN113705203A (zh) * 2021-09-02 2021-11-26 上海极链网络科技有限公司 文本纠错方法、装置、电子设备及计算机可读存储介质
CN115358217A (zh) * 2022-09-02 2022-11-18 美的集团(上海)有限公司 词句的纠错方法、装置、可读存储介质和计算机程序产品
CN115935957B (zh) * 2022-12-29 2023-10-13 广东南方网络信息科技有限公司 一种基于句法分析的句子语法纠错方法及系统
CN116662579B (zh) * 2023-08-02 2024-01-26 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
CN111144391A (zh) * 2019-12-23 2020-05-12 北京爱医生智慧医疗科技有限公司 一种ocr识别结果纠错方法及装置
CN111324214A (zh) * 2018-12-17 2020-06-23 北京搜狗科技发展有限公司 一种语句纠错方法和装置
CN111613214A (zh) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 一种用于提升语音识别能力的语言模型纠错方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
CN111324214A (zh) * 2018-12-17 2020-06-23 北京搜狗科技发展有限公司 一种语句纠错方法和装置
CN111144391A (zh) * 2019-12-23 2020-05-12 北京爱医生智慧医疗科技有限公司 一种ocr识别结果纠错方法及装置
CN111613214A (zh) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 一种用于提升语音识别能力的语言模型纠错方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HAO YANAN, ET AL.: "The Research on the Automatic Proofreading Method of Word Errors in OCR recognizied Text", JISUANJI-FANGZHEN = COMPUTER SIMULATION, ZHONGGUO HANGTIAN GONGYE ZONGGONGSI, CN, vol. 37, no. 9, 30 September 2020 (2020-09-30), CN , pages 333 - 337, XP055946417, ISSN: 1006-9348 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169331A (zh) * 2022-07-19 2022-10-11 哈尔滨工业大学 融入词语信息的中文拼写纠错方法
CN115169331B (zh) * 2022-07-19 2023-05-12 哈尔滨工业大学 融入词语信息的中文拼写纠错方法
CN114997147A (zh) * 2022-08-04 2022-09-02 深圳依时货拉拉科技有限公司 基于混合mask的poi地址纠错方法、装置、存储介质和设备
CN114997147B (zh) * 2022-08-04 2022-11-04 深圳依时货拉拉科技有限公司 基于混合mask的poi地址纠错方法、装置、存储介质和设备
CN115879421A (zh) * 2023-02-16 2023-03-31 之江实验室 一种增强bart预训练任务的句子排序方法及装置
CN115879421B (zh) * 2023-02-16 2024-01-09 之江实验室 一种增强bart预训练任务的句子排序方法及装置
CN116306600A (zh) * 2023-05-25 2023-06-23 山东齐鲁壹点传媒有限公司 一种基于MacBert的中文文本纠错方法
CN116306600B (zh) * 2023-05-25 2023-08-11 山东齐鲁壹点传媒有限公司 一种基于MacBert的中文文本纠错方法

Also Published As

Publication number Publication date
CN112668313A (zh) 2021-04-16

Similar Documents

Publication Publication Date Title
WO2022134356A1 (zh) 句子智能纠错方法、装置、计算机设备及存储介质
CN111310443B (zh) 一种文本纠错方法和系统
CN110444223B (zh) 基于循环神经网络和声学特征的说话人分离方法及装置
US11373639B2 (en) System and method for streaming end-to-end speech recognition with asynchronous decoders pruning prefixes using a joint label and frame information in transcribing technique
WO2021042543A1 (zh) 基于长短期记忆网络的多轮对话语义分析方法和系统
KR20180001889A (ko) 언어 처리 방법 및 장치
JP6461308B2 (ja) 音声認識装置およびリスコアリング装置
WO2021213161A1 (zh) 方言语音识别方法、装置、介质及电子设备
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
WO2022141864A1 (zh) 对话意图识别模型训练方法、装置、计算机设备及介质
WO2023226292A1 (zh) 从文本中进行关系抽取的方法、关系抽取模型及介质
WO2019201295A1 (zh) 文件识别方法和特征提取方法
CN112101010B (zh) 一种基于bert的电信行业oa办公自动化文稿审核的方法
CN111753532B (zh) 西文文本的纠错方法和装置、电子设备及存储介质
WO2022201646A1 (en) An artificial intelligence system for capturing context by dilated self-attention
US10643028B1 (en) Transliteration of text entry across scripts
CN114861637A (zh) 拼写纠错模型生成方法和装置、拼写纠错方法和装置
CN114973229A (zh) 文本识别模型训练、文本识别方法、装置、设备及介质
Ogawa et al. Error type classification and word accuracy estimation using alignment features from word confusion network
TWI818427B (zh) 使用基於文本的說話者變更檢測的說話者劃分糾正方法及系統
CN113268452B (zh) 一种实体抽取的方法、装置、设备和存储介质
US20220310097A1 (en) Reducing Streaming ASR Model Delay With Self Alignment
CN113129869B (zh) 语音识别模型的训练与语音识别的方法、装置
Zhang et al. Language model score regularization for speech recognition
CN116822498B (zh) 文本纠错处理方法、模型处理方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908368

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21908368

Country of ref document: EP

Kind code of ref document: A1