WO2022126897A1 - 文本纠错方法、装置、设备及存储介质 - Google Patents

文本纠错方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022126897A1
WO2022126897A1 PCT/CN2021/082587 CN2021082587W WO2022126897A1 WO 2022126897 A1 WO2022126897 A1 WO 2022126897A1 CN 2021082587 W CN2021082587 W CN 2021082587W WO 2022126897 A1 WO2022126897 A1 WO 2022126897A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
error correction
error
preset
model
Prior art date
Application number
PCT/CN2021/082587
Other languages
English (en)
French (fr)
Inventor
邓悦
郑立颖
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022126897A1 publication Critical patent/WO2022126897A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a text error correction method, apparatus, device, and readable storage medium.
  • the current text error correction modeling method mainly relies on the sequence-to-sequence encoder-decoder framework based on the attention mechanism, which takes the original erroneous sentence as input in the process of text error correction, After encoding by the encoder, the correct sentence after error correction is decoded one by one using the decoder.
  • the decoding of each step of the sequence-to-sequence model depends on the output of the previous decoder.
  • the decoding process is a one-by-one decoding process, which will cause the problem of time series dependence, resulting in a loss of running speed, and the above encoder
  • the process of encoding and decoding is difficult to parallelize, resulting in slow running speed on the wire.
  • the main purpose of this application is to provide a text error correction method, device, device and readable storage medium, which aims to solve the existing technical problem of slow error correction in the text error correction process when performing text error correction tasks .
  • the application provides a text error correction method
  • the text error correction method comprises the steps:
  • the preset text error correction model is trained by a preset annotation editing operation sequence;
  • the preset annotation editing operation sequence is used for converting the preset error text into correct text corresponding to the preset error text;
  • the present application also provides a text error correction device, the text error correction device includes a memory, a processor and a text error correction program stored on the memory and running on the processor , when the text error correction program is executed by the processor, the steps of the text error correction method described below are realized:
  • the preset text error correction model is trained by a preset annotation editing operation sequence;
  • the preset annotation editing operation sequence is used for converting the preset error text into correct text corresponding to the preset error text;
  • the present application also provides a computer-readable storage medium, where a text error correction program is stored on the computer-readable storage medium, and the text error correction program is executed by a processor to achieve the following: Steps of text error correction method:
  • the preset text error correction model is trained by a preset annotation editing operation sequence;
  • the preset annotation editing operation sequence is used for converting the preset error text into correct text corresponding to the preset error text;
  • the present application also provides a text error correction device, the text error correction device includes:
  • the acquisition module is used to acquire the text to be corrected
  • a generation module is used to input the text to be corrected into a preset text error correction model to generate an error correction editing operation sequence;
  • the preset text error correction model is obtained by training a preset annotation editing operation sequence;
  • the preset annotation The editing operation sequence is used to convert the preset error text into correct text corresponding to the preset error text;
  • An error correction module configured to perform error correction on the text to be corrected based on the error correction editing operation sequence to obtain a target text after error correction.
  • the text to be corrected is obtained; the text to be corrected is input into a preset text error correction model to generate an error correction editing operation sequence; the preset text error correction model is obtained by training a preset annotation editing operation sequence; The preset annotation editing operation sequence is used to convert the preset error text into correct text corresponding to the preset error text; based on the error correction editing operation sequence, the error correction is performed on the to-be-corrected text, and an error correction is obtained. post text.
  • the improvement of the text error correction process is realized, so that the text conversion process is to generate the error correction editing operation sequence first, and then directly convert the erroneous text into the correct text according to the error correction editing operation sequence, instead of generating part of the error correction editing operation sequence at the same time.
  • the partial error text is converted into the partial correct text, which avoids the problem of time series dependence caused by the intersection of encoder encoding and decoder decoding, that is, the problem of text error correction is converted into sequence generation. Problems, and finally correct the text to be corrected through the generated sequence, so that the process of generating the error correction editing operation sequence and the process of converting the error text into the correct text can be paralleled, thereby improving the error correction speed of the text error correction process.
  • Fig. 1 is the schematic flow chart of the first embodiment of the text error correction method of the present application
  • FIG. 2 is a schematic diagram of an implementation process of a multi-head attention mechanism in a bidirectional pre-trained language model in an embodiment of the present application
  • FIG. 3 is a schematic flowchart of the second embodiment of the text error correction method of the present application.
  • Fig. 4 is the functional module schematic diagram of the preferred embodiment of the text error correction device of the present application.
  • FIG. 5 is a schematic structural diagram of a hardware operating environment involved in the solution of the embodiment of the present application.
  • FIG. 1 is a schematic flowchart of the first embodiment of the text error correction method of the present application.
  • the embodiments of the present application provide embodiments of the text error correction method. It should be noted that although the logical sequence is shown in the flowchart, in some cases, the shown or described steps.
  • the text error correction method can be applied to mobile terminals, which include but are not limited to mobile phones, personal computers, etc. For ease of description, each step of the execution body describing the text error correction method is omitted below.
  • Text error correction methods include:
  • Step S110 acquiring the text to be corrected.
  • the text to be corrected that needs to be corrected is obtained.
  • the task of correcting the error text is a text error correction task.
  • the text error correction task it needs to treat part of the text in the error correction text (that is, in most cases, the wrong sentence and the correct sentence are only There are differences in specific positions) for error correction.
  • the text error correction task only needs to modify the specific position of the text, instead of regenerating the text. It can be understood that the text error correction task is the text conversion task.
  • the present embodiment adopts the edit distance (the edit distance is a quantitative measurement of the degree of difference between two character strings (such as English words), and the measurement method is to determine how many string into another string), that is, for the process of converting text E into text F (text E is not the same as text F), a series of processing (including at least any one of text E is required) is required. position at least one of adding a character, deleting a character, replacing a character). For example, the text E is "the sun is very big today", and the text F is "today is too very big", in order to convert the text E to the text F, it is necessary to add a character "yang" after the word "tai” in the text E.
  • Step S120 inputting the text to be corrected into a preset text error correction model to generate an error correction editing operation sequence; the preset text error correction model is trained by a preset annotation editing operation sequence; the preset annotation editing operation The sequence is used to convert the preset error text into correct text corresponding to the preset error text.
  • the above-mentioned text to be corrected is input into a preset text error correction model to generate an error correction editing operation sequence;
  • the preset text error correction model is obtained by training a preset annotation editing operation sequence;
  • the preset annotation editing operation sequence is used for Convert the preset error text to the correct text corresponding to the preset error text.
  • the preset label editing operation sequence can be obtained by manually labeling the preset error text, that is, manually correcting the preset error text, and organizing the editing operations corresponding to the error correction process into the preset label editing. sequence of operations.
  • the error correction editing operation sequence includes at least one editing operation, and the editing operation includes at least one of the following: retaining the current character (C), deleting the current character (D), inserting a character or character string ( A(w)), where "w" is a character or string.
  • the process of converting text X to text Y can be: keep the character “jin”, and use the character “jin” in the character "jin” After inserting the character "tian”, the reserved character “tai”, the reserved character “yang”, the deleted character “de”, the reserved character “true”, the reserved character “de”, the reserved character “non”, the reserved character “chang”, The characters "big” are reserved.
  • obtaining the error correction editing operation sequence from the above text to be corrected needs to be realized by an algorithm from preset sequence to editing operation.
  • the algorithm from preset sequence to editing operation may be the seq2edit algorithm, and the specific implementation process is as follows:
  • the wrong text can be converted into correct text, so as to generate a sequence of editing operations through each editing operation, for example, the wrong text is "I come from Shanghai", and the correct text is "I am from Shanghai” ”, only need to delete “word” and then modify it to “self”. Therefore, the generated editing operation sequence is “CCDACC”.
  • This embodiment optimizes the editing operation sequence, and proposes a new editing operation to change the current character Replace with a character or string (R(w)), it can be understood that the "replace” editing operation can replace the combination of "delete” and “insert” editing operations, that is, the optimized editing operation sequence is "CCRCC”, It can be understood that the optimized editing operation sequence is simplified, thereby improving the efficiency of the preset text error correction model in generating the editing operation sequence.
  • the above-mentioned acquisition of the preset text error correction model includes:
  • Step a obtain the training data set and the model to be trained.
  • a training data set and a to-be-trained model are acquired, so as to train the to-be-trained model through the training data set.
  • the above-mentioned training data set includes one or more training samples and standard detection results corresponding to each of the training samples, and the above-mentioned acquisition of the training data set includes:
  • Step a11 obtaining training samples
  • step a12 the training samples are marked to obtain standard detection results.
  • the training data set includes one or more training samples and standard detection results corresponding to each of the training samples. Specifically, a training sample is obtained, and then the training sample is marked to obtain a standard detection result.
  • the training sample is wrong text
  • the labeling process is to determine the editing operation that needs to be performed to convert the wrong text into correct text and determine the editing operation sequence corresponding to the editing operation
  • the editing operation sequence is the standard detection result
  • Obtaining the model to be trained above includes:
  • Step a21 obtaining a bidirectional pre-trained language model.
  • a bidirectional pretrained language model is obtained.
  • the target word vector H [h 1 , h 2 , ..., h n ].
  • D is [2,3,4,5,6]
  • P is [0,1,0,3,2,0,4,5]
  • H is [(0,0),(2,1),(0,0),(4,3),(3,2),(0,0),(5,4),(6,5 )].
  • the two-way pre-training language model uses the context information of a character in the error text when correcting errors, which improves the output of the two-way pre-training language model. accuracy.
  • step a22 the bidirectional pre-training language model is adaptively adjusted to obtain a model to be trained.
  • the to-be-trained model is obtained by performing a preset adjustment on the bidirectional pre-trained language model, and the preset adjustment is an adjustment to suit the usage requirements, that is, the bidirectional pre-trained language model is adaptively adjusted, including adjusting the input of the model, adjusting the loss function, etc.
  • Step a23 adding a self-attention mechanism to the bidirectional pre-trained language model.
  • a self-attention mechanism is added to the above-mentioned bidirectional pre-trained language model.
  • the weight of each character of the above error text relative to other characters is output through the self-attention mechanism.
  • the formula used in the encoding process is:
  • Step a24 adding a multi-head self-attention mechanism to the bidirectional pre-trained language model.
  • a multi-head self-attention mechanism is added to the above-mentioned bidirectional pre-trained language model. It should be noted that, in order to extract multiple semantics in the error text, so as to make the output of the bidirectional pre-trained language model more accurate through multiple semantics, the self-attention mechanism is a multi-head attention mechanism, and its formula is:
  • MultiHead concat(head 1 ,head 2 ,...,head o );
  • the splicing results are fully connected to realize the mixing of the output results of the multi-head attention mechanism, and then the output results of the bidirectional pre-training language model are obtained.
  • the elements in the "circle” represent the nodes in the next layer of the network in the bidirectional pre-training language model, and the elements in the "square” represent the The nodes in the previous layer of the network in the bidirectional pre-trained language model, where the arrows represent the attention information of the above-mentioned multi-head attention mechanism, for example, in the calculation element the attention score of , it needs to point to the node in the previous layer of the network through the arrow and calculate.
  • Step b Perform iterative training on the to-be-trained model based on the training data set to obtain an updated to-be-trained model, and determine whether the updated to-be-trained model satisfies a preset iteration end condition.
  • the model to be trained is iteratively trained based on the above training data set to obtain an updated model to be trained, and it is determined whether the updated model to be trained satisfies a preset iteration end condition.
  • the preset iteration end condition may be the convergence of the loss function.
  • sequence R is an operation sequence for replacing the current character with a character or character string
  • sequence A is an operation sequence for inserting a character or character string after the current character
  • M is the word vector of the error text X corresponding to the mask character [MASK].
  • the target input sequence Input emphasizes position information. It can be understood that the target input sequence Input already contains the content of the error text X. In order to avoid the repetition of the content of the error text X, the sequence R and the sequence A are the same as The absolute position of each character in the error text X in the error text X is related to the content of the error text X.
  • the resulting target output sequence is (w 11 ,w 12 ,...,w 1n ,e 1 ,e 2 ,..., en ,w 21 ,w 22 ,...w 2n ), where (w 11 ,w 12 ,...,w 1n ) are characters to be replaced, and (w 21 ,w 22 ,...,w 2n ) are characters to be inserted.
  • the relevant parameters of the model to be trained are updated by minimizing the above-mentioned cross-entropy loss function, so as to obtain the above-mentioned preset text error correction model.
  • Step c if the updated to-be-trained model satisfies the preset iteration end condition, the updated to-be-trained model is used as the preset text error correction model;
  • Step d if the updated model to be trained does not meet the iterative end condition, continue to perform iterative training and update on the updated model to be trained until the updated model to be trained satisfies the end of the iteration condition.
  • the updated to-be-trained model satisfies the preset iteration end condition, that is, the model training is completed, the updated to-be-trained model is used as the preset text error correction model; if the updated to-be-trained model does not meet the iteration end condition , that is, the model has not been trained yet, iterative training and updating of the updated model to be trained is continued until the updated model to be trained satisfies the iteration end condition.
  • the text to be corrected is obtained; the text to be corrected is input into a preset text error correction model to generate an error correction editing operation sequence; the preset text error correction model is obtained by training a preset annotation editing operation sequence; The preset annotation editing operation sequence is used to convert the preset error text into correct text corresponding to the preset error text; based on the error correction editing operation sequence, the error correction is performed on the to-be-corrected text to obtain a corrected text. Error text.
  • the improvement of the text error correction process is realized, so that the text conversion process is to generate the error correction editing operation sequence first, and then directly convert the erroneous text into the correct text according to the error correction editing operation sequence, instead of generating part of the error correction editing operation sequence at the same time.
  • the partial error text is converted into the partial correct text, which avoids the problem of time series dependence caused by the intersection of encoder encoding and decoder decoding, that is, the problem of text error correction is converted into sequence generation. Problems, and finally correct the text to be corrected through the generated sequence, so that the process of generating the error correction editing operation sequence and the process of converting the error text into the correct text can be paralleled, thereby improving the error correction speed of the text error correction process.
  • a second embodiment is proposed, wherein the error correction is performed on the text to be corrected based on the error correction editing operation sequence, and the target text after error correction is obtained, include:
  • Step S131 performing error correction on the text to be corrected based on the error correction editing operation sequence to obtain an initial text after error correction.
  • an error correction editing operation is performed on the text to be corrected based on the error correction editing operation sequence to complete the error correction of the text to be corrected, and an initial corrected text is obtained.
  • the gap that is, the text after the initial error correction is not necessarily the correct text.
  • the text after the initial error correction needs to undergo one or more editing operations before it can be converted into the correct text. It is understandable that the accuracy of the preset text error correction model is generally up to less than 100%.
  • Step S132 inputting the initial error-corrected text into the preset text error correction model for iterative error correction, obtaining an updated error-corrected text, and determining whether the updated error-corrected text satisfies a preset iteration End request.
  • this implementation proposes to input the initial error-corrected text into a preset text error-correction model for iterative error correction, and obtain the updated error-corrected text , and determine whether the updated error-corrected text meets the preset iteration end requirements.
  • the preset iterative end requirement may be that the accuracy of the updated error-corrected text meets the requirement that it does not need to be updated iteratively again, or that the number of iterative updates reaches a preset threshold, and the preset threshold can be determined according to the specific conditions. Situation settings, this embodiment does not make specific limitations.
  • Step S133 if the updated error-corrected text meets the preset iteration termination requirement, then the updated error-corrected text is used as the target error-corrected text;
  • Step S134 if the updated error-corrected text does not meet the preset iterative end requirement, then continue to iteratively update the updated error-corrected text until the updated error-corrected text is updated.
  • the text satisfies the preset iteration end requirement.
  • the updated error-corrected text meets the preset iteration end requirement, the updated error-corrected text is used as the target error-corrected text; if the updated error-corrected text does not meet the preset iteration end requirement , then continue to iteratively update the updated error-corrected text until the updated error-corrected text meets the preset iteration end requirements, then stop the iterative error-correction and use the updated error-corrected text as the target correction Error text.
  • the preset text error correction model is improved every time on the more "correct” error-corrected text, thereby improving the The accuracy of the output of the preset text error correction model further solves the problem of error propagation in the text error correction process in the prior art.
  • the present application also provides a text error correction device, as shown in FIG. 4 , the text error correction device includes:
  • the first obtaining module 10 is used to obtain the text to be corrected
  • the generating module 20 is configured to input the text to be corrected into a preset text error correction model to generate an error correction editing operation sequence;
  • the preset text error correction model is obtained by training a preset annotation editing operation sequence;
  • the preset The sequence of annotation editing operations is used to convert the preset error text into correct text corresponding to the preset error text;
  • the error correction module 30 is configured to perform error correction on the text to be corrected based on the error correction editing operation sequence to obtain a target text after error correction.
  • the text error correction device also includes:
  • the second acquisition module is used to acquire the training data set and the model to be trained
  • an iterative training module for performing iterative training on the to-be-trained model based on the training data set to obtain an updated to-be-trained model
  • a determination module configured to determine whether the updated model to be trained satisfies the preset iteration end condition; if the updated model to be trained satisfies the preset iteration end condition, the updated model to be trained is As the preset text error correction model; if the updated to-be-trained model does not meet the iteration end condition, continue to perform iterative training and update on the updated to-be-trained model until the updated to-be-trained model is updated.
  • the training model satisfies the iteration end condition.
  • the first acquisition module 10 includes:
  • a first obtaining unit used for obtaining a bidirectional pre-trained language model
  • the adjustment unit is used for adaptively adjusting the bidirectional pre-training language model to obtain the model to be trained.
  • the first acquisition module 10 also includes:
  • a unit is added for adding a self-attention mechanism to the bidirectional pre-trained language model.
  • the adding unit includes:
  • a subunit is added for adding a multi-head self-attention mechanism to the bidirectional pre-trained language model.
  • the first acquisition module 10 also includes:
  • a second acquisition unit used for acquiring training samples
  • the labeling unit is used to label the training samples to obtain standard detection results.
  • the error correction module 30 includes:
  • An error correction unit configured to perform error correction on the text to be corrected based on the error correction editing operation sequence to obtain the text after initial error correction;
  • an iterative error correction unit configured to input the initial error-corrected text into the preset text error-correction model to perform iterative error correction to obtain updated error-corrected text
  • a determination unit configured to determine whether the updated error-corrected text meets the preset iteration end requirement; if the updated error-corrected text meets the preset iteration end requirement, then the updated error-corrected text The text after error correction is used as the target text after error correction; if the updated text after error correction does not meet the preset iterative end requirement, then continue to perform iterative error correction and update on the updated text after error correction, until all The updated error-corrected text meets the preset iteration end requirement.
  • FIG. 5 is a schematic structural diagram of a hardware operating environment involved in the solution of the embodiment of the present application.
  • FIG. 5 can be a schematic structural diagram of the hardware operating environment of the text error correction device.
  • the text error correction device may include: a processor 1001 , such as a CPU, a memory 1005 , a user interface 1003 , a network interface 1004 , and a communication bus 1002 .
  • the communication bus 1002 is used to realize the connection communication between these components.
  • the user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface).
  • the memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .
  • the text error correction device may further include an RF (Radio Frequency, radio frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like.
  • RF Radio Frequency, radio frequency
  • the structure of the text error correction device shown in FIG. 5 does not constitute a limitation on the text error correction device, and may include more or less components than those shown in the figure, or combine some components, or different component layout.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module and a text error correction program.
  • the operating system is a program that manages and controls the hardware and software resources of the text error correction device, and supports the running of the text error correction program and other software or programs.
  • the user interface 1003 is mainly used to connect the terminal and perform data communication with the terminal, for example, to obtain the error text sent by the terminal;
  • the network interface 1004 is mainly used for the backend server to perform data communication with the backend server. ;
  • the processor 1001 can be used to call the text error correction program stored in the memory 1005, and execute the steps of the text error correction method as described above.
  • an embodiment of the present application also provides a computer-readable storage medium
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium
  • the computer-readable storage medium may also be a volatile computer-readable storage medium
  • the computer-readable storage medium stores instructions of a text error correction program, and when the text error correction program is executed by the processor, implements the steps of the above text error correction method.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, a device, or a network device, etc.) to execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

一种文本纠错方法、装置、设备及可读存储介质,涉及人工智能技术领域,该方法包括:获取待纠错文本(S110);将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本(S120);基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本(S130)。所述方法避免了由于编码器编码和解码器解码的交叉进行而产生的时间序列依赖问题,将文本纠错问题转换为序列生成问题,使得生成纠错编辑操作序列和将错误文本转换为正确文本的过程可以并行,提高了文本纠错过程的纠错速度。

Description

文本纠错方法、装置、设备及存储介质
本申请要求于2020年12月18日提交中国专利局、申请号为202011515647.X、发明名称为“文本纠错方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种文本纠错方法、装置、设备及可读存储介质。
背景技术
在公文撰写或文章编辑的过程中,多字、错字和漏字情况时常发生,提交一份没有错别字的公文往往需要人工耗时校对,从而在一定程度上降低了办公效率,为了解决这个问题,文本纠错的自动化与智能化是十分必要的。
发明人意识到目前文本纠错的建模方法主要依靠的是基于注意力机制的序列到序列的编码器-解码器框架,该框架在文本纠错的过程中将原本有错误的句子作为输入,通过编码器进行编码之后,使用解码器逐个解码出纠错后的正确句子。然而,序列到序列的模型每一步的解码都依赖于其上一步解码器的输出,该解码过程为逐个解码的过程,会产生时间序列依赖的问题,造成运行速度上的损失,并且上述编码器编码和解码器解码的过程难以并行,导致线上的运行速度缓慢。
由此可知,目前在进行文本纠错任务时,存在文本纠错过程纠错速度慢的问题。
发明内容
本申请的主要目的在于提供一种文本纠错方法、装置、设备及可读存储介质,旨在解决现有的在进行文本纠错任务时,存在的文本纠错过程纠错速度慢的技术问题。
为实现上述目的,本申请提供一种文本纠错方法,所述文本纠错方法包括步骤:
获取待纠错文本;
将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。
此外,为实现上述目的,本申请还提供一种文本纠错设备,所述文本纠错设备包括存储器、处理器和存储在所述存储器上并可在所述处理器上运行的文本纠错程序,所述文本纠错程序被所述处理器执行时实现如下所述的文本纠错方法的步骤:
获取待纠错文本;
将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有文本纠错程序,所述文本纠错程序被处理器执行时实现如下所述的文本纠错方法的步骤:
获取待纠错文本;
将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文 本转化为与所述预设错误文本对应的正确文本;
基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。。
此外,为实现上述目的,本申请还提供一种文本纠错装置,所述文本纠错装置包括:
获取模块,用于获取待纠错文本;
生成模块,用于将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
纠错模块,用于基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。
本申请通过获取待纠错文本;将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到纠错后文本。实现了对文本纠错过程的改进,使得文本转换过程为先生成纠错编辑操作序列,后根据纠错编辑操作序列将错误文本直接转换成正确文本,而非一边生成部分纠错编辑操作序列一边根据该部分纠错编辑操作序列将部分错误文本转换为部分正确文本,避免了由于编码器编码和解码器解码的交叉进行而产生的时间序列依赖的问题,即将文本纠错的问题转换为序列生成问题,并最终通过生成的序列对待纠错文本进行纠错,从而使得生成纠错编辑操作序列和将错误文本转换为正确文本的过程可以并行,进而提高了文本纠错过程的纠错速度。
附图说明
图1是本申请文本纠错方法第一实施例的流程示意图;
图2是本申请实施例中多头注意力机制在双向预训练语言模型中的实现过程示意图;
图3是本申请文本纠错方法第二实施例的流程示意图;
图4是本申请文本纠错装置较佳实施例的功能模块示意图;
图5是本申请实施例方案涉及的硬件运行环境的结构示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供一种文本纠错方法,参照图1,图1为本申请文本纠错方法第一实施例的流程示意图。
本申请实施例提供了文本纠错方法的实施例,需要说明的是,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。文本纠错方法可应用于移动终端中,该移动终端包括但不限于手机、个人计算机等,为了便于描述,以下省略执行主体描述文本纠错方法的各个步骤。文本纠错方法包括:
步骤S110,获取待纠错文本。
具体地,获取需要纠错的待纠错文本。
需要说明的是,对待纠错文本进行纠错的任务为文本纠错任务,对于文本纠错任务,其需要对待纠错文本中的部分文字(即绝大多数情况下,错误句子与正确句子只在特定位置存在差异)进行纠错,例如,新闻从业者在编辑新闻稿时,出于时效方面考虑,其编辑速度一般较快,因此而导致的编辑错误包括错别字、多字、漏字较为常见。因此,文本纠 错任务只需要对文本的特定位置进行修改,而非重新生成文本。可以理解,文本纠错任务即为文本转换任务。
针对上述文本纠错任务,本实施例采用编辑距离(编辑距离是针对二个字符串(例如英文字)的差异程度的量化量测,量测方式为确定至少需要进行多少次的处理才能将一个字符串变成另一个字符串)的思想来处理,即,对于文本E转换为文本F(文本E与文本F不相同)的过程,需要通过一系列的处理(至少包括在文本E的任意一个位置添加一个字符、删去一个字符、替换一个字符中的至少一种)。例如,文本E为“今天太阳非常大”,文本F为“今天太非常大”,为将文本E转换为文本F,需要在文本E中的“太”字后面添加一个字符“阳”。
步骤S120,将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本。
具体地,将上述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;该预设文本纠错模型由预设标注编辑操作序列训练得到;该预设标注编辑操作序列用于将预设错误文本转化为与预设错误文本对应的正确文本。需要说明的是,该预设标注编辑操作序列可由人工对预设错误文本进行标注得到,即人工对预设错误文本进行纠错,并将该纠错过程对应的编辑操作整理为预设标注编辑操作序列。
需要说明的是,该纠错编辑操作序列包括至少一个编辑操作,该编辑操作包括以下至少一种:保留当前字符(C)、删除当前字符(D)、在当前字符后面插入字符或字符串(A(w)),其中,“w”为字符或字符串。例如,文本X为“今太阳的真的非常大”,文本Y为“今天太阳真的非常大”,将文本X转换为文本Y的过程可为:保留字符“今”、在字符“今”后面插入字符“天”、保留字符“太”、保留字符“阳”、删除字符“的”、保留字符“真”、保留字符“的”、保留字符““非”、保留字符“常”、保留字符“大”。
需要说明的是,由上述待纠错文本得到纠错编辑操作序列需要通过预设序列到编辑操作的算法实现,该预设序列到编辑操作的算法可为seq2edit算法,其具体地实现过程为:
通过一系列的编辑操作(例如C、A)可以将错误文本转化为正确文本,从而通过各编辑操作生成编辑操作序列,例如,错误文本为“我来字上海”,正确文本为“我来自上海”,只需要将“字”删除后修改为“自”,因此,生成的编辑操作序列为“CCDACC”,本实施例对该编辑操作序列进行了优化,提出了一个新的编辑操作将当前字符替换成字符或字符串(R(w)),可以理解,“替换”的编辑操作可替代“删除”和“插入”的编辑操作的组合,即经过优化后的编辑操作序列为“CCRCC”,可以理解,优化后的编辑操作序列得到了简化,从而提高了预设文本纠错模型在生成编辑操作序列时的效率。
进一步地,上述获取所述预设文本纠错模型,包括:
步骤a,获取训练数据集和待训练模型。
具体地,获取训练数据集和待训练模型,以通过该训练数据集对该待训练模型进行训练。
上述训练数据集包括一个或多个训练样本和各所述训练样本对应的标准检测结果,上述获取训练数据集,包括:
步骤a11,获取训练样本;
步骤a12,对所述训练样本进行标注,获得标准检测结果。
具体地,训练数据集包括一个或多个训练样本和各所述训练样本对应的标准检测结果。具体地,获取训练样本,之后,对训练样本进行标注,从而获得标准检测结果。
具体地,训练样本为错误文本,该标注过程为确定错误文本转换为正确文本需要进行的编辑操作并确定该编辑操作对应的编辑操作序列,该编辑操作序列即为标准检测结果。
上述获取待训练模型,包括:
步骤a21,获取双向预训练语言模型。
具体地,获取双向预训练语言模型。需要说明的是,对于双向预训练语言模型,错误文本在输入双向预训练语言模型前,需要将错误文本的文字序列转换为初始字向量,例如错误文本X=(x 1,x 2,…,x n),其对应的初始字向量D=[d 1,d 2,…,d n]。此外,为编码错误文本中的各字符在错误文本中的位置信息,需要通过位置向量P=[p 1,p 2,···,p n]来表示各字符在错误文本中的绝对位置,其中,n为预设词库(至少包含所有错误文本中的字符)所包含的字符的数量,需要说明的是,位置向量P可用于表示错误文本中任一字符的位置。例如错误文本为“我来字上海”,其中,字符“我”在错误文本中的位置为1,字符“我”在预设词库中的位置为32,则在位置向量P中,字符“我”为p 32=1。最后,将初始字向量D=[d 1,d 2,…,d n]与位置向量P=[p 1,p 2,···,p n]相加后,可得到目标字向量H=[h 1,h 2,…,h n]。
例如D为[2,3,4,5,6],P为[0,1,0,3,2,0,4,5],
则H为[(0,0),(2,1),(0,0),(4,3),(3,2),(0,0),(5,4),(6,5)]。
其中,对于双向,相对于仅根据上文信息来纠错的预训练语言模型,该双向预训练语言模型在纠错时使用错误文本中某字符的上下文信息,提高了双向预训练语言模型的输出的准确性。
步骤a22,对所述双向预训练语言模型进行适应性调整,得到待训练模型。
具体地,待训练模型通过对双向预训练语言模型进行预设调整后得到,该预设调整为适应使用需求的调整,即对双向预训练语言模型进行适应性调整,包括调整模型的输入、调整损失函数等。
上述获取双向预训练语言模型之后,包括:
步骤a23,为所述双向预训练语言模型添加自注意力机制。
具体地,为上述双向预训练语言模型添加自注意力机制。需要说明的是,为了提高双向预训练语言模型的输出的准确性,在双向预训练语音模型中,使用自注意力机制对上述目标字向量H=[h 1,h 2,…,h n]进行进一步编码。具体地,通过自注意力机制输出上述错误文本的每个字符相对于其他字符的权重。其中,编码过程所使用的公式为:
Figure PCTCN2021082587-appb-000001
其中,Q,K,V均指目标字向量H=[h 1,h 2,…,h n];D k指目标字向量的向量维度。
上述为所述双向预训练语言模型添加自注意力机制,包括:
步骤a24,为所述双向预训练语言模型添加多头自注意力机制。
具体地,为上述双向预训练语言模型添加多头自注意力机制。需要说明的是,为了能够提取到错误文本中的多重语义,以通过多重语义使得双向预训练语言模型的输出更加准确,该自注意力机制为多头注意力机制,其公式为:
Figure PCTCN2021082587-appb-000002
其中,Q,K,V均指目标字向量H=[h 1,h 2,…,h n];
Figure PCTCN2021082587-appb-000003
Figure PCTCN2021082587-appb-000004
为双向预训练语言模型训练过程中需要更新的参数。
之后对多头自注意力机制输出结果headi的每个头(例如head 1、head 2)进行拼接,得到错误文本的文本特征表示,拼接过程对应的公式为:
MultiHead=concat(head 1,head 2,…,head o);
在得到上述拼接结果后,对该拼接结果进行全连接处理,以实现对多头注意力机制输出结果进行混合,之后得到双向预训练语言模型输出结果。
该多头注意力机制在双向预训练语言模型中的具体实现过程可参照图2,“圆圈”中的 元素代表双向预训练语言模型中的后一层网络中的节点,“方块”中的元素代表双向预训练语言模型中的前一层网络中的节点,其中的箭头则代表上述多头注意力机制的注意力信息,例如在计算元素
Figure PCTCN2021082587-appb-000005
的注意力得分时,需要通过箭头指向它的前一层网络中的节点
Figure PCTCN2021082587-appb-000006
Figure PCTCN2021082587-appb-000007
计算。
步骤b,基于所述训练数据集对所述待训练模型进行迭代训练,得到更新后的待训练模型,并确定所述更新后的待训练模型是否满足预设迭代结束条件。
具体地,基于上述训练数据集对待训练模型进行迭代训练,得到更新后的待训练模型,并确定更新后的待训练模型是否满足预设迭代结束条件。需要说明的是,该预设迭代结束条件可为损失函数收敛。
具体地,在训练待训练模型时,将序列R=(r 1,r 2,…,r n)和序列A=(a 1,a 2,…,a n)与错误文本X=(x 1,x 2,…,x n)进行拼接,得到目标输入序列Input=(r 1,r 2,…,r n,x 1,x 2,…,x n,a 1,a 2,…,a n)。其中,序列R为将当前字符替换成字符或字符串的操作序列,序列A为在当前字符后面插入字符或字符串的操作序列,r i=[M,p i],a i=[M,(p i+p i+1)/2],其中,M为掩码字符[MASK]对应的错误文本X的字向量,可以理解,序列R和序列A与位置向量P有关而与字向量D或目标字向量H无关。
需要说明的是,该目标输入序列Input强调的是位置信息,可以理解,该目标输入序列Input中已包含错误文本X的内容,为避免错误文本X的内容的重复出现,序列R和序列A与错误文本X中各字符在错误文本X中的绝对位置有关而与错误文本X的内容无关。
由此得到的目标输出序列为(w 11,w 12,…,w 1n,e 1,e 2,…,e n,w 21,w 22,…w 2n),其中,(w 11,w 12,…,w 1n)为需要被替换的字符,(w 21,w 22,…,w 2n)为需要被插入的字符。
由此可得出错误文本X中各字符对应的编辑操作ei的概率可由下式计算得出:
其中
Figure PCTCN2021082587-appb-000008
由此计算对应的交叉熵损失函数:
L(e,x)=-∑ ilog(P(e i|x));
之后通过最小化上述交叉熵损失函数来更新待训练模型的相关参数,以得到上述预设文本纠错模型。
步骤c,若所述更新后的待训练模型满足所述预设迭代结束条件,则将所述更新后的待训练模型作为所述预设文本纠错模型;
步骤d,若所述更新后的待训练模型未满足所述迭代结束条件,则继续对所述更新后的待训练模型进行迭代训练更新,直至所述更新后的待训练模型满足所述迭代结束条件。
具体地,若更新后的待训练模型满足预设迭代结束条件,即模型训练完成,则将更新后的待训练模型作为预设文本纠错模型;若更新后的待训练模型未满足迭代结束条件,即模型还未完成训练,则继续对更新后的待训练模型进行迭代训练更新,直至更新后的待训练模型满足所述迭代结束条件。
本实施例通过获取待纠错文本;将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到纠错后文本。实现了对文本纠错过程 的改进,使得文本转换过程为先生成纠错编辑操作序列,后根据纠错编辑操作序列将错误文本直接转换成正确文本,而非一边生成部分纠错编辑操作序列一边根据该部分纠错编辑操作序列将部分错误文本转换为部分正确文本,避免了由于编码器编码和解码器解码的交叉进行而产生的时间序列依赖的问题,即将文本纠错的问题转换为序列生成问题,并最终通过生成的序列对待纠错文本进行纠错,从而使得生成纠错编辑操作序列和将错误文本转换为正确文本的过程可以并行,进而提高了文本纠错过程的纠错速度。
参照图3,基于本申请文本纠错方法第一实施例,提出第二实施例,所述基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本,包括:
步骤S131,基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到初始纠错后文本。
具体地,基于纠错编辑操作序列对待纠错文本进行纠错编辑操作以完成对待纠错文本的纠错,得到初始纠错后文本,该初始纠错后文本与正确文本之间可能存在一定的差距,即初始纠错后文本不一定是正确文本,例如初始纠错后文本还需要经过一个或多个编辑操作后才能转化为正确文本,可以理解,预设文本纠错模型的准确率一般达不到100%。
步骤S132,将所述初始纠错后文本输入所述预设文本纠错模型进行迭代纠错,得到更新后的纠错后文本,并确定所述更新后的纠错后文本是否满足预设迭代结束要求。
具体地,为改善上述初始纠错后文本与正确文本之间存在差距的问题,本实施提出将初始纠错后文本输入预设文本纠错模型进行迭代纠错,得到更新后的纠错后文本,并确定更新后的纠错后文本是否满足预设迭代结束要求。需要说明的是,该预设迭代结束要求可以为更新后的纠错后文本的准确率满足不需要再次迭代更新的要求,也可以为迭代更新次数达到预设阈值,该预设阈值可根据具体情况设置,本实施例不做具体限制。
步骤S133,若所述更新后的纠错后文本满足所述预设迭代结束要求,则将所述更新后的纠错后文本作为目标纠错后文本;
步骤S134,若所述更新后的纠错后文本未满足所述预设迭代结束要求,则继续对所述更新后的纠错后文本进行迭代纠错更新,直至所述更新后的纠错后文本满足所述预设迭代结束要求。
具体地,若更新后的纠错后文本满足预设迭代结束要求,则将该更新后的纠错后文本作为目标纠错后文本;若更新后的纠错后文本未满足预设迭代结束要求,则继续对更新后的纠错后文本进行迭代纠错更新,直至更新后的纠错后文本满足预设迭代结束要求,才停止迭代纠错并将该更新后的纠错后文本作为目标纠错后文本。
本实施例通过将纠错后文本输入预设文本纠错模型中进行再次纠错,使得该预设文本纠错模型每次都在更加“正确”的纠错后文本上进行改进,从而能够提高该预设文本纠错模型的输出的准确性,进而解决了现有技术中的文本纠错过程中的误差传播问题。
此外,本申请还提供一种文本纠错装置,如图4所示,所述文本纠错装置包括:
第一获取模块10,用于获取待纠错文本;
生成模块20,用于将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
纠错模块30,用于基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。
进一步地,所述文本纠错装置还包括:
第二获取模块,用于获取训练数据集和待训练模型;
迭代训练模块,用于基于所述训练数据集对所述待训练模型进行迭代训练,得到更新后的待训练模型;
确定模块,用于确定所述更新后的待训练模型是否满足预设迭代结束条件;若所述更新后的待训练模型满足所述预设迭代结束条件,则将所述更新后的待训练模型作为所述预设文本纠错模型;若所述更新后的待训练模型未满足所述迭代结束条件,则继续对所述更新后的待训练模型进行迭代训练更新,直至所述更新后的待训练模型满足所述迭代结束条件。
进一步地,所述第一获取模块10包括:
第一获取单元,用于获取双向预训练语言模型;
调整单元,用于对所述双向预训练语言模型进行适应性调整,得到待训练模型。
进一步地,所述第一获取模块10还包括:
添加单元,用于为所述双向预训练语言模型添加自注意力机制。
进一步地,所述添加单元包括:
添加子单元,用于为所述双向预训练语言模型添加多头自注意力机制。
进一步地,所述第一获取模块10还包括:
第二获取单元,用于获取训练样本;
标注单元,用于对所述训练样本进行标注,获得标准检测结果。
进一步地,所述纠错模块30包括:
纠错单元,用于基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到初始纠错后文本;
迭代纠错单元,用于将所述初始纠错后文本输入所述预设文本纠错模型进行迭代纠错,得到更新后的纠错后文本;
确定单元,用于确定所述更新后的纠错后文本是否满足预设迭代结束要求;若所述更新后的纠错后文本满足所述预设迭代结束要求,则将所述更新后的纠错后文本作为目标纠错后文本;若所述更新后的纠错后文本未满足所述预设迭代结束要求,则继续对所述更新后的纠错后文本进行迭代纠错更新,直至所述更新后的纠错后文本满足所述预设迭代结束要求。
本申请文本纠错装置具体实施方式与上述文本纠错方法各实施例基本相同,在此不再赘述。
此外,本申请还提供一种文本纠错设备。如图5所示,图5是本申请实施例方案涉及的硬件运行环境的结构示意图。
需要说明的是,图5即可为文本纠错设备的硬件运行环境的结构示意图。
如图5所示,该文本纠错设备可以包括:处理器1001,例如CPU,存储器1005,用户接口1003,网络接口1004,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。
可选地,文本纠错设备还可以包括RF(Radio Frequency,射频)电路,传感器、音频电路、WiFi模块等等。
本领域技术人员可以理解,图5中示出的文本纠错设备结构并不构成对文本纠错设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图5所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及文本纠错程序。其中,操作系统是管理和控制文本纠错设备硬件和软件资源的程序,支持文本纠错程序以及其它软件或程序的运行。
在图5所示的文本纠错设备中,用户接口1003主要用于连接终端,与终端进行数据通信,例如获取终端发送的错误文本;网络接口1004主要用于后台服务器,与后台服务器进行数据通信;处理器1001可以用于调用存储器1005中存储的文本纠错程序,并执行如上所述的文本纠错方法的步骤。
本申请文本纠错设备具体实施方式与上述文本纠错方法各实施例基本相同,在此不再赘述。
此外,本申请实施例还提出一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质上存储有文本纠错程序的指令,所述文本纠错程序被处理器执行时实现如上所述的文本纠错方法的步骤。
本申请计算机可读存储介质具体实施方式与上述文本纠错方法各实施例基本相同,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,设备,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的较佳实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种文本纠错方法,其中,所述文本纠错方法包括以下步骤:
    获取待纠错文本;
    将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
    基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。
  2. 如权利要求1所述的方法,其中,获取所述预设文本纠错模型,包括:
    获取训练数据集和待训练模型;
    基于所述训练数据集对所述待训练模型进行迭代训练,得到更新后的待训练模型,并确定所述更新后的待训练模型是否满足预设迭代结束条件;
    若所述更新后的待训练模型满足所述预设迭代结束条件,则将所述更新后的待训练模型作为所述预设文本纠错模型;
    若所述更新后的待训练模型未满足所述迭代结束条件,则继续对所述更新后的待训练模型进行迭代训练更新,直至所述更新后的待训练模型满足所述迭代结束条件。
  3. 如权利要求2所述的方法,其中,所述获取待训练模型,包括:
    获取双向预训练语言模型;
    对所述双向预训练语言模型进行适应性调整,得到待训练模型。
  4. 如权利要求3所述的方法,其中,所述获取双向预训练语言模型之后,包括:
    为所述双向预训练语言模型添加自注意力机制。
  5. 如权利要求4所述的方法,其中,所述为所述双向预训练语言模型添加自注意力机制,包括:
    为所述双向预训练语言模型添加多头自注意力机制。
  6. 如权利要求2所述的方法,其中,所述训练数据集包括一个或多个训练样本和各所述训练样本对应的标准检测结果,所述获取训练数据集,包括:
    获取训练样本;
    对所述训练样本进行标注,获得标准检测结果。
  7. 如权利要求1-6任一项所述的方法,其中,所述基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本,包括:
    基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到初始纠错后文本;
    将所述初始纠错后文本输入所述预设文本纠错模型进行迭代纠错,得到更新后的纠错后文本,并确定所述更新后的纠错后文本是否满足预设迭代结束要求;
    若所述更新后的纠错后文本满足所述预设迭代结束要求,则将所述更新后的纠错后文本作为目标纠错后文本;
    若所述更新后的纠错后文本未满足所述预设迭代结束要求,则继续对所述更新后的纠错后文本进行迭代纠错更新,直至所述更新后的纠错后文本满足所述预设迭代结束要求。
  8. 一种文本纠错设备,其中,所述文本纠错设备包括存储器、处理器和存储在所述存储器上并可在所述处理器上运行的文本纠错程序,所述文本纠错程序被所述处理器执行时实现如下所述的文本纠错方法的步骤:
    获取待纠错文本;
    将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
    基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。
  9. 根据权利要求8所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行时实现所述预设文本纠错模型的步骤时,包括:
    获取训练数据集和待训练模型;
    基于所述训练数据集对所述待训练模型进行迭代训练,得到更新后的待训练模型,并确定所述更新后的待训练模型是否满足预设迭代结束条件;
    若所述更新后的待训练模型满足所述预设迭代结束条件,则将所述更新后的待训练模型作为所述预设文本纠错模型;
    若所述更新后的待训练模型未满足所述迭代结束条件,则继续对所述更新后的待训练模型进行迭代训练更新,直至所述更新后的待训练模型满足所述迭代结束条件。
  10. 根据权利要求9所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行时实现所述获取待训练模型的步骤时,包括:
    获取双向预训练语言模型;
    对所述双向预训练语言模型进行适应性调整,得到待训练模型。
  11. 根据权利要求10所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行时实现所述获取双向预训练语言模型的步骤之后,还包括:
    为所述双向预训练语言模型添加自注意力机制。
  12. 根据权利要求8所述的文本纠错设备,其中,所述训练数据集包括一个或多个训练样本和各所述训练样本对应的标准检测结果,所述文本纠错程序被所述处理器执行时实现所述获取训练数据集的步骤时,包括:
    获取训练样本;
    对所述训练样本进行标注,获得标准检测结果。
  13. 根据权利要求8-12中任一项所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行时实现所述基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本的步骤时,包括:
    基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到初始纠错后文本;
    将所述初始纠错后文本输入所述预设文本纠错模型进行迭代纠错,得到更新后的纠错后文本,并确定所述更新后的纠错后文本是否满足预设迭代结束要求;
    若所述更新后的纠错后文本满足所述预设迭代结束要求,则将所述更新后的纠错后文本作为目标纠错后文本;
    若所述更新后的纠错后文本未满足所述预设迭代结束要求,则继续对所述更新后的纠错后文本进行迭代纠错更新,直至所述更新后的纠错后文本满足所述预设迭代结束要求。
  14. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有文本纠错程序,所述文本纠错程序被处理器执行时实现如下所述的文本纠错方法的步骤:
    获取待纠错文本;
    将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
    基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本
  15. 根据权利要求14所述的计算机可读存储介质,其中,所述文本纠错程序被处理器执行时实现所述预设文本纠错模型的步骤时,包括:
    获取训练数据集和待训练模型;
    基于所述训练数据集对所述待训练模型进行迭代训练,得到更新后的待训练模型,并确定所述更新后的待训练模型是否满足预设迭代结束条件;
    若所述更新后的待训练模型满足所述预设迭代结束条件,则将所述更新后的待训练模 型作为所述预设文本纠错模型;
    若所述更新后的待训练模型未满足所述迭代结束条件,则继续对所述更新后的待训练模型进行迭代训练更新,直至所述更新后的待训练模型满足所述迭代结束条件。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述文本纠错程序被处理器执行时实现所述获取待训练模型的步骤时,包括:
    获取双向预训练语言模型;
    对所述双向预训练语言模型进行适应性调整,得到待训练模型。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述文本纠错程序被处理器执行时实现所述获取双向预训练语言模型的步骤之后,还包括:
    为所述双向预训练语言模型添加自注意力机制。
  18. 根据权利要求14所述的计算机可读存储介质,其中,所述训练数据集包括一个或多个训练样本和各所述训练样本对应的标准检测结果,所述文本纠错程序被处理器执行时实现所述获取训练数据集的步骤时,包括:
    获取训练样本;
    对所述训练样本进行标注,获得标准检测结果。
  19. 根据权利要求14-18中任一项所述的计算机可读存储介质,所述文本纠错程序被处理器执行时实现所述基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本的步骤时,包括:
    基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到初始纠错后文本;
    将所述初始纠错后文本输入所述预设文本纠错模型进行迭代纠错,得到更新后的纠错后文本,并确定所述更新后的纠错后文本是否满足预设迭代结束要求;
    若所述更新后的纠错后文本满足所述预设迭代结束要求,则将所述更新后的纠错后文本作为目标纠错后文本;
    若所述更新后的纠错后文本未满足所述预设迭代结束要求,则继续对所述更新后的纠错后文本进行迭代纠错更新,直至所述更新后的纠错后文本满足所述预设迭代结束要求。
  20. 一种文本纠错装置,其中,所述文本纠错装置包括:
    获取模块,用于获取待纠错文本;
    生成模块,用于将所述待纠错文本输入预设文本纠错模型,生成纠错编辑操作序列;所述预设文本纠错模型由预设标注编辑操作序列训练得到;所述预设标注编辑操作序列用于将预设错误文本转化为与所述预设错误文本对应的正确文本;
    纠错模块,用于基于所述纠错编辑操作序列对所述待纠错文本进行纠错,得到目标纠错后文本。
PCT/CN2021/082587 2020-12-18 2021-03-24 文本纠错方法、装置、设备及存储介质 WO2022126897A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011515647.X 2020-12-18
CN202011515647.XA CN112632912A (zh) 2020-12-18 2020-12-18 文本纠错方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2022126897A1 true WO2022126897A1 (zh) 2022-06-23

Family

ID=75318034

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/082587 WO2022126897A1 (zh) 2020-12-18 2021-03-24 文本纠错方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112632912A (zh)
WO (1) WO2022126897A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115906815A (zh) * 2023-03-08 2023-04-04 北京语言大学 一种用于修改一种或多种类型错误句子的纠错方法及装置
CN116127953A (zh) * 2023-04-18 2023-05-16 之江实验室 一种基于对比学习的中文拼写纠错方法、装置和介质
CN116822498A (zh) * 2023-08-30 2023-09-29 深圳前海环融联易信息科技服务有限公司 文本纠错处理方法、模型处理方法、装置、设备及介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064975A (zh) * 2021-04-14 2021-07-02 深圳市诺金系统集成有限公司 基于ai深度学习的人力资源数据处理系统及方法
CN113515931B (zh) * 2021-07-27 2023-07-21 中国平安人寿保险股份有限公司 文本纠错方法、装置、计算机设备及存储介质
CN114581926B (zh) * 2022-04-11 2024-06-21 深圳市星桐科技有限公司 多行文本识别方法、装置、设备及介质
CN114462356B (zh) * 2022-04-11 2022-07-08 苏州浪潮智能科技有限公司 一种文本纠错方法、装置、电子设备和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020120647A1 (en) * 2000-09-27 2002-08-29 Ibm Corporation Application data error correction support
CN110162767A (zh) * 2018-02-12 2019-08-23 北京京东尚科信息技术有限公司 文本纠错的方法和装置
CN111191441A (zh) * 2020-01-06 2020-05-22 广东博智林机器人有限公司 文本纠错方法、装置及存储介质
CN111950292A (zh) * 2020-06-22 2020-11-17 北京百度网讯科技有限公司 文本纠错模型的训练方法、文本纠错处理方法和装置
CN112016310A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文本纠错方法、系统、设备及可读存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188327B (zh) * 2019-05-30 2021-05-14 北京百度网讯科技有限公司 文本去口语化方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020120647A1 (en) * 2000-09-27 2002-08-29 Ibm Corporation Application data error correction support
CN110162767A (zh) * 2018-02-12 2019-08-23 北京京东尚科信息技术有限公司 文本纠错的方法和装置
CN111191441A (zh) * 2020-01-06 2020-05-22 广东博智林机器人有限公司 文本纠错方法、装置及存储介质
CN111950292A (zh) * 2020-06-22 2020-11-17 北京百度网讯科技有限公司 文本纠错模型的训练方法、文本纠错处理方法和装置
CN112016310A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文本纠错方法、系统、设备及可读存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115906815A (zh) * 2023-03-08 2023-04-04 北京语言大学 一种用于修改一种或多种类型错误句子的纠错方法及装置
CN115906815B (zh) * 2023-03-08 2023-06-27 北京语言大学 一种用于修改一种或多种类型错误句子的纠错方法及装置
CN116127953A (zh) * 2023-04-18 2023-05-16 之江实验室 一种基于对比学习的中文拼写纠错方法、装置和介质
CN116822498A (zh) * 2023-08-30 2023-09-29 深圳前海环融联易信息科技服务有限公司 文本纠错处理方法、模型处理方法、装置、设备及介质
CN116822498B (zh) * 2023-08-30 2023-12-01 深圳前海环融联易信息科技服务有限公司 文本纠错处理方法、模型处理方法、装置、设备及介质

Also Published As

Publication number Publication date
CN112632912A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022126897A1 (zh) 文本纠错方法、装置、设备及存储介质
WO2021135910A1 (zh) 基于机器阅读理解的信息抽取方法、及其相关设备
WO2021189851A1 (zh) 文本纠错方法、系统、设备及可读存储介质
CN111402861B (zh) 一种语音识别方法、装置、设备及存储介质
WO2020228175A1 (zh) 多音字预测方法、装置、设备及计算机可读存储介质
CN111739514B (zh) 一种语音识别方法、装置、设备及介质
KR101496885B1 (ko) 문장 띄어쓰기 시스템 및 방법
CN113673228B (zh) 文本纠错方法、装置、计算机存储介质及计算机程序产品
WO2023201975A1 (zh) 一种差异描述语句生成方法、装置、设备及介质
CN111209740A (zh) 文本模型训练方法、文本纠错方法、电子设备及存储介质
CN113051894B (zh) 一种文本纠错的方法和装置
US20220013126A1 (en) Alphanumeric sequence biasing for automatic speech recognition
CN111554295B (zh) 文本纠错方法、相关设备及可读存储介质
CN112560452A (zh) 一种自动生成纠错语料的方法和系统
WO2022141844A1 (zh) 文本纠错方法、装置、设备及可读存储介质
JP2023002730A (ja) テキスト誤り訂正とテキスト誤り訂正モデルの生成方法、装置、機器及び媒体
CN116434752A (zh) 语音识别纠错方法和装置
CN113822044B (zh) 语法纠错数据生成方法、装置、计算机设备及存储介质
KR20210125449A (ko) 업계 텍스트를 증분하는 방법, 관련 장치 및 매체에 저장된 컴퓨터 프로그램
WO2023138361A1 (zh) 图像处理方法、装置、可读存储介质及电子设备
CN110929514A (zh) 文本校对方法、装置、计算机可读存储介质及电子设备
WO2022242535A1 (zh) 一种翻译方法、翻译装置、翻译设备以及存储介质
CN113609157B (zh) 语言转换模型训练、语言转换方法、装置、设备及介质
CN115828937A (zh) 前端多语言翻译方法、装置、设备及存储介质
CN111832288B (zh) 文本修正方法及装置、电子设备、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21904837

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21904837

Country of ref document: EP

Kind code of ref document: A1