CN115169344A

CN115169344A - Method, device, equipment and medium for correcting Chinese text errors

Info

Publication number: CN115169344A
Application number: CN202210851133.4A
Authority: CN
Inventors: 宋红梅; 徐洁馨; 李金龙
Original assignee: China Merchants Bank Co Ltd
Current assignee: China Merchants Bank Co Ltd
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-10-11

Abstract

The invention relates to the technical field of natural language processing, and discloses a Chinese text error correction method, a device, equipment and a medium. The method comprises the steps of obtaining a text to be corrected, correcting the text to be corrected through a sequence tag model, and determining a suggested correction text corresponding to the text to be corrected; post-processing the suggested correction text according to the text to be corrected to generate a corrected text; comparing the difference between the text to be corrected and the text after error correction to generate a text error reporting prompt; therefore, the comprehensiveness and the accuracy of Chinese text error correction are improved, and the efficiency of text error correction is improved.

Description

Method, device, equipment and medium for correcting Chinese text errors

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a medium for correcting a chinese text error.

Background

The text error correction technology is an important technology for realizing intelligent checking and automatic error correction of natural sentences, and can reduce the cost of manual verification while improving the language correctness. The text error correction module is used as a comparative basic module for natural language processing, and has wide application scenes, such as error correction prompt for a user to input a query word in a search engine, intelligent verification of word correctness of an article on a network platform, text-assisted intelligent conversion of automatic voice recognition and the like.

At present, the existing Chinese text error correction methods are generally divided into two categories, one is text error correction based on rules, and the other is text error correction based on a deep learning model.

Rule-based text error correction is largely divided into two steps, the first being error detection and the second being error correction. The error detection part firstly cuts words of the text, for example, a Chinese word segmentation device in the Chinese crust is used for segmenting words, but the text may contain wrongly-written characters, which may cause word segmentation errors.

The other method is text error correction based on a deep learning model, most of the methods are end-to-end text error correction, namely, a section of text is input into the deep learning model, and then the corrected text content is directly calculated and output, and the deep learning model needs a large number of errors and correct sentences for model training in the previous stage.

However, the existing Chinese text error correction method cannot effectively summarize and identify numerous and complicated error types of the Chinese text, such as text disorder, word loss or redundancy and other syntax semantic errors, the processed text is short, and meanwhile, error correction prompt positioning ambiguity cannot enable a machine to quickly find an error position and modify the text according to a correction suggestion.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a medium for correcting Chinese text, aiming at improving the comprehensiveness and the accuracy of correcting long text at Chinese chapter level and improving the efficiency of correcting the text.

In order to achieve the above object, the present invention provides a method for correcting a chinese text, comprising the steps of:

acquiring a text to be corrected, correcting the error through the sequence tag model, and determining a suggested error correction text;

based on the text to be corrected, post-processing the suggested error correction text to generate a text after error correction;

and comparing the difference between the text to be corrected and the text after error correction to generate a text error reporting prompt.

Preferably, before the step of obtaining the text to be corrected, correcting the error through the sequence tag model, and determining the suggested corrected text, the method for correcting the chinese text further includes:

the method comprises the steps of obtaining sample texts and target texts corresponding to the sample texts, performing word segmentation operation on the sample texts and the target texts according to a preset word element table and a preset word segmentation algorithm, and generating sample word segmentation lists corresponding to the sample texts and target word segmentation lists corresponding to the target texts;

classifying and labeling the sample word segmentation list according to the target word segmentation list to generate a training sample with sequence labels, wherein the classification labels comprise a different-word label, a multi-word label and a few-word label;

acquiring the maximum sequence length in the training samples labeled by the sequences, filling each sequence in the training samples labeled by the sequences according to the maximum sequence length, generating an original text word segmentation sequence, and inputting the original text word segmentation sequence into an initial model; wherein the sequence length of the textual participle sequence is the same as the maximum sequence length; the training sample marked by the sequence comprises the word segmentation position of the original text;

simultaneously coding and embedding the original text word segmentation sequence and the original text word segmentation position to generate embedding characteristics corresponding to the original text word segmentation sequence and the original text word segmentation position;

based on the embedded features, obtaining a sample operation label tensor and a sample probability label tensor corresponding to the training sample of the sequence annotation;

and performing iterative training on the sample operation label tensor and the sample probability label tensor according to a preset cross entropy loss function to obtain a sequence label model.

Preferably, the step of obtaining a sample operation label tensor and a sample probability label tensor corresponding to the training sample of the sequence annotation based on the embedded features includes:

step F1, inputting the embedded features into a model layer of the initial model, and performing context semantic analysis on the embedded features through an attention mechanism group of the model layer to obtain fused embedded features;

step F2, carrying out linear full-connection weighting processing on the fusion embedded features through a linear full-connection group of the model layer to obtain intermediate features;

step F3, processing the intermediate features through the addition group and the standardization group of the model layer to obtain an initial operation label tensor and an initial probability label tensor;

and F4, repeatedly executing the steps F1 to F3 until the preset iteration times are reached or the loss function value is not reduced, and outputting the sample operation label tensor and the sample probability label tensor.

Preferably, the step of obtaining the text to be corrected and correcting the error through the sequence tag model to determine the suggested corrected text includes:

coarsely segmenting the text to be corrected according to a preset regular expression to generate a segmented text;

if the text length of the segmented text is smaller than the preset text length, extracting the text content of the segmented text to generate a short text list;

segmenting words in the short text list according to a preset word element list and a preset word segmentation algorithm to generate a preprocessed text;

correcting the error of the preprocessed text through the sequence label model, and outputting an operation label tensor and a probability label tensor corresponding to the preprocessed text;

generating a specific operation list according to the operation label tensor, the probability marking tensor and a preset error tolerance probability; wherein the specific operations include retention, deletion, addition and replacement;

and performing error correction processing on the preprocessed text according to the specific operation list, and determining a corresponding suggested error correction text.

Preferably, after the step of coarsely segmenting the text to be corrected according to a preset regular expression to generate a segmented text, the method for correcting the chinese text further includes:

if the text length of the segmented text is larger than the preset text length, the segmented text is subjected to hard segmentation to generate a sub-segmented text, and text content extraction and subsequent word segmentation steps are performed on the sub-segmented text.

Preferably, the step of segmenting words in the short text list according to a preset lemma list and a preset word segmentation algorithm to generate a preprocessed text includes:

performing text cleaning on the short text list to generate a cleaned short text list;

traversing and character segmenting the cleaned short text list to generate a segmented list;

and performing word segmentation on the segmentation list according to a preset word element table and a preset word segmentation algorithm to generate a preprocessed text.

Preferably, the step of performing post-processing on the suggested text to be corrected based on the text to be corrected, and generating the text after correction includes:

comparing the difference between the text to be corrected and the suggested error correction text to determine a correct segmentation text;

and restoring the cutting position and the symbol of the correct segmentation text according to the text to be corrected to generate the text after error correction.

In addition, to achieve the above object, the present invention also provides a chinese text error correction apparatus, including:

the acquisition module is used for acquiring a text to be corrected, correcting the error through the sequence tag model and determining a suggested error correction text corresponding to the preprocessed text;

the post-processing module is used for performing post-processing on the suggested error correction text based on the text to be corrected to generate a text after error correction;

and the comparison module is used for carrying out difference comparison on the text to be corrected and the corrected text to generate a text error report prompt.

In addition, to achieve the above object, the present invention also provides an apparatus, which is a chinese text error correction apparatus, comprising: a memory, a processor, and a chinese text error correction program stored on the memory and executable on the processor, the chinese text error correction program when executed by the processor implementing the steps of the chinese text error correction method as described above.

In addition, to achieve the above object, the present invention also provides a medium which is a computer readable storage medium having a chinese text error correction program stored thereon, the chinese text error correction program, when executed by a processor, implementing the steps of the chinese text error correction method as described above.

The invention provides a Chinese text error correction method, a device, equipment and a medium; the Chinese text error correction method comprises the following steps: acquiring a text to be corrected, correcting the text through a sequence tag model, and determining a suggested error correction text; based on the text to be corrected, post-processing the suggested error correction text to generate an error-corrected text; and comparing the difference between the text to be corrected and the text after error correction to generate a text error reporting prompt. The method comprises the steps of obtaining a text to be corrected, correcting the text to be corrected through a sequence tag model, and determining a suggested correction text corresponding to the text to be corrected; post-processing the suggested correction text according to the text to be corrected to generate the corrected text; comparing the difference between the text to be corrected and the text after error correction to generate a text error reporting prompt; therefore, the comprehensiveness and the accuracy of Chinese text error correction are improved, and the efficiency of text error correction is improved.

Drawings

Fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a text error correction method according to a first embodiment of the present invention;

FIG. 3 is a diagram illustrating a first exemplary embodiment of a token tagging method according to the present invention;

FIG. 4 is a diagram illustrating a second exemplary vocabulary entry tagging method according to the first exemplary embodiment of the Chinese text error correction method;

FIG. 5 is a flowchart illustrating a second embodiment of a text error correction method according to the present invention;

FIG. 6 is a schematic diagram of a training process of a sequence label model according to a second embodiment of the Chinese text error correction method;

FIG. 7 is a schematic diagram of a tag table according to a second embodiment of the text error correction method according to the present invention;

FIG. 8 is a flowchart illustrating a third embodiment of a text error correction method according to the present invention;

FIG. 9 is a flowchart illustrating a text error correction method according to a fourth embodiment of the present invention;

FIG. 10 is a flowchart illustrating a fifth embodiment of a text error correction method according to the present invention;

FIG. 11 is a flowchart illustrating a sixth embodiment of a text error correction method according to the present invention;

FIG. 12 is a functional block diagram of a text error correction apparatus according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

The device of the embodiment of the invention can be a mobile terminal or a server device.

As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a chinese text error correction program.

The operating system is a program for managing and controlling the Chinese text error correction equipment and software resources, and supports the operation of a network communication module, a user interface module, a Chinese text error correction program and other programs or software; the network communication module is used for managing and controlling the network interface 1002; the user interface module is used to manage and control the user interface 1003.

In the chinese text error correction apparatus shown in fig. 1, the chinese text error correction apparatus calls a chinese text error correction program stored in a memory 1005 through a processor 1001 and performs operations in various embodiments of the chinese text error correction method described below.

Based on the hardware structure, the embodiment of the text error correction method in the invention is provided.

Referring to fig. 2, fig. 2 is a schematic flowchart of a chinese text error correction method according to a first embodiment of the present invention, where the chinese text error correction method includes:

s10, acquiring a text to be corrected, correcting the text through a sequence tag model, and determining a suggested corrected text;

step S20, post-processing the suggested error correction text based on the text to be corrected to generate a text after error correction;

and S30, comparing the difference between the text to be corrected and the text after error correction to generate a text error reporting prompt.

In the embodiment, a text to be corrected is obtained, and the text to be corrected is corrected through a sequence tag model, so that a suggested correction text corresponding to the text to be corrected is determined; post-processing the suggested correction text according to the text to be corrected to generate the corrected text; comparing the difference between the text to be corrected and the text after error correction to generate a text error reporting prompt; therefore, the comprehensiveness and the accuracy of Chinese text error correction are improved, and the efficiency of text error correction is improved.

The respective steps will be described in detail below:

and S10, acquiring a text to be corrected, correcting the text through the sequence label model, and determining a suggested corrected text.

In this embodiment, the chinese text error correction method is applied to the sequence tag model, and the chinese text error correction method mainly performs error correction processing on a text with a long length of chinese chapter, and provides error correction prompts at a character level, that is, error correction prompts for different characters, multiple characters, fewer characters, and the like, so that an error position in the text with the long length of chinese chapter can be quickly and accurately located, and the text with the long length of chinese chapter is corrected according to related error correction prompts, that is, specific positions where characters in the text with the long length of chinese chapter have errors, erroneous characters, and suggested corrected characters corresponding to the characters are prompted, so as to solve the problems of long time consumption and fuzzy location and correction.

Acquiring texts to be corrected from different channels, wherein the texts to be corrected can be acquired from a database in a system, or can be acquired from clients of different users; wherein, different users comprise business personnel, clients and third party institution personnel. The channel for acquiring the text to be corrected is not limited in this embodiment.

Inputting the text to be corrected into the sequence label model, performing error correction operations such as single-word spelling check, error word tagging, error word position positioning and the like on the text to be corrected through the sequence label model, performing error correction processing on error words in the text to be corrected, and generating a corresponding suggested error correction text.

For example, the text to be corrected: the family becomes a family with love.

Wherein, the lemma of the text to be corrected may contain "###" or "[ UNK ]" special mark, when processing the labeling sequence, "$ REPLACE" represents the lemma replacement, REPLACEs the word or word; "$ APPEND" indicates that a lemma is added, and a word or a character is added before the word or the character; "$ DELETE" indicates a lemma, DELETE the word or word.

Referring to fig. 3, fig. 3 is a first explanatory diagram of word notation, in fig. 3, "SEPL _ SEPR" indicates that a word "add" end after "is added, since two word elements" end "and" end "are added, the" SEPL _ SEPR "is used to connect the added word elements," SEPL | | | SEPR "is used for the original word element connection modification tag," $ START "indicates the START of the sequence, and" $ KEEP "indicates that the word element remains unchanged and the word or word remains unchanged.

Referring to fig. 4, fig. 4 is a second explanatory diagram of the word element labels, and fig. 4 shows that "kernel" is modified to "person", "home" is deleted, and "genus" is added after "love".

The sequence tag model performs the above tag labeling process on all wrong-correct sentences in the text to be corrected,

and step S20, post-processing the suggested error correction text based on the text to be corrected to generate the text after error correction.

In an embodiment, the difference between the text for error correction and the text to be corrected is compared, the difference between the text for error correction and the text to be corrected is determined, and the text for error correction is restored according to the difference to generate the text after error correction.

For example, the text to be corrected is: the family becomes a family with love.

The text after error correction is: after ending, the lover becomes a new family.

In this embodiment, positioning is performed according to characters of a text to be corrected, and the obtained text after error correction is compared with the text to be corrected in a difference manner to obtain text error reporting prompts at a character level, where the text error reporting prompts include error reporting prompts for different characters, error reporting prompts for multiple characters, and error reporting prompts for fewer characters.

The error-reporting prompt of the different characters is as follows: [ initial position of different character, ending position of different character, different character or word, suggested corrected character or word ];

the error reporting prompt of the multi-character is as follows: [ multi-character starting position, multi-character ending position, character or word to be removed,' ], wherein the starting position is the same as the ending position;

the error reporting prompt of few characters is as follows: [ starting position of little word, ending position of little word, '', word or word requiring addition ], wherein the starting position is the same as the ending position.

In the embodiment, a text to be corrected is obtained, and the text to be corrected is corrected through a sequence tag model, so that a suggested correction text corresponding to the text to be corrected is determined; post-processing the suggested correction text according to the text to be corrected to generate the corrected text; comparing the difference between the text to be corrected and the text after error correction to generate text error correction prompt information; therefore, the comprehensiveness and the accuracy of Chinese text error correction are improved, and the efficiency of text error correction is improved.

Further, based on the first embodiment of the Chinese text error correction method, a second embodiment of the Chinese text error correction method is provided.

The difference between the second embodiment of the chinese text error correction method and the first embodiment of the chinese text error correction method is that, in this embodiment, before the step of obtaining the text to be error corrected, performing error correction through the sequence tag model, and determining the recommended error correction text in step S10, referring to fig. 5, the chinese text error correction method further includes:

step A10, obtaining sample texts and target texts corresponding to the sample texts, and performing word segmentation operation on the sample texts and the target texts according to a preset word element table and a preset word segmentation algorithm to generate sample word segmentation lists corresponding to the sample texts and target word segmentation lists corresponding to the target texts;

step A20, classifying and labeling the sample word segmentation list according to the target word segmentation list, and generating a training sample with sequence labeling, wherein the classification labeling comprises word error labels, word redundancy labels and word missing labels;

step A30, obtaining the maximum sequence length in the training samples labeled by the sequences, filling each sequence in the training samples labeled by the sequences according to the maximum sequence length, generating an original text word segmentation sequence, and inputting the original text word segmentation sequence into an initial model; wherein the sequence length of the textual participle sequence is the same as the maximum sequence length; the training sample marked by the sequence comprises the word segmentation position of the original text;

step A40, the original text word segmentation sequence and the original text word segmentation position are coded and embedded simultaneously, and embedding characteristics corresponding to the original text word segmentation sequence and the original text word segmentation position are generated;

step A50, based on the embedded features, obtaining a sample operation label tensor and a sample probability label tensor corresponding to the training sample labeled by the sequence;

and A60, performing iterative training on the sample operation label tensor and the sample probability label tensor according to a preset cross entropy loss function to obtain a sequence label model.

In the embodiment, a sample text and a target text corresponding to each sample text are obtained, and the sample text and the target text are participled according to a preset word element table and a preset word segmentation algorithm to generate a corresponding sample word segmentation list and a corresponding target word segmentation list; labeling the sample word segmentation list according to the target word segmentation list to generate a training sample with sequence labeling; filling the training samples marked by the sequence to generate a corresponding original text word segmentation sequence, inputting the original text word segmentation sequence into an initial model, and performing iterative training on the original text word segmentation sequence and the original text word segmentation position through the initial model to obtain a sequence label model; therefore, the error correction accuracy of the trained sequence label model is improved.

The respective steps will be described in detail below:

step A10, obtaining sample texts and target texts corresponding to the sample texts, and performing word segmentation operation on the sample texts and the target texts according to a preset word element table and a preset word segmentation algorithm to generate sample word segmentation lists corresponding to the sample texts and target word segmentation lists corresponding to the target texts.

In this embodiment, the sample texts and the target texts corresponding to the sample texts are obtained from different channels, and the channels for obtaining the sample texts and the target texts corresponding to the sample texts are not limited in this embodiment. Wherein, the sample text and the target text are preferably short texts.

It should be noted that, in order to ensure the accuracy of the sequence label model, the training samples are sufficient, and the number of sample texts and target texts in the training set is not limited in this embodiment; in practical application, the more the number of sample texts and target texts in a training set is, the more accurate the error correction result output by the sequence label model is.

And segmenting the sample texts and the target texts according to a preset word element table and a preset word segmentation algorithm, and segmenting each sample text and each target text into a corresponding sample word segmentation list and a corresponding target word segmentation list.

Wherein, the preset word element table is preferably a basic word element table with the size of 21128; each element in the preset element table has a corresponding ID; for example, the word element "good" corresponds to an ID of 1962.

The preset word segmentation algorithm is preferably a WordPiece algorithm.

Such as: the sample text one is: the wine with the question can answer the question again.

The first target text is: there will be an answer to a question.

Step A20, performing classification labeling on the sample word segmentation list according to the target word segmentation list, and generating a training sample with sequence labeling, wherein the classification labeling comprises a low-word label, a multi-word label and a low-word label.

In this embodiment, classifying and labeling the sample word segmentation list according to the target word segmentation list, and classifying and labeling characters in the target word segmentation list, which are different from the sample word segmentation list; the classification labels comprise a different-word label, a multi-word label and a few-word label.

Step A30, obtaining the maximum sequence length in the training samples marked by the sequences, filling each sequence in the training samples marked by the sequences according to the maximum sequence length, generating an original text word segmentation sequence, and inputting the original text word segmentation sequence into an initial model; wherein the sequence length of the textual participle sequence is the same as the maximum sequence length; the training sample marked by the sequence comprises the word segmentation position of the original text.

In this embodiment, referring to fig. 6, fig. 6 is a schematic diagram of a training flow of a sequence label model; acquiring a sequence with the maximum sequence length in the training samples marked by the sequence, and filling each sequence in the training samples marked by the sequence according to the maximum sequence length to ensure that the length of each filled sequence is the same as the maximum sequence length, wherein the filled sequence is an original word segmentation sequence; and inputting the original text participle sequence into the initial model.

For example, the maximum sequence length is 10, and a sequence of the sequence-labeled training samples is: chinese woodcut museum, the length of the sequence is 7, namely the length of the sequence is less than the maximum sequence length of 10.

In addition, three special characters "[ CLS ]" (ID = 101), "$" (ID = 109), and "[ UNK ]" (ID = 100) indicating the start of the sequence are added at the beginning of each sequence. The expression of the original word segmentation sequence is [ batch size, maximum sequence length in batch +3].

The training sample of the sequence label comprises an original text word segmentation position, wherein the original text word segmentation position is the position of each word element in the original text.

The expression of the position of the word segmentation of the original text is [ batch size, maximum sequence length in the batch ], and the default "$ START" is the original text START marker, which is also counted in the sequence length. The batch size is the number of sequences in the sequence of the original word segmentation.

And A40, simultaneously coding and embedding the original text participle sequence and the original text participle position, and generating the embedded characteristics corresponding to the original text participle sequence and the original text participle position.

In this embodiment, referring to fig. 6, fig. 6 is a schematic diagram of a training process of a sequence label model, where the original word segmentation sequence and the original word segmentation position are input into an embedding layer of an initial model, and the original word segmentation sequence and the original word segmentation position are simultaneously encoded, so as to obtain a word element tensor corresponding to the original word segmentation sequence and a position tensor corresponding to the original word segmentation position, respectively; wherein, the encoding treatment is preferably onehot treatment; the expression of the tensor of the lemma is [ the size of batch processing, the maximum sequence length in the batch and the size of the lemma table ], and the expression of the tensor of the position is [ the size of batch processing, the maximum sequence length in the batch and the preset text sequence length ]; the batch processing size is the number of sequences in the original text segmentation word sequence, the lemma table size is the number of lemmas in a preset lemma table, and the preset text sequence length is the preset longest sequence length max _ len of each segmented text, which can be set by itself according to actual conditions, in this embodiment, the preset text sequence length is preferably 128 or 512.

Embedding the element tensor and the position tensor to obtain embedding characteristics corresponding to the element tensor and the position tensor; specifically, each lemma is converted into a feature representation, and the expression of the embedded feature is as follows: [ batch size, maximum sequence length in batch, 768].

And A50, acquiring a sample operation label tensor and a sample probability label tensor corresponding to the training sample labeled by the sequence based on the embedded characteristics.

In this embodiment, referring to fig. 6, fig. 6 is a schematic diagram of a training flow of a sequence label model, and the above-mentioned embedded features are input into a model layer of an initial model, where the model layer is preferably a model layer based on Bert, and the embedded features sequentially pass through operations such as Self-attention mechanism (Self-attention), linear full-join (Linear), add (Add), and normalization (Normalize), so as to obtain a sample operation label tensor and a sample probability label tensor corresponding to a training sample of sequence labeling.

The operation label tensor is used for representing a real label of a character, and the size of the operation label tensor is [ batch processing size, maximum sequence length in batch, 16502], wherein 16502 is the length of a label table, and an ID corresponding to each label operation is recorded in the operation label tensor.

Referring to FIG. 7, FIG. 7 is a schematic diagram of a sample tag table, where "$ KEEP" has 0 ID in the tag table, "$ DELETE" has 1 ID in the tag table, and "$ APPEND _! "ID in tag table is 2," $ REPLACE _! The "ID in tag table is 3, and the" $ APPEND _ "ID in tag table is 4.

Wherein, the probability mark tensor is used to represent whether there is a false real mark for the character, and the size of the probability mark tensor is [ batch size, maximum sequence length in batch, 4], where 4 represents { "CORRECT":0, "INCORRECT":1, "@ @ UNKNOWN @":2, "@ @ PADDING @":3, correct, error, unknown, and filled.

In the embodiment, the cross entropy loss of the initial model is calculated according to the sample operation label tensor and the sample probability label tensor; the preset cross entropy loss function is preferably a cross entropy loss function; and performing iterative training on the sample operation label tensor and the sample probability label tensor according to the cross entropy loss function until the cross entropy loss after iteration is not reduced any more, and finally obtaining a sequence label model.

And performing auxiliary calculation by using a mask, wherein the mask is a 0 or 1 mark for the existence of characters.

For example, in a batch of data, if the length of sequence 1 is 10 and the length of sequence 2 is 15, the length of mask is 15, mask1 is the presence or absence of the mark of sequence 1, and the first 10 characters of mask1 are all 1, and the last 5 characters are all 0; mask2 indicates the presence or absence of a marker in SEQ ID No. 2, and mask2 indicates a sequence of length 15 and all 1 s.

In the embodiment, sample texts and target texts corresponding to the sample texts are obtained, and the sample texts and the target texts are segmented according to a preset word element table and a preset word segmentation algorithm to generate corresponding sample word segmentation lists and target word segmentation lists; labeling the sample word segmentation list according to the target word segmentation list to generate a training sample with sequence labeling; filling the training samples marked by the sequence to generate a corresponding original text word segmentation sequence, inputting the original text word segmentation sequence into an initial model, and performing iterative training on the original text word segmentation sequence and the original text word segmentation position through the initial model to obtain a sequence label model; therefore, the error correction accuracy of the trained sequence label model is improved.

Further, based on the first and second embodiments of the method for correcting the text in the Chinese language, a third embodiment of the method for correcting the text in the Chinese language is provided.

The third embodiment of the chinese text error correction method is different from the first and second embodiments of the chinese text error correction method in that, in this embodiment, for step a50, based on the embedded features, refinement of the sample operation label tensor and the sample probability label tensor corresponding to the training sample labeled by the sequence is obtained, with reference to fig. 8, the step specifically includes:

and F4, repeatedly executing the steps F1 to F3 until the preset iteration times are reached or the loss function value is not reduced, and outputting the sample operation label tensor and the sample probability mark tensor.

In this embodiment, the embedded features are input into a model layer of an initial model, and the embedded features are subjected to repeated calculation processing of preset iteration times sequentially through a self-attention mechanism group, a linear full-connection group, an addition group and a standardization group of the model layer to obtain a sample operation label tensor and a sample probability label tensor corresponding to a training sample of sequence labeling; thereby improving the stability of error correction of the sequence label model.

The respective steps will be described in detail below:

step F1, inputting the embedded features into a model layer of the initial model, and performing context semantic analysis on the embedded features through an attention mechanism group of the model layer to obtain fusion embedded features;

In this embodiment, referring to fig. 6, fig. 6 is a schematic diagram of a training flow of a sequence label model, and the embedded features are input into a model layer of an initial model, wherein the model layer is preferably a Bert-based model layer, and the model layer includes a Self-attention mechanism (Self-attention-mechanism) group, a Linear full-join (Linear) group, an Add (Add) group, and a Normalize (normaize) group.

Embedding features into a model layer of an initial model, sequentially performing operations such as Self-attention mechanism (Self-attention), linear full connection (Linear), addition (Add), standardization (normalization) and the like, and repeatedly calculating preset iteration times until the preset iteration times are reached or a loss function value is not reduced any more to obtain output of the initial model, wherein the output of the initial model comprises an operation label tensor and a probability label tensor; presetting iteration times, preferably 12 times, and acquiring a sample operation label tensor and a sample probability label tensor corresponding to the training sample of the sequence annotation; thereby improving the stability of error correction detection.

The training process of the sequence label model specifically comprises the following steps: in the attention mechanism group of the model layer, performing context semantic analysis on the embedded features to obtain fusion embedded features; and the Self-attention mechanism (Self-attention) is used for learning the context semantic information of the training sample labeled by the sequence to obtain the fusion embedding characteristics of the fusion context semantic information.

And in the linear full-connection group of the model layer, performing linear full-connection weighting processing on the fusion embedded features to obtain intermediate features.

And processing the intermediate features in the addition group and the standardization group of the model layer to obtain an initial operation label tensor and an initial probability label tensor corresponding to the training sample of the sequence annotation.

In this embodiment, the embedded features are input into a model layer of an initial model, repeated calculation processing of preset iteration times is performed on the embedded features sequentially through a self-attention mechanism group, a linear full-connection group, an addition group and a standardization group of the model layer, and a sample operation label tensor and a sample probability label tensor corresponding to a training sample of sequence labeling are obtained; thereby improving the stability of error correction of the sequence label model.

Further, based on the first, second and third embodiments of the method for correcting the text in the Chinese language, a fourth embodiment of the method for correcting the text in the Chinese language is provided.

The difference between the fourth embodiment of the chinese text error correction method and the first, second, and third embodiments of the chinese text error correction method is that in this embodiment, the step S10 is to acquire a text to be error corrected, perform error correction through a sequence tag model, and determine refinement of a suggested error correction text, and with reference to fig. 9, the step specifically includes:

s11, coarsely segmenting the text to be corrected according to a preset regular expression to generate a segmented text;

step S12, if the text length of the segmented text is smaller than the preset text length, extracting the text content of the segmented text to generate a short text list;

s13, segmenting words of the short text list according to a preset word element list and a preset word segmentation algorithm to generate a preprocessed text;

step S14, correcting the error of the preprocessed text through the sequence label model, and outputting an operation label tensor and a probability label tensor corresponding to the preprocessed text;

step S15, generating a specific operation list according to the operation label tensor, the probability label tensor and a preset error tolerance probability; wherein the specific operations comprise retention, deletion, addition and replacement;

and S16, performing error correction processing on the preprocessed text according to the specific operation list, and determining a corresponding suggested error correction text.

In this embodiment, a text to be corrected is preprocessed to generate a preprocessed text corresponding to the text to be corrected; carrying out error correction processing on the preprocessed text through a sequence tag model, and determining a suggested error correction text corresponding to the preprocessed text; therefore, the accuracy and the efficiency of error correction of the Chinese text are improved through the sequence label model.

The respective steps will be described in detail below:

and S11, carrying out rough segmentation on the text to be corrected according to a preset regular expression to generate a segmented text.

In the embodiment, the text to be corrected is roughly segmented according to a specific punctuation mark (such as ",", ","; the preset regular expression is preferably a specific punctuation mark and a space.

Wherein, the expression of the segmentation text is as follows: [ (segmented text, segmented text start position), (segmented symbol, segmented symbol start position) ].

For example, the text to be corrected: the moon before bed is called ground frost. To look at the moon and to look low at the hometown.

The text is segmented into: "bright moon before bed", "frost on the ground", "look ahead to bright moon", "look ahead to low head and think of hometown".

Further, in an embodiment, after step S11, the method for correcting chinese text further includes:

and D10, if the text length of the segmented text is larger than the preset text length, performing hard segmentation on the segmented text to generate a sub-segmented text, and performing text content extraction and subsequent word segmentation on the sub-segmented text.

In this embodiment, each text in the segmented text is compared with a preset text length, and if the text length of the segmented text is greater than or equal to the preset text length, the segmented text is hard segmented, that is, the segmented text is hard segmented into sub-segmented texts each having characters of the preset text length, and "|" is used as a separation symbol to perform place-occupying marking, so as to facilitate subsequent splicing of the corrected long text and provide accurate character-level positioning error-reporting information.

The preset text length is max _ len (e.g., 128 or 512 or other) characters for each preset segment of the segmented text.

And carrying out pure text content extraction, word segmentation and subsequent processing on the sub-segmentation text.

And S12, if the text length of the segmented text is smaller than the preset text length, extracting the text content of the segmented text to generate a short text list.

In this embodiment, each text in the segmented text is compared with a preset text length, and when the text of the segmented text is smaller than the preset text length, the plain text content of the segmented text is extracted, that is, only the plain segmented text is collected in this step, and the separation symbol and the related position information are temporarily ignored, so as to obtain a short text list such as "[ segmented text 1, segmented text 2, segmented text 3, \8230;, segmented text n ]"; the preset text length is the length of each preset segmented text segment, which is max _ len (e.g., 128 or 512 or other) characters.

And S13, segmenting words of the short text list according to a preset word element list and a preset word segmentation algorithm to generate a preprocessed text.

In this embodiment, a word segmentation process is performed on a short text list according to a preset lemma list and a preset word segmentation algorithm, that is, an input short text list is segmented into sub-strings, the granularity of the sub-strings can be words or characters, wherein each short text list obtains a preprocessed text such as "[ lemma 1, lemma 2, lemma 3, \8230;, lemma n ]".

Wherein, the preset word element table is a basic word element table with the size of 21128; each word element in the preset word element table has a corresponding ID; for example, the word element "good" corresponds to an ID of 1962. The preset word segmentation algorithm is preferably a WordPiece algorithm.

For example, short text list 1 is: "do you get good".

In the lemma table, the lemma "good" has an ID of 1962, the lemma "you" has an ID of 280, and the lemma "do" has an ID of 400.

The preprocessed text is: "[280, 1962, 400]".

And S14, correcting the error of the preprocessed text through the sequence label model, and outputting an operation label tensor and a probability label tensor corresponding to the preprocessed text.

In this embodiment, the error correction processing is performed on the preprocessed text according to the sequence label model, and an operation label tensor and a probability label tensor corresponding to the preprocessed text are generated.

The operation label tensor is used for representing a real label of a character, and the expression of the operation label tensor is [ batch processing size, maximum sequence length in batch, 16502], wherein 16502 is the length of a label table, and an ID corresponding to each label operation is recorded in the label table.

The probability mark tensor is used for expressing whether a character has a wrong real mark, and the probability mark tensor expression is [ batch processing size, maximum sequence length in a batch, 4], where 4 represents { "CORRECT":0, "INCORRECT":1, "@ @ UNKNOWN @":2, "@ @ PADDING @":3, correct, error, unknown, and filled.

Step S15, generating a specific operation list according to the operation label tensor, the probability label tensor and a preset error tolerance probability; wherein the specific operations include retention, deletion, addition and replacement.

In this embodiment, the probability mark tensor is compared with a preset error tolerance probability to generate a specific operation list; specifically, if the probability mark tensor is smaller than or equal to the preset error tolerance probability, the error of the text corresponding to the probability mark tensor is tolerated, and the text is considered to be error-free and is not corrected; if the probability mark tensor is larger than the preset error tolerance probability, judging that the text corresponding to the probability mark tensor has errors, and correcting the text; thereby reducing the error correction rate; the preset error tolerance probability is a preset error tolerance probability.

For example, the first probability marking tensor of the word element in one text is p1, the preset error tolerance probability is p2, if p1< p2, the error of the text is tolerated, and the text is considered to be error-free and is not corrected.

The specific operations include retention, deletion, addition and replacement, and the corresponding tags are as follows: "$ REPLACE" indicates a word element replacement, "$ APPEND" indicates a word element addition, and "$ DELETE" indicates a word element deletion.

In this embodiment, the preprocessed text is modified according to the specific operation list, so as to obtain a corresponding suggested error correction text, that is, obtain the final output of the sequence tag model.

Further, based on the first, second, third and fourth embodiments of the Chinese text error correction method of the present invention, a fifth embodiment of the Chinese text error correction method of the present invention is proposed.

The difference between the fifth embodiment of the chinese text error correction method and the first, second, third, and fourth embodiments of the chinese text error correction method is that in this embodiment, in step S13, the short text list is segmented according to a preset lemma table and a preset segmentation algorithm to generate refinement of a preprocessed text, and with reference to fig. 10, the step specifically includes:

b10, text cleaning is carried out on the short text list, and a cleaned short text list is generated;

step B20, traversing and character segmenting the cleaned short text list to generate a segmentation list;

and B30, segmenting words in the segmentation list according to a preset word element list and a preset word segmentation algorithm to generate a preprocessed text.

In the embodiment, the short text list is subjected to text cleaning to generate a cleaned short text list; traversing and character segmenting the cleaned short text list to generate a segmented list; segmenting words in the segmentation list according to a preset word element table and a preset word segmentation algorithm to generate a preprocessed text; thereby improving the efficiency of correcting the Chinese text.

The respective steps will be described in detail below:

and B10, performing text cleaning on the short text list to generate a cleaned short text list.

In this embodiment, the format of the short text list is processed uniformly, the short text list is converted into a Unicode short text list uniformly, and the short text list is subjected to text cleaning (for example, blank spaces in the short text list are removed) to generate a cleaned short text list.

And step B20, traversing and character segmenting the cleaned short text list to generate a segmented list.

In this embodiment, each character in the cleaned short text list is traversed and a chinese character is segmented to obtain a character string text in which space is used to distinguish the chinese character, then the character string text is segmented according to characters to obtain a segmented list, and the list can be traversed if an english character needs to be converted into a lower case.

It should be noted that, when traversing whether the washed short text list is in the preset primitive table, if the washed short text list is not completely in the preset primitive table, a "##" prefix is added to an undivided part, for example, "unaffable" is divided into "[" un "," # # aff "," # # able "]", wherein "un", "aff" and "able" are all in the preset primitive table, but "unaffable" is not in the basic primitive table, so that a "#" prefix is added to a subsequently-divided part to indicate that it is not a single primitive in the original text.

In this embodiment, the segmentation list is segmented according to a preset lemma table and a preset segmentation algorithm, that is, the segmentation list is segmented into sub-strings, the granularity of the sub-strings can be words or characters, wherein each segmentation list obtains preprocessed texts such as [ lemma 1, lemma 2, lemma 3, 8230 ], "\8230, and lemma n ]".

Wherein, the preset word element table is a basic word element table with the size of 21128; each word element in the preset word element table has a corresponding ID. The preset word segmentation algorithm is preferably a WordPiece algorithm.

It is noted that the placeholder rows of several "[ unused ]" can be modified to be "" "," "", "\8230", "-", respectively, since these symbols are not present in the underlying vocabulary, but may appear in the text.

Terms in which an unknown appears are denoted by "[ UNK ]".

Further, based on the first, second, third, fourth and fifth embodiments of the text error correction method for Chinese text of the present invention, a sixth embodiment of the text error correction method for Chinese text of the present invention is provided.

The sixth embodiment of the chinese text error correction method is different from the first, second, third, fourth and fifth embodiments of the chinese text error correction method in that the present embodiment performs post-processing on the suggested error correction text based on the text to be error corrected in step S20 to generate refinement of the text after error correction, and referring to fig. 11, the step specifically includes:

s21, comparing the difference between the text to be corrected and the suggested error correction text to determine a correct segmentation text;

and S22, restoring the cutting position and the symbol of the correct segmentation text according to the text to be corrected to generate the corrected text.

In the embodiment, the text to be corrected is compared with the suggested error correction text in a difference mode, and the suggested error correction text is restored according to the difference, so that the corresponding correct segmentation text is determined; carrying out cutting position and symbol reduction on the text to be corrected and the correctly segmented text to generate a text after error correction; therefore, the accuracy of splicing and restoring the Chinese text is improved.

The respective steps will be described in detail below:

and S21, comparing the difference between the text to be corrected and the suggested error correction text, and determining a correct segmentation text.

In this embodiment, the text to be corrected is compared with the suggested error correction text to obtain the difference between the text to be corrected and the suggested error correction text; and correcting the suggested error correction text according to the difference to determine the corresponding correct segmentation text.

In this embodiment, the correctly segmented text is restored according to the cutting position information and the character replacement information in the text to be corrected, so as to generate an error-corrected text corresponding to the correctly segmented text.

In this embodiment, the difference between the text to be corrected and the suggested error correction text is compared, and the suggested error correction text is restored according to the difference to determine the corresponding correct segmentation text; carrying out cutting position and symbol reduction on the text to be corrected and the correctly segmented text to generate a text after error correction; therefore, the accuracy of splicing and restoring the Chinese text is improved.

The invention also provides a Chinese text error correction device. Referring to fig. 12, the apparatus for correcting text in chinese according to the present invention includes:

the acquiring module 10 is used for acquiring a text to be corrected, correcting the text through a sequence tag model and determining a suggested corrected text;

a post-processing module 20, configured to perform post-processing on the suggested error correction text based on the text to be error corrected, and generate an error-corrected text;

a comparison module 30, configured to compare the difference between the text to be error-corrected and the text after error correction, and generate a text error-reporting prompt.

Furthermore, the present invention also provides a medium, which is a computer readable storage medium, on which a chinese text error correction program is stored, the chinese text error correction program implementing the steps of the chinese text error correction method as described above when executed by a processor.

The method implemented when the chinese text error correction program running on the processor is executed may refer to each embodiment of the chinese text error correction method of the present invention, and is not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A Chinese text error correction method is characterized by comprising the following steps:

acquiring a text to be corrected, correcting the text through a sequence tag model, and determining a suggested corrected text;

based on the text to be corrected, post-processing the suggested error correction text to generate an error-corrected text;

2. The method for chinese text error correction according to claim 1, wherein prior to the steps of obtaining the text to be error corrected, performing error correction by using the sequence tag model, and determining the suggested error corrected text, the method for chinese text error correction further comprises:

obtaining sample texts and target texts corresponding to the sample texts, and performing word segmentation operation on the sample texts and the target texts according to a preset word element table and a preset word segmentation algorithm to generate sample word segmentation lists corresponding to the sample texts and target word segmentation lists corresponding to the target texts;

simultaneously coding and embedding the original text word segmentation sequence and the original text word segmentation position to generate an embedding characteristic corresponding to the original text word segmentation sequence and the original text word segmentation position;

based on the embedded features, acquiring a sample operation label tensor and a sample probability label tensor corresponding to the training sample of the sequence annotation;

3. The method for correcting chinese text according to claim 1, wherein the step of obtaining a sample operation label tensor and a sample probability label tensor corresponding to the training sample of the sequence annotation based on the embedded feature comprises:

4. The method for correcting errors in chinese text according to claim 1, wherein the step of obtaining the text to be corrected and correcting errors using the sequential label model to determine the suggested corrected text comprises:

segmenting words of the short text list according to a preset word element list and a preset word segmentation algorithm to generate a preprocessed text;

correcting the error of the preprocessed text through the sequence label model, and outputting an operation label tensor and a probability mark tensor corresponding to the preprocessed text;

generating a specific operation list according to the operation label tensor, the probability mark tensor and a preset error tolerance probability; wherein the specific operations include retention, deletion, addition and replacement;

5. The method for correcting errors in a chinese text as recited in claim 4, wherein after the step of coarsely segmenting the text to be corrected according to the predetermined regular expression to generate the segmented text, the method for correcting errors in a chinese text further comprises:

6. The method for correcting errors in chinese text as recited in claim 4, wherein the step of segmenting the short text list according to a preset lemma table and a preset segmentation algorithm to generate the preprocessed text comprises:

7. The chinese text error correction method of claim 1, wherein the post-processing the suggested corrected text based on the text to be error corrected to generate an error corrected text comprises:

8. A chinese text error correction apparatus, comprising:

the acquisition module is used for acquiring a text to be corrected, correcting the error through the sequence tag model and determining a suggested error correction text;

the post-processing module is used for performing post-processing on the suggested error correction text based on the text to be corrected to generate an error-corrected text;

and the comparison module is used for comparing the difference between the text to be corrected and the text after error correction to generate a text error reporting prompt.

9. An apparatus, said apparatus being a chinese text correction apparatus, said chinese text correction apparatus comprising: a memory, a processor, and a chinese text correction program stored on the memory and executable on the processor, the chinese text correction program when executed by the processor implementing the steps of the chinese text correction method of any of claims 1-7.

10. A medium, which is a computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a chinese text error correction program, which when executed by a processor, implements the steps of the chinese text error correction method according to any one of claims 1 to 7.