CN113901797A

CN113901797A - Text error correction method, device, equipment and storage medium

Info

Publication number: CN113901797A
Application number: CN202111210864.2A
Authority: CN
Inventors: 吴迪; 邹俊逸; 董忠; 蔡巍; 柯淑玲
Original assignee: Guangdong Bozhilin Robot Co Ltd
Current assignee: Guangdong Bozhilin Robot Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-07
Anticipated expiration: 2041-10-18
Also published as: CN113901797B

Abstract

The invention relates to the technical field of artificial intelligence, and provides a text error correction method, a text error correction device, text error correction equipment and a storage medium, which are used for improving the accuracy and the efficiency of text error correction. The text error correction method comprises the following steps: acquiring a target confusion dictionary of the original text data set, and performing word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set; acquiring a retraining text data set, and training a preset initial text error correction model through a wrongly written text data set, a preset loss function and a retraining text data set to obtain a target text error correction model, wherein the target text error correction model comprises a correction network based on a bert model; and acquiring a text to be processed, and sequentially carrying out position wrongly written word probability calculation and dictionary word correction on the text to be processed through a target text error correction model and a target confusion dictionary to obtain the text after error correction.

Description

Text error correction method, device, equipment and storage medium

Technical Field

The invention relates to the field of intelligent decision making of artificial intelligence, in particular to a text error correction method, a text error correction device, text error correction equipment and a storage medium.

Background

With the technical development of natural language processing, in many application scenarios, such as speech recognition, retrieval, intention recognition, dialog systems, and the like, error correction processing (i.e., a process of correcting errors in a text) is involved to help downstream processing processes accurately perform parsing, semantic extraction, entity recognition, and the like on the text, and therefore, text error correction is a crucial branch of the natural language processing technology.

At present, text error correction generally depends on a wrongly-written word dictionary constructed manually to carry out error matching and correction, the intelligence of text error correction is low, the operation speed is low under the condition of realizing consistent precision, and due to the limitation of wrongly-written dictionaries, some rare proper nouns or emerging nouns cannot be recorded in the wrongly-written dictionaries, so that the accuracy and the efficiency of text error correction are not high.

Disclosure of Invention

The invention provides a text error correction method, a text error correction device, text error correction equipment and a storage medium, which are used for improving the accuracy and the efficiency of text error correction.

The invention provides a text error correction method in a first aspect, which comprises the following steps:

acquiring a target confusion dictionary of an original text data set, and performing word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set;

acquiring a retraining text data set, and training a preset initial text error correction model through the wrongly written or mispronounced character text data set, a preset loss function and the retraining text data set to obtain a target text error correction model, wherein the target text error correction model comprises a correction network based on a bert model;

and acquiring a text to be processed, and sequentially carrying out position wrongly written word probability calculation and dictionary word correction on the text to be processed through the target text error correction model and the target confusion dictionary to obtain an error-corrected text.

Optionally, in a first implementation manner of the first aspect of the present invention, the target text error correction model includes a correction network based on a bert model and a detection network based on the bert model, the obtaining a text to be processed, and performing position-wrongly-written or mispronounced word probability calculation and dictionary word correction on the text to be processed sequentially through the target text error correction model and the target confusion dictionary to obtain an error-corrected text, including:

acquiring a text to be processed, and performing embedded vector conversion on the text to be processed through the target text error correction model to obtain a text vector sequence;

calculating the probability of wrongly written characters at each position in the text vector sequence through the detection network to obtain the probability of wrongly written characters;

performing mask embedded vector conversion on the wrongly-written character probability to obtain a wrongly-written character probability vector;

and performing word probability calculation and probability classification of each position on the text to be processed through the correction network based on the error probability vector and the target confusion dictionary to obtain the corrected text.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing, by the correction network, word probability calculation and probability classification of each position on the to-be-processed text based on the error probability vector and the target confusion dictionary to obtain an error-corrected text includes:

calculating the word probability of each position of the text to be processed based on the error probability vector and the target confusion dictionary through a bert model of the correction network to obtain a word probability set of each position, wherein the correction network comprises the bert model and a normalized exponential function;

and acquiring words corresponding to each position in the text to be processed from the target confusion dictionary based on the word probability set of each position through the normalized index function to obtain an error-corrected text, wherein the error-corrected text is a two-dimensional vector of the sentence length and the dictionary length.

Optionally, in a third implementation manner of the first aspect of the present invention, the obtaining a retraining text data set trains a preset initial text error correction model through the wrongly written text data set, a preset loss function, and the retraining text data set to obtain a target text error correction model, where the target text error correction model includes a correction network based on a bert model, and includes:

constructing a detection network and a correction network to obtain an initial text error correction model, wherein the detection network comprises a bert model and a full connection layer with a first preset dimension, and the correction network comprises the bert model and a classification layer with a second preset dimension;

training the initial text error correction model through the wrongly written text data set and a preset loss function to obtain a candidate text error correction model;

acquiring a retraining text data set, wherein the retraining text data set comprises rare proper nouns or emerging nouns and wrongly written characters corresponding to the rare proper nouns or the emerging nouns;

and retraining the candidate text error correction model through the retraining text data set to obtain a target text error correction model.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the training the initial text error correction model through the wrongly written text data set and a preset loss function to obtain a candidate text error correction model includes:

performing text error correction processing on the wrongly written text data set through the initial text error correction model to obtain an error correction result;

respectively calculating loss values of the detection network and the correction network based on the error correction result through a preset loss function to obtain a detection loss value and a correction loss value;

calculating a linear weighted sum of the detection loss value and the correction loss value to obtain a target loss value;

and adjusting the initial text error correction model according to the target loss value to obtain a candidate text error correction model.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the obtaining a target confusion dictionary of an original text data set, and performing word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set includes:

acquiring an original text data set, and performing word segmentation processing on each sentence in the original text data set to obtain a text word segmentation set;

carrying out word frequency statistics on each text participle in the text participle set to obtain word frequency;

performing dictionary index storage on the text word segmentation set after word frequency statistics to obtain a target confusion dictionary;

and performing word replacement on the text segmented word set through the target confusion dictionary and/or the word frequency to obtain a wrongly written text data set.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the performing, by using the target confusion dictionary and/or the word frequency, word replacement on the text segmented word set to obtain a wrongly written text data set includes:

selecting words under the same dictionary index or words under similar pinyin indexes in the target confusion dictionary according to the word frequency to obtain a target word set;

replacing words at corresponding positions in the text word segmentation sets through the target word sets to obtain wrongly written text data sets;

or, randomly selecting words in the target confusion dictionary to obtain random words;

and replacing the words at the corresponding positions in the text word segmentation set through the random words to obtain a wrongly written text data set.

A second aspect of the present invention provides a text correction apparatus, including:

the replacing module is used for acquiring a target confusion dictionary of the original text data set and performing word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set;

the training module is used for acquiring a retraining text data set, training a preset initial text error correction model through the wrongly written text data set, a preset loss function and the retraining text data set, and obtaining a target text error correction model, wherein the target text error correction model comprises a correction network based on a bert model;

and the calculation correction module is used for acquiring a text to be processed, and sequentially performing position wrongly written character probability calculation and dictionary word correction on the text to be processed through the target text error correction model and the target confusion dictionary to obtain an error-corrected text.

Optionally, in a first implementation manner of the second aspect of the present invention, the target text error correction model includes a correction network based on a bert model and a detection network based on the bert model, and the calculation correction module includes:

the first conversion unit is used for acquiring a text to be processed, and performing embedded vector conversion on the text to be processed through the target text error correction model to obtain a text vector sequence;

the calculation unit is used for calculating the probability of wrongly written characters at each position in the text vector sequence through the detection network to obtain the probability of wrongly written characters;

the second conversion unit is used for performing mask embedded vector conversion on the wrongly-written character probability to obtain a wrongly-written character probability vector;

and the classification unit is used for performing word probability calculation and probability classification of each position on the text to be processed based on the error probability vector and the target confusion dictionary through the correction network to obtain the corrected text.

Optionally, in a second implementation manner of the second aspect of the present invention, the classification unit is specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the training module includes:

the system comprises a construction unit, a correction unit and a processing unit, wherein the construction unit is used for constructing a detection network and a correction network to obtain an initial text error correction model, the detection network comprises a bert model and a full connection layer with a first preset dimension, and the correction network comprises a bert model and a classification layer with a second preset dimension;

the training unit is used for training the initial text error correction model through the wrongly written text data set and a preset loss function to obtain a candidate text error correction model;

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a retraining text data set, and the retraining text data set comprises rare proper nouns or emerging nouns and wrongly written characters corresponding to the rare proper nouns or the emerging nouns;

and the retraining unit is used for retraining the candidate text error correction model through the retraining text data set to obtain a target text error correction model.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the training unit is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the replacing module includes:

the word segmentation unit is used for acquiring an original text data set and performing word segmentation processing on each sentence in the original text data set to obtain a text word segmentation set;

the statistical unit is used for carrying out word frequency statistics on each text participle in the text participle set to obtain word frequency;

the storage unit is used for performing dictionary index storage on the text word segmentation set after word frequency statistics to obtain a target confusion dictionary;

and the replacing unit is used for performing word replacement on the text part word set through the target confusion dictionary and/or the word frequency to obtain a wrongly written text data set.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the replacing unit is specifically configured to:

A third aspect of the present invention provides a text correction apparatus comprising: a memory and at least one processor, the memory having stored therein a computer program; the at least one processor invokes the computer program in the memory to cause the text correction apparatus to perform the text correction method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute the above-described text error correction method.

In the technical scheme provided by the invention, a target confusion dictionary of an original text data set is obtained, and the original text data set is subjected to word replacement through the target confusion dictionary to obtain a wrongly written text data set; acquiring a retraining text data set, and training a preset initial text error correction model through the wrongly written or mispronounced character text data set, a preset loss function and the retraining text data set to obtain a target text error correction model, wherein the target text error correction model comprises a correction network based on a bert model; and acquiring a text to be processed, and sequentially carrying out position wrongly written word probability calculation and dictionary word correction on the text to be processed through the target text error correction model and the target confusion dictionary to obtain an error-corrected text. In the embodiment of the invention, the initial text error correction model is trained by a larger-scale and higher-quality data set (wrongly written text data set), so that the accuracy and generalization capability of the target text error correction model are improved, by retraining the text data set, the problems of low accuracy and efficiency of text error correction caused by the fact that some rare proper nouns or emerging nouns cannot be recorded in the wrongly written dictionary can be solved, the intelligence and generalization capability of the target text error correction model are improved, by the correction network without correcting the dictionary and training the large-scale data set (wrongly written text data set), the method can realize higher operation speed, higher efficiency and stronger intelligence under the condition of consistent precision, and furthermore, text error correction processing is carried out on the file to be processed through the target text error correction model, so that the accuracy and the efficiency of text error correction are improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a text error correction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a text error correction method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a text correction device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a text correction device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a text correction device in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a text error correction method, a text error correction device, text error correction equipment and a storage medium, and improves the accuracy and efficiency of text error correction.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a text error correction method in an embodiment of the present invention includes:

101. and acquiring a target confusion dictionary of the original text data set, and performing word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set.

It is to be understood that the execution subject of the present invention may be a text error correction apparatus, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

After obtaining the authorization of each platform, each website and/or each application, the server crawls or downloads a large amount of text data from the databases of each platform, each website and/or each application, and performs security detection, data cleaning, data conversion, abnormal value detection and null value filling on the crawled or downloaded large amount of text data to obtain an original text data set. Calling a preset Chinese word segmentation algorithm, carrying out Chinese word segmentation processing on each original text data in the original text data set to obtain an initial word segmentation set, carrying out context-based semantic analysis and part-of-speech detection on each initial word in the initial word segmentation set, determining the initial word which accords with the original text semantic analysis and has no errors in part-of-speech detection based on the context semantic analysis as a target word segmentation, thereby obtaining a target word segmentation set, and carrying out re-word segmentation on the initial word which does not accord with the original text semantic analysis and has errors in part-of-speech detection based on the context semantic analysis, thereby improving the accuracy of the target word segmentation set.

And counting the occurrence frequency of each word or each character in the target word segmentation set, and storing the characters or the words with the same pinyin or easy confusion in the same dictionary index, wherein the dictionary index is the Chinese pinyin. Thereby, a target confusion dictionary containing each word or the frequency of occurrence of each word is successfully constructed. And replacing the corresponding word or word in the target confusion dictionary with the word or word at the position corresponding to each original text data in the original text data set through the occurrence frequency of each word or each word in the target confusion dictionary, thereby obtaining the wrongly-written text data set.

The language type of the original text data set in this embodiment is chinese, but the language type of the original text data set in the technical scheme provided by the present invention is not limited to chinese, and may be a text corresponding to a language type other than chinese or a mixture of a text corresponding to a language type other than chinese and chinese.

The characters or words in the original text data set are replaced by the characters or words in the target confusion dictionary to generate a large-scale high-quality data set, so that the accuracy and generalization capability of the target text error correction model are improved.

102. And acquiring a retraining text data set, and training a preset initial text error correction model through the wrongly written text data set, a preset loss function and the retraining text data set to obtain a target text error correction model, wherein the target text error correction model comprises a correction network based on a bert model.

The initial text error correction model comprises a detection network based on the bert model in addition to the correction network based on the bert model, namely the target text error correction model comprises the detection network based on the bert model in addition to the correction network based on the bert model. The server constructs a detection network based on a bert model and a correction network based on the bert model to obtain an initial text error correction model, wherein the bert model is a bert-base-Chinese model. The measuring network comprises a full connection layer besides the bert model, and the correcting network comprises a first preset dimension classification layer besides the bert model.

The method comprises the steps that a server inputs a wrongly written text data set into a preset initial text error correction model, data processing is carried out on the wrongly written text data set through the preset initial text error correction model to obtain a text correction result, a target loss value of the initial text error correction model is calculated based on the text correction result through a preset loss function, parameter adjustment or frame adjustment is carried out on the initial text error correction model through the target loss value to obtain a candidate text error correction model; after the server obtains authorization, a retraining text data set is crawled according to preset time and preset quantity, and the retraining text data set is used for indicating texts containing rare proper nouns or emerging nouns; and retraining the candidate text error correction model through the retraining text data set to obtain a target text error correction model.

The correct text is directly output without correcting the dictionary through the correction network based on the bert model, so that the efficiency and the intelligence of the target text correction model under the condition of realizing consistent correction precision are improved. By comparing the rare texts of the proper nouns or the emerging nouns (i.e. the retraining text data set), the accuracy and the efficiency of text error correction are improved, and the intelligence and the generalization capability of the target text error correction model are improved.

103. And acquiring a text to be processed, and sequentially carrying out position wrongly written word probability calculation and dictionary word correction on the text to be processed through a target text error correction model and a target confusion dictionary to obtain the text after error correction.

The target text error correction model comprises a detection network and a correction network, and further comprises an embedded layer which is a bert _ embedding layer, the input of the detection network is a sequence of the text to be processed after the bert _ embedding processing, and the input of the correction network is a vector of the text to be processed output by the detection network after the bert _ embedding processing.

The text to be processed can be stored in the block chain, and the server can extract the text to be processed from a preset database corresponding to the block chain or receive the text to be processed sent by the target terminal to obtain the text to be processed. Embedding (namely embedding) the text to be processed to obtain an input sequence of the detection network; carrying out wrongly written character probability calculation (namely position wrongly written character probability calculation) on the input sequence of the detection network through the detection network to obtain wrongly written character probability; performing self-defined embedding processing on the wrongly written character probability to obtain a wrongly written character probability vector; and acquiring corresponding words from the target confusion dictionary through a correction network based on the wrongly written word probability vector, so as to obtain a correct text, namely the text after error correction.

The text to be processed is corrected through the target text correction model with high efficiency, high intelligence and high generalization capability, so that the accuracy of the corrected text and the correction efficiency are improved.

In the embodiment of the invention, the initial text error correction model is trained by a larger-scale and higher-quality data set (wrongly written text data set), so that the accuracy and generalization capability of the target text error correction model are improved, by retraining the text data set, the problems of low accuracy and efficiency of text error correction caused by the fact that some rare proper nouns or emerging nouns cannot be recorded in a wrong dictionary can be solved, the intelligence and generalization capability of a target text error correction model are improved, by the correction network without correcting the dictionary and training the large-scale data set (wrongly written text data set), the method can realize higher operation speed, higher efficiency and stronger intelligence under the condition of consistent precision, and furthermore, text error correction processing is carried out on the file to be processed through the target text error correction model, so that the accuracy and the efficiency of text error correction are improved. In addition, in the method for improving the generalization capability of the algorithm in the prior art, a domain text error detection model is used, a fine tuning model database is constructed, the steps are relatively complex, the algorithm realizes the daily automatic training of the model, and the step is simple and the generalization capability of the model is obviously improved.

Referring to fig. 2, another embodiment of the text error correction method according to the embodiment of the present invention includes:

201. and acquiring a target confusion dictionary of the original text data set, and performing word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set.

Specifically, the server acquires an original text data set, and performs word segmentation processing on each sentence in the original text data set to obtain a text word segmentation set; carrying out word frequency statistics on each text participle in the text participle set to obtain word frequency; performing dictionary index storage on the text word segmentation set after word frequency statistics to obtain a target confusion dictionary; and performing word replacement on the text word segmentation set through the target confusion dictionary and/or the word frequency to obtain a wrongly written text data set.

The method comprises the steps that an original text data set can be stored in a block chain, after the original text data set is obtained from the block chain by a server, security detection, data cleaning, data conversion, abnormal value detection and null value filling are carried out on the original text data set to obtain a preprocessed original text data set, and the original text data set is obtained through legal and compliant approaches; calling a preset word segmentation component (such as a knot component), performing Chinese word segmentation on each sentence in the preprocessed original text data set to obtain a text word segmentation set, further calling the preset word segmentation component, and performing Chinese word segmentation on each sentence in the preprocessed original text data set based on a preset word segmentation library to obtain the text word segmentation set, wherein the word segmentation library comprises each text sentence stored in history and a word corresponding to each text sentence, and the specific execution process comprises the following steps: and searching a preset word segmentation library through each sentence in the preprocessed original text data set to obtain corresponding word segmentation information, calling a preset word segmentation component, and performing Chinese word segmentation on each sentence in the preprocessed original text data set according to the word segmentation information to obtain a text word segmentation set.

The server calls a preset Chinese word frequency statistical tool to count the occurrence frequency of each word or each word (each text word) in the text word segmentation set in the original text data set to obtain word frequency; according to a preset type, identifying and screening a text word segmentation set after word frequency statistics to obtain target words, wherein the target words comprise characters (or words) with the same pinyin and/or easy confusion; and storing the target words in the same dictionary index, wherein the dictionary index is Chinese pinyin.

And performing word replacement on the text word segmentation set through the target confusion dictionary and/or the word frequency to obtain a wrongly written text data set. Specifically, the server selects words under the same dictionary index or words under similar pinyin indexes in the target confusion dictionary according to the word frequency to obtain a target word set; replacing words at corresponding positions in the text participle set through the target word set to obtain a wrongly written text data set; or, randomly selecting words in the target confusion dictionary to obtain random words; and replacing the words at the corresponding positions in the text word segmentation set by the random words to obtain a wrongly written text data set.

The server calculates the occurrence frequency of each text participle in the text participle set under the same index of the target confusion dictionary to obtain a first target probability, and extracts a word or a word with the word frequency corresponding to the first target probability from the words under the same index of the target confusion dictionary to obtain a target word set; or the server calculates the occurrence frequency of each text participle in the text participle set under the similar pinyin index of the target confusion dictionary to obtain a second target probability, and extracts a word or a word with the word frequency corresponding to the second target probability from the words under the similar pinyin index of the target confusion dictionary to obtain a target word set; and replacing the words at the corresponding positions in the text participle set through the target word set to obtain a wrongly written text data set. Or the server randomly selects words from the target confusion dictionary to obtain random words; and replacing the words at the corresponding positions in the text word segmentation set by the random words to obtain a wrongly written text data set.

The initial text error correction model is trained by a larger-scale and higher-quality data set (namely a wrongly written text data set), so that the accuracy and generalization capability of the target text error correction model can be remarkably improved.

202. And acquiring a retraining text data set, and training a preset initial text error correction model through the wrongly written text data set, a preset loss function and the retraining text data set to obtain a target text error correction model, wherein the target text error correction model comprises a correction network based on a bert model and a detection network based on the bert model.

Specifically, the server constructs a detection network and a correction network to obtain an initial text error correction model, wherein the detection network comprises a bert model and a full connection layer with a first preset dimension, and the correction network comprises the bert model and a classification layer with a second preset dimension; training the initial text error correction model through the wrongly written text data set and a preset loss function to obtain a candidate text error correction model; acquiring a retraining text data set, wherein the retraining text data set comprises rare proper nouns or emerging nouns and wrongly written characters corresponding to the rare proper nouns or the emerging nouns; and retraining the candidate text error correction model by retraining the text data set to obtain the target text error correction model.

The initial text error correction model comprises a detection network based on the bert model in addition to the correction network based on the bert model, namely the target text error correction model comprises the detection network based on the bert model in addition to the correction network based on the bert model. The server constructs a detection network and a correction network to obtain an initial text error correction model, wherein the detection network comprises a bert model and a full connection layer with a first preset dimension, and the correction network comprises a bert model and a classification layer with a second preset dimension. Detecting a network: the full connection layer of the first preset dimension is connected behind the existing bert model, the first preset dimension of the embodiment of the invention is 768 × 512, the activation function adopts an S-shaped function, and the S-shaped function is sigmoid. And (3) correcting the network: the existing bert model is followed by a normalized index function layer of a second preset dimension, the normalized index function layer is a softmax layer, residual connection of embeddings of an original text (namely an original text data set) is accessed, each softmax function output is a vector of vocab _ size 1, the vocab _ size is the length of a dictionary, the first preset dimension and the second preset dimension can be the same or different, and the second preset dimension of the embodiment of the invention is 512.

And the server trains the initial text error correction model through the wrongly written text data set and the preset loss function to obtain a candidate text error correction model. Specifically, the server performs text error correction processing on the wrongly written text data set through an initial text error correction model to obtain an error correction result; respectively calculating loss values of the detection network and the correction network based on an error correction result through a preset loss function to obtain a detection loss value and a correction loss value; calculating a linear weighted sum of the detection loss value and the correction loss value to obtain a target loss value; and adjusting the initial text error correction model through the target loss value to obtain a candidate text error correction model.

The text error correction processing can comprise the calculation of the probability of wrongly written characters of the detection network and the word correction of a target confusion dictionary based on the probability of wrongly written characters of the correction network, and the preset loss function comprises a loss function of the detection network, a loss function of the correction network and a total loss function. And the server inputs the wrongly written text data set into the initial text error correction model, and carries out wrongly written word probability calculation and word correction of a target confusion dictionary based on wrongly written word probability on the wrongly written text data set through the initial text error correction model to obtain an error correction result.

Calculating the loss value of the detection network based on the error correction result by detecting the loss function of the network to obtain the detection loss value, wherein the loss function of the detection network is as follows:

，Loss_drepresenting the detected loss value, i representing the position of the word, n representing the sum of the number of words in each sentence of the original text data set, P_dG, probability that the position of the ith character of each sentence in the original text data set in the error correction result is a wrongly written character_iThe ith character represents each sentence of the original text data set, and X represents each sentence of the original text data set; calculating the loss value of the correction network based on the error correction result by the loss function of the correction network to obtain the correction loss value, wherein the loss function of the correction network is as follows:

Loss_crepresenting the correction loss value, i representing the position of the word, n representing the sum of the number of words of each sentence of the original text data set, P_cRepresenting the probability that the position of the ith word of each sentence in the original text data set in the error correction result is any word in the target confusion dictionary, y_iRepresenting the ith word in the target confusion dictionary and X representing each sentence in the original text data set.

Calculating a linear weighted sum of the detection loss value and the correction loss value through a total loss function to obtain a target loss value, wherein the total loss function is as follows: loss ═ μ · Loss_c+(1-μ)·Loss_dLoss denotes the target Loss value, Loss_dIndicating the detection Loss value, Loss_cRepresents the correction loss value, mu represents the over-parameter, and the value range is [0, 1 ]](ii) a And performing parameter adjustment or frame adjustment on the initial text error correction model through the target loss value to obtain a candidate text error correction model.

After the server obtains authorization, crawling texts containing rare proper nouns or emerging nouns according to preset time and preset quantity, preprocessing the texts of the rare proper nouns or the emerging nouns to obtain preprocessed texts, screening the preprocessed texts to obtain screened texts, processing the screened texts through the execution process of the step 201 (target confusion dictionary construction and word replacement) to obtain a retraining text data set, and training the text data set to indicate text data which is prone to errors of a candidate text error correction model in the rare proper nouns or the emerging nouns; and retraining the candidate text error correction model by retraining the text data set and the preset loss function to obtain a target text error correction model so that the target text error correction model can have self-learning capability, master some new error correction rules and improve the accuracy and generalization capability of the target text error correction model.

Through experiments, the accuracy rate (the proportion of all predicted correct characters of the correction algorithm to the total) of the target text error correction model is 0.9849, the accuracy rate (the precision rate, namely the proportion of all predicted wrongly-written characters of the detection algorithm) is 0.9803, and the recall rate (the recall rate, namely the proportion of all predicted wrongly-written characters of the detection algorithm to the wrongly-written characters) is 0.9775. The target text error correction model of the embodiment of the invention can realize self-learning and automatic path finding, has simple steps and obviously improves the generalization capability of the target text error correction model. The detection algorithm can understand the text, find the position of the wrong word, perform mask operation on the wrong word, input a correction network, and the correction network needs to understand the context of the position where the mask is removed, so as to predict the correct word and finish error correction. The specific implementation difficulty mainly lies in that the error correction algorithm has high requirements on precision, and the precision of the algorithm can reach over 0.98 in a self-learning process every day, which is the level of the top in the industry.

The detection network part avoids a complex wrongly written or mispronounced dictionary matching method, and the training of a large-scale data set (wrongly written or mispronounced text data set) can realize higher operation speed, higher efficiency and stronger intelligence under the condition of consistent precision. The correction network part does not need to correct the dictionary, the correction network can directly output correct texts, and higher efficiency and stronger intelligence can be realized under the condition of consistent precision through training of a large-scale data set (wrongly written text data set). By retraining the text data set, the problem that the accuracy and efficiency of text error correction are not high due to the fact that some rare proper nouns or emerging nouns are not recorded in the misreading dictionary can be solved, and the intelligence and generalization capability of the target text error correction model are obviously improved. In addition, in the method for improving the generalization capability of the algorithm in the prior art, a domain text error detection model is used, a fine tuning model database is constructed, the steps are relatively complex, the algorithm realizes the daily automatic training of the model, and the step is simple and the generalization capability of the model is obviously improved.

203. And acquiring a text to be processed, and performing embedded vector conversion on the text to be processed through a target text error correction model to obtain a text vector sequence.

The target text error correction model comprises a detection network and a correction network, and also comprises two embedding layers, wherein the embedding layers are bert _ embedding layers and are a first embedding layer and a second embedding layer, an embedding layer is connected in front of the detection network, namely the first embedding layer, an embedding layer is connected in front of the correction network, namely the second embedding layer, the first embedding layer is used for embedding the text to be processed, and the second embedding layer is used for carrying out self-defined embedding on the output of the detection network.

The server extracts the text to be processed from the preset database or receives the text to be processed sent by the target terminal. Through the first embedding layer, the text to be processed is subjected to bert _ embedding conversion to obtain a text vector sequence x_i(x₁，x₂，x₃，x₄… …). By carrying out embedded vector conversion on the text to be processed, the probability calculation of wrongly written characters of the text to be processed by a subsequent detection network is facilitated.

Furthermore, the number of the texts to be processed can be one or more, when the number of the texts to be processed is more than one, the texts to be processed can be written into preset multiple threads, and asynchronous parallel processing is performed through the written multiple threads, so that the running processing efficiency is improved.

204. And calculating the probability of wrongly written characters at each position in the text vector sequence through a detection network to obtain the probability of wrongly written characters.

The detection network can understand the text to be processed and find the position of the wrongly written character. The server converts the text vector sequence x_i(x₁，x₂，x₃，x₄… …) input into the detection network, and the text vector sequence x is input into the detection network through a bert model based on the detection network and based on a first preset dimension through a full connection layer_i(x₁，x₂，x₃，x₄… …) to obtain the probability p of wrongly written words_i(p₁，p₂，p₃，p₄… …) indicating the probability that a certain position corresponding to the text vector sequence is a wrong word.

The probability calculation of wrongly written characters is carried out on each position in the text vector sequence through the detection network, so that the complex wrongly written character matching process is avoided, and the operation speed is improved under the condition of consistent precision.

205. And carrying out mask embedded vector conversion on the wrongly-written character probability to obtain a wrongly-written character probability vector.

The server checks the probability p of wrongly written words through a second embedding layer_i(p₁，p₂，p₃，p₄… …) carrying out self-defined embedding vector embedding processing to obtain a wrongly written word probability vector.

x_i'(x₁'，x₂'，x₃'，x₄'......). In particular, the server is embedded by a second embeddingLayer, to wrong word probability p_i(p₁，p₂，p₃，p₄… …) performing mask processing (i.e. mask) to obtain mask probability, and processing the probability p of wrongly written words by a self-defined embedding processing formula_i(p₁，p₂，p₃，p₄… …), masking probability and text vector sequence x_i(x₁，x₂，x₃，x₄… …) to obtain a wrongly written word probability vector x_i'(x₁'，x₂'，x₃'，x₄'...) the custom embedding process formula is as follows: x is the number of_i'＝p_i·x_mask+(1-p_i)·x_i，x_i' representing a wrongly written probability vector, p_iRepresenting the probability of wrongly written words, x_maskRepresenting the mask probability, x_iRepresenting a sequence of text vectors.

By performing mask embedded vector conversion on the wrongly-written characters, probability calculation and probability classification of a subsequent correction network are facilitated, and a foundation is laid for improving the processing efficiency and accuracy of the correction network.

206. And performing word probability calculation and probability classification of each position on the text to be processed through a correction network based on the error probability vector and the target confusion dictionary to obtain the corrected text.

Specifically, the server calculates the word probability of each position of the text to be processed based on the error probability vector and the target confusion dictionary through a bert model of a correction network to obtain a word probability set of each position, wherein the correction network comprises the bert model and a normalization index function; and acquiring words corresponding to each position in the text to be processed from the target confusion dictionary based on the word probability set of each position through a normalized exponential function to obtain an error-corrected text, wherein the error-corrected text is a two-dimensional vector of the sentence length and the dictionary length.

The correction network can understand the context of the position of the wrongly written word which is removed by the mask, thereby predicting the correct word and completing the error correction. The server will wrongly written word probability vector x_i'(x₁'，x₂'，x₃'，x₄'...) is input toIn the bert model of the correction network, based on the probability vector x of the wrongly written characters through the bert model_i'(x₁'，x₂'，x₃'，x₄'...) calculating the probability that each position of the text to be processed corresponds to any character in the target confusion dictionary to obtain a character probability set of each position, wherein the character probability is the probability that each position is output as any character in the target confusion dictionary for each position in the original text (namely the text to be processed), selecting an index corresponding to the maximum character probability in the character probability set through a normalized exponential function softmax, finding the character and the word in the target confusion dictionary through the index, namely the corrected character, and outputting a two-dimensional vector of seq _ length × vocab _ size to obtain the text after error correction, wherein seq _ length is the length of a sentence, and vocab _ size is the length of the dictionary.

The correction network carries out word probability calculation and probability classification of each position on the text to be processed based on the misregistration probability vector and the target confusion dictionary, so that the effect of directly outputting the correct text (namely the text after error correction) without correcting the dictionary is realized, and the efficiency, the accuracy and the intelligence of error correction of the text to be processed are improved.

In the embodiment of the invention, the initial text error correction model is trained by a larger-scale and higher-quality data set (wrongly written or mispronounced word text data set), the accuracy and generalization capability of the target text error correction model are improved, the problems of low accuracy and efficiency of text error correction caused by the fact that some rare proper nouns or emerging nouns cannot be recorded in a wrongly written or mispronounced dictionary can be solved by retraining the text data set, the intelligence and generalization capability of the target text error correction model are improved, the operation speed is higher under the condition of consistent accuracy by training the large-scale data set (wrongly written or mispronounced word text data set), the efficiency is higher, the intelligence is stronger, and further, the text error correction processing is carried out on the file to be processed by the target text error correction model, the accuracy and the efficiency of text error correction are improved. In addition, in the method for improving the generalization capability of the algorithm in the prior art, a domain text error detection model is used, a fine tuning model database is constructed, the steps are relatively complex, the algorithm realizes the daily automatic training of the model, and the step is simple and the generalization capability of the model is obviously improved.

In the above description of the text error correction method in the embodiment of the present invention, referring to fig. 3, a text error correction device in the embodiment of the present invention is described below, and an embodiment of the text error correction device in the embodiment of the present invention includes:

the replacing module 301 is configured to obtain a target confusion dictionary of the original text data set, and perform word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set;

the training module 302 is configured to obtain a retraining text data set, train a preset initial text error correction model through the wrongly written text data set, a preset loss function, and the retraining text data set, and obtain a target text error correction model, where the target text error correction model includes a correction network based on a bert model;

and the calculation and correction module 303 is configured to obtain a text to be processed, and sequentially perform position-wrongly-written-word probability calculation and dictionary word correction on the text to be processed through the target text error correction model and the target confusion dictionary to obtain an error-corrected text.

The function implementation of each module in the text error correction device corresponds to each step in the text error correction method embodiment, and the function and implementation process thereof are not described in detail herein.

In the embodiment of the invention, the initial text error correction model is trained by a larger-scale and higher-quality data set (wrongly written text data set), so that the accuracy and generalization capability of the target text error correction model are improved, by retraining the text data set, the problems of low accuracy and efficiency of text error correction caused by the fact that some rare proper nouns or emerging nouns cannot be recorded in the wrongly written dictionary can be solved, the intelligence and generalization capability of the target text error correction model are improved, by the correction network without correcting the dictionary and training the large-scale data set (wrongly written text data set), the method can realize higher operation speed, higher efficiency and stronger intelligence under the condition of consistent precision, and furthermore, text error correction processing is carried out on the file to be processed through the target text error correction model, so that the accuracy and the efficiency of text error correction are improved. In addition, in the method for improving the generalization capability of the algorithm in the prior art, a domain text error detection model is used, a fine tuning model database is constructed, the steps are relatively complex, the algorithm realizes the daily automatic training of the model, and the step is simple and the generalization capability of the model is obviously improved.

Referring to fig. 4, another embodiment of the text error correction apparatus according to the embodiment of the present invention includes:

the training module 302 is configured to obtain a retraining text data set, train a preset initial text error correction model through the wrongly written text data set, a preset loss function and the retraining text data set, and obtain a target text error correction model, where the target text error correction model includes a correction network based on a bert model and a detection network based on the bert model;

the calculation correction module 303 is configured to obtain a text to be processed, and sequentially perform position-wrongly-written character probability calculation and dictionary word correction on the text to be processed through the target text error correction model and the target confusion dictionary to obtain an error-corrected text;

wherein, the calculation and correction module 303 specifically includes:

the first conversion unit 3031 is configured to obtain a text to be processed, and perform embedded vector conversion on the text to be processed through a target text error correction model to obtain a text vector sequence;

a calculating unit 3032, configured to perform probability calculation of a wrongly written character at each position in the text vector sequence through a detection network, to obtain a probability of wrongly written characters;

a second conversion unit 3033, configured to perform mask embedded vector conversion on the wrongly written word probability to obtain a wrongly written word probability vector;

and the classification unit 3034 is configured to perform word probability calculation and probability classification at each position on the text to be processed through the correction network based on the misregistration probability vector and the target confusion dictionary to obtain the corrected text.

Optionally, the classification unit 3034 may further specifically be configured to:

calculating the word probability of each position of the text to be processed based on the error probability vector and the target confusion dictionary through a bert model of a correction network to obtain a word probability set of each position, wherein the correction network comprises the bert model and a normalization index function;

and acquiring words corresponding to each position in the text to be processed from the target confusion dictionary based on the word probability set of each position through a normalized exponential function to obtain an error-corrected text, wherein the error-corrected text is a two-dimensional vector of the sentence length and the dictionary length.

Optionally, the training module 302 includes:

the construction unit 3021 is configured to construct a detection network and a correction network to obtain an initial text error correction model, where the detection network includes a bert model and a full connection layer with a first preset dimension, and the correction network includes a bert model and a classification layer with a second preset dimension;

the training unit 3022 is configured to train the initial text error correction model through the wrongly written text data set and the preset loss function to obtain a candidate text error correction model;

an obtaining unit 3023, configured to obtain a retraining text data set, where the retraining text data set includes a rare proper noun or an emerging noun, and a wrongly written word corresponding to the rare proper noun or the emerging noun;

and the retraining unit 3024 is configured to retrain the candidate text error correction model by retraining the text data set to obtain a target text error correction model.

Optionally, the training unit 3022 may be further specifically configured to:

performing text error correction processing on the wrongly written text data set through an initial text error correction model to obtain an error correction result;

respectively calculating loss values of the detection network and the correction network based on an error correction result through a preset loss function to obtain a detection loss value and a correction loss value;

and adjusting the initial text error correction model through the target loss value to obtain a candidate text error correction model.

Optionally, the replacing module 301 includes:

the word segmentation unit 3011 is configured to obtain an original text data set, and perform word segmentation processing on each sentence in the original text data set to obtain a text word segmentation set;

a statistics unit 3012, configured to perform word frequency statistics on each text segment in the text segment set to obtain a word frequency;

a saving unit 3013, configured to perform dictionary index saving on the text segmentation sets after the word frequency statistics is performed, so as to obtain a target confusion dictionary;

and the replacing unit 3014 is configured to perform word replacement on the text segmentation sets through the target confusion dictionary and/or the word frequency to obtain a wrongly written text data set.

Optionally, the replacing unit 3014 may be further specifically configured to:

selecting words under the same dictionary index or similar pinyin indexes in the target confusion dictionary according to the word frequency to obtain a target word set;

replacing words at corresponding positions in the text participle set through the target word set to obtain a wrongly written text data set;

and replacing the words at the corresponding positions in the text word segmentation set by the random words to obtain a wrongly written text data set.

The functional implementation of each module and each unit in the text error correction device corresponds to each step in the text error correction method embodiment, and the functions and implementation processes are not described in detail herein.

The text error correction device in the embodiment of the present invention is described in detail in terms of the modular functional entity in fig. 3 and 4 above, and the text error correction device in the embodiment of the present invention is described in detail in terms of the hardware processing below.

Fig. 5 is a schematic structural diagram of a text error correction apparatus according to an embodiment of the present invention, where the text error correction apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), and each module may include a series of computer program operations in the text correction apparatus 500. Still further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of computer program operations in the storage medium 530 on the text correction device 500.

The text error correction apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. It will be understood by those skilled in the art that the configuration of the text correction device shown in FIG. 5 does not constitute a limitation of the text correction device, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The present invention also provides a text error correction apparatus, comprising: a memory having a computer program stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the computer program in the memory to cause the text correction device to perform the steps in the text correction method described above. The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored thereon a computer program, which, when run on a computer, causes the computer to perform the steps of the text error correction method.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several computer programs to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text error correction method, characterized in that the text error correction method comprises:

2. The text error correction method according to claim 1, wherein the target text error correction model includes a correction network based on a bert model and a detection network based on the bert model, and the obtaining of the text to be processed sequentially performs position-wrongly-written-word probability calculation and dictionary word correction on the text to be processed through the target text error correction model and the target confusion dictionary to obtain the text after error correction comprises:

3. The method according to claim 2, wherein the performing, by the correction network, word probability calculation and probability classification for each position of the text to be processed based on the error probability vector and the target confusion dictionary to obtain the text after error correction comprises:

4. The method of claim 1, wherein the obtaining of the retraining text data set is performed by training a preset initial text error correction model through the wrongly written text data set, a preset loss function and the retraining text data set to obtain a target text error correction model, and the target text error correction model includes a bert model-based correction network, including:

5. The method of claim 4, wherein the training the initial text correction model through the wrongly written text data set and a predetermined loss function to obtain a candidate text correction model comprises:

6. The method for correcting the text errors according to any one of claims 1 to 5, wherein the obtaining a target confusion dictionary of an original text data set and performing word replacement on the original text data set through the target confusion dictionary to obtain a wrongly written text data set comprises:

7. The method of claim 6, wherein the performing word replacement on the text segmented word set through the target confusion dictionary and/or the word frequency to obtain a wrongly written text data set comprises:

8. A text correction apparatus, characterized in that the text correction apparatus comprises:

9. A text correction apparatus, characterized in that the text correction apparatus comprises: a memory and at least one processor, the memory having stored therein a computer program;

the at least one processor invokes the computer program in the memory to cause the text correction apparatus to perform the text correction method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a text correction method according to any one of claims 1 to 7.