CN113177405A

CN113177405A - Method, device and equipment for correcting data errors based on BERT and storage medium

Info

Publication number: CN113177405A
Application number: CN202110596473.2A
Authority: CN
Inventors: 马丹; 黄少波; 曾增烽
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-07-27

Abstract

The embodiment of the application relates to the field of data processing, and discloses a method, a device, equipment and a storage medium for correcting data errors based on BERT, wherein the method comprises the following steps: acquiring source data to be corrected, identifying abnormal data in the source data to be corrected, and determining a candidate data set corresponding to the abnormal data, wherein the candidate data set comprises one or more candidate data; calling a mask language model based on BERT to perform mask processing on the abnormal data to obtain a candidate data sorting result corresponding to the abnormal data; determining replacement data corresponding to the abnormal data according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data; and replacing the abnormal data according to the replacement data to obtain the target statement of the source data to be corrected. The data error correction accuracy can be effectively improved. The present application relates to a blockchain technique, such as the above data can be written into a blockchain for a scenario of data error correction.

Description

Method, device and equipment for correcting data errors based on BERT and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for data error correction based on BERT.

Background

With the rapid development of computer technology, techniques such as question-answering robots, dialog systems, etc. are more and more widely applied to daily work or life of people, and the input of the techniques is generally characters or voice, and input errors usually occur in character input or voice input. For example, for the question and answer robot, when the user inputs a voice by the question and answer robot, the question and answer robot may perform a wrong recognition in the process of recognizing the voice of the user, that is, a wrong word may occur in a character finally input into the question and answer robot, or when the user inputs a character by the question and answer robot, the user may miss a certain word. Therefore, in the field of error correction, how to improve the error correction accuracy becomes a problem that needs to be solved urgently.

Disclosure of Invention

By implementing the method, after a candidate data set corresponding to abnormal data is obtained, all candidate data in the candidate data set can be sequenced to determine final replaceable data according to a sequencing result, so that error correction accuracy is improved.

In a first aspect, an embodiment of the present application discloses a method for correcting data errors based on BERT, where the method includes:

acquiring source data to be corrected, identifying abnormal data in the source data to be corrected, and determining a candidate data set corresponding to the abnormal data, wherein the candidate data set comprises one or more candidate data;

calling a mask language model based on BERT to perform mask processing on the abnormal data to obtain a candidate data sorting result corresponding to the abnormal data;

determining replacement data corresponding to the abnormal data according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data;

and replacing the abnormal data according to the replacement data to obtain target data of the source data to be corrected.

In a second aspect, an embodiment of the present application discloses a data error correction apparatus, where the apparatus includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring source data to be corrected, identifying abnormal data in the source data to be corrected and determining a candidate data set corresponding to the abnormal data, and the candidate data set comprises one or more candidate data;

the calling unit is used for calling a mask language model based on BERT to sort each candidate data in the candidate data set to obtain a candidate data sorting result corresponding to the abnormal data;

the determining unit is used for determining the replacement data corresponding to the abnormal data according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data;

and the replacing unit is used for replacing the abnormal data according to the replacing data to obtain the target data of the source data to be corrected.

In a third aspect, an embodiment of the present application discloses an apparatus, including a processor and a memory, where the memory is used for storing a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method of the first aspect.

In a fourth aspect, embodiments of the present application disclose a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect.

In the embodiment of the application, the device may acquire source data to be corrected to identify abnormal data in the source data to be corrected, and determine a candidate data set corresponding to the abnormal data, where the candidate data set includes one or more candidate data. Then, a mask language model based on the BERT can be called to perform mask processing on the abnormal data so as to obtain a candidate data sorting result corresponding to the abnormal data. Therefore, the replacement data corresponding to the abnormal data can be determined according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data. Furthermore, the abnormal data can be replaced according to the replacement data to obtain the target data of the source data to be corrected. By implementing the method, after the candidate data set corresponding to the abnormal data is obtained, all candidate data in the candidate data set can be sequenced to determine the final replaceable word according to the sequencing result, so that the error correction accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a BERT-based data error correction method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of another BERT-based data error correction method provided in the embodiments of the present application;

FIG. 3 is a schematic flow chart of another BERT-based data error correction method provided in the embodiments of the present application;

FIG. 4 is a schematic flowchart of another BERT-based data error correction method provided in the embodiments of the present application;

fig. 5 is a schematic structural diagram of a data error correction apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for correcting error of data based on BERT according to an embodiment of the present application. The data error correction method described in this embodiment is applied to a device, and may be executed by the device, where the device may be a server or a terminal. As shown in fig. 1, the data error correction method includes the steps of:

s101: the method comprises the steps of obtaining source data to be corrected, identifying abnormal data in the source data to be corrected, and determining a candidate data set corresponding to the abnormal data, wherein the candidate data set comprises one or more candidate data.

In one implementation, the source data to be corrected may be obtained first, for example, the source data to be corrected may be a sentence, for example, a sentence input by the user in the question and answer robot may be obtained, where the sentence may be a sentence input by text or a sentence input by voice and converted into text. After the source data to be corrected is obtained, abnormal data in the source data to be corrected can be identified, and a candidate data set corresponding to the abnormal data is obtained. After the candidate data are obtained, the subsequent steps can be executed to sort each candidate data in the candidate data set so as to determine the replacement data of the abnormal data according to the candidate data sorting result. For example, if the source data to be corrected is a statement, the abnormal data may refer to a wrong word in the statement, and the candidate data set corresponding to the abnormal data may be a candidate word set, which may include one or more candidate words. Fig. 2 shows a flow of another BERT-based data error correction method implemented in the present application, where in the flow of fig. 2, assuming that source data to be error-corrected is a statement to be error-corrected, an erroneous word in the statement to be error-corrected may be identified first, and after the erroneous word is identified, a candidate word set corresponding to the erroneous word may be determined, so that each candidate word in the candidate word set may be sorted, so as to determine a replacement word of the erroneous word according to a candidate word sorting result.

In an implementation manner, a specific implementation manner of identifying abnormal data in source data to be corrected and obtaining a candidate data set corresponding to the abnormal data is not limited in this application, for example, a deep neural network model may be used to identify abnormal data in the source data to be corrected and determine the candidate data set corresponding to the abnormal data. The deep Neural Network model may be, for example, a Recurrent Neural Network (RNN) model, a Conditional Random Field (CRF) model, a seq2seq (sequence-to-sequence) model, or other models, which are used for identifying abnormal data. It can be understood that, when the abnormal data recognition model is used for recognition, the abnormal data recognition model may be trained first to obtain a trained abnormal data recognition model, wherein a specific acquisition manner of a training data set used for training the abnormal data recognition model may be as follows, the training data set for a target field, for example, the target field may be an education field, an insurance field, an academic research field, and the like, so that the trained abnormal data recognition model is data error correction for the target field, and then, when the abnormal data recognition model is applied to the target field and data error correction is implemented, accuracy of abnormal data recognition may also be improved. The training data set may be a batch of relatively accurate training data, that is, the training data does not include abnormal data, after the training data set is obtained, abnormal data processing may be performed on the training data set, and the training data set subjected to the abnormal data processing is a training data set used for training the abnormal data recognition model. Specifically, the abnormal data processing of the training data set may specifically be to replace correct data in the training data set with abnormal data with a random probability after the training data set is collected. For example, the training data set includes a plurality of training sentences, and assuming that the random probability is 15%, 15% of correct characters in the training data set may be replaced by wrong characters, where the replaced wrong characters may be composed of homophones corresponding to the correct characters and random wrong characters, for example, 80% of the replaced wrong characters are homophones corresponding to the correct characters, and 20% of the replaced wrong characters are random wrong characters. The training data set for the target field may also be acquired according to other manners, which are not limited in this application.

S102: and calling a mask language model based on BERT to perform mask processing on the abnormal data to obtain a candidate data sorting result corresponding to the abnormal data.

In an implementation manner, in the flow of data error correction shown in fig. 2, after the candidate data set is obtained, each candidate data in the candidate data set may be sorted to obtain a candidate data sorting result corresponding to the abnormal data. Specifically, the source data to be corrected may be input to a mask language model based on BERT, so as to obtain a candidate data sorting result corresponding to the abnormal data according to the mask language model. The BERT-based mask language model is obtained by performing fine-tuning processing on an initial mask language model according to a training data set of a target domain, where the target domain may refer to a specific domain, for example, the target domain may be an education domain, an insurance domain, an academic research domain, and the like. After the mask language model based on the BERT is obtained, the mask language model based on the BERT may be used to perform mask processing on the abnormal data position corresponding to the abnormal data in the source data to be corrected, so as to determine the occurrence probability of each candidate data in the candidate data set at the abnormal data position. Then, after determining the occurrence probability of each candidate data at the abnormal data position, the occurrence probabilities of each candidate data at the abnormal data position in the candidate data set may be sorted in a descending order, so as to obtain a candidate data sorting result corresponding to the abnormal data.

S103: and determining the replacement data corresponding to the abnormal data according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data.

S104: and replacing the abnormal data according to the replacement data to obtain target data of the source data to be corrected.

In step S103 and step S104, the candidate data in the first bit in the candidate data sorting result of the abnormal data may be determined as the replacement data of the abnormal data, and the abnormal data in the source data to be corrected is replaced by the replacement data to obtain the replaced source data to be corrected, where the replaced source data to be corrected is the target data of the source data to be corrected.

In one implementation, when the source data to be corrected is corrected, a situation that the correct data in the source data to be corrected is identified as abnormal data may occur, that is, a situation that the correct data is corrected may exist subsequently. Then, it may be considered to set some selection rules regarding replacement data, and in case that the selection rules are satisfied, the abnormal data in the error correction statement may be replaced by the replacement data. Specifically, the candidate data ranked first in the candidate data sorting result corresponding to the abnormal data may be determined as the candidate replacement data of the abnormal data. After the candidate replacement data is determined, whether the candidate replacement data meets a preset selection rule is detected. And when the candidate replacement data is detected to meet the preset selection rule, determining the candidate replacement data as the replacement data corresponding to the abnormal data. And replacing the abnormal data in the source data to be corrected by using the replacement data to obtain replaced source data to be corrected, wherein the replaced source data to be corrected is the target data.

In an implementation manner, a specific implementation manner of determining that the candidate replacement data is detected to satisfy the preset selection rule may be as follows, where whether the occurrence probability of the candidate replacement data is greater than or equal to a preset probability threshold is detected, and if the occurrence probability is greater than or equal to the preset probability threshold, it may be determined that the candidate replacement data is detected to satisfy the preset selection rule. Wherein the preset probability threshold may be preset.

In an implementation manner, a specific implementation manner of determining that the detected candidate replacement data meets the preset selection rule may be described as follows, and a confidence of the abnormal data may be determined according to the confidence set, where the above description may be referred to as a method for determining the confidence of the abnormal data, and is not described herein again. After determining the confidence of the abnormal data, it may be detected whether a difference between the occurrence probability of the candidate replacement data and the confidence of the abnormal data is greater than or equal to a preset threshold. When the difference between the occurrence probability of the candidate replacement data and the confidence of the abnormal data is greater than or equal to the preset threshold, it may be determined that the candidate replacement data is detected to satisfy the preset selection rule. The preset threshold may be preset, and when the preset threshold is larger, the accuracy of the candidate replacement data as the replacement data is higher, that is, the accuracy of the target data obtained by replacing the abnormal data with the replacement data is higher.

In an implementation manner, the specific implementation manner of determining that the detected candidate replacement data meets the preset selection rule may be further described as follows, and the confidence degrees in the confidence degree set may be sorted in a descending order to obtain a confidence degree sorting result, the confidence degree (occurrence probability) of the candidate replacement data is determined at a first position of the confidence degree sorting result, and the confidence degree of the abnormal data is determined at a second position of the confidence degree sorting result. For the method for determining the confidence of the abnormal data, reference may be made to the above description, which is not repeated herein. After the first position and the second position are determined, whether a difference value between the first position and the second position is greater than or equal to a preset position threshold value is detected, and when the difference value between the first position and the second position is greater than or equal to the preset position threshold value, it can be determined that the candidate replacement data is detected to meet a preset selection rule. The preset position threshold may be preset, and when the preset position threshold is larger, the accuracy of the candidate replacement data as the replacement data is higher, that is, the accuracy of the target data obtained by replacing the abnormal data with the replacement data is higher.

Optionally, the specific implementation of determining that the detected candidate replacement data meets the preset selection rule may further include other manners, which are not limited in this application.

In the embodiment of the application, the device may obtain source data to be corrected, identify abnormal data in the source data to be corrected, and determine a candidate data set corresponding to the abnormal data, where the candidate data set may include one or more candidate data. Then, a mask language model based on the BERT may be called to rank each candidate data in the candidate data set to obtain a candidate data ranking result corresponding to the abnormal data. And determining the replacement data corresponding to the abnormal data according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data. Furthermore, the abnormal data can be replaced according to the replacement data, so that the target data of the source data to be corrected is obtained. By implementing the method, after the candidate data set corresponding to the abnormal data is obtained, the occurrence probability of each candidate data as the abnormal data position corresponding to the abnormal data is determined, and all candidate data in the candidate data set are sequenced according to the occurrence probability, so that the final replaceable data can be determined according to the sequencing result and the selection rule, and the error correction accuracy is improved.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating another BERT-based data error correction method according to an embodiment of the present application. The data error correction method described in this embodiment is applied to a device, and may be executed by the device, where the device may be a server or a terminal. As shown in fig. 3, the data error correction method includes the steps of:

s301: the method comprises the steps of obtaining source data to be corrected, identifying abnormal data in the source data to be corrected, and determining a candidate data set corresponding to the abnormal data.

S302: inputting source data to be corrected into a mask language model based on BERT, wherein the mask language model based on the BERT is obtained by carrying out fine adjustment processing on an initial mask language model according to a training data set of a target field.

In one implementation, the initial mask language model may be a BERT model, where the BERT model in this application is a pre-trained language model that has already been pre-trained and that has been pre-trained, and considering that the training samples used for training the pre-trained language model are typically corpora in the fields of news and the like, if the pre-trained language model is used as the mask language model and the mask language model is used in other business fields, the model effect of the mask language model may be poor. For example, it is assumed that the application scenario of the present application is a certain dialog system, and the dialog system is directed to a specific field, for example, the specific field may be an education field, an insurance field, an academic research field, and the like. It can be seen that the training samples used for pre-training the language model are sentences in the field of news, etc., rather than being specific to the field of sentences, i.e. the training samples lack many words in the specific field. Then, in this case, it can be considered to perform fine tuning on the pre-trained language model by using the training samples related to the application scenarios to obtain the BERT model in different application scenarios. In the application, a training data set of a target field can be obtained, and then the training data set is utilized to perform fine tuning on the initial mask language model so as to obtain the mask language model based on the BERT. The training data set may include one or more training data, and the number of all abnormal data in the training data set may be controlled within a smaller range to ensure the accuracy of the mask language model. The target domain may be the education domain, insurance domain, academic research domain, etc. described above, and then the training data set of the target domain, i.e., the data related to the target domain, i.e., the vocabulary of the target domain may exist in the data related to the target domain.

S303: and performing mask processing on abnormal data positions corresponding to the abnormal data in the source data to be corrected by using a mask language model based on BERT, and determining the occurrence probability of each candidate data in the candidate data set at the abnormal data positions.

In one implementation, masking may be performed on abnormal data positions in the source data to be corrected by using a BERT-based mask language model, so as to obtain a confidence set corresponding to all reference data in the reference dictionary. If the confidence set includes the confidence corresponding to each reference data in the reference dictionary, then after the confidence set is obtained, the confidence of each candidate data in the candidate data set may be determined from the confidence set. The reference dictionary comprises a large amount of reference data, all the reference data have a fixed sequence, namely the position of each reference data in the reference dictionary is fixed and invariable, and the sequence of the confidence degrees in the confidence degree set corresponding to each reference data in the reference dictionary obtained through a BERT-based mask language model is also consistent with the fixed sequence, namely after the confidence degree set is obtained through the mask language model, the confidence degree corresponding to each reference data can be determined according to the position corresponding to each confidence degree in the confidence degree set and the position of each reference data in the reference dictionary. Then, a specific implementation of determining the occurrence probability of each candidate data in the candidate data set at the abnormal data position by using the mask language model may be as follows, for any candidate data in the candidate data set, each reference data in the reference dictionary may be matched with the candidate data to determine a specified position of the matched reference data in the reference dictionary, and after determining the specified position, a target position corresponding to the specified position is determined from the confidence level set, so that the confidence level of the candidate data at the target position in the confidence level set may be determined. By the above method, the confidence of each candidate data may be determined in the confidence set, and after the confidence of each candidate data is determined, the confidence of each candidate data may be determined as the occurrence probability of each candidate data at the abnormal data position.

For example, the source data to be corrected is taken as a statement for illustration, then the abnormal data may be understood as a wrong word, the candidate data corresponding to the wrong word may be understood as a candidate word, and the reference data in the reference dictionary may be understood as a reference word. Assume that the reference dictionary A is [ A ]₁、A₂、…、A_k、…、A_n]The confidence set p is [ p ]₁、p₂、…、p_k、…、p_n]Wherein A is_nRepresenting respective reference words in a reference dictionary A, the subscript n representing the reference word A_nAt the position of reference dictionary A, i.e. A_nAt the nth position of the reference dictionary A, p_nRepresenting individual confidences in a set of confidences p, with the subscript n representing the confidence p_nAt the position of the confidence set p, i.e. p_nAt the nth position of the confidence set p. It is understood that each reference word in the reference dictionary has a one-to-one correspondence with each confidence in the confidence set, that is, the confidence corresponding to a reference word at a certain position in the reference dictionary is the confidence at the position in the confidence set. For example, for reference word A2 in the reference dictionary, it can be known that the reference word is located at the second position in the reference dictionaryThen, when determining the confidence of the reference word from the confidence set, only the confidence at the second position in the confidence set needs to be determined, and the confidence at the second position is the confidence corresponding to the reference word, that is, the confidence of the reference word a2 is p 2. Then, for any candidate word in the candidate data set, assume that the candidate word is a_kThe candidate word may be first defined as A_kMatching with each reference word in the reference dictionary A, and determining A according to the reference dictionary A_kThe specified position in the reference dictionary a is the kth bit. Then, after the designated position is determined, a target position corresponding to the designated position is determined from the confidence set p, and the target position is the k-th bit in the confidence set p. And the confidence p of the k-th bit in the confidence set p_kCan be determined as a candidate word A_kIs the confidence of (1), then word A is candidate_kProbability of occurrence at a wrong word position is p_k。

S304: and performing descending order on the occurrence probability of each candidate data in the candidate data set at the abnormal data position to obtain a candidate data ordering result corresponding to the abnormal data.

In one implementation, the probability of occurrence of each candidate data in the candidate data set at the abnormal data position may be sorted in a descending order to obtain a result of sorting the probability of occurrence candidate data. After the occurrence probability candidate data sorting result is determined, the candidate data sorting result corresponding to the abnormal data can be determined according to the corresponding relation between the candidate data and the occurrence probability.

For example, suppose there are 7 candidate data in the candidate data set, respectively candidate data 1, candidate data 2, candidate data 3, candidate data 4, candidate data 5, candidate data 6, and candidate data 7, each candidate data has an occurrence probability of 0.25, 0.85, 0.5, 0.75, 0.80, 0.4, 0.3 at the abnormal data position, then the occurrence probabilities are sorted in descending order to obtain the occurrence probability sorting results of 0.85, 0.80, 0.75, 0.5, 0.4, 0.3 and 0.25, each occurrence probability in the occurrence probability sorting results is matched with the corresponding candidate data, that is, the candidate data ranking results corresponding to the abnormal data are candidate data 2(0.85), candidate data 5(0.80), candidate data 4(0.75), candidate data 3(0.5), candidate data 6(0.4), candidate data 7(0.3), and candidate data 1 (0.25).

S305: and determining the replacement data corresponding to the abnormal data according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data.

S306: and replacing the abnormal data according to the replacement data to obtain target data of the source data to be corrected.

In an implementation manner, it is considered that if a plurality of abnormal data exists in a statement to be recognized, when a mask language model is used to mask abnormal data positions corresponding to abnormal data in source data to be corrected, a mask language model is used to mask the abnormal data positions corresponding to the plurality of abnormal data, that is, a plurality of masks may appear when the mask language model is used to process the abnormal data positions, and if the plurality of masks are processed simultaneously, that is, replacement data of each abnormal data in the plurality of abnormal data is determined simultaneously according to the mask language model, the model processing efficiency and the model accuracy of the mask language model may be reduced. Then, under the condition that the number of the abnormal data is at least two, error correction can be realized on the abnormal data in the statement to be recognized in an iterative mode, that is, one mask is processed in one iteration. Fig. 4 shows a flow of another BERT-based data error correction method provided by the present application, where fig. 4 mainly shows a flow of implementing error correction on abnormal data in source data to be corrected by using an iterative manner, and in the flow shown in fig. 4, a training data set in a target field may be collected first, and then, the training data set is used to perform fine-tuning processing on an initial mask language model, so as to obtain a BERT-based mask language model. Further, an iteration mode is adopted, and the mask language model is utilized to correct errors of a plurality of abnormal data in the statement to be recognized, specifically, an error correction sequence corresponding to each abnormal data in at least two abnormal data can be determined according to the positions of the abnormal data in the source data to be corrected, wherein the sequence of the positions of the abnormal data in the source data to be corrected can be the error correction sequence corresponding to each abnormal data. Then, after the error correction sequence corresponding to each abnormal data, the statement to be recognized may be input to the mask language model to obtain the replacement data corresponding to the first abnormal data, and the first abnormal data is replaced according to the replacement data corresponding to the first abnormal data to obtain the first error correction source data of the source data to be corrected, where the first abnormal data is the abnormal data whose error correction sequence is the first bit. After the first error correction source data is obtained, the first error correction source data may be input to the mask language model to obtain replacement data corresponding to the second abnormal data, and the second abnormal data is replaced according to the replacement data corresponding to the second abnormal data to obtain second error correction source data of the source data to be error corrected, where the second abnormal data is abnormal data whose error correction sequence is the second bit. When the second error correction source data does not include any one of the at least two abnormal data, the second error correction source data may be determined as the target data. It can be seen from the above steps that the number of masks in the source data to be corrected is reduced, because one abnormal data in the source data to be corrected is replaced every time iteration is performed, until no mask exists in the source data to be corrected, that is, no abnormal data exists, the iteration loop can be stopped, and then each abnormal data in the source data to be corrected is corrected.

For specific implementation of steps S301 and S305 to S306, reference may be made to the detailed description of steps S101 and S103 to S104 in the foregoing embodiment, and details are not repeated here.

In the embodiment of the application, the device may obtain source data to be corrected to identify abnormal data in the source data to be corrected and determine a candidate data set corresponding to the abnormal data, where the candidate data set may include one or more candidate data, then input the source data to be corrected to a BERT-based mask language model to perform mask processing on abnormal data positions corresponding to the abnormal data in the source data to be corrected by using the mask language model, thereby determining occurrence probabilities of each candidate data in the candidate data set at the abnormal data positions, then perform descending order on the occurrence probabilities of each candidate data in the candidate data set at the abnormal data positions to obtain candidate data ordering results corresponding to the abnormal data, and then determine replacement data corresponding to the abnormal data according to the candidate data arranged at the first position in the candidate data ordering results corresponding to the abnormal data, furthermore, the abnormal data can be replaced according to the replacement data to obtain the target data of the source data to be corrected. By implementing the method, after the candidate data set corresponding to the abnormal data is obtained, the occurrence probability of each candidate data as the abnormal data position corresponding to the abnormal data is determined again, so that all candidate data in the candidate data set are sequenced according to the occurrence probability, the final replaceable word can be determined according to the sequencing result and the selection rule, and the error correction accuracy is improved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a data error correction apparatus according to an embodiment of the present application, where the data error correction apparatus includes:

an obtaining unit 501, configured to obtain source data to be corrected, identify abnormal data in the source data to be corrected, and determine a candidate data set corresponding to the abnormal data, where the candidate data set includes one or more candidate data;

a calling unit 502, configured to call a mask language model based on BERT to perform mask processing on the abnormal data, so as to obtain a candidate data sorting result corresponding to the abnormal data;

a determining unit 503, configured to determine, according to candidate data ranked first in a candidate data ranking result corresponding to the abnormal data, replacement data corresponding to the abnormal data;

a replacing unit 504, configured to replace the abnormal data according to the replacement data, so as to obtain target data of the source data to be corrected.

In an implementation manner, the invoking unit 502 is specifically configured to:

inputting the source data to be corrected to a mask language model based on BERT, wherein the mask language model based on the BERT is obtained by carrying out fine tuning processing according to a training data set initial mask language model in a target field;

masking abnormal data positions corresponding to abnormal data in the source data to be corrected by using the BERT-based mask language model, and determining the occurrence probability of each candidate data in the candidate data set at the abnormal data positions;

and performing descending order on the occurrence probability of each candidate data in the candidate data set at the abnormal data position to obtain a candidate data ordering result corresponding to the abnormal data.

masking the abnormal data position in the source data to be corrected by using the BERT-based mask language model to obtain a confidence set corresponding to all reference data in a reference dictionary, wherein the confidence set comprises a confidence corresponding to each reference data in the reference dictionary;

determining a confidence level of each candidate data in the candidate data set from the confidence level set;

and determining the confidence of each candidate data as the occurrence probability of each candidate data at the abnormal data position.

for any candidate data in the candidate data set, matching each reference data in the reference dictionary with the candidate data;

determining a designated position of the matched reference data in the reference dictionary;

determining a target position of the designated position in the confidence set, and determining the confidence of the target position in the confidence set as the confidence of the candidate data.

In an implementation manner, the determining unit 503 is specifically configured to:

determining candidate data ranked at the first position in a candidate data sorting result corresponding to the abnormal data as candidate replacement data of the abnormal data;

detecting whether the candidate replacement data meets a preset selection rule;

and when the candidate replacement data is detected to meet the preset selection rule, determining the candidate replacement data as the replacement data corresponding to the abnormal data.

In one implementation, the determining unit 503 is further configured to:

determining the confidence of the abnormal data according to the confidence set;

detecting whether a difference between the occurrence probability of the candidate replacement data and the confidence of the abnormal data is greater than or equal to a preset threshold;

and when the difference between the occurrence probability of the candidate replacement data and the confidence coefficient of the abnormal data is greater than or equal to the preset threshold value, determining that the candidate replacement data meets the preset selection rule.

In one implementation, the number of the abnormal data is at least two; the invoking unit 502 is further configured to:

determining an error correction sequence corresponding to each abnormal data in at least two abnormal data according to the abnormal data positions of the at least two abnormal data in the source data to be corrected;

replacing the first abnormal data according to replacement data corresponding to the first abnormal data to obtain first error correction source data of the source data to be error corrected, wherein the first abnormal data is abnormal data of which the error correction sequence is a first bit;

inputting the first error correction source data into the BERT-based mask language model to obtain replacement data corresponding to the second abnormal data, and replacing the second abnormal data according to the replacement data corresponding to the second abnormal data to obtain second error correction source data of the source data to be error corrected, wherein the second abnormal data is abnormal data with a second bit of error correction sequence;

and when the second error correction source data does not comprise any abnormal data in the at least two abnormal data, determining the second error correction source data as target data.

It can be understood that the functions of the functional units of the data error correction apparatus described in the embodiment of the present application may be specifically implemented according to the method in the embodiment of the method described in fig. 1 or fig. 3, and the specific implementation process may refer to the description related to the embodiment of the method in fig. 1 or fig. 3, which is not described herein again.

In the embodiment of the present application, an obtaining unit 501 obtains source data to be corrected, identifies abnormal data in the source data to be corrected, and determines a candidate data set corresponding to the abnormal data, where the candidate data set includes one or more candidate data; the calling unit 502 calls a mask language model based on the BERT to perform mask processing on the abnormal data to obtain a candidate data sorting result corresponding to the abnormal data; the determining unit 503 determines the replacement data corresponding to the abnormal data according to the candidate data ranked first in the sorting result of the candidate data corresponding to the abnormal data; the replacing unit 504 replaces the abnormal data according to the replacement data to obtain the target data of the source data to be corrected. By implementing the method, the data error correction accuracy can be improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present disclosure. The apparatus comprises: a processor 601, a memory 602, and a network interface 603. The processor 601, the memory 602, and the network interface 603 may exchange data therebetween.

The Processor 601 may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 602 may include both read-only memory and random access memory, and provides program instructions and data to the processor 601. A portion of the memory 602 may also include random access memory. Wherein, the processor 601, when calling the program instruction, is configured to perform:

In one implementation, the processor 601 is specifically configured to:

inputting the source data to be corrected to a mask language model based on BERT, wherein the mask language model based on the BERT is obtained by carrying out fine tuning processing on an initial mask language model according to a training data set of a target field;

In one implementation, the processor 601 is specifically configured to:

detecting whether the candidate replacement data meets a preset selection rule;

In one implementation, the processor 601 is further configured to:

In one implementation, the number of the abnormal data is at least two; the processor 601 is further configured to:

inputting the first error correction statement into the BERT-based mask language model to obtain replacement data corresponding to the second abnormal data, and replacing the second abnormal data according to the replacement data corresponding to the second abnormal data to obtain second error correction source data of the source data to be error corrected, wherein the second abnormal data is abnormal data with a second bit error correction sequence;

In a specific implementation, the processor 601 and the memory 602 described in this embodiment of the present application may execute the implementation manner described in the data error correction method provided in fig. 1 or fig. 3 in this embodiment of the present application, and may also execute the implementation manner of the data error correction apparatus described in fig. 5 in this embodiment of the present application, which is not described herein again.

In this embodiment of the application, the processor 601 may obtain source data to be corrected, identify abnormal data in the source data to be corrected, and determine a candidate data set corresponding to the abnormal data, where the candidate data set includes one or more candidate data; calling a mask language model based on BERT to perform mask processing on the abnormal data to obtain a candidate data sorting result corresponding to the abnormal data; determining replacement data corresponding to the abnormal data according to the candidate data ranked at the first position in the candidate data ranking result corresponding to the abnormal data; and replacing the abnormal data according to the replacement data to obtain target data of the source data to be corrected. By implementing the method, the data error correction accuracy can be improved.

The embodiment of the present application also provides a computer-readable storage medium, in which program instructions are stored, and when the program is executed, some or all of the steps of the data error correction method in the embodiment corresponding to fig. 1 or fig. 3 may be included.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

It is emphasized that the data may also be stored in a node of a blockchain in order to further ensure the privacy and security of the data. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The BERT-based data error correction method, apparatus, device and storage medium provided by the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A BERT-based data error correction method, comprising:

2. The method of claim 1, wherein the invoking a BERT-based mask language model to mask the abnormal data to obtain a candidate data ordering result corresponding to the abnormal data comprises:

3. The method according to claim 2, wherein the masking abnormal data positions in the source data to be corrected by using the BERT-based masking language model to determine the occurrence probability of each candidate data in the candidate data set at the abnormal data positions comprises:

4. The method of claim 3, wherein determining the confidence level for each candidate data in the candidate data set from the confidence level set comprises:

5. The method according to claim 3 or 4, wherein the determining the replacement data corresponding to the abnormal data according to the candidate data ranked first in the candidate data ranking result corresponding to the abnormal data comprises:

detecting whether the candidate replacement data meets a preset selection rule;

6. The method according to claim 5, wherein before detecting that the candidate replacement data satisfies the preset selection rule, the method further comprises:

7. The method of claim 1, wherein the number of anomalous data is at least two; the method further comprises the following steps:

the replacing the abnormal data according to the replacement data to obtain the target data of the source data to be corrected, including:

8. A data error correction apparatus, comprising:

the calling unit is used for calling a mask language model based on BERT to perform mask processing on the abnormal data to obtain a candidate data sorting result corresponding to the abnormal data;

and the replacing unit is used for replacing the abnormal data according to the replacing data to obtain the target statement of the source data to be corrected.

9. An apparatus comprising a processor, a memory, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.