CN113887201A - Text fixed-length error correction method, device, equipment and storage medium - Google Patents

Text fixed-length error correction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113887201A
CN113887201A CN202111149204.8A CN202111149204A CN113887201A CN 113887201 A CN113887201 A CN 113887201A CN 202111149204 A CN202111149204 A CN 202111149204A CN 113887201 A CN113887201 A CN 113887201A
Authority
CN
China
Prior art keywords
error correction
vector
text
training text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111149204.8A
Other languages
Chinese (zh)
Inventor
周柱君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202111149204.8A priority Critical patent/CN113887201A/en
Publication of CN113887201A publication Critical patent/CN113887201A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a text fixed-length error correction method, which comprises the following steps: vectorizing the text set after data enhancement to obtain a training text vector, inputting the training text vector into a deep error correction model for error detection to obtain an output result of whether spelling is correct or not, performing soft mask connection on the training text vector according to the output result to obtain embedded data, performing error correction on the embedded data to obtain an error correction result, adjusting model parameters of the deep error correction model according to a cross entropy loss value of the deep error correction model obtained by calculating the error correction result, outputting a standard deep error correction model, inputting a text to be corrected into the standard deep error correction model, and obtaining a correct text after error correction based on a preset multi-round error correction mechanism. In addition, the invention also relates to a block chain technology, and the error correction result can be stored in the node of the block chain. The invention also provides a text fixed length error correction device, electronic equipment and a storage medium. The invention can solve the problem of lower accuracy of fixed-length text error correction.

Description

Text fixed-length error correction method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a text fixed-length error correction method, a text fixed-length error correction device, electronic equipment and a computer-readable storage medium.
Background
In different business activities of a bank, there are many business scenarios involving text recording: for example, the manual agent may record the contents of the customer's problem consultation, the customer's service process summary, and the customer's complaint summary in the form of text entry, and there may be a case where the recorded text is mistakenly entered with wrong words due to manual negligence. At this time, it is necessary to find out the wrongly written words in the entered text by using a text error correction technique and correct the wrongly written words according to the context.
The existing text error correction method usually utilizes a pre-training language model to carry out fixed length error correction on a text, and uses a mode of masking before prediction to carry out error correction on each single character in the text, and the error correction mode needs to be carried out sequentially and is difficult to carry out parallel error correction at one time, so that error correction errors can be caused, and the accuracy of text error correction is low.
Disclosure of Invention
The invention provides a method and a device for fixed-length text error correction and a computer-readable storage medium, and mainly aims to solve the problem of low accuracy of fixed-length text error correction.
In order to achieve the above object, the present invention provides a method for correcting a fixed length of a text, comprising:
acquiring an original text set, and performing data enhancement processing on the original text set to obtain a training text set;
vectorizing the training text set to obtain a training text vector;
inputting the training text vector into an error exploration network of a preset deep error correction model for error exploration to obtain an output result of judging whether spelling is correct or not;
performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data;
carrying out error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result;
calculating a cross entropy loss value of the depth error correction model according to the error correction result, adjusting model parameters of the depth error correction model according to the cross entropy loss value, and outputting a standard depth error correction model;
and inputting the pre-acquired text of the number to be corrected into the standard deep error correction model, and obtaining the corrected correct text based on a preset multi-round error correction mechanism.
Optionally, the performing data enhancement processing on the original text set to obtain a training text set includes:
acquiring a preset corpus and word frequencies corresponding to the corpora in the corpus;
randomly selecting a plurality of replaced words from the original text set;
calculating confusion probability values between confusion words in a preset confusion dictionary and the replaced words by using a preset confusion probability calculation formula according to the word frequencies;
and sorting the confusion words according to the confusion probability values, and selecting the confusion word with the highest confusion probability value to replace the replaced word to obtain a training text set.
Optionally, the vectorizing the training text set to obtain a training text vector includes:
acquiring a word embedding vector, a position embedding vector and a segment embedding vector of each character of the training text in the training text set;
and summing the word embedding vector, the position embedding vector and the segment embedding vector to obtain a training text vector.
Optionally, the inputting the training text vector into an error detection network of a preset deep error correction model for error detection to obtain an output result indicating whether the spelling is correct includes:
respectively carrying out forward coding and reverse coding on the training text vector to obtain a forward coding vector and a reverse coding vector;
transversely combining the forward coding vector and the reverse coding vector to obtain a coding hidden vector;
and inputting the coding hidden vector into a full connection layer for carrying out two-classification processing to obtain an output result of whether the spelling is correct or not.
Optionally, the inputting the encoded hidden vector into a full-link layer for performing a binary classification process to obtain an output result indicating whether the spelling is correct includes:
inputting the coding hidden vector into the full-connection layer to obtain a classification probability;
when the classification probability is greater than or equal to a preset classification threshold value, judging the output result as misspelling;
and when the classification probability is smaller than a preset classification threshold value, judging the output result as the correct spelling.
Optionally, the performing soft mask connection processing on the training text vector to obtain embedded data includes:
calculating by using a preset mask coefficient formula and the coding hidden vector to obtain a mask coefficient;
and calculating the misspelling vector in the training text vector according to the mask coefficient and a preset soft mask connection formula to obtain embedded data.
Optionally, the performing, by using the error correction network of the deep error correction model, error correction processing on the embedded data to obtain an error correction result includes:
encoding the embedded data by utilizing a plurality of encoding layers in the error correction network, and taking the hidden state of the last encoding layer in the plurality of encoding layers;
residual error connection is carried out on the hidden state and the training text vector to obtain a connection value;
and inputting the connection value into a full connection layer of the error correction network to obtain an error correction result.
In order to solve the above problem, the present invention further provides a text fixed length correction apparatus, including:
the data processing module is used for acquiring an original text set, performing data enhancement processing on the original text set to obtain a training text set, and performing vectorization processing on the training text set to obtain a training text vector;
the error probing module is used for inputting the training text vector into an error probing network of a preset deep error correction model for error probing to obtain an output result of whether spelling is correct or not;
the error correction processing module is used for performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data, and performing error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result;
the model training module is used for calculating a cross entropy loss value of the depth error correction model according to the error correction result, adjusting model parameters of the depth error correction model according to the cross entropy loss value and outputting a standard depth error correction model;
and the multi-round error correction module is used for inputting the pre-acquired text to be corrected into the standard deep error correction model and obtaining the corrected correct text based on a preset multi-round error correction mechanism.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the text fixed length correction method described above.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the text fixed length correction method.
The embodiment of the invention obtains the training text set by carrying out data enhancement processing on the original text set, the data enhancement processing can rapidly construct massive training samples which can be used for training a model at low cost, the training text set after vectorization processing is input into an error exploration network of a preset deep error correction model to obtain an output result of correct spelling, the training text vector is subjected to soft mask connection processing according to the output result of correct spelling to obtain embedded data, the soft mask connection processing integrates the output result of correct spelling as the input of an error correction network, the error correction network of the deep error correction model is utilized to carry out error correction processing on the embedded data, the error correction network has strong text fixed length error correction capability and can carry out parallel error correction on service texts containing wrong characters at one time, the error correction efficiency and the timeliness of the method can be greatly improved. And inputting the pre-acquired text of the number to be corrected into the standard deep error correction model based on a preset multi-round error correction mechanism to obtain the corrected correct text. The multi-round error correction mechanism can further improve the error correction effect of the depth error correction model. Therefore, the method, the device, the electronic equipment and the computer readable storage medium for fixed-length text error correction provided by the invention can solve the problem of low accuracy of fixed-length text error correction.
Drawings
Fig. 1 is a schematic flowchart of a text fixed-length error correction method according to an embodiment of the present invention;
fig. 2 is a functional block diagram of a text fixed-length error correction apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for implementing the method for correcting the fixed length of the text according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a fixed-length text error correction method. The execution subject of the text fixed-length error correction method includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiments of the present application. In other words, the text fixed-length error correction method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of a text fixed-length error correction method according to an embodiment of the present invention. In this embodiment, the text fixed-length error correction method includes:
and S1, acquiring an original text set, and performing data enhancement processing on the original text set to obtain a training text set.
In the embodiment of the present invention, the original text set may be text records related to different business activities of the bank, for example, relevant texts of customer problem consultation recorded by a human agent, a summary of service process of the customer, and a summary of complaint of the customer.
Specifically, the performing data enhancement processing on the original text set to obtain a training text set includes:
acquiring a preset confusion dictionary and a preset enhancement number;
and screening out confusion words with the number consistent with that of the enhancement numbers from the confusion dictionary, and performing random replacement on characters in the original text set by using the confusion words to obtain a training text set.
Wherein, the enhancement number refers to the number of characters needing data replacement.
In detail, the confusion dictionary contains 5401 Chinese common words and five kinds of confusion words corresponding to the Chinese common words: homonymic homonymous characters, homonymic heteronymous characters, and characters with the same radicals and the same stroke numbers.
For example, the original text in the original text set is "apply for credit card", and the original text is replaced by homophonic homonyms in the confusion dictionary to obtain "apply for credit card". The original text is replaced by the characters with the same radicals and stroke numbers in the confusion dictionary to obtain a 'card for applying for infringement' or a 'card for applying for immediate use'.
In another embodiment of the present invention, the performing data enhancement processing on the original text set to obtain a training text set may also include:
acquiring a preset corpus and word frequencies corresponding to the corpora in the corpus;
randomly selecting a plurality of replaced words from the original text set;
calculating confusion probability values between confusion words in a preset confusion dictionary and the replaced words by using a preset confusion probability calculation formula according to the word frequencies;
and sorting the confusion words according to the confusion probability values, and selecting the confusion word with the highest confusion probability value to replace the replaced word to obtain a training text set.
Specifically, the preset confusion probability calculation formula is as follows:
Figure BDA0003286576210000061
wherein, P (C)i) For the confusion probability value, n represents the number of confusing words in the obfuscated dictionary corresponding to the replaced word in the original text,
Figure BDA0003286576210000062
representing a sum of word frequencies, F, of said confusing words in said corpuscorpus(Ci) Representing said confusing word CiWord frequency in the corpus.
For example, the original text in the original text set is "a copy credit card product of a bank for which a client consults to apply for a query", a plurality of replaced words are randomly selected from the original text set, wherein the replaced words can be "query", "application", "copy", "credit" and "product", all five types of confusion words corresponding to the replaced words in the confusion dictionary are found, word frequencies corresponding to all five types of confusion words are found in the corpus, the confusion word with the maximum confusion probability value obtained by calculating the confusion probability value is used for replacing the replaced word, the query is replaced by "query", "deep" is replaced by "application", "sample" is replaced by "heart" and "shovel" replaces the product ", and therefore, the obtained sample in the training sample set is a copy card shovel product for the bank for which the client consults to find the deep request.
According to the embodiment of the invention, a large number of training samples with wrongly-written characters and corresponding correct character labels (characters in the original text are replaced by wrongly-written characters, and the characters before replacement can be regarded as the corresponding correct character labels) can be obtained by performing data enhancement processing on the original text set, and then the training samples with the labels are input into the depth error correction model for training and model effect optimization, so that the service text depth error correction model suitable for the service scene can be obtained finally.
And S2, carrying out vectorization processing on the training text set to obtain a training text vector.
In the embodiment of the invention, the training text vector is obtained by obtaining the word embedding vector, the position embedding vector and the segment embedding vector of each character of the training text in the training text set and summing the word embedding vector, the position embedding vector and the segment embedding vector.
Therefore, in the embodiment of the present invention, the training text vector is obtained by summing a word embedding vector (word embedding), a position embedding vector (position embedding) and a segment embedding vector (segment embedding) of each character in the training text, so as to obtain a training text vector:
ei=wordembedding(xi)+positionembedding(xi)+segmentembedding(xi)
wherein x isiRepresenting the ith character, e, in the training textiRepresenting a training text vector, wherein word embedding is the word embedding vector, position embedding is the position embedding vector, and segment embedding is the segment embedding vector.
And S3, inputting the training text vector into an error detection network of a preset deep error correction model for error detection, and obtaining an output result of whether the spelling is correct or not.
In the embodiment of the present invention, the depth error Correction model is a Detection-Correction combined depth model (Detection-Correction-combined model), which includes two Network models, namely a Detection Network (Detection Network) and an error Correction Network (Correction Network). Wherein the error detection network is substantially a bidirectional GRU model (Bi-GRU). The GRU model (Gated Current Unit) is also called a gate control cycle Unit structure, is constructed by an update gate and a reset gate, can effectively capture semantic association between long sequences, and relieves gradient disappearance or explosion phenomena.
Specifically, the inputting the training text vector into an error exploration network of a preset deep error correction model for error exploration to obtain an output result includes:
respectively carrying out forward coding and reverse coding on the training text vector to obtain a forward coding vector and a reverse coding vector;
transversely combining the forward coding vector and the reverse coding vector to obtain a coding hidden vector;
and inputting the coding hidden vector into a full connection layer for carrying out two-classification processing to obtain an output result of whether the spelling is correct or not.
Further, the forward encoding vector and the reverse encoding vector are obtained by calculation according to the following calculation formula:
Figure BDA0003286576210000081
Figure BDA0003286576210000082
wherein the content of the first and second substances,
Figure BDA0003286576210000083
for the purpose of the forward-encoded vector,
Figure BDA0003286576210000084
for the purpose of the reverse-direction encoding vector,
Figure BDA0003286576210000085
is a hidden state of the previous character,
Figure BDA0003286576210000086
hidden state for the next character, eiRepresenting the training text vector corresponding to the ith character.
Further, in the embodiment of the present invention, the forward encoded vector and the backward encoded vector are transversely combined to obtain an encoded hidden vector:
Figure BDA0003286576210000087
wherein the content of the first and second substances,
Figure BDA0003286576210000088
for the purpose of said encoding of the concealment vector(s),
Figure BDA0003286576210000089
for the purpose of the forward-encoded vector,
Figure BDA00032865762100000810
the vector is encoded for the inverse direction.
Specifically, the inputting the encoding hidden vector into the full-link layer for performing two-classification processing to obtain an output result indicating whether the spelling is correct includes:
inputting the coding hidden vector into the full-connection layer to obtain a classification probability;
when the classification probability is greater than or equal to a preset classification threshold value, judging the output result as misspelling;
and when the classification probability is smaller than a preset classification threshold value, judging the output result as the correct spelling.
The embodiment of the invention utilizes the following calculation formula to calculate and obtain the classification probability:
Figure BDA00032865762100000811
wherein, Pd(yiK | X) is the classification probability,
Figure BDA00032865762100000812
for the encoded concealment vector, b is a bias term for the fully-connected layer, W is a weight matrix for the fully-connected layer, softmax is an activation function, k and X are fixed parameters, yiFor the output result corresponding to the ith character, [ k ]]The integer part of the k value is solved.
And S4, performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data.
In the embodiment of the invention, the output result of whether the spelling of the error detection network is correct or not is integrated into the input of the error correction network in a soft masking connection (soft masking connection) mode.
In detail, the soft mask connection processing is mainly to process a misspelling part in the output result of whether the spelling is correct or not, so as to obtain embedded data.
Specifically, the performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct to obtain embedded data includes:
calculating by using a preset mask coefficient formula and the coding hidden vector to obtain a mask coefficient;
and calculating the misspelling vector in the training text vector according to the mask coefficient and a preset soft mask connection formula to obtain embedded data.
Further, the preset mask coefficient formula is as follows:
Figure BDA0003286576210000091
wherein, bdAs an offset term in the fully-connected layer, WdIs a weight matrix, p, in the full connection layeriFor the said mask coefficients,
Figure BDA0003286576210000092
for the coded concealment vector, X is a fixed parameter, and σ refers to the variance.
Specifically, the embodiment of the present invention obtains the embedded data by calculation using the following calculation formula:
e′i=pi·emask+(1-pi)·eg
wherein, e'iFor said embedded data, piIs the mask coefficient, egFor the misspelled vector in the training vector, emaskIs the embedded 'mask special character'.
S5, carrying out error correction processing on the embedded data by using the error correction network of the deep error correction model to obtain an error correction result.
In the embodiment of the invention, an error Correction Network (Correction Network) of the deep error Correction model is a MacBert pre-training language model, and comprises 12 coding layers (Encoder layers). The MacBert pre-training language model improves the RoBERTA model in multiple aspects, particularly adopts MLM as a masked strategy for correction (Mac), and the structure of the MacBert pre-training language model is consistent with that of the Bert model and only changes a mask mode.
Specifically, the performing, by using the error correction network of the deep error correction model, error correction processing on the embedded data to obtain an error correction result includes:
encoding the embedded data by utilizing a plurality of encoding layers in the error correction network, and taking the hidden state of the last encoding layer in the plurality of encoding layers;
residual error connection is carried out on the hidden state and the training text vector to obtain a connection value;
and inputting the connection value into a full connection layer of the error correction network to obtain an error correction result.
Specifically, the connection value is input into the full connection layer, a conditional probability corresponding to the connection value is calculated by using an activation function in the full connection layer, and an error correction result is obtained by comparing the conditional probability with a preset error correction threshold.
The embodiment of the invention selects all hidden states (hidden states) of the last coding layer c in the error correction network
Figure BDA0003286576210000101
Further, the embodiment of the present invention performs residual error connection on the hidden state and the training text vector by using the following calculation formula to obtain a connection value:
Figure BDA0003286576210000102
wherein, h'iIs a value for the connection of the data,
Figure BDA0003286576210000103
is the hidden state, eiIs the training text vector.
In the embodiment of the invention, the value h 'obtained after Residual Connection'iInputting into a further fully-connected layer, which will be h'iMapping hidden states (hidden states) to a space with the same dimension number as the number of words in a candidate word list (vocab), and inputting the result of the mapping of the full connection layer into a softmax function to calculate a character x in a text sequenceiCorrected to in a candidate word list (vocab)Conditional probability of character j. This calculation is shown by the following equation:
Pc(yi=j|X)=softmax(Wh′i+b)[j]
wherein, Pc(yiJ | X) is the character XiIs corrected to a conditional probability of a character j in a candidate word list (vocab), b is an offset term of the fully-connected layer, W is a weight matrix, h 'of the fully-connected layer'iIs a connected value, j is a fixed parameter.
In detail, an error correction result is obtained by comparing a conditional probability with a preset error correction threshold, when the conditional probability is greater than or equal to the error correction threshold, the error correction result is judged to be correct for error correction, and when the conditional probability is less than the error correction threshold, the error correction result is judged to be an error correction.
S6, calculating the cross entropy loss value of the depth error correction model according to the error correction result, adjusting the model parameter of the depth error correction model according to the cross entropy loss value, and outputting a standard depth error correction model.
In the embodiment of the invention, the cross entropy loss value of the depth error correction model is calculated by using the following formula:
C=αC1+βC2
Figure BDA0003286576210000104
Figure BDA0003286576210000105
wherein C is the cross entropy loss value of the depth error correction model, C1For the cross entropy loss value, C, of the error probed network2For the cross entropy loss value of the error correction network, alpha and beta are preset fixed weights, n is the total amount of data, y1As the output result, y2As a result of error correction, a1Is a preset prediction result, a2Is a preset detection result.
In detail, the prediction result is a preset corresponding standard result in the error probing network, and the detection result is a preset corresponding standard result in the error correcting network. The error correction result includes an error correction condition when the error correction processing is performed on the embedded data by using the error correction network, and the error correction condition may be correct error correction or error correction error.
In the embodiment of the invention, in the model training stage, gradient descent backward propagation is carried out by utilizing the cross entropy loss value to adjust the model parameters, so that the cross entropy loss value is continuously reduced. While the cross entropy loss value is reduced, the effect of the current model training is measured by using some measurement indexes: if the verification data set is obtained in advance to verify the deep error correction model, the indexes of the model on the verification data set, such as accuracy value, call value, precision value, f1 value, are used to verify the good or bad effect of the model training. When the effect of the measurement index such as an accuracy value or an f1 value on the verification data set is improved to a higher level (for example, about 95% -98%), the model can be considered to be trained to a better degree, and at this time, the training can be stopped, and the model is output as a standard deep error correction model.
S7, inputting the pre-acquired text of the number to be corrected into the standard deep error correction model, and obtaining the correct text after error correction based on a preset multi-round error correction mechanism.
In the embodiment of the invention, after a large number of training samples enhanced by a data enhancement method combining a confusion dictionary and text word substitution train a deep error correction model under a certain service scene, when a section of service text with wrongly written words is subjected to fixed-length error correction, a multi-round error correction mechanism can be used for further improving the error correction effect.
In detail, when a section of service text with wrongly written words is inputted into the depth error correction model and subjected to a round of error correction, most of the wrongly written words are corrected, but there may be a small number of wrongly written words that are not corrected in the round of error correction, then after two or even three rounds of error correction are continued on the result of the round of error correction, the wrongly written words in the service text are basically all corrected to correct words.
The embodiment of the invention obtains the training text set by carrying out data enhancement processing on the original text set, the data enhancement processing can rapidly construct massive training samples which can be used for training a model at low cost, the training text set after vectorization processing is input into an error exploration network of a preset deep error correction model to obtain an output result of correct spelling, the training text vector is subjected to soft mask connection processing according to the output result of correct spelling to obtain embedded data, the soft mask connection processing integrates the output result of correct spelling as the input of an error correction network, the error correction network of the deep error correction model is utilized to carry out error correction processing on the embedded data, the error correction network has strong text fixed length error correction capability and can carry out parallel error correction on service texts containing wrong characters at one time, the error correction efficiency and the timeliness of the method can be greatly improved. And inputting the pre-acquired text of the number to be corrected into the standard deep error correction model based on a preset multi-round error correction mechanism to obtain the corrected correct text. The multi-round error correction mechanism can further improve the error correction effect of the depth error correction model. Therefore, the method for fixed-length text error correction provided by the invention can solve the problem of low accuracy of fixed-length text error correction.
Fig. 2 is a functional block diagram of a text fixed length correction apparatus according to an embodiment of the present invention.
The fixed length text correcting device 100 of the present invention can be installed in an electronic device. According to the realized functions, the text fixed length correction device 100 can comprise a data processing module 101, an error detection module 102, an error correction processing module 103, a model training module 104 and a multi-round error correction module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the data processing module 101 is configured to obtain an original text set, perform data enhancement processing on the original text set to obtain a training text set, and perform vectorization processing on the training text set to obtain a training text vector;
the error detection module 102 is configured to input the training text vector into an error detection network of a preset deep error correction model to perform error detection, so as to obtain an output result indicating whether spelling is correct;
the error correction processing module 103 is configured to perform soft mask connection processing on the training text vector according to an output result indicating whether the spelling is correct or not to obtain embedded data, and perform error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result;
the model training module 104 is configured to calculate a cross entropy loss value of the depth error correction model according to the error correction result, adjust a model parameter of the depth error correction model according to the cross entropy loss value, and output a standard depth error correction model;
the multi-round error correction module 105 is configured to input a pre-acquired text to be corrected into the standard deep error correction model, and obtain a correct text after error correction based on a preset multi-round error correction mechanism.
In detail, the specific implementation of each module of the text fixed length correction apparatus 100 is as follows:
the method comprises the steps of firstly, obtaining an original text set, and carrying out data enhancement processing on the original text set to obtain a training text set.
In the embodiment of the present invention, the original text set may be text records related to different business activities of the bank, for example, relevant texts of customer problem consultation recorded by a human agent, a summary of service process of the customer, and a summary of complaint of the customer.
Specifically, the performing data enhancement processing on the original text set to obtain a training text set includes:
acquiring a preset confusion dictionary and a preset enhancement number;
and screening out confusion words with the number consistent with that of the enhancement numbers from the confusion dictionary, and performing random replacement on characters in the original text set by using the confusion words to obtain a training text set.
Wherein, the enhancement number refers to the number of characters needing data replacement.
In detail, the confusion dictionary contains 5401 Chinese common words and five kinds of confusion words corresponding to the Chinese common words: homonymic homonymous characters, homonymic heteronymous characters, and characters with the same radicals and the same stroke numbers.
For example, the original text in the original text set is "apply for credit card", and the original text is replaced by homophonic homonyms in the confusion dictionary to obtain "apply for credit card". The original text is replaced by the characters with the same radicals and stroke numbers in the confusion dictionary to obtain a 'card for applying for infringement' or a 'card for applying for immediate use'.
In another embodiment of the present invention, the performing data enhancement processing on the original text set to obtain a training text set may also include:
acquiring a preset corpus and word frequencies corresponding to the corpora in the corpus;
randomly selecting a plurality of replaced words from the original text set;
calculating confusion probability values between confusion words in a preset confusion dictionary and the replaced words by using a preset confusion probability calculation formula according to the word frequencies;
and sorting the confusion words according to the confusion probability values, and selecting the confusion word with the highest confusion probability value to replace the replaced word to obtain a training text set.
Specifically, the preset confusion probability calculation formula is as follows:
Figure BDA0003286576210000131
wherein, P (C)i) For the confusion probability value, n represents the number of confusing words in the obfuscated dictionary corresponding to the replaced word in the original text,
Figure BDA0003286576210000132
representing a sum of word frequencies, F, of said confusing words in said corpuscorpus(Ci) Representing said confusing word CiWord frequency in the corpus.
For example, the original text in the original text set is "a copy credit card product of a bank for which a client consults to apply for a query", a plurality of replaced words are randomly selected from the original text set, wherein the replaced words can be "query", "application", "copy", "credit" and "product", all five types of confusion words corresponding to the replaced words in the confusion dictionary are found, word frequencies corresponding to all five types of confusion words are found in the corpus, the confusion word with the maximum confusion probability value obtained by calculating the confusion probability value is used for replacing the replaced word, the query is replaced by "query", "deep" is replaced by "application", "sample" is replaced by "heart" and "shovel" replaces the product ", and therefore, the obtained sample in the training sample set is a copy card shovel product for the bank for which the client consults to find the deep request.
According to the embodiment of the invention, a large number of training samples with wrongly-written characters and corresponding correct character labels (characters in the original text are replaced by wrongly-written characters, and the characters before replacement can be regarded as the corresponding correct character labels) can be obtained by performing data enhancement processing on the original text set, and then the training samples with the labels are input into the depth error correction model for training and model effect optimization, so that the service text depth error correction model suitable for the service scene can be obtained finally.
And step two, vectorizing the training text set to obtain a training text vector.
In the embodiment of the invention, the training text vector is obtained by obtaining the word embedding vector, the position embedding vector and the segment embedding vector of each character of the training text in the training text set and summing the word embedding vector, the position embedding vector and the segment embedding vector.
Therefore, in the embodiment of the present invention, the training text vector is obtained by summing a word embedding vector (word embedding), a position embedding vector (position embedding) and a segment embedding vector (segment embedding) of each character in the training text, so as to obtain a training text vector:
ei=wordembedding(xi)+positionembedding(xi)+segmentembedding(xi)
wherein x isiRepresenting the ith character, e, in the training textiRepresenting a training text vector, wherein word embedding is the word embedding vector, position embedding is the position embedding vector, and segment embedding is the segment embedding vector.
And step three, inputting the training text vector into an error exploration network of a preset deep error correction model for error exploration to obtain an output result of whether the spelling is correct or not.
In the embodiment of the present invention, the depth error Correction model is a Detection-Correction combined depth model (Detection-Correction-combined model), which includes two Network models, namely a Detection Network (Detection Network) and an error Correction Network (Correction Network). Wherein the error detection network is substantially a bidirectional GRU model (Bi-GRU). The GRU model (Gated Current Unit) is also called a gate control cycle Unit structure, is constructed by an update gate and a reset gate, can effectively capture semantic association between long sequences, and relieves gradient disappearance or explosion phenomena.
Specifically, the inputting the training text vector into an error exploration network of a preset deep error correction model for error exploration to obtain an output result includes:
respectively carrying out forward coding and reverse coding on the training text vector to obtain a forward coding vector and a reverse coding vector;
transversely combining the forward coding vector and the reverse coding vector to obtain a coding hidden vector;
and inputting the coding hidden vector into a full connection layer for carrying out two-classification processing to obtain an output result of whether the spelling is correct or not.
Further, the forward encoding vector and the reverse encoding vector are obtained by calculation according to the following calculation formula:
Figure BDA0003286576210000151
Figure BDA0003286576210000152
wherein the content of the first and second substances,
Figure BDA0003286576210000153
for the purpose of the forward-encoded vector,
Figure BDA0003286576210000154
for the purpose of the reverse-direction encoding vector,
Figure BDA0003286576210000155
is a hidden state of the previous character,
Figure BDA0003286576210000156
hidden state for the next character, eiRepresenting the training text vector corresponding to the ith character.
Further, in the embodiment of the present invention, the forward encoded vector and the backward encoded vector are transversely combined to obtain an encoded hidden vector:
Figure BDA0003286576210000157
wherein the content of the first and second substances,
Figure BDA0003286576210000158
for the purpose of said encoding of the concealment vector(s),
Figure BDA0003286576210000159
for the purpose of the forward-encoded vector,
Figure BDA00032865762100001510
the vector is encoded for the inverse direction.
Specifically, the inputting the encoding hidden vector into the full-link layer for performing two-classification processing to obtain an output result indicating whether the spelling is correct includes:
inputting the coding hidden vector into the full-connection layer to obtain a classification probability;
when the classification probability is greater than or equal to a preset classification threshold value, judging the output result as misspelling;
and when the classification probability is smaller than a preset classification threshold value, judging the output result as the correct spelling.
The embodiment of the invention utilizes the following calculation formula to calculate and obtain the classification probability:
Figure BDA00032865762100001511
wherein, Pd(yiK | X) is the classification probability,
Figure BDA0003286576210000161
for the encoded concealment vector, b is a bias term for the fully-connected layer, W is a weight matrix for the fully-connected layer, softmax is an activation function, k and X are fixed parameters, yiFor the output result corresponding to the ith character, [ k ]]The integer part of the k value is solved.
And fourthly, performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data.
In the embodiment of the invention, the output result of whether the spelling of the error detection network is correct or not is integrated into the input of the error correction network in a soft masking connection (soft masking connection) mode.
In detail, the soft mask connection processing is mainly to process a misspelling part in the output result of whether the spelling is correct or not, so as to obtain embedded data.
Specifically, the performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct to obtain embedded data includes:
calculating by using a preset mask coefficient formula and the coding hidden vector to obtain a mask coefficient;
and calculating the misspelling vector in the training text vector according to the mask coefficient and a preset soft mask connection formula to obtain embedded data.
Further, the preset mask coefficient formula is as follows:
Figure BDA0003286576210000162
wherein, bdAs an offset term in the fully-connected layer, WdIs a weight matrix, p, in the full connection layeriFor the said mask coefficients,
Figure BDA0003286576210000163
for the coded concealment vector, X is a fixed parameter, and σ refers to the variance.
Specifically, the embodiment of the present invention obtains the embedded data by calculation using the following calculation formula:
e′i=pi·emask+(1-pi)·eg
wherein, e'iFor said embedded data, piIs the mask coefficient, egFor the misspelled vector in the training vector, emaskIs the embedded 'mask special character'.
And fifthly, carrying out error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result.
In the embodiment of the invention, an error Correction Network (Correction Network) of the deep error Correction model is a MacBert pre-training language model, and comprises 12 coding layers (Encoder layers). The MacBert pre-training language model improves the RoBERTA model in multiple aspects, particularly adopts MLM as a masked strategy for correction (Mac), and the structure of the MacBert pre-training language model is consistent with that of the Bert model and only changes a mask mode.
Specifically, the performing, by using the error correction network of the deep error correction model, error correction processing on the embedded data to obtain an error correction result includes:
encoding the embedded data by utilizing a plurality of encoding layers in the error correction network, and taking the hidden state of the last encoding layer in the plurality of encoding layers;
residual error connection is carried out on the hidden state and the training text vector to obtain a connection value;
and inputting the connection value into a full connection layer of the error correction network to obtain an error correction result.
Specifically, the connection value is input into the full connection layer, a conditional probability corresponding to the connection value is calculated by using an activation function in the full connection layer, and an error correction result is obtained by comparing the conditional probability with a preset error correction threshold.
The embodiment of the invention selects all hidden states (hidden states) of the last coding layer c in the error correction network
Figure BDA0003286576210000171
Further, the embodiment of the present invention performs residual error connection on the hidden state and the training text vector by using the following calculation formula to obtain a connection value:
Figure BDA0003286576210000172
wherein, h'iIs a value for the connection of the data,
Figure BDA0003286576210000173
is the hidden state, eiIs the training text vector.
In the embodiment of the invention, the residual error is connected (Resi)Value h 'obtained after dual Connection)'iInputting into a further fully-connected layer, which will be h'iMapping hidden states (hidden states) to a space with the same dimension number as the number of words in a candidate word list (vocab), and inputting the result of the mapping of the full connection layer into a softmax function to calculate a character x in a text sequenceiIs corrected to the conditional probability of character j in the candidate word list (vocab). This calculation is shown by the following equation:
Pc(yi=j|X)=softmax(Wh′i+b)[j]
wherein, Pc(yiJ | X) is the character XiIs corrected to a conditional probability of a character j in a candidate word list (vocab), b is an offset term of the fully-connected layer, W is a weight matrix, h 'of the fully-connected layer'iIs a connected value, j is a fixed parameter.
In detail, an error correction result is obtained by comparing a conditional probability with a preset error correction threshold, when the conditional probability is greater than or equal to the error correction threshold, the error correction result is judged to be correct for error correction, and when the conditional probability is less than the error correction threshold, the error correction result is judged to be an error correction.
And step six, calculating a cross entropy loss value of the depth error correction model according to the error correction result, adjusting model parameters of the depth error correction model according to the cross entropy loss value, and outputting a standard depth error correction model.
In the embodiment of the invention, the cross entropy loss value of the depth error correction model is calculated by using the following formula:
C=αC1+βC2
Figure BDA0003286576210000181
Figure BDA0003286576210000182
wherein C is the depth error correction modelCross entropy loss value of C1For the cross entropy loss value, C, of the error probed network2For the cross entropy loss value of the error correction network, alpha and beta are preset fixed weights, n is the total amount of data, y1As the output result, y2As a result of error correction, a1Is a preset prediction result, a2Is a preset detection result.
In detail, the prediction result is a preset corresponding standard result in the error probing network, and the detection result is a preset corresponding standard result in the error correcting network. The error correction result includes an error correction condition when the error correction processing is performed on the embedded data by using the error correction network, and the error correction condition may be correct error correction or error correction error.
In the embodiment of the invention, in the model training stage, gradient descent backward propagation is carried out by utilizing the cross entropy loss value to adjust the model parameters, so that the cross entropy loss value is continuously reduced. While the cross entropy loss value is reduced, the effect of the current model training is measured by using some measurement indexes: if the verification data set is obtained in advance to verify the deep error correction model, the indexes of the model on the verification data set, such as accuracy value, call value, precision value, f1 value, are used to verify the good or bad effect of the model training. When the effect of the measurement index such as an accuracy value or an f1 value on the verification data set is improved to a higher level (for example, about 95% -98%), the model can be considered to be trained to a better degree, and at this time, the training can be stopped, and the model is output as a standard deep error correction model.
And step seven, inputting the pre-acquired text of the number to be corrected into the standard deep error correction model, and obtaining the corrected correct text based on a preset multi-round error correction mechanism.
In the embodiment of the invention, after a large number of training samples enhanced by a data enhancement method combining a confusion dictionary and text word substitution train a deep error correction model under a certain service scene, when a section of service text with wrongly written words is subjected to fixed-length error correction, a multi-round error correction mechanism can be used for further improving the error correction effect.
In detail, when a section of service text with wrongly written words is inputted into the depth error correction model and subjected to a round of error correction, most of the wrongly written words are corrected, but there may be a small number of wrongly written words that are not corrected in the round of error correction, then after two or even three rounds of error correction are continued on the result of the round of error correction, the wrongly written words in the service text are basically all corrected to correct words.
The embodiment of the invention obtains the training text set by carrying out data enhancement processing on the original text set, the data enhancement processing can rapidly construct massive training samples which can be used for training a model at low cost, the training text set after vectorization processing is input into an error exploration network of a preset deep error correction model to obtain an output result of correct spelling, the training text vector is subjected to soft mask connection processing according to the output result of correct spelling to obtain embedded data, the soft mask connection processing integrates the output result of correct spelling as the input of an error correction network, the error correction network of the deep error correction model is utilized to carry out error correction processing on the embedded data, the error correction network has strong text fixed length error correction capability and can carry out parallel error correction on service texts containing wrong characters at one time, the error correction efficiency and the timeliness of the method can be greatly improved. And inputting the pre-acquired text of the number to be corrected into the standard deep error correction model based on a preset multi-round error correction mechanism to obtain the corrected correct text. The multi-round error correction mechanism can further improve the error correction effect of the depth error correction model. Therefore, the text fixed-length error correction device provided by the invention can solve the problem of low accuracy of text fixed-length error correction.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a text fixed-length error correction method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a text fixed length correction program, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., executing a text fixed length error correction program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a text fixed length error correction program, but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The text fixed-length error correction program stored in the memory 11 of the electronic device 1 is a combination of instructions, and when running in the processor 10, can realize:
acquiring an original text set, and performing data enhancement processing on the original text set to obtain a training text set;
vectorizing the training text set to obtain a training text vector;
inputting the training text vector into an error exploration network of a preset deep error correction model for error exploration to obtain an output result of judging whether spelling is correct or not;
performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data;
carrying out error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result;
calculating a cross entropy loss value of the depth error correction model according to the error correction result, adjusting model parameters of the depth error correction model according to the cross entropy loss value, and outputting a standard depth error correction model;
and inputting the pre-acquired text of the number to be corrected into the standard deep error correction model, and obtaining the corrected correct text based on a preset multi-round error correction mechanism.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring an original text set, and performing data enhancement processing on the original text set to obtain a training text set;
vectorizing the training text set to obtain a training text vector;
inputting the training text vector into an error exploration network of a preset deep error correction model for error exploration to obtain an output result of judging whether spelling is correct or not;
performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data;
carrying out error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result;
calculating a cross entropy loss value of the depth error correction model according to the error correction result, adjusting model parameters of the depth error correction model according to the cross entropy loss value, and outputting a standard depth error correction model;
and inputting the pre-acquired text of the number to be corrected into the standard deep error correction model, and obtaining the corrected correct text based on a preset multi-round error correction mechanism.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A text fixed-length error correction method is characterized by comprising the following steps:
acquiring an original text set, and performing data enhancement processing on the original text set to obtain a training text set;
vectorizing the training text set to obtain a training text vector;
inputting the training text vector into an error exploration network of a preset deep error correction model for error exploration to obtain an output result of judging whether spelling is correct or not;
performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data;
carrying out error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result;
calculating a cross entropy loss value of the depth error correction model according to the error correction result, adjusting model parameters of the depth error correction model according to the cross entropy loss value, and outputting a standard depth error correction model;
and inputting the pre-acquired text of the number to be corrected into the standard deep error correction model, and obtaining the corrected correct text based on a preset multi-round error correction mechanism.
2. The method of claim 1, wherein the performing data enhancement processing on the original text set to obtain a training text set comprises:
acquiring a preset corpus and word frequencies corresponding to the corpora in the corpus;
randomly selecting a plurality of replaced words from the original text set;
calculating confusion probability values between confusion words in a preset confusion dictionary and the replaced words by using a preset confusion probability calculation formula according to the word frequencies;
and sorting the confusion words according to the confusion probability values, and selecting the confusion word with the highest confusion probability value to replace the replaced word to obtain a training text set.
3. The method of claim 1, wherein the vectorizing the training text set to obtain a training text vector comprises:
acquiring a word embedding vector, a position embedding vector and a segment embedding vector of each character of the training text in the training text set;
and summing the word embedding vector, the position embedding vector and the segment embedding vector to obtain a training text vector.
4. The method for fixed-length text error correction according to claim 1, wherein inputting the training text vector into an error detection network of a preset deep error correction model for error detection to obtain an output result indicating whether spelling is correct comprises:
respectively carrying out forward coding and reverse coding on the training text vector to obtain a forward coding vector and a reverse coding vector;
transversely combining the forward coding vector and the reverse coding vector to obtain a coding hidden vector;
and inputting the coding hidden vector into a full connection layer for carrying out two-classification processing to obtain an output result of whether the spelling is correct or not.
5. The method of claim 4, wherein the inputting the encoding hidden vector into the full-concatenation layer for binary classification to obtain an output result indicating whether the spelling is correct comprises:
inputting the coding hidden vector into the full-connection layer to obtain a classification probability;
when the classification probability is greater than or equal to a preset classification threshold value, judging the output result as misspelling;
and when the classification probability is smaller than a preset classification threshold value, judging the output result as the correct spelling.
6. The method of claim 1, wherein the soft mask connection processing on the training text vector to obtain embedded data comprises:
calculating by using a preset mask coefficient formula and the coding hidden vector to obtain a mask coefficient;
and calculating the misspelling vector in the training text vector according to the mask coefficient and a preset soft mask connection formula to obtain embedded data.
7. The method for fixed-length text error correction according to claim 1, wherein the error correction processing on the embedded data by using the error correction network of the deep error correction model to obtain an error correction result comprises:
encoding the embedded data by utilizing a plurality of encoding layers in the error correction network, and taking the hidden state of the last encoding layer in the plurality of encoding layers;
residual error connection is carried out on the hidden state and the training text vector to obtain a connection value;
and inputting the connection value into a full connection layer of the error correction network to obtain an error correction result.
8. A text fixed length correction apparatus, comprising:
the data processing module is used for acquiring an original text set, performing data enhancement processing on the original text set to obtain a training text set, and performing vectorization processing on the training text set to obtain a training text vector;
the error probing module is used for inputting the training text vector into an error probing network of a preset deep error correction model for error probing to obtain an output result of whether spelling is correct or not;
the error correction processing module is used for performing soft mask connection processing on the training text vector according to the output result of whether the spelling is correct or not to obtain embedded data, and performing error correction processing on the embedded data by using an error correction network of the deep error correction model to obtain an error correction result;
the model training module is used for calculating a cross entropy loss value of the depth error correction model according to the error correction result, adjusting model parameters of the depth error correction model according to the cross entropy loss value and outputting a standard depth error correction model;
and the multi-round error correction module is used for inputting the pre-acquired text to be corrected into the standard deep error correction model and obtaining the corrected correct text based on a preset multi-round error correction mechanism.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the text fixed length correction method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a text fixed-length correction method according to any one of claims 1 to 7.
CN202111149204.8A 2021-09-29 2021-09-29 Text fixed-length error correction method, device, equipment and storage medium Pending CN113887201A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111149204.8A CN113887201A (en) 2021-09-29 2021-09-29 Text fixed-length error correction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111149204.8A CN113887201A (en) 2021-09-29 2021-09-29 Text fixed-length error correction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113887201A true CN113887201A (en) 2022-01-04

Family

ID=79007843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111149204.8A Pending CN113887201A (en) 2021-09-29 2021-09-29 Text fixed-length error correction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113887201A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065738A (en) * 2022-01-11 2022-02-18 湖南达德曼宁信息技术有限公司 Chinese spelling error correction method based on multitask learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065738A (en) * 2022-01-11 2022-02-18 湖南达德曼宁信息技术有限公司 Chinese spelling error correction method based on multitask learning

Similar Documents

Publication Publication Date Title
CN112667800A (en) Keyword generation method and device, electronic equipment and computer storage medium
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN115423535B (en) Product purchasing method, device, equipment and medium based on market priori big data
CN115392237B (en) Emotion analysis model training method, device, equipment and storage medium
CN113706322A (en) Service distribution method, device, equipment and storage medium based on data analysis
CN112507663A (en) Text-based judgment question generation method and device, electronic equipment and storage medium
CN113627160B (en) Text error correction method and device, electronic equipment and storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN114840684A (en) Map construction method, device and equipment based on medical entity and storage medium
CN114399775A (en) Document title generation method, device, equipment and storage medium
CN113887201A (en) Text fixed-length error correction method, device, equipment and storage medium
CN111460293B (en) Information pushing method and device and computer readable storage medium
CN112633988A (en) User product recommendation method and device, electronic equipment and readable storage medium
CN116630712A (en) Information classification method and device based on modal combination, electronic equipment and medium
CN114625340B (en) Commercial software research and development method, device, equipment and medium based on demand analysis
CN115346095A (en) Visual question answering method, device, equipment and storage medium
CN112749264A (en) Problem distribution method and device based on intelligent robot, electronic equipment and storage medium
CN113806540A (en) Text labeling method and device, electronic equipment and storage medium
CN111414452A (en) Search word matching method and device, electronic equipment and readable storage medium
CN114462411B (en) Named entity recognition method, device, equipment and storage medium
CN116663503A (en) Sentence error correction method, device, equipment and medium based on self-attention weight graph
CN117195898A (en) Entity relation extraction method and device, electronic equipment and storage medium
CN116612898A (en) Correlation weight analysis method, device, equipment and storage medium for disease factors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination