CN115688748A - Question error correction method and device, electronic equipment and storage medium - Google Patents

Question error correction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115688748A
CN115688748A CN202110850150.1A CN202110850150A CN115688748A CN 115688748 A CN115688748 A CN 115688748A CN 202110850150 A CN202110850150 A CN 202110850150A CN 115688748 A CN115688748 A CN 115688748A
Authority
CN
China
Prior art keywords
question
text
error correction
error
corrected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110850150.1A
Other languages
Chinese (zh)
Inventor
钟维坚
何庆
徐海勇
陶涛
尚晶
田风
林锋
覃志智
唐苏东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202110850150.1A priority Critical patent/CN115688748A/en
Publication of CN115688748A publication Critical patent/CN115688748A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a question error correction method, a question error correction device, electronic equipment and a storage medium. A question text sample used by the question error correction model during training and a corrected question text sample corresponding to the question text sample can be selected according to the accuracy of the question error correction model required by actual training, so that the problem that the model parameter space is too large when N in the N-gram model is too large is solved. Moreover, because the question error correction model applies word vectors, the word vectors are represented by distributed features, and the semantics of words can be considered through the word vectors, so that the characteristic of high sparsity represented by traditional statistical words can be well relieved.

Description

Question error correction method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a question error correction method, apparatus, electronic device, and storage medium.
Background
Syntax Error Correction (GEC) is one of the more difficult tasks in syntax analysis in natural language processing. Syntax error correction in current natural languages mainly includes spelling errors and syntax errors.
The existing Chinese spelling error detection and correction technology mainly comprises the following steps: rule-based methods and statistical machine learning-based methods, the latter of which is most versatile in language modeling and classification. For example, the N-gram statistical language model-based improved weighted channel noise model error correction method uses a Chinese spell check algorithm based on an N-gram statistical language model for reference, and provides a Chinese spell error correction algorithm based on a weighted noise channel model on the basis. The algorithm takes a Trigram language model as a language model, takes word frequency probability as a conversion model, combines the weight of a confusion word, and takes a Beam Search algorithm as a decoding algorithm.
Since the number of parameters of the N-gram language model increases exponentially with the increase of N, the parameter space of the N-gram language model is too large, and the error correction efficiency is reduced. Moreover, the N-gram statistical language model can also have the problem of data smoothness caused by sparse word data in the text to be corrected, so that the accuracy of the N-gram statistical language model in error correction is reduced.
Disclosure of Invention
The invention provides a question error correction method, a question error correction device, electronic equipment and a storage medium, which are used for overcoming the defects in the prior art.
In a first aspect, the present invention provides a question error correction method, including:
acquiring a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector and a byte pair encoding text of the question text to be corrected;
inputting the word vectors and the byte pair coded texts into a coding layer of a question error correction model to obtain error attribute characteristics of the question texts to be corrected, which are output by the coding layer;
inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be corrected, which is output by the decoding layer;
the question error correction model is obtained by training on the basis of question text samples containing error information and corrected question text samples corresponding to the question text samples.
In one embodiment, the determining a word vector and a byte pair encoding text of the question text to be corrected specifically includes:
converting the question text to be corrected into a token form to obtain a word sequence;
and performing word segmentation processing on the word sequence based on a byte pair coding algorithm to obtain the word vector corresponding to the word sequence and the byte pair coded text.
In one embodiment, the inputting the word vector and the byte pair encoded text into an encoding layer of a question error correction model to obtain an error attribute characteristic of the question text to be error-corrected output by the encoding layer specifically includes:
inputting the byte pair coded text into the first type of convolutional neural network layer to obtain position coding characteristics of the word segmentation text corresponding to the word vector output by the first type of convolutional neural network layer;
determining the aggregation characteristics corresponding to the word vectors based on the position coding characteristics of the word segmentation texts corresponding to the word vectors;
and inputting the aggregation characteristics to the encoding end to obtain error attribute characteristics corresponding to the aggregation characteristics output by the encoding end.
In one embodiment, the determining the aggregation feature corresponding to the word vector based on the position coding feature of the segmented text corresponding to the word vector specifically includes:
splicing the position coding features and the word vectors to obtain splicing features corresponding to the word vectors;
inputting the splicing features into the first type of convolutional neural network layer to obtain error position features of the question text to be corrected, which are output by the first type of convolutional neural network layer;
determining the aggregated characteristic based on the error location characteristic.
In an embodiment, the inputting the aggregation characteristic to the encoding end to obtain an error attribute characteristic corresponding to the aggregation characteristic output by the encoding end specifically includes:
inputting the aggregation characteristics to the encoding end to obtain deep interactive characteristics corresponding to the aggregation characteristics output by the encoding end;
and inputting the deep interactive features into the encoding end to obtain the error attribute features output by the encoding end.
In one embodiment, the question error correction model is trained based on:
inputting the question text sample and the corrected question text sample into a question error correction model to be trained to obtain a first type of sample characteristics corresponding to the question text sample and a second type of sample characteristics corresponding to the corrected question text sample;
calculating a first class matching score of the first class of sample features and a second class matching score of the second class of sample features, and calculating a loss function value based on the first class matching score and the second class matching score;
and training the question error correction model to be trained based on the loss function value to obtain the question error correction model.
In one embodiment, the context text is an abbreviated sentence text.
In a second aspect, the present invention provides a question error correction apparatus, including:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector of the question text to be corrected;
the coding module is used for inputting the word vectors into a coding layer of a question error correction model to obtain error attribute characteristics of the question text to be corrected output by the coding layer;
the decoding module is used for inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be corrected, which is output by the decoding layer;
the question error correction model is obtained by training on the basis of question text samples containing error information and corrected question text samples corresponding to the question text samples.
In a third aspect, the present invention provides an electronic device, comprising a memory and a memory storing a computer program, wherein the processor implements the steps of the question error correction method according to the first aspect when executing the program.
In a fourth aspect, the present invention provides a processor-readable storage medium storing a computer program for causing a processor to execute the steps of the question error correction method of the first aspect.
The question error correction method, the question error correction device, the electronic equipment and the storage medium adopt a question error correction model, wherein the question error correction model comprises a coding layer and a decoding layer, the coding layer comprises a coding end of a first type of convolutional neural network layer and a coding end of a Transformer layer, and the decoding layer comprises a decoding end of a second type of convolutional neural network layer and a decoding end of a Transformer layer. A question text sample used by the question error correction model during training and a corrected question text sample corresponding to the question text sample can be selected according to the accuracy of the question error correction model required by actual training, so that the problem that the model parameter space is too large when N in the N-gram model is too large is solved. Moreover, because the word vector is applied to the question error correction model, the word vector is represented by a distributed characteristic, the word semantics can be considered through the word vector, the characteristic of high sparsity represented by the traditional statistical word can be well relieved, the semantic information can be considered in a specific scene, and the capability of detecting and correcting the position of the spelling error is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a question error correction method according to the present invention;
FIG. 2 is a schematic structural diagram of a question error correction model provided by the present invention;
FIG. 3 is a schematic structural diagram of a question error correction apparatus according to the present invention;
fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The existing Chinese spelling error detection and correction technology mainly comprises the following steps: rule-based methods and statistical machine learning-based methods, the latter of which is most versatile in language modeling and classification. These methods do not solve the problem independently but work together in combination more often.
(1) The spelling error correcting method based on the rule comprises the following steps: and judging whether the sentence has spelling errors or not by utilizing grammatical rules such as the colloquial chunks, syntax and grammar. Such methods require word segmentation, chunk recognition, phrase recognition, and the like for a chinese text, and if the segmentation result cannot satisfy grammatical rules such as syntax or grammar, it is considered that an error is likely.
(2) A classification based approach. The method is similar to the Real-word method in the English spelling error correction method, and the main idea is to collect large-scale correct Chinese text corpora and screen out a group of effective characteristics as training samples according to characteristic engineering to train a classification model. Researchers put forward a Max Ent-based method, each Chinese character is trained with a two-class maximum entropy model through large-scale linguistic data, and the method has the idea that each character in a sentence can be regarded as a two-class problem containing correct and wrong classes, and the character in the sentence to be corrected is corrected once. In addition, researchers propose an LR method HANSpeller + +, which performs feature extraction on all new sentences obtained by replacing words in the sentences to be corrected by using a confusion set, and performs second-order classification by using LR. Basis for the first order classification: classifying the word segmentation characteristics, the language model characteristics, the editing distance characteristics and the dictionary characteristics to obtain optimal 20 candidate sentences, and introducing the 20 candidate sentences into new characteristics: and performing second-order classification based on the characteristics of Web, the translation characteristics and the translation rationality characteristics to obtain 5 optimal candidate sentences. And finally, obtaining a finally corrected sentence through a grammar rule.
(3) A statistical language model-based method. The method replaces the characters in the original sentence by using the mixed set to obtain a plurality of candidate sentences, scores the replaced sentences by using the language model, compares the sentences with the highest scores with the original sentence to find the characters with misspelling and gives the correction suggestion characters. The method utilizes the closeness of the conditional probability analysis context between words or between words to correct errors, for example, in a Chinese error correction method based on a statistical language model for a search engine, the statistical analysis result of the language model is combined with TF/IDF to correct errors; researchers put forward a spelling error correction method combining Chinese participles, bi-grams and tri-grams, namely a text participle, a bi-gram and a tri-gram language model; in addition, some researchers provide a method for combining a language model and pattern matching, the method firstly uses a longest matching method to perform Chinese word segmentation, the word segmentation result is combined with an n-gram grammar to obtain the co-occurrence frequency of the word and left and right adjacent words, and if the frequency is lower than a certain threshold value, the word is considered to be possible to be misspelled.
In the prior art, the idea of a weighted channel noise model error correction method based on an N-gram statistical language model improvement is as follows: a Chinese spelling check algorithm based on an N-element language model is used for reference, and a Chinese spelling error correction algorithm based on a weighted noise channel model is provided on the basis. The algorithm takes a Trigram language model as a language model, takes word frequency probability as a conversion model, combines the weight of a confusion word, and simultaneously takes a Beam Search algorithm as a decoding algorithm.
The specific implementation steps are as follows:
step 1, preparing a model. Firstly, performing Unigram, bigram and Trigram segmentation according to a training corpus, and counting word frequency.
1.1, constructing a word frequency table charDict according to Unigram statistics;
1.2, training a trigram LM of a ternary language model according to a given N-element language model formula, and taking the trigram LM as an N-element probability dictionary and a language model P (I) of an error detection module, wherein the format of the language model is identified according to an ARPA format (simultaneously comprising unitary, binary and ternary probabilities) given by an SRILM framework, the first behavior model of the format is identified, and a plurality of next rows represent orders of the model and the number of model parameters of each order are divided by taking an empty row as a partition; then, specific parameters of each order model are divided by using an empty line. Each row represents a parameter, is divided into 3 columns, and is respectively L: the logarithm of the probability of an n-gram, the n-gram itself and the backoff weight of the n-gram (no backoff weight for the maximum order).
1.3, calculating a confusion set conversion model according to the word frequency table char Dict and the confusion set according to the following formula:
Figure BDA0003182130410000071
the word frequency table is a table for counting the occurrence times of each word in a section of text, and the confusion set is a set constructed for each word, and the characters in the set have the characteristic of same pronunciation or spelling. For example: the word of partially mixing the words of and is 'yuyuyuyuyuyun beggar' and the like.
1.4, converting the confusion words into pinyin according to the confusion set context, calculating the weight of each confusion word by using the minimum editing distance of the pinyin according to the following formula, and constructing a weight confusion set:
Figure BDA0003182130410000072
and 2, inputting. The original Sentence Sennce, error detection module based on CBNP carry out error detection according to trigram LM to obtain position information error List of error word and initialized error correction Result Correct Result.
And 3, correcting. Traversing each position information in the error List, obtaining a confusion word set con List according to the marked words, and respectively replacing each confusion word in the con List to obtain all new sentences. And calculating the probability of the sentence according to the following formula to obtain the sentence new Sen with the highest score.
Figure BDA0003182130410000081
And 4, outputting. It is determined whether the new Sen and Sennce are the same, and if so, correctResult is used to record all modified words and position information.
The technical scheme has the following defects:
(1) First, a considerable amount of training text is required to determine the model parameters. When N is large, the parameter space of the model is too large; as shown in Table 1, is a parameter quantity table of N-element language model
TABLE 1 number of parameters of N-ary language model
Model (model) Number of parameters
Unigram (unary language model) 20000
Bnigram (binary language model) 20000 2 =4×10 8
Tnigram (three-dimensional language model) 20000 3 =8×10 12
FOUR-gram (Quaternary language model) 20000 4 =1.6×10 17
(2) Secondly, there is a data smoothing problem that may be caused by data sparsity;
(3) Finally, the N-gram model is constructed from discrete unit words that do not have any genetic attributes between them, and thus does not have the semantic advantage that the word vectors in continuous space satisfy: words of similar meaning have similar word vectors, so that when the system model adjusts parameters for a word or word sequence, the words and word sequences of similar meaning also change.
(4) In addition, the prior art does not have a corresponding completion technology for the condition of semantic ambiguous information caused by question omission, pronouns and the like.
Therefore, the embodiment of the invention provides a question error correction method to solve the technical problem. Fig. 1 is a schematic flow chart of a question error correction method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:
s1, obtaining a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector and a byte pair encoding text of the question text to be corrected;
s2, inputting the word vectors and the byte pair coded texts into a coding layer of a question error correction model to obtain error attribute characteristics of the question texts to be corrected, which are output by the coding layer;
s3, inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be corrected, which is output by the decoding layer;
the question error correction model is obtained by training on the basis of question text samples containing error information and corrected question text samples corresponding to the question text samples.
Specifically, in the question error correction method provided in the embodiment of the present invention, the execution main body is a server, the server may be a local server or a cloud server, and the local server may be a computer or the like.
Firstly, step S1 is executed to obtain a question text to be corrected, which contains an error question text and a context text of the error question text. The question text to be corrected refers to a question text for correcting errors, and the question text to be corrected contains an error question text and a context text of the error question text. The error question text refers to a question text with error information, the context text refers to a text adjacent to the error question text, and the text can be a complete sentence pattern text or an omitted sentence pattern text before or after the error question text. The error type of the error question text may be a word error, a sentence error, or the like, and when the context text is an omission sentence text, the context text may also be regarded as an error question text whose error type is an omission error. For example, the text of the question to be corrected may be "where can a remote logout be handled in the background? That is? "where" can a remote logout be handled in the background? "is a wrong question text," is Nathirteen? "is context text, and the context text is an abbreviated period text.
After the question text to be corrected is obtained, a word vector (word embedding) and a byte pair encoding text of the question text to be corrected can be determined through a word segmentation technology. A word vector is a transformation of a word into a distributed representation that represents the word as a continuous dense vector of fixed length. In the embodiment of the present invention, the first and second substrates, a word vector may be represented as E = { E = { E = } 1 ,e 2 ,e 3 ,…,e i ,…,e m Where E denotes a set of word vectors, E i And (5) representing the ith word vector, wherein m is the number of the word vectors in the question text to be corrected. A Byte-pair Encoding (BPE) text refers to a text obtained by performing word segmentation processing on a question text to be corrected by a BPE algorithm.
And then executing step S2, inputting the word vector and byte pair coded text into a coding (Encoder) layer of the question error correction model, and obtaining the error attribute characteristics of the question text to be corrected output by the coding layer. The question error correction model can comprise a coding layer and a decoding (Decoder) layer, and error attribute characteristics of a question text to be corrected can be extracted through the coding layer. The error attribute characteristics can include error position characteristics and error type characteristics in the question text to be corrected, the error position characteristics are used for representing the position of error information in the question text to be corrected, and the error type characteristics are used for representing the type of the error information in the question text to be corrected, and the error attribute characteristics can be word errors, sentence pattern errors, omission errors and the like.
And finally, executing the step S3, inputting the error attribute characteristics to a decoding layer of the question error correction model, and obtaining a corrected question text of the question text to be corrected, which is output by the decoding layer. The correct question text refers to a correct and complete text corresponding to the question text to be corrected, and may include a correct text corresponding to an incorrect question text and a complete text corresponding to a context text when the context text is a sentence-omitted text. The correct question text of the question text to be corrected may be "where can the remote logout be handled in beijing? Where can remote logouts be handled? "
In the embodiment of the present invention, the coding layer of the question error correction model includes a coding end of a first type Convolutional Neural Network (CNN) layer and a transform layer, and the decoding layer also includes a decoding end of a second type Convolutional Neural Network (CNN) layer and a transform layer. CNNs are a class of feed forward Neural Networks (fed Neural Networks) that contain convolution calculations and have a deep structure, and are one of the representative algorithms for deep learning (deep learning). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on input information according to a hierarchical structure of the convolutional neural network. The conventional CNN and RNN are abandoned in the Transformer, and the whole network structure is completely composed of an Attention mechanism. The Transformer consists of and consists only of self-attention and Feed Forward Neural Network.
The number of layers of the first type of convolutional neural network layer and the number of layers of the transform layer in the coding layer can be set according to needs, and the number of layers of the second type of convolutional neural network layer and the number of layers of the transform layer in the decoding layer can also be set according to needs. For example, the number of the first type CNN layers may be 4, and the number of the transform layers in the coding layer may be 8, that is, the coding layer may be regarded as a stacking model of the coding ends of the 4 first type CNN layers and the 8 transform layers, that is, a transform-CNN model. The number of the second type CNN layers in the decoding layers may be 4, and the number of the transform layers in the decoding layers may be 6, that is, the decoding layers may be regarded as a stack model of the decoding ends of the 4 first type CNN layers and the 6 transform layers, that is, a transform-CNN model. In the embodiment of the present invention, in the coding layer and the decoding layer, the dimension of the word vector may be set to 512, the dimension of the hidden layer may be set to 1024, and the size of the convolution window may be set to 3.
In the embodiment of the invention, the question error correction model can be obtained by training a question error correction model to be trained through question text samples containing error information and corrected question text samples corresponding to the question text samples, the question error correction model to be trained can comprise a coding layer and a decoding layer, and both the coding layer and the decoding layer can be a Transformer-CNN model. The question text sample and the corrected question text sample corresponding to the question text sample can be selected according to the precision of the question error correction model obtained according to actual training requirements.
The question error correction method provided by the embodiment of the invention adopts a question error correction model, the question error correction model comprises a coding layer and a decoding layer, the coding layer comprises a first type convolutional neural network layer and a Transformer layer, and the decoding layer comprises a second type convolutional neural network layer and a Transformer layer. A question text sample used by the question error correction model during training and a corrected question text sample corresponding to the question text sample can be selected according to the accuracy of the question error correction model required by actual training, so that the problem that the model parameter space is too large when N in the N-gram model is too large is solved. Moreover, because the question error correction model applies the word vectors, and the word vectors are represented by distributed characteristics, the word semantics can be considered through the word vectors, the high sparsity characteristic of the traditional statistical word representation can be well relieved, the semantic information can be considered into a specific scene, and the capacity of detecting and correcting the spelling error position is improved.
In addition, in the embodiment of the present invention, the context text may be an omitted sentence text. And when the context text is the text with the omitted sentence pattern, the correct text corresponding to the text with the wrong question and the complete text corresponding to the context text can be output through the question error correction model. Finally, the question error correction model can position and focus errors, correct and complement the errors, and therefore the question retrieval rate and the reply accuracy rate are improved. That is to say, the question error correction method provided in the embodiment of the present invention may be applied to a scenario where, for example, a user inputs a question in an intelligent question-answering system, the question has a misspelling, multiple questions are asked at the same time, and terms such as pronouns are omitted.
On the basis of the foregoing embodiment, the method for correcting a question error provided in the embodiment of the present invention for determining a word vector and a byte pair encoded text of a question text to be corrected specifically includes:
converting the question text to be corrected into a token form to obtain a word sequence;
and performing word segmentation processing on the word sequence based on a byte pair coding algorithm to obtain the word vector corresponding to the word sequence and the byte pair coded text.
Specifically, in the embodiment of the present invention, when determining the word vector of the question text to be corrected, to solve the problem that a spelling error may exist in the question and a sparse word is input, the question text to be corrected may be first converted into a token form, and a word sequence X = { X } may be obtained 1 ,x 2 ,…,x j ,…,x n }。x j And the j th word in the question text to be corrected is shown, and n is the number of words in the question text to be corrected. Then, according to byte pair coding algorithm, the words are sequenced intoAnd performing line segmentation processing to obtain a word vector E corresponding to the word sequence X. Wherein, because the word is composed of a plurality of words, m is less than or equal to n. While obtaining the word vector E corresponding to the word sequence X, the BPE text BPE can be obtained through the BPE algorithm x ={x e1 ,x e2 ,x e3 ,…,x em }。
The BPE algorithm is a compression algorithm, a bottom-up algorithm. Words are treated as word pieces (word pieces) to facilitate handling of non-occurring words. In the NMT task, firstly, BPE is utilized to divide a training set word into segments, then the segments are randomly assigned and then placed into RNNs or CNNs to train the embedding of the segments, and then the segments are combined to obtain a word vector, and then NMT work is carried out. Therefore, if rare words or unknown words are encountered in the training set or other situations, the NMT task is performed by directly combining the fragments.
In the embodiment of the invention, the word vector is determined by encoding the encoding algorithm through the byte pair, so that the problem of possible sparse words can be solved.
On the basis of the foregoing embodiment, the question error correction method provided in the embodiment of the present invention is a method for inputting a coded text of byte pairs of word vectors into a coding layer of a question error correction model to obtain error attribute characteristics of a question text to be error-corrected output by the coding layer, and specifically includes:
inputting the byte pair coded text into the first type of convolutional neural network layer to obtain position coding characteristics of the word segmentation text corresponding to the word vector output by the first type of convolutional neural network layer;
determining the aggregation characteristics corresponding to the word vectors based on the position coding characteristics of the word segmentation texts corresponding to the word vectors;
and inputting the aggregation characteristics to the encoding end to obtain error attribute characteristics corresponding to the aggregation characteristics output by the encoding end.
Specifically, in the embodiment of the present invention, the coding layer may include a first type convolutional neural network layer and a transform coding end. In order to solve the problem that the input of a question error correction model lacks time sequence characteristic information, the position coding of a byte pair coded text can be carried out through a first type of convolutional neural network layer, and the position coding characteristic is obtained. The first type of convolutional neural network layer can position-code the coded text by the following formula:
Figure BDA0003182130410000141
or
Figure BDA0003182130410000142
Wherein POS is the position index of the ith word vector obtained after BPE algorithm processing, PE is a position coding function, d model Is the dimension of the ith word vector.
The resulting position-coding feature can be expressed as: p = { P 1 ,p 2 ,p 3 ,,…,p m }。
Then, according to the position coding feature P of the segmented text corresponding to the word vector, the aggregation feature corresponding to the word vector can be determined. The aggregation feature refers to a feature obtained by performing error positioning and feature representation again on the spliced feature of the word vector and the position coding feature and performing maximum pooling, and can be used for representing whether the word corresponding to the word vector has an error or not.
And finally, inputting the aggregation characteristics to a transform coding end, continuously extracting deep interactive characteristics through the transform coding end, and finally outputting error attribute characteristics corresponding to the aggregation characteristics.
In the embodiment of the invention, the error attribute characteristics of the question text to be corrected are obtained through the coding layer, shallow error characteristic extraction is carried out through the first type of convolutional neural network layer in the coding layer, and deep interactive characteristic extraction is carried out through the coding end of the Transformer, so that the error attribute characteristics are more accurate.
On the basis of the foregoing embodiment, the question error correction method provided in the embodiment of the present invention is a method for determining aggregation characteristics corresponding to word vectors based on position coding characteristics of segmented texts corresponding to the word vectors, and specifically includes:
splicing the position coding features and the word vectors to obtain splicing features corresponding to the word vectors;
inputting the splicing features into the first type of convolutional neural network layer to obtain error position features of the question text to be corrected, which are output by the first type of convolutional neural network layer;
determining the aggregated feature based on the error location feature.
Specifically, in the embodiment of the present invention, when determining the aggregation feature corresponding to a word vector, the position coding feature is first spliced with the word vector to obtain a splicing feature corresponding to the word vector, so that the splicing feature is rich in position information and can be finally represented as W (x) i )=W(p i +BPE(x i )+b),BPE(x i )=x ei . The splicing mode can be direct splicing or splicing according to corresponding positions.
Inputting the obtained splicing characteristics into a first-class CNN layer, performing error positioning and characteristic representation on the question through the first-class CNN layer again, and performing characteristic representation c i =f(w·x i:i+h-1 ) + b locate the error phrase. Wherein, c i Is the error position characteristic of the ith word vector obtained by the convolution layer in the first CNN layer, b is the threshold value of each neuron of the output layer in the first CNN layer, and f is the activation function. Since the range of the error word needs to be confirmed specifically, only the surrounding words or characters of h size are considered in the embodiment of the present invention, where h is a set value. For question error correction model, after setting h, c i The ginseng is added in the range of h.
Finally, all error position characteristics are spliced, namely:
c=[c 1 ,c 2 ,…,c m ]
wherein c is a splicing feature.
The polymerization characteristics can be determined by performing maximum pooling (max-pooling) of the spliced characteristics.
In the embodiment of the invention, a method for determining the aggregation characteristics is provided, and the obtained aggregation characteristics can be more accurate through the first-type CNN layer.
On the basis of the foregoing embodiment, the question error correction method provided in the embodiment of the present invention is a method for inputting the aggregation characteristic to the encoding end to obtain an error attribute characteristic corresponding to the aggregation characteristic output by the encoding end, and specifically includes:
inputting the aggregation characteristics to the encoding end to obtain deep interactive characteristics corresponding to the aggregation characteristics output by the encoding end;
and inputting the deep interactive features to the encoding end to obtain the error attribute features output by the encoding end.
Specifically, in the embodiment of the present invention, when determining the error attribute feature, the aggregation feature is input to the encoding end of the Transformer, so as to obtain a deep interactive feature corresponding to the aggregation feature output by the encoding end, for example, the error attribute feature. And then inputting the deep interactive features into a transform coding end, and further accurately finding error position features and error type features through the transform coding end. For example, coding the features output by the first-class CNN layer by a transform coding end, for each layer of the first-class CNN layer, starting from a word level, performing interaction between words by the corresponding transform coding end, amplifying each word vector, and performing self-attention mechanism error detection at the same time to detect an error word, thereby obtaining error attribute features.
In the embodiment of the invention, a method for determining the error attribute characteristics is provided, and the obtained error attribute characteristics can be more accurate through a transform encoding end.
On the basis of the above embodiment, in the question error correction method provided in the embodiment of the present invention, the question error correction model is trained based on the following method:
inputting the question text sample and the corrected question text sample into a question error correction model to be trained to obtain a first type of sample characteristics corresponding to the question text sample and a second type of sample characteristics corresponding to the corrected question text sample;
calculating a first class matching score of the first class of sample features and a second class matching score of the second class of sample features, and calculating a loss function value based on the first class matching score and the second class matching score;
and training the question error correction model to be trained based on the loss function value to obtain the question error correction model.
Specifically, in the embodiment of the invention, when the question error correction model is trained, the question text sample is firstly tested
Figure BDA0003182130410000161
And amending the question text sample
Figure BDA0003182130410000162
Figure BDA0003182130410000163
And inputting the input data into a question error correction model to be trained to obtain a first type of sample characteristics corresponding to a question text sample and a second type of sample characteristics corresponding to a corrected question text sample.
And calculating a first class matching score p of the first class sample characteristics and a second class matching score u of the second class sample characteristics, and calculating a loss function value based on the first class matching score p and the second class matching score u. The formula for the loss function can be expressed as:
Figure BDA0003182130410000171
wherein the content of the first and second substances,
Figure BDA0003182130410000172
to obtain loss function values, S * Is a set of matching scores, θ crt Is the parameter value to be trained in the question error correction model to be trained, P (P | mu; theta) crt ) Is expressed at theta crt And the value probability of p under the mu condition.
And training the question error correction model to be trained on the basis of the loss function value, and obtaining the question error correction model through multiple rounds of correction until the loss function is converged.
Fig. 2 is a schematic structural diagram of a question error correction model according to an embodiment of the present invention, as shown in fig. 2, the question error correction model includes a BPE layer 1, an encoding layer (Encoder layer) 2, and a decoding layer (Decoder layer) 3, the encoding layer 2 includes a first CNN layer 21 and a transform encoding end 22, the encoding end 22 includes a multi-headed self-attention layer 221 and a forward propagation layer 222, and the decoding layer 3 includes a forward propagation layer 31, an encoding-decoding attention layer 32, and a self-attention layer 33.
In the encoding layer 2, the encoding end 22 may include a superposition and normalization layer between the multi-headed self-attention layer 221 and the forward propagation layer 222 and after the forward propagation layer 222.
In summary, the question error correction method provided in the embodiment of the present invention uses discrete unit words for the conventional N-gram-based improved model without considering sentence word vectors and semantic information. In the embodiment of the invention, a CNN model is used for obtaining the time sequence characteristics and the position characteristics of words in a sentence, and finally, the words are spliced and pooled to obtain aggregation characteristics; aiming at sentence spelling error detection, the aggregation characteristic vector output by the CNN in the embodiment of the invention is encoded by using a Transformer model, and the error is detected by using a self-attention mechanism of the Transformer model so as to detect the error; aiming at the situation that spelling error correction and the traditional sentence error correction algorithm cannot pay attention to the information possibly omitted by the question, the embodiment of the invention continuously uses a Transformer model to combine with the context question to carry out multiple rounds of error correction and question information completion.
Fig. 3 is a schematic structural diagram of a question error correction apparatus provided in an embodiment of the present invention, and as shown in fig. 3, the apparatus includes:
an obtaining module 31, configured to obtain a question text to be corrected, which includes an incorrect question text and a context text of the incorrect question text, and determine a word vector of the question text to be corrected;
the encoding module 32 is configured to input the word vector to an encoding layer of a question error correction model, so as to obtain an error attribute feature of the question text to be error-corrected, which is output by the encoding layer;
the decoding module 33 is configured to input the error attribute feature to a decoding layer of the question error correction model, so as to obtain a corrected question text of the question text to be error-corrected, which is output by the decoding layer;
the question error correction model is obtained by training on the basis of question text samples containing error information and correction question text samples corresponding to the question text samples.
On the basis of the foregoing embodiment, in the question error correction apparatus provided in the embodiment of the present invention, the obtaining module is specifically configured to:
converting the question text to be corrected into a token form to obtain a word sequence;
and performing word segmentation processing on the word sequence based on a byte pair coding algorithm to obtain the word vector corresponding to the word sequence and the byte pair coded text.
On the basis of the foregoing embodiment, in the question error correction apparatus provided in the embodiment of the present invention, the encoding module is specifically configured to:
inputting the byte pair coded text into the first type of convolutional neural network layer to obtain position coding characteristics of the word segmentation text corresponding to the word vector output by the first type of convolutional neural network layer;
determining aggregation characteristics corresponding to the word vectors based on position coding characteristics of the word segmentation texts corresponding to the word vectors;
and inputting the aggregation characteristics to the encoding end to obtain error attribute characteristics corresponding to the aggregation characteristics output by the encoding end.
On the basis of the foregoing embodiment, in the question error correction apparatus provided in the embodiment of the present invention, the encoding module is further specifically configured to:
splicing the position coding features and the word vectors to obtain splicing features corresponding to the word vectors;
inputting the splicing characteristics to the first type of convolutional neural network layer to obtain error position characteristics of the question text to be corrected, which are output by the first type of convolutional neural network layer;
determining the aggregated feature based on the error location feature.
On the basis of the foregoing embodiment, in the question error correction apparatus provided in the embodiment of the present invention, the encoding module is further specifically configured to:
inputting the aggregation characteristics to the encoding end to obtain deep interactive characteristics corresponding to the aggregation characteristics output by the encoding end;
and inputting the deep interactive features into the encoding end to obtain the error attribute features output by the encoding end.
On the basis of the above embodiment, the question error correction apparatus provided in the embodiment of the present invention further includes a training module, configured to:
inputting the question text sample and the corrected question text sample into a question error correction model to be trained to obtain a first type of sample characteristics corresponding to the question text sample and a second type of sample characteristics corresponding to the corrected question text sample;
calculating a first class matching score of the first class of sample features and a second class matching score of the second class of sample features, and calculating a loss function value based on the first class matching score and the second class matching score;
and training the question error correction model to be trained based on the loss function value to obtain the question error correction model.
On the basis of the foregoing embodiment, in the question error correction apparatus provided in the embodiment of the present invention, the context text is an omitted sentence pattern text.
Specifically, the functions of the modules in the question-sentence correcting apparatus provided in the embodiment of the present invention correspond to the operation flows of the steps in the above method embodiments one to one, and the achieved effects are also consistent.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor) 410, a Communication Interface (Communication Interface) 420, a memory (memory) 430 and a Communication bus 440, wherein the processor 410, the Communication Interface 420 and the memory 430 are communicated with each other via the Communication bus 440. The processor 410 may call the computer program in the memory 430 to execute the steps of the question error correction method provided in the above embodiments, for example, including: acquiring a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector and a byte pair encoding text of the question text to be corrected; inputting the word vectors and the byte pair coded texts into a coding layer of a question error correction model to obtain error attribute characteristics of the question texts to be corrected, which are output by the coding layer; inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be corrected, which is output by the decoding layer; the question error correction model is obtained by training on the basis of question text samples containing error information and correction question text samples corresponding to the question text samples.
In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer being capable of executing the steps of the question sentence correcting method provided in the above embodiments, for example, including: acquiring a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector and a byte pair encoding text of the question text to be corrected; inputting the word vectors and the byte pair coded texts into a coding layer of a question error correction model to obtain error attribute characteristics of the question texts to be corrected, which are output by the coding layer; inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be corrected, which is output by the decoding layer; the question error correction model is obtained by training on the basis of question text samples containing error information and corrected question text samples corresponding to the question text samples.
On the other hand, an embodiment of the present application further provides a processor-readable storage medium, where the processor-readable storage medium stores a computer program, where the computer program is configured to cause the processor to execute the steps of the question error correction method provided in each of the above embodiments, and the steps include, for example: acquiring a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector and a byte pair encoding text of the question text to be corrected; inputting the word vectors and the byte pair coded texts into a coding layer of a question error correction model to obtain error attribute characteristics of the question texts to be corrected, which are output by the coding layer; inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be error corrected, which is output by the decoding layer; the question error correction model is obtained by training on the basis of question text samples containing error information and corrected question text samples corresponding to the question text samples.
The processor-readable storage medium may be any available media or data storage device that can be accessed by a processor, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), solid State Disks (SSDs)), etc.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A question error correction method, comprising:
acquiring a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector and a byte pair encoding text of the question text to be corrected;
inputting the word vectors and the byte pair coding texts into a coding layer of a question error correction model to obtain error attribute characteristics of the question texts to be corrected, which are output by the coding layer;
inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be error corrected, which is output by the decoding layer;
the question error correction model is obtained by training on the basis of question text samples containing error information and corrected question text samples corresponding to the question text samples.
2. The question error correction method according to claim 1, wherein the determining of the word vector and the byte pair encoded text of the question text to be error corrected specifically comprises:
converting the question text to be corrected into a token form to obtain a word sequence;
and performing word segmentation processing on the word sequence based on a byte pair coding algorithm to obtain the word vector corresponding to the word sequence and the byte pair coded text.
3. The question error correction method according to claim 1, wherein the step of inputting the word vector and the byte pair encoded text into an encoding layer of a question error correction model to obtain the error attribute characteristics of the question text to be corrected output by the encoding layer specifically comprises:
inputting the byte pair coded text into the first type of convolutional neural network layer to obtain position coding characteristics of the word segmentation text corresponding to the word vector output by the first type of convolutional neural network layer;
determining the aggregation characteristics corresponding to the word vectors based on the position coding characteristics of the word segmentation texts corresponding to the word vectors;
and inputting the aggregation characteristics to the encoding end to obtain error attribute characteristics corresponding to the aggregation characteristics output by the encoding end.
4. The question error correction method according to claim 3, wherein the determining the aggregation characteristic corresponding to the word vector based on the position coding characteristic of the participle text corresponding to the word vector specifically comprises:
splicing the position coding features and the word vectors to obtain splicing features corresponding to the word vectors;
inputting the splicing characteristics to the first type of convolutional neural network layer to obtain error position characteristics of the question text to be corrected, which are output by the first type of convolutional neural network layer;
determining the aggregated feature based on the error location feature.
5. The question error correction method according to claim 3, wherein the inputting the aggregation characteristic to the encoding end to obtain an error attribute characteristic corresponding to the aggregation characteristic output by the encoding end specifically comprises:
inputting the aggregation characteristics to the encoding end to obtain deep interactive characteristics corresponding to the aggregation characteristics output by the encoding end;
and inputting the deep interactive features into the encoding end to obtain the error attribute features output by the encoding end.
6. The question error correction method according to any one of claims 1 to 5, characterized in that the question error correction model is trained on the basis of:
inputting the question text sample and the corrected question text sample into a question error correction model to be trained to obtain a first type of sample characteristics corresponding to the question text sample and a second type of sample characteristics corresponding to the corrected question text sample;
calculating a first class matching score of the first class of sample features and a second class matching score of the second class of sample features, and calculating a loss function value based on the first class matching score and the second class matching score;
and training the question error correction model to be trained based on the loss function value to obtain the question error correction model.
7. The question error correction method according to any one of claims 1 to 5, characterized in that the context text is an omitted sentence text.
8. A question error correction apparatus, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a question text to be corrected, which contains an error question text and a context text of the error question text, and determining a word vector of the question text to be corrected;
the coding module is used for inputting the word vectors into a coding layer of a question error correction model to obtain error attribute characteristics of the question text to be corrected output by the coding layer;
the decoding module is used for inputting the error attribute characteristics to a decoding layer of the question error correction model to obtain a corrected question text of the question text to be corrected, which is output by the decoding layer;
the question error correction model is obtained by training on the basis of question text samples containing error information and corrected question text samples corresponding to the question text samples.
9. An electronic device comprising a processor and a memory storing a computer program, wherein the processor implements the steps of the question error correction method according to any one of claims 1 to 7 when executing the computer program.
10. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program for causing a processor to execute the steps of the question error correction method according to any one of claims 1 to 7.
CN202110850150.1A 2021-07-27 2021-07-27 Question error correction method and device, electronic equipment and storage medium Pending CN115688748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110850150.1A CN115688748A (en) 2021-07-27 2021-07-27 Question error correction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110850150.1A CN115688748A (en) 2021-07-27 2021-07-27 Question error correction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115688748A true CN115688748A (en) 2023-02-03

Family

ID=85059334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110850150.1A Pending CN115688748A (en) 2021-07-27 2021-07-27 Question error correction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115688748A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859951A (en) * 2023-02-28 2023-03-28 环球数科集团有限公司 Content error correction system for AIGC

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859951A (en) * 2023-02-28 2023-03-28 环球数科集团有限公司 Content error correction system for AIGC
CN115859951B (en) * 2023-02-28 2023-05-02 环球数科集团有限公司 Content error correction system for AIGC

Similar Documents

Publication Publication Date Title
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN111651589B (en) Two-stage text abstract generation method for long document
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN112100354A (en) Man-machine conversation method, device, equipment and storage medium
CN110134950B (en) Automatic text proofreading method combining words
CN106611041A (en) New text similarity solution method
CN110427619B (en) Chinese text automatic proofreading method based on multi-channel fusion and reordering
CN111651978A (en) Entity-based lexical examination method and device, computer equipment and storage medium
US11170169B2 (en) System and method for language-independent contextual embedding
CN100361124C (en) System and method for word analysis
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
Fusayasu et al. Word-error correction of continuous speech recognition based on normalized relevance distance
CN109815497B (en) Character attribute extraction method based on syntactic dependency
EP2759945A2 (en) Sampling and optimization in phrase-based machine translation using an enriched language model representation
KR20150092879A (en) Language Correction Apparatus and Method based on n-gram data and linguistic analysis
CN115688748A (en) Question error correction method and device, electronic equipment and storage medium
CN114548113A (en) Event-based reference resolution system, method, terminal and storage medium
CN115994544A (en) Parallel corpus screening method, parallel corpus screening device, and readable storage medium
KR102354898B1 (en) Vocabulary list generation method and device for Korean based neural network language model
CN111428475A (en) Word segmentation word bank construction method, word segmentation method, device and storage medium
CN112183117A (en) Translation evaluation method and device, storage medium and electronic equipment
Lyon et al. Reducing the Complexity of Parsing by a Method of Decomposition.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination