CN114495114A - Text sequence identification model calibration method based on CTC decoder - Google Patents
Text sequence identification model calibration method based on CTC decoder Download PDFInfo
- Publication number
- CN114495114A CN114495114A CN202210402975.1A CN202210402975A CN114495114A CN 114495114 A CN114495114 A CN 114495114A CN 202210402975 A CN202210402975 A CN 202210402975A CN 114495114 A CN114495114 A CN 114495114A
- Authority
- CN
- China
- Prior art keywords
- context
- character
- label
- sequence
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 239000011159 matrix material Substances 0.000 claims abstract description 36
- 238000009826 distribution Methods 0.000 claims abstract description 28
- 238000009499 grossing Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 21
- 230000008859 change Effects 0.000 claims abstract description 5
- 230000008569 process Effects 0.000 claims description 8
- 230000001419 dependent effect Effects 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- 210000005266 circulating tumour cell Anatomy 0.000 description 15
- 230000006870 function Effects 0.000 description 10
- 206010068829 Overconfidence Diseases 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 3
- 241000282326 Felis catus Species 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000009827 uniform distribution Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2193—Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a text sequence identification model calibration method based on a CTC decoder, which comprises the following steps: inputting the text image support set into a training model to be calibrated to obtain a text sequence recognition result; calculating a context confusion matrix by using a text sequence identification result of a text image support set, wherein the context confusion matrix is used for representing the context distribution relation between predicted characters at adjacent moments in a sequence; according to the context confusion matrix, selectively carrying out self-adaptive change on the smooth intensity in the label smoothing by utilizing context correlation prediction distribution so as to realize self-adaptive calibration of the sequence confidence coefficient; and retraining the training model to be calibrated based on the context selective loss function, and outputting the predicted text sequence and the calibrated confidence coefficient. According to the method, the label is smoothly expanded to a text sequence identification model based on a CTC decoder, the context relationship between sequences is introduced, and the prediction sequence is subjected to self-adaptive calibration, so that the confidence of the model output prediction text can be more accurate.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence and text sequence processing, and particularly relates to a text sequence identification model calibration method based on a CTC decoder.
Background
With the development of deep learning, deep neural network models are deployed in the fields of medical treatment, transportation, finance and the like due to high prediction accuracy, for example: the medical image recognition model can provide auxiliary basis for doctors to diagnose the state of illness, the target detection recognition model enables the vehicle to have intelligent analysis capability so as to control the speed or direction of the sensor, and the OCR (optical character recognition) model provides powerful support for the digitization of the financial bill entry. However, the potential risks of the depth model are gradually exposed in the process of popularizing and deepening the depth model in various fields. Scene text images are widely existed in various industries and fields of our lives as one of data forms widely existed in our daily scenes. For example: the medical diagnosis, the medical examination order, and the financial system. Compared with the unstructured data such as common single-frame images and characters, the structured sequence data is more difficult to predict, and the reliability of the structured sequence data is more complicated to obtain and judge.
Currently, confidence is one of the most direct indicators for evaluating the reliability of a prediction. The model prediction score is generally normalized to probability as its confidence. The confidence of reliability can accurately reflect the accuracy of the prediction, and when the model is not confident about the prediction and gives a relatively low confidence, an intervention decision is manually needed to ensure that the task is safely performed. However, it is found that the confidence of the output of many existing deep neural network models is not calibrated, but there is a self-confidence problem that the confidence of the output is higher than the accuracy. The reason for the model not being calibrated comes from several aspects. On one hand, as the structure of the model is gradually huge and complex, the problem of overfitting of the model is caused by high fitting capacity caused by a large number of parameters. For the over-fit prediction label category, the model tends to assign a high confidence to the erroneous predictions as well. Moreover, the one-hot coding-based loss function and softmax confidence coefficient calculation method increases the distance between positive and negative prediction samples, and although the correct samples are selected conveniently, the prediction confidence coefficient is easily over-confident. On the other hand, the distribution of training data and test data is different. It is also difficult for a model to give a reliable confidence when it needs to deal with data never or rarely seen in the training dataset in real-world scenarios.
Due to the complex structure of the text sequence, calibration of the scene text recognition model is also very difficult. Specifically, one is that a text sequence is usually composed of a plurality of characters, and the confidence space size thereof becomes larger as the number of characters increases. Secondly, text recognition is usually a time-sequence related process, and the context relationship between characters is important prior information. And the strength of the context dependence between them is different for different characters. Therefore, alignment at the sequence level is difficult to achieve by simply aligning all characters uniformly.
However, most of the existing confidence calibration methods are mainly directed to unstructured simple data expansion. These calibration methods can be largely divided into two main categories: post-processing calibration and predictive model training calibration. The post-processing approach typically learns a regression equation related to confidence on the set-aside (hold-out) data set, transforming the output confidence. The calibration methods for traditional classifiers proposed earlier in the field of machine learning are mostly based on post-processing ideas, such as: plattscealing, order preservation regression, histogramming, and the like. In the field of deep learning, a scholars proposes temperature regulation based on platting and calibrates confidence level by introducing a temperature parameter. The prediction model training calibration generally adjusts the depth model directly. The method mainly considers the over-confidence problem caused by over-fitting, and the model is calibrated by relieving the over-fitting in modes of dropout, label smoothing loss, entropy regulation and the like. In addition, from the data aspect, part of the methods are to perform enhancement operation on the training data during the training process to solve the problem, for example, methods such as MixUp, GAN, and the like. However, these methods do not consider the heterogeneity of the distribution of different classes of data in the data set, or only consider the correlation between local single prediction and real tags, neglect the length and intrinsic context-dependent characteristics of the sequence data, and are difficult to directly migrate to the confidence calibration of the sequence data. Therefore, a specific calibration design needs to be further made according to the sequence data characteristics, so as to improve the calibration performance of the sequence confidence.
Disclosure of Invention
In view of the above, it is necessary to provide a method for calibrating a text sequence recognition model based on a CTC decoder for solving the technical problem of confidence calibration of a scene text recognition model, where the method reviews the essence of a label smoothing method, the effectiveness of label smoothing is mainly embodied by adding a Kullback-leibler (kl) divergence term as a regular term on the basis of an original loss function, and in consideration of context dependence existing in a sequence, a context relationship between characters is modeled in the form of a confusion matrix and used as a language knowledge prior to guide label probability distribution, and smoothing strengths of different types of labels are adaptively adjusted according to context prediction error rates thereof.
The invention discloses a text sequence identification model calibration method based on a CTC decoder, which comprises the following steps:
step 1, inputting a text image support set into a training model to be calibrated to obtain a text sequence recognition result;
step 2, calculating a context confusion matrix by using a text sequence identification result of the text image support set, wherein the context confusion matrix is used for representing the context distribution relation between predicted characters at adjacent moments in the sequence;
step 3, selectively carrying out self-adaptive change on the smooth intensity in the label smoothing by utilizing context correlation prediction distribution according to the context confusion matrix so as to realize self-adaptive calibration of the sequence confidence coefficient;
and 4, retraining the training model to be calibrated based on the context selective loss function, and finally outputting the predicted text sequence and the confidence coefficient of the calibration.
Specifically, the process of computing the context confusion matrix comprises the following steps:
initial setup is commonContext confusion matrix with 0 element for each prediction class,Indexing for the corresponding prediction categories;
text sequence recognition results of aligned text image support setsAnd corresponding genuine label,The result length is identified for the text sequence,is the length of the real label sequence;
if the recognition result is aligned with the real label, the character is known at the last momentThe class to which the tag belongs is indexed byIn case of directly counting the currentTime of day characterIs predicted as a characterOf the context confusion matrixWherein each element of the context confusion matrixIndicating that the predicted character at the previous time is known to belong toClass time, the true tag belongs toThe current time character of the class is predicted to be the firstThe times of class labels are that for the characters at the head of the text, the category of the characters at the head of the text is set as a space by default;
if the identification result is not aligned with the real label, calculating the operation sequence from the prediction sequence to the real label through the edit distance to obtain the alignment relation between the sequences, and then counting to obtain the context confusion matrix.
Preferably, the process of obtaining the alignment relationship between the sequences requires performing the following operations several times: deleting one character operation, inserting one character operation or replacing one character operation until the characters are correctly predicted and aligned, wherein the deleting one character operation is used for correcting the empty symbols in the real label sequence to be wrongly predicted into other characters, the inserting one character operation is used for correcting the corresponding characters in the real label sequence to be predicted into the empty symbols, and the replacing one character operation is used for correcting the corresponding characters in the real label sequence to be predicted into other characters.
Specifically, the selectively adaptively changing the smoothing intensity in smoothing the label by using the context-dependent prediction distribution in step 3 means that the smoothing intensity is adaptively adjusted according to the context relationship, and the label probability is adjusted to obtain a selective context-aware probability distribution formula:
whereinRepresents the character at the last timeThe corresponding error-prone set when the category is known,indicating that a character was last time-of-dayCurrent character in case of label belonging to categoryIs predicted to beThe number of times of the operation of the motor,the index of the category is represented by,indicating that a character was last time-of-dayCurrent character in case of label belonging to categoryIs predicted to beThe number of times of the operation of the motor,representing the intensity of the smoothingFor the previous characterWhen the label type is known, the current character needs to be confirmed firstWhether the label belongs to an error-prone set or not, if not, label smooth prediction is not needed; otherwise, the sliding strength is adaptively adjusted and leveled according to the error rate;
the error-prone set is obtained byCorresponding to different prediction categories, and counting the characters at the last moment according to the frequency of predicting the characters appearing in each category in the context confusion matrixBelong to the firstError prone set of time classesThe division is according to the following:
wherein,representing the character of the last momentBelong to the firstClass time, the accuracy of the prediction at the current moment, if the class error rate is greater than the set threshold valueThen the corresponding category will be correspondedTo be classified into the error-prone setAfter the label belongs to the category, the corresponding error-prone set can be obtained。
More specifically, the context selectivity loss function described in step 4 is:
wherein,to representAt time, corresponding tag class indexIs selected based on the selected context-aware probability distribution,is shown inTime of day corresponding prediction category labelThe probability of (a) of (b) being,the loss of the CTC is indicated,to representThe divergence of the light beam is measured by the light source,to representThe time corresponds to allA probability vector for the class label category,
representing corresponding prediction category labelsThe probability of (a) of (b) being,representing corresponding genuine category labelThe probability of (c).
Compared with the prior art, the invention has the beneficial effects that:
according to the method, the label is smoothly expanded to the text sequence recognition model based on the CTC decoder, the context relationship among sequences is introduced, the predicted sequence is subjected to self-adaptive calibration, the calibration performance of the text sequence recognition model can be well improved, and the confidence coefficient of the model output predicted text can be more accurate.
Drawings
FIG. 1 shows a schematic flow diagram of a method embodying the present invention;
FIG. 2 is a schematic diagram showing the operation of modules according to an embodiment of the present invention;
fig. 3 shows a schematic flow of an alignment policy in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the sake of reference and clarity, the technical terms, abbreviations or abbreviations used hereinafter are to be interpreted in summary as follows:
CTC: connectionsist Temporal Classification (Link definition Temporal classifier)
KL divergence: Kullback-Leibler divergence
NLL: negative Log-Likelihood (Negative Log Likelihood)
LS: label smoothening (Label Smoothing)
CASSLS: context-aware Selective Label Smoothing
The invention discloses a character image augmentation method based on shape transformation, which aims to solve various problems in the prior art.
Fig. 1 shows a schematic flow diagram of an embodiment of the invention. A text sequence identification model calibration method based on a CTC decoder comprises the following steps:
inputting the text image support set into a training model to be calibrated to obtain a text sequence recognition result;
calculating a context confusion matrix by using a text sequence identification result of the text image support set, wherein the context confusion matrix is used for representing the context distribution relation between predicted characters at adjacent moments in the sequence;
according to the context confusion matrix, selectively carrying out self-adaptive change on the smooth intensity in the label smoothing by utilizing context correlation prediction distribution so as to realize self-adaptive calibration of the sequence confidence coefficient;
and retraining the training model to be calibrated based on the context selective loss function, and finally outputting the predicted text sequence and the calibrated confidence coefficient.
Specifically, the present embodiment adopts the following steps to implement the inventive method.
Step 1, constructing a support data set, inputting the support data set into a corresponding scene text recognition pre-training model, and obtaining a recognition result, namely a corresponding text sequence.
The data distribution in the support set needs to be similar to the training set, and a validation set or a part of the training set of the reference data set is generally selected as the support set. Here, the training data set of IIIT5k, SVT, IC03, IC13, and IC15 is selected as the support set. And inputting the data to be tested into a corresponding scene text recognition pre-training model, and performing model recognition prediction to obtain a corresponding prediction sequence. In the next step the confusion matrix is constructed.
And 2, acquiring a sequence context prediction distribution relation by utilizing the support set prediction result, and representing the sequence context prediction distribution relation in the form of a confusion matrix as context modeling output.
In step 2, for the input data, the text is predictedAnd corresponding genuine labelBased on (Andcorresponding to the length of the text sequence), the context relationship of the predictions at adjacent times in the sequence is obtained, that is, under the condition that the category to which the character prediction at the previous time belongs is known, the probability that the character prediction at the next time belongs to the category has certain correlation with the probability.
If the recognition result is aligned with the real label, the character is known at the last momentThe class to which the tag belongs is indexed byIn case of directly counting the currentTime of day characterIs predicted as a characterContext confusion matrix of. Wherein each element in the context confusion matrixMeta-representation knowing that the last-moment predicted character belongs to the secondClass time, the true tag belongs toThe current time character of the class is predicted to be the firstNumber of class labels. Specifically, for the character at the beginning of the text, the first character isDefault setting of the category of the time character as a space;
the specific construction mode is that firstly, the initialization is carried outThe element of each prediction class is a confusion matrix of 0. For scene text recognition, initialization is first performed(containing 10 digits, 26 english letters, and 1 space category). Assuming that there is a predicted sequence "cat" (the true label is "cat"), the character class is blank for the time immediately preceding the first character "c", and the true character label "c" is correctly predicted as "c" at the current time, then one is added to the position element representing the corresponding label "c" and prediction "c" in the confusion matrix for the corresponding blank class. And counting all samples in the support set in the same way to finally obtain a confusion matrix representing the context prediction frequency distribution of different prediction categories. FIG. 2 shows context confusion matrices for previous time characters "3", "A", and "V", respectively, and for previous time character predictions of different classes, the class to which the current character prediction belongs is different, and therefore a differential calibration operation is required.
Considering the situation that the predicted sequence is not aligned with the real sequence label of the predicted sequence due to error prediction, the editing distance is used for calculating the operation sequence between the real sequence and the predicted sequence, and the alignment relation between the sequences is obtained. The specific alignment strategy is as shown in fig. 3, and in order to realize the one-to-one correspondence between characters of the predicted text and the real text, the method includes (1) deleting one character (d); (2) inserting a character (i); (3) an operation of replacing one character(s). If not, it is indicated by the symbol "-". Taking the prediction sequence "lapaitmen" as an example, in order to obtain the edit distance from the prediction sequence to the real tag "avatar", the following operations are required: the deleted character "l", the replaced character "i" is "r", and the inserted character "t". Accordingly, in the statistical confusion matrix process, a delete operation indicates that the null symbol label "#" in the real sequence is mispredicted to another character "l", and an insert operation indicates that the corresponding label "t" in the real sequence is predicted to be the null symbol "#".
And 3, utilizing the confusion matrix to perform self-adaptive change on the label smoothness according to the context relationship, and introducing a penalty item to realize self-adaptive calibration of the sequence confidence coefficient.
In step 3, for the CTC decoder, the optimization goal is the maximum likelihood of the sequence probability, which can be defined by the following equation:
wherein,for a given inputOutput isThe probability of (a) of (b) being,in order to predict the total step size for decoding,is shown in the decoding pathTo middleThe confidence level of the character corresponding to the time instant,representing the mapping rules. The probability of this decoded path is directly considered as the confidence of the predicted sequence.
The tag smoothing strategy, which is typically used on cross-entropy loss, is then generalized to CTC losses. Label smoothness loss is deduced. The label smoothed probability distribution can be expressed as:
wherein,in order to even out the probability of label smoothing,in order to be a smoothing factor, the method,the label accords with one-hot probability distribution, if the prediction is correct, the label is 1, otherwise, the label is 0, the label is equivalent to a Dirac function, anduniform distribution of label probability over all classes of labels, valueIs as follows. Substituting the above equation into a general text sequence recognition loss function can obtain:
wherein,is composed ofThe time of day is predicted asThe probability of the category(s) is,the total step size is decoded for prediction.
the loss function can be derived to be decoupled as a sum of standard Negative Log Likelihood (NLL) lossesDivergence termTwo items are as follows:
wherein,represents a constant term with no effect in gradient back propagation and negligible effect.
Due to the functionThe overall loss is expected to approach zero forThe divergence penalty term can be approximately understood to mean that the smaller the expected prediction probability distribution and the uniform distribution distance is, the more the prediction probability is prevented from changing towards the direction of over confidence. Thus, although the CTC decoder-based text sequence recognition model does not one-hot encode true tags, but its core optimization goal is still sequence confidence maximization, then the smooth CTC loss in combination with standard tags can be defined as:
wherein,is lost as CTCs. Smoothing factorAs a weight of the penalty term, the strength of the calibration is controlled. In-service setting。
Further introducing sequence context in tag smoothing. Firstly, screening an error-prone set, and only carrying out label smoothing on error-prone categories. To pairCorresponding to different prediction categories, the characters at the last time can be obtained through statistics according to the frequency of prediction appearing in each category in the corresponding confusion matrixBelong to the firstError-prone set of time classes. The division is based on the following:
wherein,representing the character of the last momentBelong to the firstClass time, the accuracy of the prediction at the current moment. If the class error rate is larger than the set thresholdThen, the corresponding category is classified into the error-prone set, so as to be clearAfter the label belongs to the category, the corresponding error-prone set can be obtained. In-service setting。
And further carrying out self-adaptive adjustment on the smoothing intensity according to the context relation according to the confusion matrix obtained in the step 2. And adjusting the label probability to obtain a selective Context Awareness (CASSLS) probability distribution formula:
for the last characterWhen the label type is known, the current character needs to be confirmed firstWhether the label belongs to an error-prone set or not, if not, label smooth prediction is not required; otherwise, the sliding strength is adaptively adjusted according to the error rate.
Bringing this probability distribution into the tag-smoothed CTC loss, we can derive the CTC decoder-based selective context-aware loss:
taking into account the calculationIn divergence, the output probability is the probability of the predicted path, there is a case where the length thereof is misaligned with the length of the real label, and for the predicted probabilityOnly the position of the sequence after the predicted path mapping is reserved. Then, according to the editing distance alignment strategy in the step 2, for deletion operation, adding a one-hot code of a space at a blank position of a corresponding target sequence; for the insertion operation, adding an equipartition distribution probability vector at a blank position corresponding to the prediction sequence; and the replacement operation does not bring any changes to the probability distribution.
And 4, retraining the target model after adjusting the loss function, and finally outputting the prediction sequence and the calibration confidence coefficient thereof.
In step 4, the original loss function is adjusted according to the context perception selective label smoothing strategy in step 3, and the target over-confidence model is retrained, so that the model can be calibrated. Because the fine tuning model is adopted for training, the learning rate is set asAnd after 200000 times of iterative training, finally outputting the prediction text and the confidence coefficient after calibration.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (6)
1. A text sequence recognition model calibration method based on a CTC decoder is characterized by comprising the following steps:
step 1, inputting a text image support set into a training model to be calibrated to obtain a text sequence recognition result;
step 2, calculating a context confusion matrix by using a text sequence identification result of the text image support set, wherein the context confusion matrix is used for representing the context distribution relation between predicted characters at adjacent moments in the sequence;
step 3, selectively carrying out self-adaptive change on the smooth intensity in the label smoothing by utilizing context correlation prediction distribution according to the context confusion matrix so as to realize self-adaptive calibration of the sequence confidence coefficient;
and 4, retraining the training model to be calibrated based on the context selective loss function, and finally outputting the predicted text sequence and the confidence coefficient of the calibration.
2. The CTC decoder-based text sequence recognition model calibration method of claim 1, wherein said process of computing a context confusion matrix comprises the steps of:
initial setup is commonContext confusion matrix with 0 element for each prediction class,Indexing for the corresponding prediction categories;
text sequence recognition results of aligned text image support setsAnd corresponding genuine label,The result length is identified for the text sequence,is the true tag sequence length;
if the recognition result is aligned with the real label, the character is known at the last momentThe class to which the tag belongs is indexed byIn case of directly counting the currentTime true characterIs predicted as a characterContext confusion matrix ofWherein each element of the context confusion matrixIndicating that the real character at the last time is known to belong to the firstClass time, the true tag belongs toThe current time character of the class is predicted to be the firstThe times of class labels are that for the characters at the head of the text, the category of the characters at the head of the text is set as a space by default;
if the identification result is not aligned with the real label, calculating the operation sequence from the prediction sequence to the real label through the edit distance to obtain the alignment relation between the sequences, and then counting to obtain the context confusion matrix.
3. A CTC decoder-based text sequence recognition model calibration method according to claim 2, wherein said process of obtaining alignment relationship between sequences requires several times of the following operations: deleting one character operation, inserting one character operation or replacing one character operation until the characters are correctly predicted and aligned, wherein the deleting one character operation is used for correcting the empty symbols in the real label sequence to be wrongly predicted into other characters, the inserting one character operation is used for correcting the corresponding characters in the real label sequence to be predicted into the empty symbols, and the replacing one character operation is used for correcting the corresponding characters in the real label sequence to be predicted into other characters.
4. The CTC decoder-based text sequence recognition model calibration method of claim 2, wherein the step 3 of selectively and adaptively changing the smoothing intensity in tag smoothing by using the context-dependent prediction distribution means that the smoothing intensity is adaptively adjusted according to the context relationship, and the tag probability is adjusted to obtain the selective context-aware probability distribution formula as follows:
wherein,representing the character of the last momentThe corresponding error-prone set when the category is known,indicating that a character was last time-of-dayCurrent character in case of label belonging to categoryIs predicted to beThe number of times of the above-mentioned operations,the index of the category is represented by,indicating that a character was last time-of-dayCurrent character in case of label belonging to categoryIs predicted to beThe number of times of the above-mentioned operations,indicating the intensity of the smoothing, for the previous characterWhen the label type is known, the current character needs to be confirmedWhether the label belongs to an error-prone set or not, if not, the label smooth calibration is not needed; otherwise, the sliding strength is adaptively adjusted according to the error rate.
5. A CTC decoder-based text sequence recognition model alignment method as in claim 4, wherein the error-prone set is obtained forEach corresponding to a different prediction class according toCorresponding to the frequency of predicting the character appearing in each category in the context confusion matrix, and counting to obtain the character at the last momentBelong to the firstError-prone set of class classesThe division is according to the following:
wherein,representing the character of the last momentBelong to the firstClass time, the accuracy of the prediction at the current moment, if the class error rate is greater than the set threshold valueThen the corresponding category will be correspondedTo enter the error-prone set, it is clearAfter the label belongs to the category, the corresponding error-prone set can be obtained。
6. The CTC decoder-based text sequence recognition model calibration method of claim 4, wherein the context selectivity loss function in step 4 is:
wherein,representAt time, corresponding tag class indexIs selected based on the selected context-aware probability distribution,is shown inTime of day corresponding prediction category labelThe probability of (a) of (b) being,the loss of the CTC is indicated,to representThe divergence of the light beam is measured by the light source,to representThe time corresponds to allProbability vectors of class label categories;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210402975.1A CN114495114B (en) | 2022-04-18 | 2022-04-18 | Text sequence recognition model calibration method based on CTC decoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210402975.1A CN114495114B (en) | 2022-04-18 | 2022-04-18 | Text sequence recognition model calibration method based on CTC decoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114495114A true CN114495114A (en) | 2022-05-13 |
CN114495114B CN114495114B (en) | 2022-08-05 |
Family
ID=81489555
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210402975.1A Active CN114495114B (en) | 2022-04-18 | 2022-04-18 | Text sequence recognition model calibration method based on CTC decoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114495114B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117151111A (en) * | 2023-08-15 | 2023-12-01 | 华南理工大学 | Text recognition model reliability regularization method based on perception and semantic relevance |
Citations (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109934293A (en) * | 2019-03-15 | 2019-06-25 | 苏州大学 | Image-recognizing method, device, medium and obscure perception convolutional neural networks |
US10366362B1 (en) * | 2012-10-18 | 2019-07-30 | Featuremetrics, LLC | Feature based modeling for forecasting and optimization |
US10388272B1 (en) * | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
CN110399845A (en) * | 2019-07-29 | 2019-11-01 | 上海海事大学 | Continuously at section text detection and recognition methods in a kind of image |
CN110634491A (en) * | 2019-10-23 | 2019-12-31 | 大连东软信息学院 | Series connection feature extraction system and method for general voice task in voice signal |
US20200027444A1 (en) * | 2018-07-20 | 2020-01-23 | Google Llc | Speech recognition with sequence-to-sequence models |
US10573312B1 (en) * | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
US20200357388A1 (en) * | 2019-05-10 | 2020-11-12 | Google Llc | Using Context Information With End-to-End Models for Speech Recognition |
CN112068555A (en) * | 2020-08-27 | 2020-12-11 | 江南大学 | Voice control type mobile robot based on semantic SLAM method |
US20200402500A1 (en) * | 2019-09-06 | 2020-12-24 | Beijing Dajia Internet Information Technology Co., Ltd. | Method and device for generating speech recognition model and storage medium |
CN112712804A (en) * | 2020-12-23 | 2021-04-27 | 哈尔滨工业大学(威海) | Speech recognition method, system, medium, computer device, terminal and application |
WO2021081562A2 (en) * | 2021-01-20 | 2021-04-29 | Innopeak Technology, Inc. | Multi-head text recognition model for multi-lingual optical character recognition |
US20210150200A1 (en) * | 2019-11-19 | 2021-05-20 | Samsung Electronics Co., Ltd. | Electronic device for converting handwriting input to text and method of operating the same |
CN112989834A (en) * | 2021-04-15 | 2021-06-18 | 杭州一知智能科技有限公司 | Named entity identification method and system based on flat grid enhanced linear converter |
CN113160803A (en) * | 2021-06-09 | 2021-07-23 | 中国科学技术大学 | End-to-end voice recognition model based on multilevel identification and modeling method |
CN113283336A (en) * | 2021-05-21 | 2021-08-20 | 湖南大学 | Text recognition method and system |
CN113516968A (en) * | 2021-06-07 | 2021-10-19 | 北京邮电大学 | End-to-end long-term speech recognition method |
CN113609859A (en) * | 2021-08-04 | 2021-11-05 | 浙江工业大学 | Special equipment Chinese named entity recognition method based on pre-training model |
EP3910534A1 (en) * | 2020-05-15 | 2021-11-17 | MyScript | Recognizing handwritten text by combining neural networks |
CN113887480A (en) * | 2021-10-19 | 2022-01-04 | 小语智能信息科技(云南)有限公司 | Burma language image text recognition method and device based on multi-decoder joint learning |
CN114023316A (en) * | 2021-11-04 | 2022-02-08 | 匀熵科技(无锡)有限公司 | TCN-Transformer-CTC-based end-to-end Chinese voice recognition method |
CN114155527A (en) * | 2021-11-12 | 2022-03-08 | 虹软科技股份有限公司 | Scene text recognition method and device |
-
2022
- 2022-04-18 CN CN202210402975.1A patent/CN114495114B/en active Active
Patent Citations (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10366362B1 (en) * | 2012-10-18 | 2019-07-30 | Featuremetrics, LLC | Feature based modeling for forecasting and optimization |
US20200027444A1 (en) * | 2018-07-20 | 2020-01-23 | Google Llc | Speech recognition with sequence-to-sequence models |
US10388272B1 (en) * | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
US10573312B1 (en) * | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
CN109934293A (en) * | 2019-03-15 | 2019-06-25 | 苏州大学 | Image-recognizing method, device, medium and obscure perception convolutional neural networks |
US20200357388A1 (en) * | 2019-05-10 | 2020-11-12 | Google Llc | Using Context Information With End-to-End Models for Speech Recognition |
CN110399845A (en) * | 2019-07-29 | 2019-11-01 | 上海海事大学 | Continuously at section text detection and recognition methods in a kind of image |
US20200402500A1 (en) * | 2019-09-06 | 2020-12-24 | Beijing Dajia Internet Information Technology Co., Ltd. | Method and device for generating speech recognition model and storage medium |
CN110634491A (en) * | 2019-10-23 | 2019-12-31 | 大连东软信息学院 | Series connection feature extraction system and method for general voice task in voice signal |
US20210150200A1 (en) * | 2019-11-19 | 2021-05-20 | Samsung Electronics Co., Ltd. | Electronic device for converting handwriting input to text and method of operating the same |
EP3910534A1 (en) * | 2020-05-15 | 2021-11-17 | MyScript | Recognizing handwritten text by combining neural networks |
CN112068555A (en) * | 2020-08-27 | 2020-12-11 | 江南大学 | Voice control type mobile robot based on semantic SLAM method |
CN112712804A (en) * | 2020-12-23 | 2021-04-27 | 哈尔滨工业大学(威海) | Speech recognition method, system, medium, computer device, terminal and application |
WO2021081562A2 (en) * | 2021-01-20 | 2021-04-29 | Innopeak Technology, Inc. | Multi-head text recognition model for multi-lingual optical character recognition |
CN112989834A (en) * | 2021-04-15 | 2021-06-18 | 杭州一知智能科技有限公司 | Named entity identification method and system based on flat grid enhanced linear converter |
CN113283336A (en) * | 2021-05-21 | 2021-08-20 | 湖南大学 | Text recognition method and system |
CN113516968A (en) * | 2021-06-07 | 2021-10-19 | 北京邮电大学 | End-to-end long-term speech recognition method |
CN113160803A (en) * | 2021-06-09 | 2021-07-23 | 中国科学技术大学 | End-to-end voice recognition model based on multilevel identification and modeling method |
CN113609859A (en) * | 2021-08-04 | 2021-11-05 | 浙江工业大学 | Special equipment Chinese named entity recognition method based on pre-training model |
CN113887480A (en) * | 2021-10-19 | 2022-01-04 | 小语智能信息科技(云南)有限公司 | Burma language image text recognition method and device based on multi-decoder joint learning |
CN114023316A (en) * | 2021-11-04 | 2022-02-08 | 匀熵科技(无锡)有限公司 | TCN-Transformer-CTC-based end-to-end Chinese voice recognition method |
CN114155527A (en) * | 2021-11-12 | 2022-03-08 | 虹软科技股份有限公司 | Scene text recognition method and device |
Non-Patent Citations (1)
Title |
---|
SHUANGPING HUANG ET AL: "Context-Aware Selective Label Smoothing for Calibrating Sequence Recognition Model", 《ACM》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117151111A (en) * | 2023-08-15 | 2023-12-01 | 华南理工大学 | Text recognition model reliability regularization method based on perception and semantic relevance |
Also Published As
Publication number | Publication date |
---|---|
CN114495114B (en) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108399428B (en) | Triple loss function design method based on trace ratio criterion | |
CN112560432B (en) | Text emotion analysis method based on graph attention network | |
CN109993236B (en) | One-shot Simese convolutional neural network-based small-sample Manchu matching method | |
CN110059586B (en) | Iris positioning and segmenting system based on cavity residual error attention structure | |
CN113469186B (en) | Cross-domain migration image segmentation method based on small number of point labels | |
CN114332578A (en) | Image anomaly detection model training method, image anomaly detection method and device | |
CN112115967A (en) | Image increment learning method based on data protection | |
CN112053354B (en) | Rail plate crack detection method | |
CN113139594B (en) | Self-adaptive detection method for airborne image unmanned aerial vehicle target | |
CN114841257A (en) | Small sample target detection method based on self-supervision contrast constraint | |
CN117237733A (en) | Breast cancer full-slice image classification method combining self-supervision and weak supervision learning | |
CN117611932B (en) | Image classification method and system based on double pseudo tag refinement and sample re-weighting | |
CN113255573A (en) | Pedestrian re-identification method based on mixed cluster center label learning and storage medium | |
CN112270334B (en) | Few-sample image classification method and system based on abnormal point exposure | |
CN114495114B (en) | Text sequence recognition model calibration method based on CTC decoder | |
CN112560948A (en) | Eye fundus map classification method and imaging method under data deviation | |
CN115797701A (en) | Target classification method and device, electronic equipment and storage medium | |
CN117636072B (en) | Image classification method and system based on difficulty perception data enhancement and label correction | |
CN111144462A (en) | Unknown individual identification method and device for radar signals | |
CN115115828A (en) | Data processing method, apparatus, program product, computer device and medium | |
CN116486150A (en) | Uncertainty perception-based regression error reduction method for image classification model | |
CN113962999B (en) | Noise label segmentation method based on Gaussian mixture model and label correction model | |
CN107229944B (en) | Semi-supervised active identification method based on cognitive information particles | |
CN112598082B (en) | Method and system for predicting generalized error of image identification model based on non-check set | |
CN112257787B (en) | Image semi-supervised classification method based on generation type dual-condition confrontation network structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |