CN111428750A

CN111428750A - Text recognition model training and text recognition method, device and medium

Info

Publication number: CN111428750A
Application number: CN202010106741.3A
Authority: CN
Inventors: 胡文阳; 蔡晓聪; 侯军
Original assignee: Sensetime International Pte Ltd
Current assignee: Sensetime International Pte Ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2020-07-17

Abstract

The embodiment of the application discloses a method, a device and a medium for training a text recognition model and recognizing a text, wherein the method comprises the following steps: determining a column characteristic matrix of a sample identification object; determining a plurality of character prediction matrixes of the sample recognition object according to the column characteristic matrix, wherein the number of matrix columns included in the character prediction matrixes is not all the same, each matrix column of the character prediction matrixes has a corresponding unit to be recognized in the sample recognition object, and the matrix columns of the character prediction matrixes comprise the value probability of the unit to be recognized for at least one prediction object; determining a model prediction loss parameter of a text recognition model according to the labeling label sequence of the sample recognition object and the character prediction matrixes; and adjusting the model parameters of the text recognition model according to the model prediction loss parameters.

Description

Text recognition model training and text recognition method, device and medium

Technical Field

The present application relates to the field of machine learning, and in particular, to a method, an apparatus, and a medium for text recognition model training and text recognition.

Background

Nowadays, with the rapid development of the internet, it is generally required to rapidly acquire texts in voices and pictures, in a text recognition scenario, a used text recognition model is generally a combination of a convolutional neural network and a cyclic neural network, the former can determine a plurality of feature vectors, and the latter can transfer sequence features, so as to predict the value probability of each sequence for each character. The output dimension of the convolutional neural network is fixed, so that the number of the determined feature vectors is fixed, further, the output step length of the text predicted by the text recognition model is fixed, in the practical use of the model, the voice length or the picture width occupied by one character in the voice or the picture is random, and the number of the feature vectors is fixed, which may cause that the same character in the recognized voice or the picture is split into two corresponding feature vectors or a plurality of characters correspond to one common feature vector, further, the model cannot perform accurate character prediction according to the extracted features, and the recognition accuracy of the text recognition model is affected.

Disclosure of Invention

The application provides a text recognition model training method, a text recognition model training device, a text recognition method, a text recognition device and a text recognition medium.

A first aspect of an embodiment of the present application provides a text recognition model method, including:

determining a column characteristic matrix of a sample identification object;

determining a plurality of character prediction matrixes of the sample recognition object according to the column characteristic matrix, wherein the number of matrix columns included in the character prediction matrixes is not all the same, each matrix column of the character prediction matrixes has a corresponding unit to be recognized in the sample recognition object, and the matrix columns of the character prediction matrixes comprise the value probability of the unit to be recognized for at least one prediction object;

determining a model prediction loss parameter of a text recognition model according to the labeling label sequence of the sample recognition object and the character prediction matrixes;

and adjusting the model parameters of the text recognition model according to the model prediction loss parameters.

Optionally, the determining the column-divided feature matrix of the sample recognition object includes:

performing convolution feature extraction on the sample identification object to obtain a convolution feature matrix of the sample identification object, wherein the convolution feature matrix comprises a plurality of matrix columns;

sequentially taking each matrix column in the convolution characteristic matrix as an input sequence of the recurrent neural network at different moments according to the arrangement sequence in the convolution characteristic matrix, and inputting the input sequence into the recurrent neural network;

determining the column characteristic matrix according to the output sequences of the recurrent neural network aiming at different input sequences; and the target matrix array in the sublist characteristic matrix is fused with the characteristics of matrix arrays except the target matrix array in the sublist characteristic matrix.

Optionally, the character prediction matrix includes a first character prediction matrix and a second character prediction matrix;

the determining a plurality of character prediction matrices for the sample recognition object according to the column-wise feature matrix comprises:

inputting the column characteristic matrix into a character classification network to obtain the first character prediction matrix of the object to be recognized;

and performing dimensionality reduction on every adjacent matrix array with the first quantity in the first character prediction matrix to obtain the second character prediction matrix.

Optionally, the number of matrix columns in the first character prediction matrix is an integral multiple of the first number, and the first number is an integer greater than 1.

Optionally, the character prediction matrix includes a third character prediction matrix and a fourth character prediction matrix:

performing dimension reduction processing on every adjacent second number of matrix columns in the column-splitting feature matrix to obtain a dimension reduction feature matrix of the sample identification object;

inputting the column characteristic matrix into a character classification network to obtain a third character prediction matrix;

and inputting the dimensionality reduction characteristic matrix into a character classification network to obtain the fourth character prediction matrix.

Optionally, the number of matrix columns in the column-divided feature matrix is an integral multiple of the second number, and the second number is an integer greater than 1.

Optionally, the dimension reduction process includes one of an average dimension reduction process or a maximum dimension reduction process.

Optionally, the determining a model prediction loss parameter of the text recognition model according to the label tag sequence of the sample recognition object and the plurality of character prediction matrices includes:

determining a subentry prediction loss parameter corresponding to each character prediction matrix according to the label tag sequence of the sample identification object;

and determining the model prediction loss parameter according to the subentry prediction loss parameter.

Optionally, the determining the model prediction loss parameter according to the subentry prediction loss parameter includes:

and carrying out weighted summation on the subentry prediction loss parameters according to the respective loss weight of each subentry prediction loss parameter to obtain the model prediction loss parameters.

Optionally, the determining, according to the tag sequence of the sample identification object, the entry prediction loss parameter corresponding to each character prediction matrix includes:

determining at least one target prediction sequence meeting the matching condition of the label sequence from candidate sequences, wherein the candidate sequences are sequences formed by prediction objects corresponding to the value probability in a matrix column of the character prediction matrix;

determining a first prediction probability corresponding to the target prediction sequence according to the value probability in the matrix column of the character prediction matrix;

and determining the subentry prediction loss parameter according to the first prediction probability.

Optionally, the matching condition of the tag sequence includes:

under the condition that the candidate sequence contains text characters, deleting continuous and same text characters in the candidate sequence to obtain a sequence which is the same as the label sequence;

or, in the case that the candidate sequence includes text characters and virtual spacers, deleting one continuous and same text character from the candidate sequence, and deleting the virtual spacer to obtain a sequence which is the same as the tagging sequence.

A third aspect of the embodiments of the present application provides a text recognition method, including:

inputting the target recognition object into a text recognition model obtained by training according to the first aspect or any optional text recognition model training method;

and acquiring a text recognition result output by the text recognition model.

A third aspect of the embodiments of the present application provides a text recognition model device, including:

the characteristic determining module is used for determining a column-dividing characteristic matrix of the sample identification object;

the character prediction module is used for determining a plurality of character prediction matrixes of the sample recognition object according to the column-splitting feature matrix, wherein the number of matrix columns included in the plurality of character prediction matrixes is not all the same, each matrix column of the character prediction matrixes has a corresponding unit to be recognized in the sample recognition object, and the matrix columns of the character prediction matrixes comprise the value probability of the unit to be recognized for at least one prediction object;

the loss determining module is used for determining model prediction loss parameters of a text recognition model according to the labeling label sequence of the sample recognition object and the character prediction matrixes, and the model prediction loss parameters are used for adjusting the model parameters of the text recognition model;

and the parameter adjusting module is used for adjusting the model parameters of the text recognition model according to the model prediction loss parameters.

Optionally, the feature determining module is specifically configured to:

the character prediction matrix is specifically configured to:

the character prediction module is specifically configured to:

Optionally, the loss determining module is specifically configured to:

the determining the model prediction loss parameter according to the itemized prediction loss parameter comprises:

Optionally, the loss determining module is specifically configured to:

determining the subentry prediction loss parameter corresponding to each character prediction matrix according to the labeling label sequence of the sample identification object comprises:

determining at least one target prediction sequence meeting the matching condition of the label tag sequence from candidate sequences, wherein the candidate sequences comprise at least one sequence object, and the sequence object is the prediction object corresponding to the value probability in a matrix column of the character prediction matrix;

Optionally, the matching condition of the tag sequence includes:

under the condition that the candidate sequence only contains text characters, deleting continuous and same text characters in the candidate sequence to obtain a sequence which is the same as the label sequence;

A fourth aspect of the embodiments of the present application provides a text recognition apparatus, including:

an input module, configured to input the target recognition object into a text recognition model obtained by training according to the first aspect or any one of the optional text recognition model training methods;

and the acquisition module is used for acquiring the text recognition result output by the text recognition model.

A fifth aspect of the embodiments of the present application provides a text recognition model training apparatus, including: a processor and a memory;

the processor is connected to the memory, where the memory is configured to store a program code, and the processor is configured to call the program code to execute any one of the text recognition model training methods according to the first aspect or the optional aspects of the present disclosure.

A sixth aspect of the embodiments of the present application provides a text recognition apparatus, including: a processor and a memory;

the processor is connected to the memory, wherein the memory is used for storing program codes, and the processor is used for calling the program codes to execute the second aspect or any optional text recognition method in the embodiment of the present application.

A seventh aspect of embodiments of the present application provides a computer storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the method of any one of the above aspects.

In the embodiment of the application, after the column characteristic matrix of the sample identification object is determined, a plurality of character prediction matrixes of the sample identification object are determined according to the column characteristic matrix, and then the model prediction loss parameter of the text identification model is determined according to the label tag sequence of the sample identification object and the plurality of character prediction matrixes, wherein the model prediction loss parameter is used for adjusting the model parameter of the text identification model. The character prediction matrixes comprise matrix columns, the number of the matrix columns is not all the same, different matrix columns in the character prediction matrixes correspond to different units to be recognized in the sample recognition objects one by one, and the matrix columns of the character prediction matrixes comprise the value probability of the units to be recognized aiming at least one prediction object. The text recognition model is trained through a plurality of character prediction matrixes of the sample recognition object in different scales, so that the recognition capability of the text recognition model on the text characters in different scales is improved, and the recognition accuracy of the text recognition model is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a network diagram of a text recognition model provided by an embodiment of the present application;

FIG. 2 is a diagram of the field of view of a feature vector provided in an embodiment of the present application;

FIG. 3 is a network diagram of another text recognition model provided by an embodiment of the present application;

FIG. 4 is a flowchart illustrating a text recognition model training method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a character prediction matrix according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating another text recognition model training method according to an embodiment of the present disclosure;

FIG. 7 is a schematic view of a pooling layer provided by embodiments of the present application;

FIG. 8 is a schematic diagram of mean pooling provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of mean pooling provided by an embodiment of the present application;

FIG. 10 is a flowchart illustrating another text recognition model training method according to an embodiment of the present application;

FIG. 11 is a schematic view of another pooling layer provided by embodiments of the present application;

fig. 12 is a schematic structural diagram of a text recognition model training apparatus according to an embodiment of the present application;

FIG. 13 is a schematic structural diagram of another training apparatus for text recognition models according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The text recognition model training method is a model training method based on a CTC (connection semantic temporal classification) algorithm, and before the text recognition model training method provided by the application is introduced, the basic principle of the CTC is simply introduced.

In the text recognition model, a commonly used text recognition model includes a Convolutional Neural Network (CNN), a cyclic Neural Network (RNN), and a classification Network (e.g., softmax layer), and the CNN can extract convolutional features of an object to be recognized and determine feature vectors of the object to be recognized, and the RNN can transmit state information of different feature vectors to obtain a prediction result, that is, a text character included in the recognized object to be recognized.

Referring to fig. 1, fig. 1 is a network schematic diagram of a text Recognition model provided in an embodiment of the present application, and fig. 1 is a diagram illustrating a model for recognizing a text included in a picture in an OCR (Optical Character Recognition) scene, where in a process of training the model, a sample picture is labeled first, text characters actually included in the sample picture are used as labels of the sample picture, and then after the sample picture is input to a CNN, the CNN performs convolution feature extraction on the sample picture to obtain a feature matrix of the sample picture, where the feature matrix may carry texture features, color features, and other abstract convolution features of the sample picture; determining a plurality of characteristic vectors according to the characteristic matrix of the sample picture, inputting the characteristic vectors into the RNN, transmitting state information of the characteristic vectors by the RNN, and outputting corresponding slicing characteristics; and inputting the slice features into a softmax layer to perform text prediction on each feature vector to obtain a prediction matrix, wherein each matrix column of the prediction matrix comprises prediction probabilities for different text characters, and outputting the text character corresponding to the maximum prediction probability in each matrix column in the prediction matrix as a final prediction result. And then determining a loss function according to the prediction matrix and the label of the sample picture, and realizing training and optimization of the model through algorithms such as Back Propagation (BP) back propagation and the like, and finally obtaining the text recognition model capable of predicting text characters contained in the picture to be recognized.

In the above process, the number of columns of the feature matrix of the sample picture is the same as the output step length (i.e., the number of text characters corresponding to the prediction matrix), and the number of rows of the feature matrix of the sample picture is the same as the feature dimension of the convolution feature of the sample picture. In the process of determining a plurality of feature vectors according to the feature matrix of the sample picture, each column of the feature matrix may be respectively used as a different feature vector, and each column includes feature data of all feature dimensions. Since the convolutional layer, pooling layer and activation function in the RNN are executed on local regions with translation invariance, each column of the feature matrix (i.e. one feature vector) corresponds to one rectangular region (called receptive field) of the sample image, and the rectangular regions are arranged in the sample image sequentially from left to right in the same order as the corresponding feature vectors in the feature matrix are arranged from left to right, and each feature vector is associated with one receptive field. Referring to fig. 2, fig. 2 is a diagram of the receptive fields of the feature vectors provided by the embodiment of the present application, and fig. 2 illustrates the receptive fields corresponding to the sample picture in fig. 1, as shown in the figure, four dashed boxes in the sample picture respectively represent four receptive fields, and the correspondence between the four feature vectors and the four receptive fields can be represented by arrows in fig. 2. Furthermore, the slice features output by the RNN also correspond to the receptive fields in the sample picture, and the columns in the prediction matrix output by the softmax layer also correspond to the receptive fields in the sample picture. For example, after the sample picture in fig. 1 is finally classified by the softmax layer, the output prediction matrix may include four columns, where each column from left to right in the prediction matrix represents the prediction probability of the receptive field 1, receptive field 2, receptive field 3, and receptive field 4 in fig. 2 for different text characters. Finally, the text recognition model outputs the text character with the highest prediction probability in each column of the prediction matrix, for example, "H", "e", and "r" are sequentially performed for the text characters output in fig. 1, that is, the text character corresponding to receptive field 1 in fig. 2 is "H", the text character corresponding to receptive field 3 is "e", and the text character corresponding to receptive field 4 is "r". In fact, the text characters corresponding to the labels of the sample picture in fig. 1 are "H", "e", and "r" in this order. Therefore, the model has the problem that the output text characters are predicted and are difficult to align with the text characters actually existing in the input picture to be recognized. Similarly, in an audio file, due to the reasons of different reading lengths of characters, different speaking speeds of people and the like, in a scene of performing text recognition on the audio file, the problem that text characters output by model prediction are difficult to align with text characters actually existing in the audio file also exists.

The CTC algorithm is an algorithm for solving the above alignment problem, and a blank symbol is introduced into the CTC algorithm, and the blank symbol is different from a space, a spacer, and other symbols in a document and is a logical separator. Referring to fig. 3, fig. 3 is a network diagram of another text recognition model provided in the embodiment of the present application, where the text recognition model includes a CNN, an RNN, a classification layer (e.g., softmax layer), and a transcription layer.

In the process of training the recognition model, firstly, the sample picture is labeled, and sample characters actually contained in the sample picture are used as labels of the sample picture. And then inserting a blank character between repeated text characters in the label of the sample picture based on the CTC algorithm according to the output step length of the text recognition model, for example, if the label of the sample picture is 'too', the output step length is 5, the blank character is identified by 'minus', and the label after the insertion of the blank character can be in one of the forms of't-o-o', 'to-o-', 'tto-o', and the like. After the labeling is finished, inputting the sample picture into the combination of the CNN, the RNN and the classification layer to obtain a prediction matrix, wherein each column in the prediction matrix comprises the prediction probability aiming at different text characters and the prediction probability aiming at a blank character, further, decoding is carried out on the basis of a transcription layer in a CTC algorithm, the probability of the occurrence of a target combination of the text characters matched with the label of the sample picture in each text character combination corresponding to the prediction matrix is determined, and the sum of the probabilities of the occurrence of the target combination is determined as the probability of the label of the sample picture predicted by the model. If a plurality of continuous and same text characters before the first blank character in the text character combination are deleted to one, a plurality of continuous and same text characters after the last blank character are deleted to one, the same text characters between any two blank characters are deleted to one, and then the blank characters are deleted, so that the label of the sample picture can be obtained, and the text character combination is determined to be matched with the label of the sample picture, so that the text character combination is the target combination. The loss function of the target combination is determined by predicting the labeling probability of the sample picture, and then the model is trained and optimized by an error back propagation algorithm, so that the text recognition model for predicting and decoding various combinations of text characters contained in the picture to be recognized and outputting the text characters contained in the picture to be recognized can be obtained finally.

In the process, the blank character is inserted into the label of the sample picture, so that the text recognition model learns the characteristics of the blank character, the combination of the text characters containing the blank character can be output, and the problem that a plurality of continuous repeated text characters are contained in the prediction output due to the fact that a plurality of receptive fields correspond to one text character is solved through the process of decoding the combination of the text characters by the transcription layer.

The text recognition model in the present application can be applied to a recognition scene of a text in a picture, a recognition scene of a text in an audio file, a recognition scene of a text in a video file, and a text recognition scene of other form objects.

Referring to fig. 4, fig. 4 is a schematic flowchart of a text recognition model training method provided in an embodiment of the present application, and as shown in the drawing, the method may include the following steps:

s401, determining a column characteristic matrix of the sample identification object.

In the training of the image text recognition model, the sample recognition object may be a sample recognition picture with a labeling tag sequence in advance, and in the training of the audio text recognition model, the sample recognition object may be a sample audio file with a labeling tag sequence in advance. And the labeling label sequence is labeled for the sample identification object and comprises the text characters actually corresponding to the sample identification object.

Specifically, the method includes the steps of performing convolution feature extraction on a sample identification object to obtain a convolution feature matrix of the sample identification object, wherein the convolution feature matrix comprises a plurality of matrix columns, sequentially inputting the matrix columns included in the convolution feature matrix as input sequences of a recurrent neural network at different moments according to an arrangement sequence in the convolution feature matrix (for example, the sequence can be from left to right in the convolution feature matrix), determining a branch feature matrix according to output sequences of the recurrent neural network aiming at different input sequences, and fusing features of matrix columns except for the target matrix column in the branch feature matrix for a target matrix column (which refers to any matrix column in the branch feature matrix or a matrix column determined according to a certain rule) in the branch feature matrix.

S402, determining a plurality of character prediction matrixes of the sample recognition object according to the column characteristic matrix, wherein the number of matrix columns included in the character prediction matrixes is not all the same, each matrix column of the character prediction matrixes has a corresponding unit to be recognized in the sample recognition object, and the matrix columns of the character prediction matrixes comprise the value probability of the unit to be recognized for at least one prediction object.

Different matrix columns in the character prediction matrix correspond to different units to be recognized in the sample recognition object one by one, namely, each matrix column in the character prediction matrix has the unit to be recognized which is uniquely corresponding in the sample recognition object. Each matrix column in the column characteristic matrix contains characteristics under different abstract characteristic dimensions, and an object to be recognized can be predicted according to the characteristics contained in each matrix column, namely, a character prediction matrix of the object to be recognized is determined. Each matrix column in the character prediction matrix corresponds to a different unit to be recognized in the object to be recognized, where the unit to be recognized is a receptive field in the object to be recognized, and if the object to be recognized is a picture, the unit to be recognized may be an image segment with a certain width in the picture (as in the example of the receptive field in fig. 2, each receptive field is a unit to be recognized), and if the object to be recognized is a voice, the unit to be recognized may be a voice segment in the voice with a certain duration. That is to say, the number of matrix columns in the character prediction matrix affects the length of the combination of text characters of the input transcription layer, and further affects the decoding result of the transcription layer, so that a plurality of character prediction matrices containing the matrix columns with the same number are determined, the diversity of input data of the transcription layers with different lengths is improved, the diversity of decoding of the transcription layers is also improved, and the coverage rate of actually labeling a label sequence to an object to be recognized is improved.

The matrix column of the character prediction matrix contains the value probability of the unit to be recognized for at least one prediction object, where the prediction object may include a text character and/or a virtual spacer (i.e., the above mentioned blank). The sum of the value probabilities in each matrix column of the character prediction matrix is 1.

In an alternative implementation, the column-wise feature matrix may be directly input into a character classification network to obtain a first character prediction matrix of the object to be recognized, average dimension reduction processing is performed on every adjacent first number of matrix columns in the first character prediction matrix to obtain a second character prediction matrix, and the first character prediction matrix and the second character prediction matrix are used as different character prediction matrices of the object to be recognized. The number of matrix columns in the first character prediction matrix is an integral multiple of the first number, and the first number is an integer greater than 1. The first number can take a plurality of different values, and the dimension reduction processing is carried out on the first character prediction matrix for a plurality of times according to the difference of the plurality of first numbers to obtain a plurality of second character prediction matrixes. In specific implementation, the dimension reduction processing on the first character prediction matrix can be realized by connecting the pooling layer after the classification layer of the text recognition model.

In another implementation mode, dimension reduction processing can be performed on every adjacent second number of matrix columns in the column feature matrix to obtain a dimension reduction feature matrix of the sample identification object; inputting the classification characteristic matrix into a character classification network to obtain a third character prediction matrix; and inputting the dimension reduction characteristic matrix into a character classification network to obtain a fourth character prediction matrix, and taking the third character prediction matrix and the fourth character prediction matrix as character prediction matrixes of different objects to be recognized. And the number of matrix columns in the column-divided characteristic matrix is an integral multiple of the second number, and the second number is an integer greater than 1. The second quantity can take a plurality of different values, and dimension reduction processing is carried out on the classification feature matrix for a plurality of times according to the difference of the second quantity to obtain a plurality of dimension reduction feature matrices. In specific implementation, the pooling layer can be connected before the RNN layer of the text recognition model to obtain feature matrices (including a column feature matrix and a dimension reduction feature matrix) with different scales, and the different feature matrices are sequentially processed by the RNN and the classification layer to obtain character prediction matrices with different scales.

And S403, determining model prediction loss parameters of the text recognition model according to the labeling label sequence of the sample recognition object and the character prediction matrixes.

Here, the polynomial prediction loss parameter corresponding to each character prediction matrix is determined according to the label tag sequence of the sample identification object, and then the model prediction loss parameter is determined according to each polynomial prediction loss parameter.

In one implementation, the subentry prediction loss parameters can be directly added to obtain a model prediction loss parameter; in another implementation manner, loss weights are preset for each subentry prediction loss parameter, so that the subentry prediction loss parameters are subjected to weighted summation according to the respective loss weights of the subentry prediction loss parameters to obtain model prediction loss parameters.

In the process of determining the item prediction loss parameters, firstly, a candidate sequence formed by prediction objects corresponding to each value-taking probability in a matrix column of each character prediction matrix is determined, then at least one target prediction sequence meeting the matching condition of a label sequence of a sample recognition object is determined from the candidate sequence, a first prediction probability corresponding to the target prediction sequence is determined according to the value-taking probability in the matrix column of the character prediction matrix, and then the item prediction loss parameters are determined according to the first prediction probability. The first prediction probability in the target prediction sequence is a product of the value probabilities of the prediction objects included in the target prediction sequence. The polynomial prediction loss parameter can be represented by a negative maximum likelihood function of the first prediction probability, for the convenience of calculation, the logarithm of the likelihood function can be calculated, the polynomial prediction loss parameter is minimized, and the model parameter is adjusted through error back propagation.

The matching condition of the label sequence can be a sequence obtained by deleting continuous and same text characters in the candidate sequence to one under the condition that the candidate sequence only contains text characters, and the sequence is the same as the label sequence; or, in the case that the candidate sequence includes text characters and virtual spacers, deleting the same text characters in the candidate sequence, and deleting the virtual spacers to obtain a sequence identical to the tag sequence.

For example, referring to fig. 5, fig. 5 is a schematic diagram of a character prediction matrix provided in an embodiment of the present application, as shown in fig. 5, the character prediction matrix includes four matrix columns, and each matrix column includes a value probability of an object to be identified corresponding to the matrix column with respect to "a", "B", and "-", where "-" represents a virtual spacer. The candidate sequences of the character prediction array comprise AAB-, AAAA, BAA-, BBAA and other sequences which are composed of 'A', 'B' and 'minus' and have four-bit lengths, and because the value probability corresponding to 'B' in a fourth column matrix array is zero, the candidate sequences do not comprise the sequences of which the last bit is 'B'. If the labeled tag sequence is AB, then according to the matching condition, for candidate sequence AAB-, if two A's that are consecutive and repeated before the virtual spacer included in the candidate sequence AAB-are deleted to one, and the virtual spacer is deleted, the obtained sequence is AB, which is the same as the standard labeled sequence, and indicates that the matching condition is satisfied. And for the candidate sequence BBAA, deleting two continuous and repeated B contained in the candidate sequence BBAA to one, deleting two continuous and repeated A to one, wherein the obtained sequence is BA, is the same as the standard labeled sequence, and does not meet the matching condition and cannot be used as the target prediction sequence.

S404, adjusting the model parameters of the text recognition model according to the model prediction loss parameters.

The model parameters of the text recognition model can be adjusted through an error back propagation algorithm, and the adjusted model parameters can include weight matrixes of each network layer of the text recognition model and the like. In one implementation, the model parameters may be adjusted by an error back propagation algorithm for model prediction loss parameters of each sample recognition object. In another implementation, the model prediction loss parameters of each sample recognition object may be added to obtain a total loss parameter, and the model parameters are adjusted by an error back propagation algorithm with respect to the total loss parameter. In the error back propagation algorithm, parameters in the initial text recognition model are updated through back propagation error loss information, so that error loss is converged, and back propagation motion taking the error loss as a main factor can realize the correction of the parameters in the initialized classification layer, the parameters of RNN and the parameters of CNN in the training process, so that the reconstruction error loss of the model is smaller and smaller.

Referring to fig. 6, fig. 6 is a schematic flowchart of another text recognition model training method provided in the embodiment of the present application, and as shown in the drawing, the method may include the following steps:

s601, determining a column characteristic matrix of the sample identification object.

S602, inputting the column characteristic matrix into a character classification network to obtain the first character prediction matrix of the object to be recognized.

S603, performing dimension reduction processing on every adjacent matrix array with the first quantity in the first character prediction matrix to obtain the second character prediction matrix.

Here, the dimension reduction processing performed on each adjacent first number of matrix columns in the first character prediction matrix may be implemented by a pooling layer, and the dimension reduction processing may include one of average dimension reduction processing or maximum value dimension reduction processing. The dimension reduction parameter of the pooling layer, namely the value of the first number, is related to the number of matrix columns of the character prediction matrix to be pooled, and the pooling layer reduces the dimension of the first number of adjacent matrix columns in the first character prediction matrix, so that the number of matrix columns in the first character prediction matrix can be divided by the first number, and the first number is not equal to 1. Referring to fig. 7, fig. 7 is a schematic diagram of a pooling layer provided by an embodiment of the present application, and fig. 7 illustrates that a pooling layer is connected after a classification layer (softmax layer), and the pooling layer is used for reducing a first character prediction matrix comprising five matrix columns into a second character prediction matrix comprising four matrix columns.

In one implementation, the mean pooling layer may be connected after the classification layer in the text recognition model to implement the mean dimensionality reduction; optionally, the average dimensionality reduction processing on the first character prediction matrix may be average dimensionality reduction processing for different dimensionality reduction parameters (a first number), and a plurality of mean pooling layers with different dimensionality reduction parameters may be respectively connected after the classification layer of the text recognition model. And respectively connecting the transcription layers behind the average pooling layers to realize the decoding of each character prediction matrix.

Referring to fig. 8 and fig. 8 are schematic diagrams illustrating mean pooling provided by an embodiment of the present application, as shown in the figure, if the first character prediction matrix includes 8 matrix columns, the first number may be 2 or 4, and if the first number is 2, the first character prediction matrix is subjected to average dimensionality reduction to be processed into a second character prediction matrix including 4 matrix columns. Referring to fig. 9, fig. 9 is a schematic diagram of mean pooling provided by an embodiment of the present application, and fig. 9 is a schematic diagram of mean dimensionality reduction in the case that the first number is 4 in the corresponding example of fig. 8, as shown in the figure, if the first number is 4, the first character prediction matrix is mean dimensionality reduced to be a second character prediction matrix including 6 matrix columns.

In another implementation mode, a maximum value pooling layer can be connected after a classification layer in the text recognition model so as to realize maximum value dimension reduction processing; optionally, the maximum value dimension reduction processing on the first character prediction matrix may be maximum value dimension reduction processing for different dimension reduction parameters (the first number), and a plurality of maximum value pooling layers with different dimension reduction parameters may be connected after the classification layer of the text recognition model. And respectively connecting the transcription layers behind the maximum pooling layers to realize the decoding of the character prediction matrixes.

In another implementation, the average pooling layer and the maximum pooling layer may be connected after the classification layer in the text recognition model, so as to implement average dimensionality reduction and maximum dimensionality reduction, and the transcription layer may be connected after each of the average pooling layer and the maximum pooling layer, so as to implement decoding of each character prediction matrix. The first number corresponding to the average pooling layer and the first number corresponding to the maximum pooling layer may be the same or different. Alternatively, the average pooling layer connected after the classification layer may include a plurality of average pooling layers, and the first numbers corresponding to the respective average pooling layers are different from each other. Optionally, the maximum pooling layer connected after the classification layer may include a plurality of maximum pooling layers, and the first numbers corresponding to the respective maximum pooling layers are different from each other.

S604, determining the respective corresponding subentry prediction loss parameters of the first character prediction matrix and the second character prediction matrix respectively.

And S605, adding the subentry prediction loss parameters to obtain a model prediction loss parameter.

And S606, adjusting the model parameters of the text recognition model according to the model prediction loss parameters.

Steps S601 to S602 and steps S604 to S606 may refer to the related description in the embodiment corresponding to fig. 4, and are not repeated herein.

In the embodiment of the application, after the column characteristic matrix of the sample identification object is determined, the first character prediction matrix of the sample identification object is determined according to the column characteristic matrix, the dimension reduction processing is performed on the first character prediction matrix to obtain the second character prediction matrix, the model prediction loss parameter of the text identification model is determined according to the label tag sequence of the sample identification object, the first character prediction matrix and the second character prediction matrix, and the model parameter of the text identification model is adjusted according to the model prediction loss function. The character prediction matrixes comprise matrix columns, the number of the matrix columns is not all the same, different matrix columns in the character prediction matrixes correspond to different units to be recognized in the sample recognition objects one by one, and the matrix columns of the character prediction matrixes comprise the value probability of the units to be recognized aiming at least one prediction object. Through dimension reduction processing of the first character prediction matrix, a plurality of character prediction matrixes of different scales of the sample recognition object are obtained to train the text recognition model, so that the recognition capability of the text recognition model on text characters of different scales is improved, and the recognition accuracy of the text recognition model is improved.

Referring to fig. 10, fig. 10 is a schematic flowchart of another text recognition model training method provided in the embodiment of the present application, and as shown in the drawing, the method may include the following steps:

s1001, determining a column characteristic matrix of the sample identification object.

S1002, performing dimension reduction processing on every adjacent second number of matrix columns in the column-divided feature matrix to obtain a dimension reduction feature matrix of the sample identification object.

Here, the dimension reduction processing performed on every adjacent second number of matrix columns in the column feature matrix may be implemented by a pooling layer, and the dimension reduction processing may include one of average dimension reduction processing or maximum value dimension reduction processing. The dimension reduction parameter of the pooling layer, that is, the value of the second number is related to the number of matrix columns of the pooled columnar feature matrix, and the pooling layer reduces the dimension of the adjacent second number of matrix columns in the columnar feature matrix, so that the number of matrix columns in the columnar feature matrix can be divided by the second number, and the second number is not equal to 1. Referring to fig. 11, fig. 11 is a schematic diagram of another pooling layer provided in the embodiment of the present application, and fig. 11 illustrates that a group of pooling layers, RNNs, classification layers, and transcription layers are connected after CNN, and a dimensionality reduction of a classification feature matrix including five matrix columns into a dimensionality reduction feature matrix including four matrix columns is implemented by the pooling layers.

In one implementation, the average pooling layer may be connected after the CNN and before the RNN of the text recognition model to implement an average dimension reduction process on the list feature matrix; alternatively, the average dimensionality reduction processing on the column feature matrix may be average dimensionality reduction processing for (a second number of) different dimensionality reduction parameters, and a plurality of mean pooling layers with different dimensionality reduction parameters may be connected after the CNN of the text recognition model. And respectively connecting the RNN, the classification layer and the transcription layer after each mean pooling layer, so as to realize the prediction and decoding aiming at different feature matrixes (the column feature matrix and the mean feature matrix).

In another implementation, after the CNN of the text recognition model and before the RNN, the average pooling layer may be connected to implement an average dimension reduction process on the column feature matrix; alternatively, the average dimensionality reduction processing on the column feature matrix may be average dimensionality reduction processing for (a second number of) different dimensionality reduction parameters, and a plurality of mean pooling layers with different dimensionality reduction parameters may be connected after the CNN and before the RNN of the text recognition model, respectively. And respectively connecting the RNN, the classification layer and the transcription layer after each mean pooling layer, so as to realize the prediction and decoding aiming at different feature matrixes (the column feature matrix and the mean feature matrix).

In another implementation, after the CNN and before the RNN of the text recognition model, the average pooling layer and the maximum pooling layer may be connected to respectively implement the average dimensionality reduction processing and the maximum dimensionality reduction processing, and after each of the average pooling layer and the maximum pooling layer, the transcription layer may be connected to respectively implement decoding of each character prediction matrix. The second number corresponding to the average pooling layer and the second number corresponding to the maximum pooling layer may be the same or different. Alternatively, after the CNN and before the RNN of the text recognition model, the connected average pooling layers may include a plurality of layers, and the second numbers corresponding to the respective average pooling layers are different from each other. Alternatively, after the CNN and before the RNN of the text recognition model, the connected maximum value pooling layers may include a plurality of maximum value pooling layers, and the second numbers corresponding to the respective maximum value pooling layers are different from each other.

And S1003, inputting the column characteristic matrix into a character classification network to obtain the third character prediction matrix.

And S1004, inputting the dimension reduction characteristic matrix into a character classification network to obtain the fourth character prediction matrix.

S1005, determining the respective corresponding subentry prediction loss parameters of the third character prediction matrix and the fourth character prediction matrix.

Wherein, S1004 is executed after S1001, S1005 is executed after S1003, and S1004 may be executed before S1005, after S1005, or simultaneously with S10051.

And S1006, adding the subentry prediction loss parameters to obtain a model prediction loss parameter.

And S1007, adjusting the model parameters of the text recognition model according to the model prediction loss parameters.

Steps S1001 and S1003 to S1007 may refer to the related description in the embodiment corresponding to fig. 4, and are not described herein again.

In the embodiment of the application, after the column characteristic matrix of the sample identification object is determined, the third character prediction matrix of the sample identification object is determined according to the column characteristic matrix, the column characteristic matrix is subjected to dimensionality reduction to obtain the fourth character prediction matrix of the sample identification object, and then the model prediction loss parameter of the text identification model is determined according to the label tag sequence, the third character prediction matrix and the fourth character prediction matrix of the sample identification object, wherein the model prediction loss parameter is used for adjusting the model parameter of the text identification model and adjusting the model parameter of the text identification model according to the model prediction loss function. The character prediction matrixes comprise matrix columns, the number of the matrix columns is not all the same, different matrix columns in the character prediction matrixes correspond to different units to be recognized in the sample recognition objects one by one, and the matrix columns of the character prediction matrixes comprise the value probability of the units to be recognized aiming at least one prediction object. The characteristic matrixes with different scales (including the column characteristic matrix and the dimension reduction characteristic matrix) are obtained by performing dimension reduction processing on the column characteristic matrix, a plurality of character prediction matrixes with different scales of the sample recognition object are obtained according to the characteristic matrixes with different scales, and the text recognition model is trained through the character prediction matrixes with different scales, so that the recognition capability of the text recognition model on text characters with different scales is improved, and the recognition accuracy of the text recognition model is improved.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a text recognition model training apparatus according to an embodiment of the present application, and as shown in the drawing, the text recognition model training apparatus 12 includes:

a feature determination module 120, configured to determine a column feature matrix of the sample identification object;

the character prediction module 121 is configured to determine, according to the column feature matrix, a plurality of character prediction matrices of the sample recognition object, where the number of matrix columns included in the plurality of character prediction matrices is not all the same, each matrix column of the character prediction matrices has a corresponding unit to be recognized in the sample recognition object, and a matrix column of the character prediction matrix includes a value probability of the unit to be recognized for at least one prediction object;

a loss determining module 122, configured to determine a model prediction loss parameter of a text recognition model according to the tagging tag sequence of the sample recognition object and the plurality of character prediction matrices, where the model prediction loss parameter is used to adjust a model parameter of the text recognition model;

and the parameter adjusting module 123 is configured to adjust the model parameters of the text recognition model according to the model prediction loss parameters.

Optionally, the feature determining module 120 is specifically configured to:

the character prediction matrix is specifically configured to:

the character prediction module 121 is specifically configured to:

Optionally, the loss determining module 122 is specifically configured to:

Optionally, the matching condition of the tag sequence includes:

In a specific implementation, the text recognition model training device may execute, through each built-in functional module thereof, each step in the text recognition model training method shown in fig. 4, fig. 6, or fig. 10, and specific implementation details may refer to implementation details of each step in the embodiment corresponding to fig. 4, fig. 6, or fig. 10, which are not described herein again.

In the embodiment of the application, after the characteristic determining module determines the column characteristic matrix of the sample identification object, the character predicting module determines a plurality of character predicting matrixes of the sample identification object according to the column characteristic matrix, then the loss determining module determines the model predicting loss parameters of the text identification model according to the label tag sequence of the sample identification object and the plurality of character predicting matrixes, and the parameter adjusting module adjusts the model parameters of the text identification model according to the model predicting loss parameters. The character prediction matrixes comprise matrix columns, the number of the matrix columns is not all the same, different matrix columns in the character prediction matrixes correspond to different units to be recognized in the sample recognition objects one by one, and the matrix columns of the character prediction matrixes comprise the value probability of the units to be recognized aiming at least one prediction object. The text recognition model is trained through a plurality of character prediction matrixes of the sample recognition object in different scales, so that the recognition capability of the text recognition model on the text characters in different scales is improved, and the recognition accuracy of the text recognition model is improved.

The embodiment of the application further provides a text recognition method, which can comprise the following steps: inputting the target recognition object into a text recognition model obtained by training according to the text recognition model training method in any one of the figures 3, 6 and 10; and acquiring a text recognition result output by the text recognition model.

Alternatively, the target recognition object may include one of an image recognition object, an audio recognition object, or a video recognition object, such as a license plate number, voice chat information, and the like.

Optionally, the text recognition result includes a character or a character string included in the target recognition object. The character or character string may specifically comprise one or more of an alphabet, a number, a symbol.

In the embodiment of the application, a plurality of character prediction matrixes for the target recognition object can be obtained, and then the combination of the text characters with the maximum prediction probability is output according to the plurality of character prediction matrixes. Through the character prediction matrixes of the target recognition object, the diversity of the text recognition model on the prediction result of the target recognition object is improved, and therefore the recognition accuracy of the text recognition model is improved.

An embodiment of the present application further provides a text recognition apparatus, where the apparatus may include: an input module, configured to input the target recognition object into a text recognition model obtained by training with the text recognition model training method shown in any one of fig. 3, fig. 6, or fig. 10; and the acquisition module is used for acquiring the text recognition result output by the text recognition model.

In the embodiment of the application, the input module inputs the target recognition object into the text recognition model, the text recognition model can obtain a plurality of character prediction matrixes for the target recognition object, and the acquisition module can acquire the combination of the text characters with the maximum prediction probability output by the text recognition model according to the character prediction matrixes. Through the character prediction matrixes of the target recognition object, the diversity of the text recognition model on the prediction result of the target recognition object is improved, and therefore the recognition accuracy of the text recognition model is improved.

Referring to fig. 13, fig. 13 is a schematic structural diagram of another text recognition model training apparatus according to an embodiment of the present application. As shown in fig. 13, the text recognition model training device 130 may include: at least one processor 1301, such as a CPU, at least one network interface 1304, a user interface 1303, memory 1305, at least one communication bus 1302. Wherein a communication bus 1302 is used to enable connective communication between these components. The user interface 1303 may include a Display (Display) and a Keyboard (Keyboard), and the optional user interface 1303 may also include a standard wired interface and a standard wireless interface. The network interface 1304 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1305 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1305 may also optionally be at least one memory device located remotely from the processor 1301. As shown in fig. 13, the memory 1305, which is a type of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the text recognition model training apparatus 130 shown in fig. 13, the network interface 1304 is mainly used for a server that inputs an object to be recognized, or a server that receives a recognition result; and user interface 1303 is mainly an interface for providing input to the user; and the processor 1301 may be used to invoke a device control application stored in memory 1305 to implement:

determining a column characteristic matrix of a sample identification object;

determining a plurality of character prediction matrixes of the sample identification object according to the column characteristic matrix, wherein each character prediction matrix is composed of matrix columns with different numbers, different matrix columns in the character prediction matrixes correspond to different units to be identified in the sample identification object one by one, and the matrix columns of the character prediction matrixes comprise the value probability of the units to be identified to at least one prediction object;

and determining a model prediction loss parameter of a text recognition model according to the labeling label sequence of the sample recognition object and the character prediction matrixes, wherein the model prediction loss parameter is used for adjusting the model parameter of the text recognition model.

It should be understood that the text recognition model training device 130 described in this embodiment may perform the description of the text recognition model training method in the embodiment corresponding to fig. 4, fig. 6, or fig. 10, and may also perform the description of the text recognition model training device 12 in the embodiment corresponding to fig. 12, which is not repeated herein. In addition, the beneficial effects of the same method are not described in detail.

Referring to fig. 14, fig. 14 is a schematic structural diagram of another text recognition apparatus according to an embodiment of the present application. As shown in fig. 14, the text recognition device 140 may include: at least one processor 1401, e.g. a CPU, at least one network interface 1404, a user interface 1403, a memory 1405, at least one communication bus 1402. The communication bus 1402 is used to realize connection communication between these components. The user interface 1403 may include a Display screen (Display) and a Keyboard (Keyboard), and the selectable user interface 1403 may also include a standard wired interface and a standard wireless interface. The network interface 1404 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1405 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). Memory 1405 may optionally also be at least one storage device located remotely from processor 1401 as described above. As shown in fig. 14, the memory 1405, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the text recognition model training apparatus 140 shown in fig. 14, the network interface 1404 is mainly used for a server that inputs an object to be recognized, or a server that receives a recognition result; and user interface 1403 is primarily an interface for providing input to a user; and processor 1401 may be used to invoke the device control application stored in memory 1405 to implement:

inputting the target recognition object into a text recognition model obtained by training according to the text recognition model training method in any one of the figures 3, 6 and 10; and acquiring a text recognition result output by the text recognition model.

It should be understood that the text recognition model training apparatus 140 described in this embodiment of the present application may perform the description of the text recognition method in the foregoing embodiment, and may also perform the description of the text recognition apparatus in the foregoing embodiment, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer storage medium, and a computer program executed by the aforementioned text recognition model training apparatus 12 or a computer program executed by the aforementioned text recognition apparatus is stored in the computer storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the processor can execute the description of the text recognition model training method in the embodiment corresponding to fig. 4, fig. 6, or fig. 10, or execute the description of the text recognition method in the embodiment, and therefore, the description of the text recognition method in this embodiment will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A text recognition model training method is characterized by comprising the following steps:

determining a column characteristic matrix of a sample identification object;

2. The method of claim 1, wherein determining the column-wise feature matrix of the sample recognition objects comprises:

3. The method of any of claims 1 or 2, wherein the character prediction matrix comprises a first character prediction matrix and a second character prediction matrix;

4. The method of claim 3, wherein the number of matrix columns in the first character prediction matrix is an integer multiple of the first number, and wherein the first number is an integer greater than 1.

5. The method of any of claims 1 or 2, wherein the character prediction matrices include a third character prediction matrix and a fourth character prediction matrix:

6. The method of claim 5, wherein the number of matrix columns in the column-wise feature matrix is an integer multiple of the second number, and wherein the second number is an integer greater than 1.

7. The method of any of claims 3-6, wherein the dimension reduction process comprises one of a mean dimension reduction process or a maximum dimension reduction process.

8. The method of any one of claims 1-7, wherein determining model prediction loss parameters for a text recognition model based on the sequence of label tags for the sample recognition objects and the plurality of character prediction matrices comprises:

9. The method of claim 8, wherein determining the model predictive loss parameter from the binomial predictive loss parameter comprises:

10. The method according to any one of claims 8 or 9, wherein the determining, according to the tag label sequence of the sample recognition object, the entry prediction loss parameter corresponding to each character prediction matrix comprises:

11. The method of claim 10, wherein the matching condition for the tag sequence comprises:

12. A method of text recognition, the method comprising:

inputting a target recognition object into a text recognition model obtained by training according to the text recognition model training method of any one of claims 1-11;

and acquiring a text recognition result output by the text recognition model.

13. The method of claim 12, wherein the target recognition object comprises one of an image recognition object, an audio recognition object, or a video recognition object;

and the text recognition result output by the text recognition model comprises characters or character strings contained in the target recognition object.

14. A text recognition model training apparatus, comprising:

15. The apparatus of claim 14, wherein the feature determination module is specifically configured to:

16. The apparatus of any one of claims 14 or 15, wherein the character prediction matrix comprises a first character prediction matrix and a second character prediction matrix;

the character prediction module is specifically configured to:

17. A text recognition apparatus, characterized in that the apparatus comprises:

an input module, configured to input a target recognition object into a text recognition model trained by the text recognition model training method according to any one of claims 1 to 11;

18. A text recognition model training apparatus, comprising: a processor and a memory;

the processor is coupled to the memory, wherein the memory is configured to store program code and the processor is configured to call the program code to perform the method of any of claims 1-11.

19. A text recognition apparatus, comprising: a processor and a memory;

the processor is coupled to the memory, wherein the memory is configured to store program code and the processor is configured to call the program code to perform the method of any of claims 12-13.

20. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any of claims 1-11, or the method of any of claims 12-13.