CN113435451A

CN113435451A - Model, training method and device of model, and recognition and device of character sequence

Info

Publication number: CN113435451A
Application number: CN202110718174.1A
Authority: CN
Inventors: 谢念; 王靓伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-24

Abstract

The application relates to the technical field of artificial intelligence, in particular to a recognition model capable of recognizing character sequences, which comprises a coding module and a context feature acquisition module, wherein the coding module is used for acquiring context features according to input data; a first decoding module, configured to obtain a recognized character sequence according to the context feature, wherein the first decoding module includes: the character position prediction module can be used for obtaining a feature map according to the context features, wherein the feature map reflects the position information of characters in the character sequence; and the first sequence processing module can be used for obtaining the character sequence according to the context characteristics and the characteristic diagram. The recognition model can be trained by a knowledge distillation mode by using a serial attention-based sequence decoder, so that the recognition model has high character recognition accuracy while relatively high operation efficiency is kept.

Description

Model, training method and device of model, and recognition and device of character sequence

Technical Field

The present application relates to image processing technology in the field of artificial intelligence, and in particular, to a model, a method and an apparatus for training a model, a method and an apparatus for recognizing a character sequence, a server, a computing device, a computer-readable storage medium, and a computer program product.

Background

Image-based character recognition refers to a process of recognizing a character shape in an image containing characters as characters. The image may be obtained by an electronic device, such as a scanner, a digital camera or a video camera, or a mobile phone, or by directly receiving an image file.

The conventional Character Recognition technology is Optical Character Recognition (OCR), and with the development of artificial intelligence technology, a technology for recognizing characters in an image through a deep learning model (or called deep neural network) is developed. Deep neural networks generally increase accuracy as the complexity of the network increases, but at the same time, reduce operational efficiency.

When the characters in the image are identified based on the deep neural network, how to have higher identification precision and higher operation efficiency is a technical problem to be solved.

Disclosure of Invention

In view of the above problems of the prior art, the present application provides a model, a training method and apparatus for the model, a character sequence recognition and apparatus, a server, a computing device, a computer-readable storage medium, and a computer program product, which can maintain relatively high operation efficiency and have high recognition accuracy when performing character sequence recognition using the model of the present application.

To achieve the above object, a first aspect of the present application provides a recognition model, including: the encoding module is used for obtaining context characteristics according to input data; a first decoding module, configured to obtain a recognized character sequence according to the context feature, where the first decoding module includes: the character position prediction module is used for obtaining a feature map according to the context features, and the feature map reflects the position information of characters in the character sequence; and the first sequence processing module is used for obtaining the character sequence according to the context characteristics and the characteristic diagram.

Therefore, the recognition model adopts the feature map and the context feature to predict the recognized character sequence, does not use a serial attention-based decoder, and has higher operation efficiency. Moreover, the character position prediction module can be trained in a knowledge distillation mode based on a decoder using serial attention, and can have high recognition accuracy.

As a possible implementation manner of the first aspect, the character position prediction module includes: a cascaded downsampled convolutional layer, a full-link layer, and an upsampled convolutional layer.

The outer layer of the coding part and the outer layer of the decoding part of the character position prediction module use a CNN layer, the inner layer uses a full-connection layer to connect the coding part and the decoding part, the reason is that the coding and decoding networks of the outer layer are respectively used for extracting and generating local information, the full-connection network of the inner layer is used for extracting and generating global information, if the CNN is continuously used in the inner layer, a plurality of layers of CNNs are needed to provide a sufficiently large global receptive field, the global receptive field can be easily obtained by using the full-connection layer, and the parameter quantity brought by the full-connection layer is not large due to the small characteristic size of the inner layer, so the structure of the character position prediction module provided by the embodiment of the application has the characteristic of high operation efficiency.

As a possible implementation manner of the first aspect, the first decoding module further includes a parallel-based attention module, configured to obtain an updated feature map according to the feature map and the context feature information; the first sequence processing module is specifically configured to obtain a character sequence according to the context feature and the updated feature map.

Therefore, the feature map output by the character position prediction module is processed based on the parallel attention module, and the feature map is refined, so that the accuracy of reflecting the position information of the characters in the character sequence by the feature map is improved.

As a possible implementation of the first aspect, the first decoding module specifically comprises a cascade of two or more parallel-based attention modules.

From the above, the more the cascade-connected attention modules based on the parallel connection are, that is, the more the number of layers of the network structure is, the more the layer-by-layer refinement of the feature map is accurate, and the higher the accuracy of the processed feature map reflecting the position information of the characters in the character sequence is.

As a possible implementation manner of the first aspect, the parallel-based attention module is further cascaded with a second sequence processing module, and the second sequence processing module is configured to process the feature map to be input to the parallel-based attention module.

Therefore, by adding a plurality of second sequence processing modules, the number of layers of the network (or referred to as increasing the depth of the network) can be increased, so that the trained network has higher accuracy in identifying the sequence.

As a possible implementation manner of the first aspect, an image correction module is further cascaded before the encoding module, and is used for correcting data of an input image.

The image correction module can be realized by adopting a neural network with space transformation invariance, and through image correction, the processing of a subsequent module is facilitated, and the recognition accuracy of characters in the image is improved.

A second aspect of the present application provides a training method for a model, where the model is a first recognition model formed by any one of the recognition models provided in the first aspect, and the training method includes: training a second recognition model, wherein the second recognition model comprises a cascade of coding modules and a serial attention-based sequence decoder; freezing network parameters of an encoding module and a serial attention-based sequence decoder; the first recognition model is trained based on knowledge distillation using the second recognition model.

Therefore, the training of the character position prediction module of the first recognition model is realized by training the first recognition model in the training mode, the first recognition model has the sequence recognition accuracy basically equivalent to that of the traditional serial attention-based sequence decoder due to the fact that the second recognition model of the serial attention-based sequence decoder is used for training the first recognition model, and the first recognition model can keep high operation efficiency due to the fact that the first recognition model does not use the serial attention-based sequence decoder.

As a possible implementation manner of the second aspect, training the first recognition model based on knowledge distillation using the second recognition model includes: knowledge distillation is carried out on the characteristic diagram to obtain a first parameter; knowledge distillation is carried out on an attention matrix formed by a serial attention-based sequence decoder to obtain second parameters; the first recognition model is trained based on a difference between the first parameter and the second parameter.

From the above, the feature map reflects the position information of the characters in the character sequence, and it is noted that the moment matrix may also reflect the position information of the characters in the predicted character sequence, so the knowledge distillation described above may be employed to train the first recognition model. Wherein the first parameter and the second parameter obtained by knowledge distillation are related to the distribution of each element or part of elements in the characteristic diagram or the attention matrix. The values of these elements may be, for example, values within a certain threshold, or, in other embodiments, values of some elements obtained by pooling the feature map matrix, which may be a maximum value mode or a mean value mode. In some possible implementation manners, the parameters may be directly the values of the elements, or may be values obtained by weighting and calculating the values of the elements by a certain coefficient, for example, values obtained by weighting and calculating the values of the elements by the temperature value T used in the distillation process, values obtained by normalizing the elements, or values obtained by weighting and normalizing the elements.

As a possible implementation manner of the second aspect, the knowledge distillation of the characteristic diagram to obtain the first parameter specifically includes: knowledge distillation of the updated feature map obtained using the parallel-based attention module is performed to obtain the first parameter.

Therefore, when the first recognition model comprises a plurality of parallel-based attention modules which are cascaded, the parallel-based attention modules can gradually refine the feature map, and therefore, when the first feature map is distilled to obtain the first parameter, a feature map after being refined can be distilled instead, wherein the feature map after being refined by the last parallel-based attention module can be distilled.

As a possible implementation manner of the second aspect, the obtained first parameter and the second parameter are related to position information of characters in the character sequence.

From the above, since the feature map reflects the position information of the character in the character sequence, it is noted that the moment matrix may also reflect the position information of the character in the predicted character sequence, and therefore the obtained first parameter and second parameter may be related to the position information of the character in the character sequence.

As a possible implementation manner of the second aspect, training the first recognition model based on knowledge distillation using the second recognition model includes: knowledge distillation is carried out on the character sequence output by the first recognition model to obtain a third parameter; knowledge distillation is carried out on the character sequence output by the second recognition model to obtain a fourth parameter; the first recognition model is trained based on a difference between the third parameter and the fourth parameter.

In this way, the character content prediction information can be distilled, and the third parameter and the fourth parameter obtained by distillation may be related to the code vector corresponding to the character sequence, the confidence value corresponding to each code vector, the probability distribution of the sequence in the sequence set, or the like.

As a possible implementation manner of the second aspect, the obtained third parameter and the fourth parameter are related to a probability distribution of characters in the character sequence in the character set.

From the above, one way that the third parameter may be related to the probability distribution of the sequence output by the first recognition model in the sequence set, for example, the probability distribution of each sequence in the sequence set may be a value of logits of the output sequence, the logits being the probability of entering softmax in the neural network, reflecting the probability distribution of each character in the whole character set (probability distribution), and the logits generally corresponding to the output of the full-link layer (the full-link layer is connected behind the softmax layer).

As a possible implementation manner of the second aspect, the method further includes: training a first recognition model based on sample data; as a possible implementation manner of the second aspect, the first recognition model and the second recognition model further respectively include an image correction module cascaded before the encoding module; training the second recognition model further comprises training an image correction module; the freeze (Freez) network parameters also include network parameters of the frozen image correction module.

Therefore, the training of the image correction module can be realized at the same time during the training.

A third aspect of the present application provides a training apparatus for a model, the model being a first recognition model formed by any one of the recognition models provided in the first aspect, the training apparatus comprising: the training module is used for training a second recognition model, and the second recognition model comprises a coding module and a serial attention-based sequence decoder which are cascaded; the configuration module is used for freezing network parameters of the coding module and the serial attention-based sequence decoder; the training module is further configured to train the first recognition model based on knowledge distillation using the second recognition model.

As a possible implementation manner of the third aspect, the training module is specifically configured to: knowledge distillation is carried out on the characteristic diagram to obtain a first parameter; knowledge distillation is carried out on an attention matrix formed by a serial attention-based sequence decoder to obtain second parameters; the first recognition model is trained based on a difference between the first parameter and the second parameter.

As a possible implementation manner of the third aspect, the knowledge distillation of the characteristic diagram to obtain the first parameter specifically includes: knowledge distillation is performed on feature maps obtained using a parallel-based attention module.

As a possible implementation manner of the third aspect, the obtained first parameter and the second parameter relate to position information of a character in the character sequence.

As a possible implementation manner of the third aspect, the training module is specifically configured to: knowledge distillation is carried out on the character sequence output by the first recognition model to obtain a third parameter; knowledge distillation is carried out on the character sequence output by the second recognition model to obtain a fourth parameter; the first recognition model is trained based on a difference between the third parameter and the fourth parameter.

As a possible implementation manner of the third aspect, the third parameter and the fourth parameter are related to a probability distribution of characters in the character sequence in the character set.

As a possible implementation manner of the third aspect, the training module is further configured to train the first recognition model based on the sample data.

As a possible implementation manner of the third aspect, the first recognition model and the second recognition model further respectively include an image correction module cascaded before the encoding module; the training module is also used for training the image correction module; the configuration module is also used for freezing the network parameters of the image correction module.

A fourth aspect of the present application provides a method for recognizing a character sequence, where the method includes: acquiring input data; obtaining context characteristics according to the data by utilizing an encoding module; obtaining a feature map by using a character position prediction module according to the context features, wherein the feature map reflects the position information of characters in the character sequence; and acquiring a character sequence according to the context characteristics and the characteristic diagram by utilizing a first sequence processing module. In some possible implementations, specifically, any one of the recognition models provided in the first aspect may be used to perform the recognition of the character sequence.

A fifth aspect of the present application provides a device for recognizing a character sequence, the device comprising: the acquisition module is used for acquiring input data; the recognition module is used for acquiring context characteristics according to the data by using the coding module, acquiring a characteristic diagram according to the context characteristics by using the character position prediction module, and acquiring a character sequence according to the context characteristics and the characteristic diagram by using the first sequence processing module, wherein the characteristic diagram reflects the position information of characters in the character sequence. In some possible implementations, specifically, any one of the recognition models provided in the first aspect may be used to perform the recognition of the character sequence.

Thus, the recognition of character sequences in images using the recognition model described above can be achieved.

A sixth aspect of the present application provides a server comprising: a processor, a memory; wherein the memory is for storing program instructions that, when executed by the processor, cause the server to implement any one of the methods provided in the second aspect of the present application, or to implement the method provided in the fourth aspect of the present application.

A seventh aspect of the present application provides a computing device comprising: a processor, a memory; wherein the memory is configured to store program instructions that, when executed by the processor, cause the computing device to implement any one of the methods provided in the second aspect of the present application, or wherein the program instructions, when executed by the processor, cause the computing device to implement the method provided in the fourth aspect of the present application.

An eighth aspect of the present application provides a computer readable storage medium having stored thereon program instructions which, when executed by a computer, cause the computer to carry out any one of the methods provided by the second aspect of the present application, or which, when executed by a computer, cause the computer to carry out the method provided by the fourth aspect of the present application.

A ninth aspect of the present application provides a computer program product comprising instructions stored thereon which, when run on a computer, cause the computer to carry out any one of the methods provided by the second aspect of the present application or alternatively to carry out the method provided by the fourth aspect of the present application.

Drawings

FIG. 1a is a schematic diagram of a recognition model provided by an embodiment of the present application;

FIG. 1b is a schematic diagram of an encoding module provided in an embodiment of the present application;

FIG. 1c is a diagram of an embodiment of an image encoding module according to an embodiment of the present disclosure;

FIG. 1d is a diagram illustrating an embodiment of a sequence coding module according to the present application;

fig. 1e is a schematic diagram of an embodiment of an encoding module according to the present application;

FIG. 1f is a diagram illustrating an embodiment of an image correction module according to an embodiment of the present disclosure;

FIG. 2a is a schematic diagram of a framework for training a first recognition module according to an embodiment of the present application;

FIG. 2b is a flowchart of a recognition model training method provided by an embodiment of the present application;

FIG. 2c is a schematic view of an embodiment of FIG. 2 a;

FIG. 2d is a schematic diagram of a serial-based attention mechanism provided by an embodiment of the present application;

FIG. 2e is a flow chart of training a first recognition model based on knowledge distillation provided by an embodiment of the present application;

FIG. 2f is a flow chart of training a first recognition model based on knowledge distillation as provided in another embodiment of the present application;

FIG. 3 is a schematic diagram of a training apparatus for a model provided in an embodiment of the present application;

fig. 4a is a flowchart of a text recognition method provided in an embodiment of the present application;

fig. 4b is a schematic diagram of a text recognition apparatus according to an embodiment of the present application;

FIG. 5 is a schematic diagram of experimental data for recognition of text in an image according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a computing device provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a server provided in an embodiment of the present application.

It should be understood that the dimensions and forms of the various blocks in the block diagrams described above are for reference only and should not be construed as exclusive of the embodiments of the present application. The relative positions and the inclusion relations among the blocks shown in the structural schematic diagram are only used for schematically representing the structural associations among the blocks, and do not limit the physical connection manner of the embodiment of the application.

Detailed Description

The technical solution provided by the present application is further described below by referring to the drawings and the embodiments. It should be understood that the system structure and the service scenario provided in the embodiments of the present application are mainly for illustrating possible implementation manners of the technical solutions of the present application, and should not be construed as the only limitations on the technical solutions of the present application. As can be known to those skilled in the art, with the evolution of the system structure and the appearance of new service scenarios, the technical solution provided in the present application is also applicable to similar technical problems.

It should be understood that the schemes for character sequence recognition provided by the embodiments of the present application include models, training methods and apparatuses for models, recognition and apparatuses for character sequences, servers, computing devices, computer-readable storage media, and computer program products. Since the principles of solving the problems of these solutions are the same or similar, some of the repeated parts may not be repeated in the following descriptions of the specific embodiments, but it should be understood that these specific embodiments are referred to and can be combined with each other.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. In the case of inconsistency, the meaning described in the present specification or the meaning derived from the content described in the present specification shall control. In addition, the terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

When recognizing characters in an image, one technical scheme is to use an attention-driven scene text recognizer (ASTER) with a flexible correction function. The basic principle of the technical scheme is as follows: an input image containing a text is corrected through an image correction module (the image correction module can adopt an STN network), the corrected image is subjected to image feature sequence extraction through a convolutional neural network, the extracted image feature sequence is subjected to sequence coding (encoding) through a sequence model (sequence coding), the sequence coded features are subjected to sequence decoding (decoding) through a traditional sequence decoder based on serial attention (attention), and each character in the image is output. Conventional serial attention-based sequence decoders are, for example, attention-based Recurrent Neural Networks (RNNs) or attention-based Long Short Term Memory (LSTM), Bi-directional long short term memory (Bi-LSTM), etc.

The above technical solution adopts a conventional serial attention-based sequence decoder, and the decoding process adopts an autoregressive mode, that is: when decoding the current character, it needs to rely on the information of the previous character. For this solution, the computational complexity of the decoder increases as the string length increases, thus causing a longer decoding delay.

Another technical scheme for recognizing characters in images is Accurate Scene Text Recognition (handware Accurate Scene Text Recognition with Semantic reading Networks) based on a Semantic Reasoning network. The technical scheme uses a parallel attention-based sequence decoder, and all characters can be predicted simultaneously. The basic principle of the technical scheme is as follows: the input image containing text is coded through a Backbone Network (Backbone Network) to obtain an image feature sequence, the image feature sequence simultaneously predicts the features of all characters through a sequence decoder based on parallel attention, the character features are subjected to Semantic modeling through a Global Semantic Reasoning Module (GSRM) to obtain Semantic features, the character features and the Semantic features are fused in a Fusion Module (Fusion), and all characters in the image are output.

The technical scheme adopts a sequence decoder based on parallel attention, all characters can be predicted simultaneously, the calculation complexity of the decoder is not increased along with the increase of the length of a character string, and the time delay is low. However, the accuracy of the parallel attention-based sequence decoder is generally lower than that of the traditional serial attention-based sequence decoder because the decoder does not depend on the information of the previous character. Because the precision is not ideal, a large amount of post-processing (such as adding a semantic reasoning module and a fusion module for post-processing) is added, so that the whole model has huge volume and low operation efficiency.

The embodiment of the application provides an improved character sequence recognition scheme, and the used models are encoder and decoder structures, wherein a character position prediction module in a decoder can be trained on knowledge extracted from a serial attention-based sequence decoder based on a knowledge distillation mode, so that the recognition accuracy is improved under the condition of model size limitation. In addition, when the method is applied to recognition of characters in the image, a large amount of post-processing is not needed, so that the relative operation efficiency is high.

The scheme for recognizing the character sequence provided by the embodiment of the application can be applied to scenes with the input of images and the output of a plurality of sequences. One of the applications is for example: for any application scene of text recognition in the image, for example, an application scene of text recognition is performed on the contained characters aiming at the received or shot pictures containing the characters, such as pictures of identity cards, bank cards, license plates, guideboards, books, posters and the like. And for example, using an Augmented Reality (AR) technology to identify characters in the image and perform instant translation. For example, the second application is: and recognizing characters from the text in the image, and further performing a voice playing scene. For example, the third application is: and identifying the music score in the image by using the application scene, such as identifying a staff of the music score in the image, or identifying and playing the corresponding music scene through sound.

The scheme for character sequence recognition provided by the embodiment of the application is not limited to the input content being a picture. For example, translation scenario: the input is a character string, and the output is an application scene of a translation character string. And as a waveform file of the acquired voice, a scene recognized as a character string.

For a further understanding of the present application, various embodiments of the present application will be described in detail below with reference to the accompanying drawings.

[ identification model provided in embodiments of the present application ]

Fig. 1a illustrates a recognition model provided by an embodiment of the present application, which is used for recognizing a character sequence. The recognition model includes an encoding module 100 and a first decoding module 200. The encoding module 100 is configured to obtain a context feature according to input data. The first decoding module 200 is configured to obtain a recognized character sequence according to the context feature. Wherein the first decoding module 200 includes a character position prediction module 210 and a first sequence processing module 220. The character position prediction module 210 is configured to obtain a feature map according to the context feature, where the feature map reflects position information of the character in the character sequence. The first sequence processing module 220 is configured to obtain the character sequence according to the context feature and the feature map.

In this embodiment, the recognition model may be used to recognize characters in an image, and in other embodiments, the recognition model may be used for sequence-to-sequence recognition, such as speech waveform-to-text recognition. Specifically, the purpose of the recognition model can be realized by training according to the used sample data. For example, if the sample data is an image containing characters and the label of the sample data is the content of the characters, the recognition model can be used for recognizing the characters in the image after being trained. For another example, if the sample data is data including a speech waveform, and the tag of the sample data is a character corresponding to speech, the recognition model may be used to recognize the speech waveform as a character after training, and if a module for converting speech into a speech waveform is further provided, the speech may be recognized as a character.

In this embodiment, the encoding module 100 is configured to encode the received image data into a feature sequence. In some embodiments, the encoding module 100 may be implemented using a Neural Network, such as CNN (conditional Neural Network), Bidirectional encoded representation based on a converter (BERT), Convolutional Recurrent Neural Network (CRNN), Full connection Network (Full Connect, FC), and the like. In some embodiments, the encoding module 100 may be implemented by a combination of multiple network modules, which may be the same or different, for example, in this embodiment, as shown in fig. 1b, the encoding module 100 includes an image encoding module 110 and a sequence encoding module 120, where the image encoding module 110 is configured to extract an image feature sequence from an image, and the sequence encoding module 120 is configured to further encode the image feature sequence into a feature sequence.

In some embodiments, the image encoding module 110 may be a CNN, BERT, CRNN, FC, etc. network, for example, when a CNN network, may include any one or more of a convolutional layer, a pooling layer, and a fully connected layer of the CNN network.

In this embodiment, the image encoding module 110 adopts a shadow network (Ghost net) structure, which is a deep neural network for image processing, and compared with the conventional CNN, a shadow-bottle (G-bneck) is used to replace the convolutional layer. The G-bneck comprises a shadow (Ghost) module, wherein the shadow (Ghost) module firstly generates an inherent feature map Y ' through the traditional convolution operation, then generates a plurality of shadow feature maps Y ' on the inherent feature map Y ' through linear transformation by using a simple linear operator, and the inherent feature map Y ' and the shadow feature map Y ' form the feature map Y to be generated. Compared with the traditional convolution operation for directly generating the feature map Y, the convolution operation part in the Ghost module is only used for generating the inherent feature map Y' with smaller size or fewer channels, the convolution and linear combination joint calculation feature is adopted, the convolution calculation feature is not completely depended on, and therefore the calculation amount of the Ghost module is smaller.

In this embodiment, the Ghost net used by the image encoding module 110 includes a convolutional layer, several shadow-bottle (G-bneck) layers, a pooling layer, and a full connection layer. As shown in FIG. 1c, which illustrates one embodiment of the image encoding module 110, in FIG. 1c, the extension (# exp) indicates the size of the extension, the output (# out) indicates the number of channels of the output, and SE indicates whether a Squeeze-and-Excitation (SE) module is used. The GhostNet network structure provided by the specific embodiment sequentially comprises: 2D convolutional layers-a group of G-bneck layers of 2G-bneck-a group of G-bneck layers of 6G-bneck-a group of G-bneck layers of 4G-bneck and a group of G-bneck layers of 2D convolutional layers (Conv2D) -a pooling layer (AvgPool) -a 2D convolutional layer (Conv2D) -a fully connected layer (FC). It can be seen that the input of the GhostNet network in fig. 1c is 224 × 224 × 3 image, i.e. 3-channel (RGB three-channel) image with a pixel size of 224 × 224, and the output is an image feature sequence consisting of 1000 feature values.

In some embodiments, the sequence encoding module 120 may include any one or combination of BERT, RNN, LSTM, CNN, etc. networks.

A specific implementation of the sequence coding module 120 in this embodiment may adopt the structure shown in fig. 1d, and the sequence coding module 120 shown in fig. 1d includes a one-dimensional convolution (1DConv) layer and a Linear gating unit (GLU) layer. The method has the advantages that each convolution module of the convolution layer can process data in parallel, and calculation is more efficient; and the convolutional layer can be multi-layer, when multi-layer, the remote relation between words can be captured, thereby better capturing more complex relation. The GLU layer is used for dividing the result after convolution into two parts, applying transformation of a sigmoid activation function to one part, namely mapping the part to an interval from 0 to 1, and then performing element-wise operation on the other part of the vector, wherein the sigmoid function is used for controlling information flow in the network, namely which information flow can be transmitted downwards. Therefore, the GLU layer provides a nonlinear path (namely a path passing through the sigmoid function) and a linear path at the same time, the perception capability of the features is ensured through the nonlinear path, and the problem of gradient disappearance is solved through the linear path.

In the embodiment shown in fig. 1d of the present embodiment, a residual concatenation structure is further adopted, that is, the input feature sequence and the output feature sequence of the sequence encoding module 120 are added to be output by the sequence encoding module 120. The residual error connection is optional, and the output can have the characteristics of a high layer and a bottom layer at the same time by adopting the residual error connection.

In some embodiments, the image encoding module 110 or the sequence encoding module 120 may further include an attention module, for example, an attention module based on a self-attention mechanism or a multi-attention mechanism, and may be disposed behind any network layer in the network structure constituting the image encoding module 110 or the sequence encoding module 120 as needed.

In some embodiments, the character position prediction module 210 may be implemented using any one of other networks such as CNN, RNN, FC, BERT, and the like, or a combination of any multiple networks.

In this embodiment, the character position prediction module 210 is formed by an encoding-decoding structure in which CNN and fully-connected layers are mixed, wherein the encoding portion employs a CNN module, for example, a convolutional layer and/or a pooling layer, the encoding portion reduces (or down-samples) the size of the input features (the input features are context features output by the encoding module 120) layer by layer, the encoding portion and the decoding portion may be directly connected, or connected through FC in this embodiment, the decoding portion may also employ a CNN module, for example, a convolutional layer and/or a pooling layer, and the decoding portion amplifies (or up-samples) the feature size output by the encoding portion layer by layer to a feature map of a desired size. In this embodiment, the enlarged Feature size can be matched with the attention moment matrix size of the second recognition network distilled in the training process, so the character position prediction module 210 is also called a Feature alignment module (Feature alignment) in this embodiment.

In this embodiment, as described above, the outer layer of the encoding portion and the outer layer of the decoding portion of the character position prediction module 210 use the CNN layer, and the inner layer uses the fully-connected layer to connect the encoding portion and the decoding portion, because the outer layer of the encoding network and the decoding network are respectively used for extracting and generating the local information, and the fully-connected network of the inner layer is used for extracting and generating the global information, if the CNN is continuously used in the inner layer, multiple layers of CNNs are required to provide a sufficiently large global receptive field, and the fully-connected layer can easily obtain the global receptive field, and because the inner layer has a small feature size, the parameter amount caused by the fully-connected layer is not large, so the structure of the character position prediction module 210 provided in this embodiment of the present application has a feature of relatively high operation efficiency. Among them, the above-mentioned convolution layer for up-sampling may be referred to as an anti-convolution layer.

In some embodiments, when the feature pattern size output by the decoding portion of the character position prediction module 210 does not match the attention matrix size of the second recognition network, any one or more of FC layer, convolutional layer, or pooling layer, etc. may be further cascaded after the network of the decoding portion to adjust the feature pattern size to achieve matching with the attention matrix size of the second recognition network. In other embodiments, the attention matrix of the obtained second recognition network may also be scaled up or down using other networks such as FC layer, convolutional layer, pooling layer, etc., or any combination thereof, to achieve matching with the feature map output by the character position prediction module 210. It is noted that the matching described herein is for facilitating knowledge extraction from the matched feature map and the attention matrix of the second recognition network for training during network training.

In this embodiment, the first sequence processing module 220 obtains the character sequence according to the context feature output by the encoding module 100 and the feature map output by the character position predicting module 210, and for this reason, the context feature output by the encoding module 100 includes feature information of each character, and the feature map reflects position information of characters in the character sequence, so that the first sequence processing module 220 predicts characters at each position based on the feature information of each character in combination with the position information of each character, and the predicted characters constitute the character sequence.

For example, in this embodiment, the context features (embodied in a matrix form) and the feature map (embodied in a matrix form) may be subjected to matrix multiplication and then input to the first sequence processing module 220, and in other embodiments, the context features and the feature map may be cascaded and then input to the first sequence processing module 220.

In some embodiments, the first sequence processing module 220 may include any one of CNN, BERT, CRNN, FC, RNN, or a combination of networks.

In this embodiment, a specific implementation of the first sequence processing module 220 may be formed by cascading an output layer through the network shown in fig. 1d, or referring to a specific implementation of an encoding module shown in fig. 1e, and outputting each character through the output layer. Wherein the output layer may include a fully-connected layer, and the output layer outputs a character corresponding to a character encoding (embedding) vector with a maximum confidence level (the confidence level may be calculated by a softmax function) corresponding to each position.

In this embodiment, as shown in fig. 1e, in the specific implementation, at least one parallel-based attention module may be further connected between the character position prediction module 210 and the first sequence processing module 220, and in this embodiment, two or more parallel-based attention modules may be specifically included in a cascade. The cascade parallel-based attention modules are used for sequentially processing the feature maps output by the character position prediction module 210, and realizing gradual refinement of the feature maps so as to improve the accuracy of the feature maps reflecting the position information of the characters in the character sequence. The more the attention modules are cascaded, that is, the more the number of layers of the network structure is, the higher the accuracy of the processed feature map reflecting the position information of the characters in the character sequence is. The more accurate the character sequence is predicted by the first sequence processing module 220 from the processed feature map.

The connection mode of the attention module in the network based on the parallel can also be flexibly set corresponding to the above-mentioned context characteristics and the mode of inputting the characteristic diagram into the first sequence processing module 220. For example, in this embodiment, after performing a matrix multiplication operation on the context feature and the feature map, the parallel-based attention module may be configured before inputting to the first sequence processing module 220. In other embodiments, the contextual features are input to the first sequence processing module 220, and the feature map is input to the first sequence processing module 220 after being processed by the parallel-based attention module. In any case, as long as the first sequence processing module 220 can directly or indirectly obtain the context feature and the feature map, the character of each position can be predicted.

In some embodiments, at least one of the parallel-based attention modules may process a feature map input thereto (including an updated feature map output by a previous parallel-based attention module cascaded to the parallel-based attention module), for example, by self-encoding the feature map input thereto, by using a self-attention mechanism.

In this embodiment, at least one of the parallel-based attention modules performs processing on a feature map (including an updated feature map output by a previous parallel-based attention module cascaded to the parallel-based attention module) input by combining the context features, and the following description of a Key-Value pair (Key-Value) -Query (Query) is used as an attention mechanism to explain this embodiment:

parallel-based attention at cascaded layersIn the module processing, when the ith layer is processed by the parallel attention modules, the output of the upper layer is taken as the input of the ith layer based on the parallel attention modules and is marked as X_{current layer i}As Key, the output context of the coding module 100 is characterized by X_{contextual feature}As Query, when Value also takes X_{contextual feature}The output of the ith layer based on the parallel attention module is

The parallel attention module-based execution process of the ith layer can be shown in the following formula (1):

wherein, s (Key, Query) is an attention scoring function, and the attention scoring function can be calculated by using one of the following models, where k represents Key and q represents Query:

1) dot product model: (k)^Tq)。

2) Scaling the dot product model:

wherein d is_kIs a scaling factor that is a constant, e.g., 6, 8, 9, etc.

3) An additive model: (tanh (Wk + Uq)), where W, U is a learnable parameter.

4) Bilinear model. (k)^TWq), where W is a learnable parameter.

As can be seen from the above formula (1),

the attention module based on the parallel realizes the processing of parallel input and parallel output because of the matrix multiplication operation.

In this embodiment, for the cascaded parallel-based attention modules of each layer, before any layer of the parallel-based attention module, a cascaded second sequence processing module may further include any one of networks such as CNN, BERT, CRNN, FC, RNN, or a combination of multiple networks, and in this embodiment, as shown in fig. 1e, the network shown in fig. 1d is used as the second sequence processing module. By adding a plurality of second sequence processing modules, the number of layers of the network (or called as the depth of the network) can be increased, so that the trained network has higher accuracy in identifying the sequence.

In some embodiments, if the Position-related information can be obtained for the data input to the encoding module 100, the Position information may be further encoded to obtain a Position-embedded (Position-embedded) code, and the Position-embedded code may be concatenated on the input of any one of the parallel attention-based modules or the input of the first sequence processing module 220, so that the recognition accuracy of the recognition model may be improved due to the introduction of the Position code. In this embodiment, as shown in fig. 1e, the position code is concatenated on the input of the first parallel attention module-based in the concatenated parallel attention modules.

In some embodiments, the encoding module 100 may be further cascaded with an image correction module for correcting data of an input image. The image correction module is used for carrying out space transformation on characters, and the space transformation comprises translation, rotation, scaling, cutting and the like. In some embodiments, the image correction module may be implemented using a neural network, such as a CNN, BERT network, or other network.

In this embodiment, the image correction module is implemented by using a neural network with Spatial transform invariance, for example, the neural network may be a Spatial Transform Network (STN), and the Spatial transform network mainly has the following functions: the input can be converted into a next expected form, the interested region characteristics can be automatically selected, and the spatial transformation of various deformed data can be realized. The network has spatial transform invariance (translation, rotation, scaling and clipping invariance are collectively referred to as spatial invariance). Wherein, the space transformation invariance refers to: the network is said to have spatial transformation invariance if the network can obtain the same detection result for the picture subjected to operations such as translation, rotation, scaling and cropping as before the network is not transformed.

Fig. 1f shows a specific embodiment of the image correction module, which is an STN network based on the TPS model. As shown in fig. 1f, the correcting principle of the STN network based on the TPS model is as follows: firstly, downsampling an input original image I to obtain an image Id; then, predicting a group of Control Points (Control Points) through a positioning Network (Localization Network), wherein the Control Points are used for determining the character position and direction in the input image data; obtaining the position corresponding relation of the control points on the original image I and the corrected image by a Grid Generator (Grid Generator), wherein the position corresponding relation can be called as a correction parameter or a TPS (thin Plate spline) conversion parameter; the corrected image Ir is obtained by the Sampler (Sampler) based on the position correspondence and the original image I. Here, in fig. 1f, the original image I is first downsampled, the control points are predicted from the image Id having a small size obtained by the downsampling, and the control points are mapped back to the original image I size when the control points are input to the mesh generator for calculation, so that the network parameters are reduced and the calculation amount is reduced. In other embodiments, the input original image I may be directly input to a positioning Network (Localization Network) without downsampling.

In some embodiments, the image correction module may be further cascaded with a module for intercepting a Region image including characters from the image, where the module may include any one of a Region extraction module (RPN), a CNN, an FC, and the like, or a combination of a plurality of networks, and the module extracts a Region image including all recognized characters from the input image to be processed by the image correction module.

[ identification model training method provided by the embodiment of the application ]

Next, a method for training a model provided in an embodiment of the present application is described with reference to the accompanying drawings. The model training method may be used to train the recognition model, which will be referred to as the first recognition model for the sake of description.

In the training, a second recognition model formed by the coding module 100 and the serial attention-based sequence decoder is used, and the first recognition model is trained based on a distillation training mode. The basic framework in this training is shown in fig. 2 a. Referring now to the flowchart depicted in fig. 2b, the training method comprises the steps of:

s10: training a second recognition model consisting of the cascade of the coding module and a serial attention-based sequence decoder.

In this embodiment, when the training of the first recognition model is performed, a second recognition model needs to be constructed on the basis of the first recognition model, wherein the second recognition model shares the encoding module of the first recognition model, and the decoding module of the second recognition model is composed of a serial attention-based sequence decoder. The serial attention based sequence decoder may be an attention based RNN, attention based LSTM, attention based Bi-LSTM, attention based BERT, transform (a transform is a sequence-to-sequence neural network), etc. network.

A serial attention based sequence decoder is used to decode the signature sequence output by the encoding module 100 into individual character outputs. When the serial attention-based sequence decoder decodes, the characters are sequentially decoded according to the time sequence, the current character information is output depending on the output of at least one piece of previous character information, and the output of the character information depends on at least one piece of previous character, so that the accuracy of the output is high. The serial attention based sequence decoder uses an attention module once each time step generates a character, so the attention value a of each time step_t，iThe character generated for each time step means a_t，iThe positional relationship of each character, that is, the positional information of each character in each generated character sequence can be reflected. Specifically, the size of the attention value reflects the probability of the character appearing at the corresponding position corresponding to the attention valueI.e. the probability distribution of each position of each character can be reflected.

As shown in fig. 2c, which includes a specific embodiment of the serial attention-based sequence decoder, after the output character predicted at the previous time step is subjected to embedded coding (Embedding) and Position embedded coding (Position Embedding), the output character is encoded by the encoding module (which may be cascaded into multiple Layers, and is denoted by N Layers in the figure) shown in fig. 1d to generate x_preFor convenience of description, herein called x_preFor the character feature of the previous time step, based on the x_preThe context feature output by the encoding module 100 is X_{contextual feature}The character feature x of the current time step can be generated by combining the attention mechanism_{current attn-applied feature}X of the_{current attn-applied feature}And normalizing to predict the output character of the current time step.

When the attention mechanism is described by Key-Value pair (Key-Value) -Query (Query), the attention a is calculated through the Key and the Query_t，iBased on attention a_t，iActing on Value to obtain an output, based on each a_t，iAn attention matrix M may also be obtained_attentionIn the following, referring to an implementation of the attention mechanism shown in fig. 2d, the serial attention mechanism is described:

as shown in the schematic diagram of fig. 2d, the character feature x of the previous time step is used_preCharacter characteristic x of current time step obtained by linear transformation_currentAs Query, the vector ci of the context feature output by the encoding module 100 serves as Key, where the vector matrix of the context feature output by the encoding module 100 is X_{contextual feature}Then vector c of context features corresponding to the current time step_i＝x_{i-contextual feature}Thereby obtaining the following formula (2):

wherein, the s (Key, Query) is an attention scoring function, and as mentioned above, the attention scoring function can be calculated by using a dot product model, a scaled dot product model, an additive model, or a bilinear model.

Each a is calculated at each time step (i.e. each time t)_t，iThen, the measured values are calculated from a at t_t，iA two-dimensional matrix is formed, namely an attention matrix M of a serial attention-based sequence decoder can be obtained_attentionI.e. the attention distribution matrix.

It should be noted that the above is only an example, and the values of Query and Key can be flexibly selected, for example, Query may also be character feature x of previous time step_pre。

Wherein, when Value also takes x_{i-contextual feature}Then, the character feature output at the current time step after the attention mechanism is combined can be x_{current attn-applied feature}＝x_pre+a_t，ix_{i-contextual feature}，x_{current attn-applied feature}The normalized (e.g., by softmax) is the vector of characters predicted for the current time step.

When the second recognition model is trained, sample data may be used for training, a loss function (or called training target) used for training the second recognition model may be a difference between a label (i.e. a real result) of the sample data and a sequence (i.e. a predicted result) output by the second recognition model, and a loss function L used for training the second recognition model_{model 2}It can be described as the following formula (3):

L_{model 2}＝L_hard(y，q) (3)

where y represents the code of the label of the data, and q is the code corresponding to the output result of the second recognition model. During training, the training may be performed in any conventional manner, for example, a gradient descent algorithm, a newton algorithm, or the like, by continuously adjusting network parameters to gradually converge the difference, or a method such as an anti-network may be used.

S20: after the second recognition model is trained, freezing (Freez) the network parameters of the coding module and the serial attention-based sequence decoder, i.e. freezing the network parameters involved in the second recognition model.

By freezing the network parameters, the network parameters of the common encoding module 100 are not adjusted any more when the training of the first recognition model continues subsequently, and the training of the decoding module 200 of the first recognition model can be realized.

S30: training the first recognition model based on knowledge distillation using the second recognition model.

In this embodiment, Knowledge distillation (KD, or Knowledge extraction) is used to train the first recognition model. The basic principle of the method is that a teacher model is trained, and then the output result of the teacher model is used as the target of a student model to train the student model. In this embodiment, the second recognition model is a trained teacher model, and the first recognition model is a student model.

In this embodiment, when training the first recognition model based on knowledge distillation by using the second recognition model, distillation may be performed based on character position prediction information, in which case, as shown in the flowchart of fig. 2e, the step S30 may include the following steps:

s301: and carrying out knowledge distillation on the characteristic diagram generated by the first recognition model to obtain a first parameter. Wherein the characteristic map reflects position information of the characters in the character sequence, and therefore the distilled first parameter is related to the position information of the characters in the character sequence.

In this embodiment, when sample data is input into the encoding module 100, the decoding module 200 of the first recognition model outputs the predicted character sequence, and the character position prediction module 210 of the decoding module 200 of the first recognition model outputs the feature map during the process of predicting the character sequence, for convenience of description, the feature map output by the character position prediction module 210 may be

To show, for the characteristic diagram

Knowledge distillation is performed to obtain a first parameter. In this embodiment, the characteristic map may be represented in a matrix form, and the first parameter obtained by distillation is associated with the distribution of each element, or a part of the elements, in the characteristic map matrix.

In some embodiments, the distilled first parameter may be related to a distribution of values of each element in the profile matrix. In other embodiments, the first parameter obtained by distillation may be related to the distribution of values of a part of elements in the feature map matrix, for example, may be a value within a certain threshold value, and in other embodiments, may be values of a part of elements obtained by pooling the feature map matrix, which may be a maximum value mode or a mean value mode,

in some embodiments, the first parameter obtained by distillation is related to the values of these elements, which may include the following: the first parameter may be the element values as they are, or may be values obtained by weighting the element values by a certain coefficient, for example, values obtained by weighting the element values by the temperature value T used in the distillation process, normalized values obtained by normalizing the elements in the feature map matrix, or normalized values obtained by weighting the elements in the feature map matrix.

S302: knowledge distillation is performed on the attention matrix formed by the serial attention-based sequence decoder to obtain second parameters. As described above, it is noted that the moment matrix may also reflect the position information of the character in the predicted character sequence, and therefore the distilled second parameter may also be related to the position information of the character in the character sequence.

In this embodiment, after the sample data is input into the encoding module 100 in step S301, since the second recognition model and the first recognition model share the encoding module 100, the predicted character sequence is also output by the second recognition model, and in the process of predicting the character sequence, the respective attentions a of the character prediction at the respective time steps based on the second recognition model are performed_t，iAn attention matrix M can be obtained_attentionFor the attention matrix, the same distillation method as that in step S301 is adopted to obtain the second parameter, which is not described again.

S303: training the first recognition model based on a difference between the first and second parameters.

In this embodiment, a loss function (or called training target) L is used_{model 1}Describing the difference between the first parameter and the second parameter, it can be described as the following formula (4):

L_{model 1}＝L_soft1(p1，q1) (4)。

wherein L is_soft1For the first loss function, p1 is a feature map for the first recognition model

The first parameter obtained by distillation, q1, is the attention matrix M to the second recognition model_attentionDistilling the obtained second parameter. In the present embodiment, the first parameter

Second parameter

v1i corresponding feature map

Z1i corresponds to the attention matrix M_attentionT is a temperature value, which is a value between 0 and 1, e.g. a value of 0.05, 0.2, 0.6, etc., and softmax is a normalization function.

In some embodiments, when the first recognition model includes a plurality of parallel-based attention modules in cascade, the parallel-based attention modules may refine the feature map step by step, and in this case, when distilling the first feature map to obtain the first parameter, the first feature map may be distilled after some refinement, for example, the feature map may be refined by an ith parallel-based attention module in the plurality of parallel-based attention modules in cascadeDrawing (A)

Distillation is carried out, in the embodiment, the feature map after the refinement of the last, i.e. the nth attention module based on parallel

Distillation is carried out.

In this embodiment, when the second recognition model is used to train the first recognition model based on knowledge distillation, distillation of the character content prediction information may also be performed, in which case, as shown in the flowchart of fig. 2f, the step S30 may include the following steps:

s311: and carrying out knowledge distillation on the character sequence output by the first recognition model to obtain a third parameter.

In some embodiments, the distilled third parameter reflects a character content distribution (Token distribution), which may be related to the encoding vectors corresponding to the character sequences, the confidence values corresponding to the respective encoding vectors, the probability distribution of the sequences in the sequence set, and the like.

In this embodiment, the third parameter is specifically related to the probability distribution of the sequence output by the first recognition model in the sequence set, for example, the probability distribution of each sequence in the sequence set may be a value of logits of the output sequence, the logits is a probability before entering softmax in the neural network, and reflects a probability distribution of each character in the whole character set (probability distribution), and generally the logits corresponds to the output of a full-link layer (the softmax layer is connected behind the full-link layer).

S312: and carrying out knowledge distillation on the character sequence output by the second recognition model to obtain a fourth parameter. In this step, the same distillation manner as in step S311 is adopted, and the fourth parameter can be obtained, which is not described again.

S313: training the first recognition model based on a difference between the third and fourth parameters.

In this embodiment, a loss function (or called training target) L is used_{model 1}Describing the difference between the third parameter and the fourth parameter, it can be described as the following formula (5):

L_{model 1}＝L_soft2(p2，q2) (5)

wherein L is_soft2And p2 is a second loss function, and q2 is a third parameter obtained by distilling the character sequence output by the first recognition model, and a fourth parameter obtained by distilling the character sequence output by the second recognition model. In the present embodiment, the third parameter

Fourth parameter

v2i is the probability distribution of each sequence of the output of the first recognition model in the sequence set, z2i is the probability distribution of each sequence of the output of the second recognition model in the sequence set, T is a temperature value, which is a value between 0 and 1, e.g., taking values of 0.05, 0.2, 0.6, etc., and softmax is a normalization function.

In some embodiments, the first recognition model may also be trained in combination with the steps of steps S301 to S303 and the steps of steps S310 to S313, that is, in combination with the formula (4) and the formula (5), in which case, the loss function (or referred to as a training target) L is used_{model 1}Describing the training, it can be described as the following equation (6):

L_{model 1}＝aL_soft1(p1，q1)+bL_soft2(p2，q2) (6)

wherein a and b are weighting coefficients.

In other embodiments, the first recognition model may be further trained with sample data, and the loss function (or training target) used may be a difference between a label (i.e. a real result) of the sample data and a sequence (i.e. a predicted result) output by the first recognition model, and when describing the loss function L used, the loss function L may be the following formula (7):

L＝cL_{model 1}+dL_hard1(y，q3) (7)

where c and d are weighting coefficients, y represents a code of a label of the data, q3 is a code corresponding to an output result of the first recognition model, and L_{model 1}May be a loss function as shown in the above equation (4), equation (5) or equation (6).

In some embodiments, when the first recognition model and the second recognition model respectively further include an image correction module cascaded before the encoding module, that is, when the first recognition model and the second recognition model further include a common image correction module, the training of the second recognition model in step S10 further includes training the image correction module; the step S20 of freezing the network parameters further includes freezing the network parameters of the image correction module. In other embodiments, the correction module may be trained separately.

[ training device of model that this application embodiment provided ]

The embodiment of the present application further provides a training device for a model, which may be used to train the first recognition model in the above embodiments, and regarding processing details of the training device or each module included in the training device, reference may also be made to descriptions in training methods respectively corresponding to the training device, or to descriptions in the summary, which is only briefly described here. As shown in fig. 3, the training apparatus 500 includes:

a training module 510, configured to train a second recognition model, where the second recognition model is composed of the coding module and a serial attention-based sequence decoder in cascade. Specifically, the training module 510 may be configured to perform step S10 of the training method of the model described above and various alternative embodiments thereof.

A configuration module 520 for freezing network parameters of the encoding module and the serial attention based sequence decoder. Specifically, the training module 510 may be configured to perform step S20 of the training method of the model described above and various alternative embodiments thereof.

The training module 510 is further configured to train the first recognition model based on knowledge distillation using the second recognition model. Specifically, the training module 510 may be further configured to perform step S30 in the training method of the model and various optional embodiments thereof.

In some embodiments, the training module 510 is specifically configured to: knowledge distillation is carried out on the characteristic diagram to obtain a first parameter; knowledge distillation is carried out on an attention matrix formed by the serial attention-based sequence decoder to obtain second parameters; training the first recognition model based on a difference between the first and second parameters. Specifically, the training module 510 may be specifically configured to perform steps S301 to S303 in the training method of the model described above and various optional embodiments thereof.

In some embodiments, the knowledge distillation of the profile to obtain the first parameter specifically comprises: knowledge distillation is performed on the feature maps obtained using the parallel-based attention module to update.

In some embodiments, the obtained first parameter and the second parameter relate to position information of characters in the character sequence.

In some embodiments, the training module 510 is specifically configured to: knowledge distillation is carried out on the character sequence output by the first recognition model to obtain a third parameter; knowledge distillation is carried out on the character sequence output by the second recognition model to obtain a fourth parameter; training the first recognition model based on a difference between the third and fourth parameters. Specifically, the training module 510 may be specifically configured to perform steps S311 to S313 in the training method of the model described above and various optional embodiments thereof.

In some embodiments, the third parameter and the fourth parameter relate to a probability distribution of characters in the character sequence in a character set.

In some embodiments, the training module 510 is also for training the first recognition model based on sample data.

In some embodiments, the first recognition model and the second recognition model further respectively comprise an image correction module cascaded before the encoding module; the training module is also used for training the image correction module; the configuration module is further configured to freeze network parameters of the image correction module.

[ identification method of character sequence ] provided in the embodiments of the present application

The embodiment of the present application further provides a method for recognizing a character sequence, where the method uses the first recognition model to recognize the character sequence in an image, and as shown in a flowchart in fig. 4a, the method includes:

s50: acquiring input image data;

s52: obtaining, with the encoding module, a context feature from the image data;

s54: obtaining a feature map by using the character position prediction module according to the context features, wherein the feature map reflects the position information of the characters in the character sequence;

s56: and obtaining the character sequence according to the context characteristic and the characteristic graph by utilizing the first sequence processing module.

[ identification device of character sequence that this application embodiment provided ]

An embodiment of the present application further provides a device for recognizing a character sequence, which recognizes the character sequence in an image by using the first recognition model, as shown in fig. 4b, where the device 600 includes:

an obtaining module 610, configured to obtain input data;

an identifying module 620, configured to obtain a context feature from the data by using the encoding module in fig. 1a, obtain a feature map from the context feature by using the character position predicting module, and obtain the character sequence from the context feature and the feature map by using the first sequence processing module, where the feature map reflects position information of characters in the character sequence.

Fig. 5 is data of experimental results of a run applied to recognition of a character in an image by using a character sequence recognition scheme according to an embodiment of the present application, where an upper curve is a run time of character recognition based on a serial attention sequence decoder, and a lower curve is a run time of character recognition by a character sequence recognition model according to the present application. It can be seen that the computational complexity of the serial attention sequence based decoder increases with the number of characters and runs inefficiently. Compared with the character recognition based on the serial attention sequence decoder, the method and the device have the advantages that the running time is short, the running efficiency is high, and the calculation complexity is not increased along with the increase of the number of characters.

In addition, the following table shows the comparison of experimental data when the embodiment of the application and the Aster model are used for character recognition:

	accuracy of recognition	Size of model
			Aster model	87.357％	28.7M
Technical scheme of the embodiment of the application	88.551％	7.7M

As can be seen from the table, the precision of the scheme of the embodiment of the application to character recognition is higher than that of the Aster model, and the size of the model is far smaller than that of the Aster model because complex post-processing is not needed.

Fig. 6 is a schematic structural diagram of a computing device 700 provided by an embodiment of the present application. The computing device 700 includes: a processor 710, a memory 720.

It is to be appreciated that the computing device 700 illustrated in FIG. 6 may also include a communication interface that may be employed to communicate with other devices.

The processor 710 may be coupled to the memory 720. The memory 720 may be used for storing the program codes and data. Therefore, the memory 720 may be a storage unit inside the processor 710, an external storage unit independent of the processor 710, or a component including a storage unit inside the processor 710 and an external storage unit independent of the processor 710.

Optionally, the computing device 700 may also include a bus. The memory 720 and the communication interface may be connected to the processor 710 via a bus. The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one line is shown in FIG. 6, but it is not intended that there be only one bus or one type of bus.

It should be understood that, in the embodiment of the present application, the processor 710 may employ a Central Processing Unit (CPU). The processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Or the processor 710 may employ one or more integrated circuits for executing related programs to implement the technical solutions provided in the embodiments of the present application.

The memory 720 may include both read-only memory and random-access memory, and provides instructions and data to the processor 710. A portion of the processor 710 may also include non-volatile random access memory. For example, the processor 710 may also store information of the device type.

When the computing device 700 is running, the processor 710 executes the computer-executable instructions in the memory 720 to perform the operational steps of the above-described method.

It should be understood that the computing device 700 according to the embodiment of the present application may correspond to a corresponding main body for executing the method according to the embodiments of the present application, and the above and other operations and/or functions of each module in the computing device 700 are respectively for implementing corresponding flows of each method of the embodiment, and are not described herein again for brevity.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

As shown in fig. 7, an embodiment of the present application further provides a server 800, including: a processor 810, a memory 820; the memory 820 is used for storing program instructions, which when executed by the processor 810, cause the server to perform the operational steps of the above-described method embodiments and alternative embodiments. For alternative implementations of the processor 810 and the memory 820, reference may be made to the description related to the embodiments of the computing device described above, and details are not repeated. In addition, in some embodiments, the server 800 may further include a communication interface and a bus, for which, reference may also be made to the description related to the embodiment of the computing device, which is not described again.

In some embodiments, the server 800 may receive image data collected by a terminal (e.g., a mobile phone, a tablet, a computer, a smart camera, etc.) through a communication interface, and perform the operation steps of the above method embodiments or alternative embodiments to identify a character sequence in the image data, where the character sequence may be a text, and may further transmit the identified text data to the terminal or other devices capable of performing text display or voice playing through the communication interface, so as to display the text data to a user, or play a voice corresponding to the text to the user through a speaker.

The present application also provides a computer readable storage medium, on which a computer program is stored, the computer program being used for executing the method of the present application when executed by a processor, the method including at least one of the solutions described in the above embodiments.

The computer storage media of the embodiments of the present application may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Additionally, the terms first, second, third and the like in the description and in the claims, or module A, module B, module C and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that specific orders or sequences may be interchanged where permissible to effect embodiments of the application described herein in other sequences than illustrated or described herein.

In the above description, reference numbers indicating steps, such as S301, S302 … …, etc., do not necessarily indicate that the steps are executed in this order, and the order of the preceding and following steps may be interchanged or executed simultaneously, if permitted.

The term "comprising" as used in the specification and claims should not be construed as being limited to the contents listed thereafter; it does not exclude other elements or steps. It should therefore be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, and groups thereof. Thus, the expression "an apparatus comprising the devices a and B" should not be limited to an apparatus consisting of only the components a and B.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments, as would be apparent to one of ordinary skill in the art from this disclosure.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application.

Claims

1. A recognition model, comprising: the encoding module is used for obtaining context characteristics according to input data; a first decoding module, configured to obtain a recognized character sequence according to the context feature, where the first decoding module includes:

the character position prediction module is used for obtaining a feature map according to the context features, and the feature map reflects the position information of the characters in the character sequence;

and the first sequence processing module is used for obtaining the character sequence according to the context characteristic and the characteristic graph.

2. The model of claim 1, wherein the character position prediction module comprises: a cascaded downsampled convolutional layer, a full-link layer, and an upsampled convolutional layer.

3. The model of any of claims 1-2, wherein said first decoding module further comprises a parallel-based attention module for obtaining an updated feature map based on said feature map and said contextual feature information;

the first sequence processing module is specifically configured to obtain the character sequence according to the context feature and the updated feature map.

4. The model of claim 3, characterized in that said first decoding module comprises in particular a cascade of two or more of said parallel-based attention modules.

5. The model according to claim 3 or 4, characterized in that said parallel-based attention module is further cascaded with a second sequence processing module for processing said feature map to be input to said parallel-based attention module.

6. The model of any one of claims 1-5, wherein the coding module is further connected with an image correction module in front of the coding module, and is used for correcting data of an input image.

7. A method for training a model, wherein the model is a first recognition model, and the first recognition model comprises: the encoding module is used for obtaining context characteristics according to input data; a first decoding module, configured to obtain a recognized character sequence according to the context feature, where the first decoding module includes: the character position prediction module is used for obtaining a feature map according to the context features, and the feature map reflects the position information of the characters in the character sequence; the first sequence processing module is used for obtaining the character sequence according to the context characteristic and the characteristic graph; the training method comprises the following steps:

training a second recognition model comprising the coding module in cascade with a serial attention-based sequence decoder;

freezing network parameters of the encoding module and the serial attention based sequence decoder;

training the first recognition model based on knowledge distillation using the second recognition model.

8. The method of claim 7, wherein training the first recognition model based on knowledge distillation using the second recognition model comprises:

knowledge distillation is carried out on the characteristic diagram to obtain a first parameter;

knowledge distillation is carried out on an attention matrix formed by the serial attention-based sequence decoder to obtain second parameters;

training the first recognition model based on a difference between the first and second parameters.

9. The method of claim 8, wherein the first decoding module further comprises a parallel-based attention module for obtaining an updated feature map from the feature map and the contextual feature information;

the knowledge distillation of the characteristic diagram to obtain the first parameter specifically comprises: knowledge distillation of the updated feature map obtained using the parallel-based attention module to obtain a first parameter.

10. The method according to claim 8 or 9,

the obtained first parameter and the second parameter are related to position information of characters in the character sequence.

11. The method of any of claims 7-10, wherein training the first recognition model based on knowledge distillation using the second recognition model comprises:

knowledge distillation is carried out on the character sequence output by the first recognition model to obtain a third parameter;

knowledge distillation is carried out on the character sequence output by the second recognition model to obtain a fourth parameter;

training the first recognition model based on a difference between the third and fourth parameters.

12. The method of claim 11,

the obtained third parameter and the fourth parameter are related to the probability distribution of the characters in the character sequence in the character set.

13. The method of any of claims 7-12, further comprising:

training the first recognition model based on sample data.

14. The method according to any one of claims 7-13, wherein the first recognition model and the second recognition model each further comprise an image correction module cascaded before the encoding module;

training the second recognition model further comprises training the image correction module;

freezing the network parameters further comprises freezing the network parameters of the image correction module.

15. An apparatus for training a model, wherein the model is a first recognition model, and the first recognition model includes: the encoding module is used for obtaining context characteristics according to input data; a first decoding module, configured to obtain a recognized character sequence according to the context feature, where the first decoding module includes: the character position prediction module is used for obtaining a feature map according to the context features, and the feature map reflects the position information of the characters in the character sequence; the first sequence processing module is used for obtaining the character sequence according to the context characteristic and the characteristic graph; the training apparatus includes:

a training module for training a second recognition model comprising the coding module and a serial attention-based sequence decoder in cascade;

a configuration module for freezing network parameters of the encoding module and the serial attention based sequence decoder;

the training module is further configured to train the first recognition model based on knowledge distillation using the second recognition model.

16. The apparatus of claim 15, wherein the training module is specifically configured to:

17. The apparatus of claim 16, wherein the first decoding module further comprises a parallel-based attention module for obtaining an updated feature map from the feature map and the contextual feature information;

the knowledge distillation of the characteristic diagram to obtain the first parameter specifically comprises: knowledge distillation is performed on the feature maps obtained using the parallel-based attention module to update.

18. The apparatus of claim 16 or 17,

19. The apparatus according to any one of claims 15-18, wherein the training module is specifically configured to:

20. The apparatus of claim 19,

the third parameter and the fourth parameter are related to probability distribution of characters in the character sequence in a character set.

21. The apparatus according to any of claims 15-20, wherein the training module is further configured to train the first recognition model based on sample data.

22. The apparatus according to any of claims 15-21, wherein the first recognition model and the second recognition model each further comprise an image correction module cascaded before the encoding module;

the training module is also used for training the image correction module;

the configuration module is further configured to freeze network parameters of the image correction module.

23. A method for recognizing a sequence of characters, comprising:

acquiring input data;

obtaining context features from the data using an encoding module;

obtaining a feature map by using a character position prediction module according to the context features, wherein the feature map reflects the position information of the characters in the character sequence;

and obtaining the character sequence according to the context characteristic and the characteristic graph by utilizing a first sequence processing module.

24. An apparatus for recognizing a character sequence, comprising:

the acquisition module is used for acquiring input data;

the identification module is used for obtaining context characteristics according to the data by utilizing the coding module, obtaining a characteristic diagram according to the context characteristics by utilizing the character position prediction module, and obtaining the character sequence according to the context characteristics and the characteristic diagram by utilizing the first sequence processing module, wherein the characteristic diagram reflects the position information of the characters in the character sequence.

25. A server, comprising:

a processor, a memory;

wherein the memory is for storing program instructions that, when executed by the processor, cause the server to implement the method of any one of claims 7-14, or that, when executed by the processor, cause the server to implement the method of claim 23.

26. A computing device, comprising:

a processor, a memory;

wherein the memory is to store program instructions that, when executed by the processor, cause the computing device to implement the method of any of claims 7-14 or that, when executed by the processor, cause the computing device to implement the method of claim 23.

27. A computer-readable storage medium on which program instructions are stored, which program instructions, when executed by a computer, cause the computer to carry out the method of any one of claims 7-14, or which program instructions, when executed by a computer, cause the computer to carry out the method of claim 23.

28. A computer program product comprising instructions stored thereon, which, when run on a computer, cause the computer to carry out the method of any one of claims 7 to 14 or the method of claim 23.