CN113033249A

CN113033249A - Character recognition method, device, terminal and computer storage medium thereof

Info

Publication number: CN113033249A
Application number: CN201911253120.1A
Authority: CN
Inventors: 白翔; 王勃飞; 徐清泉; 许永超; 刘少丽
Original assignee: ZTE Corp; Huazhong University of Science and Technology
Current assignee: ZTE Corp; Huazhong University of Science and Technology
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2021-06-25
Also published as: WO2021115159A1

Abstract

The application discloses a character recognition method, a character recognition device, a terminal and a computer storage medium thereof, wherein feature extraction is carried out on an input picture through a convolutional neural network, then an attention mechanism module with a plurality of channels is input to obtain the attention weight of each channel, each channel of a depth feature map is zoomed to obtain a plurality of attention feature maps, then a full connection layer is input to carry out feature fusion to obtain a character type prediction result, a loss function is designed according to character type marks of the input picture and the character type prediction result during model training, the attention weight is optimized, the character recognition accuracy is improved, and the recognition robustness of a difficult sample is stronger.

Description

Character recognition method, device, terminal and computer storage medium thereof

Technical Field

The embodiment of the application relates to the technical field of computer vision, in particular to a character recognition method, a character recognition device, a terminal and a computer storage medium thereof.

Background

Handwritten Chinese Character Recognition (HCCR) has been a very active and challenging research direction in the field of computer vision, since the 20 th century in the 60 th era, and has made great progress, and many applications in real life are closely related to it, such as mail sorting, bank check reading, book and handwritten note transcription, and so on. Despite much research, the recognition of handwritten chinese characters remains a very challenging task, on one hand, because of the large number of chinese character categories and the large number of near-form characters, which are easily confused; on the other hand, different people have great writing style difference, so that even the same type of characters have obvious visual difference, which brings great difficulty to the recognition of handwritten Chinese characters.

Most existing methods based on deep learning use convolutional neural networks to classify handwritten Chinese characters by learning global semantic features from the entire image, but this is not sufficient for the recognition of visually similar characters because there are often only subtle differences between confusing characters. In particular, the global attention provided by these methods may locate entire characters well, but the attention areas between characters of different classes have large overlap and lack of distinctiveness, which may result in a high recognition error rate for near-word and words of large differences within the classes.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

In a first aspect, embodiments of the present application provide a method for training a character recognition network model, a method for character recognition, an apparatus, a terminal, and a computer storage medium thereof, which can improve accuracy of visually confusable character recognition.

In a second aspect, an embodiment of the present application provides a method for training a character recognition network model, including the following steps:

standardizing each picture in the original data set, and carrying out character type labeling on each picture to obtain a standard training data set with the character type labels;

inputting each picture in the standard training data set into a convolutional neural network, extracting the convolutional characteristic of the picture, and obtaining a depth characteristic map containing the convolutional characteristic;

inputting the depth feature map into an attention mechanism module with a plurality of channels to obtain an attention weight of each channel, and rescaling each channel of the depth feature map by using the attention weight to obtain a plurality of attention feature maps;

inputting each attention feature map into a full-connection layer respectively to obtain a plurality of attention feature vectors;

performing feature fusion on the attention feature vectors, and inputting the attention feature vectors into a character full-connection layer to perform character type prediction;

and designing a target loss function according to the character type prediction result and the character type label, performing iteration by using a back propagation algorithm, minimizing the target loss function, and optimizing the attention weight.

In a third aspect, an embodiment of the present application provides a text recognition method, including:

standardizing the picture to be tested, and zooming to a preset height H and a preset width W;

inputting a picture to be tested into a convolutional neural network, extracting the convolutional characteristic of the picture to be tested, and obtaining a depth characteristic graph containing the convolutional characteristic;

and performing feature fusion on the attention feature vectors, and inputting the attention feature vectors into a character full-connection layer to perform character type prediction.

In a fourth aspect, an embodiment of the present application provides a device for training a character recognition network model, including: the computer program is stored in a memory and can be run on a processor, and the processor when executing the computer program implements the word recognition network model training method according to the embodiment of the second aspect.

In a fifth aspect, an embodiment of the present application provides a text recognition apparatus, including: a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the word recognition method according to the embodiment of the third aspect when executing the computer program.

In a sixth aspect, an embodiment of the present application provides a terminal, including the character recognition network model training apparatus according to the fourth aspect or including the character recognition apparatus according to the fifth aspect.

In a seventh aspect, an embodiment of the present application provides a computer storage medium storing computer-executable instructions for performing the method for training a word recognition network model according to the embodiment of the second aspect or performing the method for word recognition according to the embodiment of the third aspect.

According to the scheme provided by the embodiment of the application: the method comprises the steps of extracting features of an input picture through a convolutional neural network, obtaining distinctive attention features through an attention mechanism module, obtaining character category prediction results after feature fusion, designing a loss function according to character category labels of the input picture and the character category prediction results during model training, and optimizing the attention weight, so that accuracy of character recognition is improved, and recognition robustness on difficult samples is higher.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 is a schematic flow chart of a method for training a character recognition network model and a method for character recognition according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a method for training a character recognition network model according to an embodiment of the present application;

fig. 3 is a network structure diagram of a character recognition network model provided in an embodiment of the present application, where "CA" denotes a Channel Attention mechanism (Channel Attention);

FIG. 4 is a diagram of a convolutional neural network architecture provided in an embodiment of the present application;

FIG. 5 is a block diagram of an attention module provided in an embodiment of the present disclosure;

FIG. 6 is a flow chart of a text recognition method according to another embodiment of the present application;

FIG. 7 is a block diagram of a device for training a character recognition network model according to another embodiment of the present application;

fig. 8 is a structural diagram of a character recognition apparatus according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart.

According to our daily experience, when a person identifies a specific character among a plurality of confusing Chinese characters, the specific Chinese character category is usually determined by observing detailed features in candidate Chinese characters and then comparing their similarities and differences. For example, "bird" and "Wu" are two Chinese characters that are easily visually confused, but we can distinguish them by observing whether there is a "left-word" in their upper half; similarly, for "diffuse" and "disrespectful," we can judge the components of their left half.

Recently, a handwritten Chinese character recognition method based on a Recurrent Neural Network (RNN) and an attention mechanism is proposed, which uses a residual convolutional neural network as a backbone network and corrects character prediction by iteratively updating attention distribution using the RNN. The method can utilize an attention mechanism to locate local regions of characters to identify visually similar kanji characters. However, this method has two major disadvantages: firstly, based on the method of updating attention distribution by iteration, the initial error may be accumulated and the identification precision is improved to a limited extent, depending on the prediction result of the previous iteration; secondly, the method uses RNN for multiple iterations, the training time is longer, the process is more complex, the RNN cannot fully utilize GPU for parallel computation due to an internal mechanism, and the problems of gradient disappearance, gradient explosion and the like easily occur in the back propagation process.

Under such a background, it is necessary to design a simple and effective text recognition method capable of mining locally distinctive features.

Based on the above, the present application provides a method for training a character recognition network model, a method for character recognition, a device, a terminal and a computer storage medium thereof, wherein features of an input picture are extracted through a convolutional neural network, then, distinctive attention features are obtained through an attention mechanism module, a character type prediction result is obtained after feature fusion, a loss function is designed according to character type labels of the input picture and the character type prediction result during model training, and the attention weight is optimized, so that the accuracy of character recognition is improved, and the recognition robustness of a difficult sample is stronger.

The embodiments of the present application will be further explained with reference to the drawings.

As shown in fig. 1, fig. 1 is a schematic flowchart of a character recognition network model training method and a character recognition method provided in an embodiment of the present application, where solid arrows represent training steps, and dashed arrows represent recognition steps.

The character recognition network model comprises a deep convolution neural network, a multi-channel attention mechanism module, a comparative attention feature learning branch and a multi-attention feature fusion module.

Deep convolutional neural network: a neural network useful for classification, the network consisting essentially of convolutional and pooling layers. The convolution layer is used for extracting picture characteristics; the role of the pooling layer is to reduce the dimensionality of the eigenvectors output by the convolutional layer, reducing overfitting. Parameters in the network can be updated by a back propagation algorithm. In the embodiment of the application, the deep convolutional neural network is composed of 14 convolutional layers and 4 pooling layers.

An attention mechanism module: in a manner of simulating human observation, generally speaking, when people look at a picture, in addition to holding an image as a whole, people pay more attention to some local information of the picture, such as the position of a table, the type of goods, and the like. In the field of computer vision, the essence of the attention mechanism is to select information needing more attention from input information and perform feature extraction from key parts. The introduction of the attention mechanism can increase the expression capacity of the model under the condition of hardly increasing the complexity of the model on one hand; on the other hand, the attention mechanism only selects and processes input information important for the model, and the efficiency of the neural network can be improved.

Comparing attention feature learning branches: the global features of the image are extracted, so that general objects can be well classified, but for the fine-grained classification problem of handwritten Chinese characters, the local features with distinguishing characters need to be concerned. The purpose of learning the contrast attention features is to enable a multi-channel attention mechanism module to locate a plurality of local regions for an input sample, and train under supervision of a contrast loss function and a region center loss function to obtain scattered attention regions, so that the model can be more likely to locate the features with distinguishing force of characters, and the recognition error rate of visually similar characters is reduced.

Referring to fig. 2 and 3, an embodiment of the present application provides a method for training a character recognition network model, including the following steps:

step S100: standardizing each picture in the original data set, and carrying out character type labeling on each picture to obtain a standard training data set with the character type labels;

step S200: inputting each picture in the standard training data set into a convolutional neural network, extracting the convolution characteristics of the pictures, and obtaining a depth characteristic map containing the convolution characteristics;

step S300: inputting the depth feature map into an attention mechanism module with a plurality of channels to obtain the attention weight of each channel, and rescaling each channel of the depth feature map by using the attention weight to obtain a plurality of attention feature maps;

step S400: inputting each attention feature map into a full-connection layer respectively to obtain a plurality of attention feature vectors;

step S500: performing feature fusion on the plurality of attention feature vectors, and inputting the feature vectors into a character full-connection layer to perform character type prediction;

step S600: and designing a target loss function according to the character type prediction result and the character type label, and performing iteration by using a back propagation algorithm to minimize the target loss function and optimize the attention weight.

In an embodiment, step S100 specifically includes: counting each picture I in the original data set_i(i ═ 1, ·, N), and the mean and variance of each picture, scaling the height and width of each picture to a preset height H and a preset width W, where generally, the default values of the preset height H and the preset width W are both 96, where N is the number of pictures in the original data set; and for each picture I_iAnd carrying out character type labeling to obtain a standard training data set with the character type label.

In an embodiment, referring to fig. 4, step S200 specifically includes: the convolutional neural network comprises 2 convolutional layers (conv1, conv2) and 4 convolutional modules, and the picture I to be standardized_i(i ═ 1, ·, N) are input into 2 convolutional layers (Conv1, Conv2), each of which is followed by a Batch Normalization layer (BN) and a nonlinear activation function ReLU, obtaining a signature graph with a size of 96 × 64, then the signature graph is input into a maximum pooling layer with a step size of 2 for sampling, obtaining a signature graph of 48 × 64, and then the signature graph is input into 4 convolution modules (Conv-Block), each of which is composed of 3 convolutional layers with a convolution kernel size of 3 × 3 and 3 Batch Normalization layers, wherein the 3 Batch Normalization layers are followed by the 3 convolutional layers respectivelyThen, the convolution module (Conv-Block) is a "bottleneck" structure, with the middle layer of 3 convolution layers having fewer channels than the upper and lower layers; each convolution module (Conv-Block) is connected with the largest pooling layer with the step size of 2, the resolution of the input feature map is halved, and finally after 4 convolution modules (Conv-Block), the depth feature map X with the size of 6X 448 is output_iThese depth profiles X_iContains high-level semantic information obtained through 14 convolutional layers.

In an embodiment, referring to fig. 5, step S300 specifically includes: a depth profile X of size 6X 448 output by the last convolution module (Conv-Block)_iAs input, the data is sent to an attention mechanism module with a plurality of channels to calculate an attention feature map

In this embodiment, the value of S is 2; the attention mechanism module uses the channel attention mechanism introduced by the SENET method for reference, and firstly uses a global flattening pool to assemble an input depth feature map X in the H multiplied by W spatial dimension_iTo generate a channel descriptor z^s＝[z₁,…，z_C]Wherein z is^sThe c element of (a)^cThe calculation method comprises the following steps:

wherein S is 1, S is the number of attention mechanism modules;

wherein C is 1, C, C is the number of channels;

at z^sThe channel descriptors are processed using the gating mechanism with Sigmoid activation to obtain the attention weight of each attention mechanism module:

where σ is Sigmoid function, δ is ReLU function,

r is the channel compression ratio;

each attention mechanism module re-aligns depth feature map X using attention weights_iIs scaled to obtain a plurality of attention feature maps

Wherein

Picture I representing normalization_iC channel of corresponding attention feature map

And scalar quantity

The product between them.

In an embodiment, step S400 specifically includes: inputting the plurality of attention feature maps obtained in step S300 into a comparative attention feature learning branch for extracting attention features of the local distinctive zones, i.e. each attention feature map

To full-junction layers containing 768 neurons respectively:

wherein the operator F_flatt(. cndot.) tiling the matrix into 1-dimensional vectors.

In an embodiment, step S500 specifically includes: attentions are paid toEigenvector f_i ^s(S is 1, …, S), and then input to a full-link layer containing 3755 neurons to perform character type prediction:

Y_i＝soft max(W·[f_i ¹,…,f_i ^S])

wherein [ ·]Denotes cascade operation, Y_iRepresenting a picture I_iCorresponding scores of 3755 Chinese characters, the category with the highest score is the predicted result of character category

In an embodiment, step S600 specifically includes: using character category label gt as expected output of network model to predict result

Designing a target loss function between the expected output of the network model and the predicted output of the network model for the predicted output of the network model, and minimizing a cross entropy loss function L in the training process_clsTo ensure each attention feature map

Can locate areas important for character classification; regarding the comparative attention feature learning branch, taking the plurality of attention features obtained in step S300 as input, and using metric learning loss functions, i.e., a contrast loss function and a region center loss function, to focus the attention feature map of the network model on different regions with distinctive features of the input picture; in particular, a contrast loss function is applied to the attention feature to capture separable attention areas;

defining the target loss function as:

L_total＝L_cls+λ(L_center+L_contra)

wherein L is_clsAs a cross-entropy loss function, L_centerFor the region-centered loss function for reducing the distance between the individual attention features of the same type of character, L_contraFor drawing a picture I_iA plurality of attention feature vectors f_i ^sA contrast loss function of zooming out in a high-dimensional space, wherein lambda is a hyper-parameter used for controlling the weight occupied by the two loss functions;

the contrast loss function is defined as:

wherein D (I)_i) Is defined as:

wherein m is a preset threshold; the contrast loss function is to input picture I_iA plurality of attention feature vectors f_i ^sIn the high-dimensional space, the distance between every two vectors is larger than a preset threshold m, and m is set to be 40 in the embodiment, so that the local features of the characters positioned by the attention feature maps are different, and thus the character recognition network model is more likely to mine the distinguishing features of the characters.

The area center loss function is defined as:

the region center loss function is used for reducing the distance between the attention features of the same type of characters, so that a plurality of attention features learned by the same type of characters are respectively similar, and each attention feature map is convenient for

Are activated in the same character part, wherein

Is y_iThe center of the s-th attention feature of the class, d the dimension of the feature, NoteCenter of gravity characteristics

Initializing with a Gaussian distribution with a mean value of 0 and a variance of 1, and then updating the feature center according to a region center loss function algorithm.

And (4) according to the designed target loss function, utilizing a back propagation algorithm to carry out iteration, and minimizing the cross entropy loss function in the training process to realize an optimal network model. Aiming at the off-line handwritten Chinese character recognition task, the original data set is used for iterative training in the training process to obtain the parameters of the network model.

Referring to fig. 6, an embodiment of the present application provides a character recognition method, which recognizes a handwritten chinese character image by using a character recognition network model trained in the above embodiment of the present application, and includes the following steps:

step A100: picture I to be tested_iCarrying out standardization, and zooming to a preset height H and a preset width W;

step A200: picture I to be tested_iInputting the convolution neural network to extract a picture I to be tested_iObtaining a depth feature map X containing the convolution features_i；

Step A300: depth feature map X_iInputting an attention mechanism module with a plurality of channels, obtaining the attention weight of each channel, and using the attention weight to re-align the depth feature map X_iIs scaled to obtain a plurality of attention feature maps

Step A400: each attention feature map

Respectively inputting into the full-connection layer to obtain multiple attention feature vectors f_i ^s；

Step A500: a plurality of attention feature vectors f_i ^sPerforming feature fusion, inputting into character full-connection layer for character classificationAnd (6) predicting.

In an embodiment, step a200 specifically includes: the convolutional neural network comprises 2 convolutional layers (conv1, conv2) and 4 convolutional modules, and pictures I to be tested_iInputting 2 convolutional layers (Conv1, Conv2), each of which is followed by a Batch Normalization layer (BN) and a nonlinear activation function ReLU to obtain a feature map with the size of 96 × 64, then inputting the feature map into a maximum pooling layer with the step length of 2 for sampling to obtain a feature map with the size of 48 × 64, and then inputting the feature map into 4 convolution modules (Conv-Block), wherein each convolution module is composed of 3 convolutional layers with the convolution kernel size of 3 × 3 and 3 Batch Normalization layers, wherein the 3 Batch Normalization layers are respectively followed by 3 convolutional layers, and the convolution module (Conv-Block) is a 'bottleneck' structure, and the channels of the middle layers of the 3 convolutional layers are less than that of the upper layer and the lower layer; each convolution module (Conv-Block) is connected with the largest pooling layer with the step size of 2, the resolution of the input feature map is halved, and finally after 4 convolution modules (Conv-Block), the depth feature map X with the size of 6X 448 is output_iDepth feature map X_iContains high-level semantic information obtained through 14 convolutional layers.

In an embodiment, step a300 specifically includes: a depth profile X of size 6X 448 output by the last convolution module (Conv-Block)_iAs input, the data is sent to an attention mechanism module with a plurality of channels to calculate an attention feature map

wherein S is 1, S is the number of attention mechanism modules;

wherein C is 1, C, C is the number of channels;

where σ is Sigmoid function, δ is ReLU function,

r is the channel compression ratio;

Wherein

And scalar quantity

The product between them.

In an embodiment, step a400 specifically includes: inputting the plurality of attention feature maps obtained in the step A300 into a comparative attention feature learning branch for extracting bureauAttention characteristics characterised by a distinct region, i.e. each attention profile

To full-junction layers containing 768 neurons respectively:

In an embodiment, step a500 specifically includes: a plurality of attention feature vectors f_i ^s(S is 1, …, S), and then input to a full-link layer containing 3755 neurons to perform character type prediction:

Y_i＝soft max(W·[f_i ¹,…,f_i ^S])

wherein [ ·]Denotes cascade operation, Y_iRepresenting pictures I to be tested_iCorresponding scores of 3755 Chinese characters, the category with the highest score is the predicted result of character category

Through the above technical scheme that this application was conceived, compare with prior art, have following technological effect:

(1) the accuracy is high: aiming at the problem of low character recognition precision of the handwritten Chinese characters with large differences between the shapes and styles of the handwritten Chinese characters, the method creatively utilizes a multiple contrast attention mechanism to extract the distinguishing characteristics of the Chinese characters, and more accurately recognizes the handwritten Chinese characters.

(2) The speed is high: the provided character recognition network model has higher training speed while ensuring the recognition precision.

(3) The universality is strong: the method can accurately identify the Chinese characters with similar characters, can realize complete end-to-end training, has small model parameter, is simple and effective, and is easy for the product to fall on the ground.

(4) The robustness is strong: the method can overcome the shape change of the handwritten Chinese characters caused by the writing styles of different individuals, and achieves the highest recognition precision on the standard handwritten Chinese character test set.

Referring to fig. 7, an embodiment of the present application provides a character recognition network model training apparatus 100, including: a memory 101, a processor 102 and a computer program stored on the memory and executable on the processor, the processor implementing the word recognition network model training method in the above embodiments when executing the computer program, for example, executing the above-described method steps S100 to S600 of fig. 2. The processor 102 and the memory 101 may be connected by a bus or other means, and fig. 7 illustrates the connection by a bus as an example.

Referring to fig. 8, an embodiment of the present application provides a text recognition apparatus 200, including: a memory 201, a processor 202 and a computer program stored on the memory and executable on the processor, the processor implementing the word recognition method in the above embodiments when executing the computer program, for example, performing the above described method steps a100 to a500 of fig. 6. The processor 202 and the memory 201 may be connected by a bus or other means, and fig. 8 illustrates a bus connection as an example.

An embodiment of the present application further provides a terminal, including the character recognition network model training apparatus 100 described in the foregoing embodiment or including the character recognition apparatus 200 described in the foregoing embodiment. The terminal may be any type of smart terminal, such as a smart phone, a tablet computer, a laptop computer, or a desktop computer.

Furthermore, an embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions, which are executed by a processor or a controller, for example, by a processor 102 in fig. 7, and can cause the processor 102 to execute the text recognition network model training method in the above embodiment, for example, execute the above-described method steps S100 to S600 in fig. 2. As another example, execution by one of the processors 202 in fig. 8 may cause the processor 202 to perform the text recognition method in the above-described embodiment, for example, perform the above-described method steps a100 to a500 of fig. 6.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described, the present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and such equivalent modifications or substitutions are included in the scope of the present invention defined by the claims.

Claims

1. A character recognition network model training method is characterized by comprising the following steps:

2. The method of claim 1, wherein normalizing each image in the raw data set comprises:

counting each picture I in the original data set_i(i ═ 1, ·, N), scaling the height and width of each picture to a preset height H and a preset width W, where N is the number of pictures in the original data set.

3. The method of claim 2, wherein the convolutional neural network comprises convolutional layers and convolutional modules;

inputting each picture in the standard training data set into a convolutional neural network, extracting the convolutional characteristic of the picture, and obtaining a depth characteristic map containing the convolutional characteristic, wherein the method comprises the following steps:

picture I to be standardized_i(i ═ 1, ·, N) are respectively input into several convolution layers, each convolution layer is followed by a batch normalization layer and nonlinear activation function ReLU, then input into maximum pooling layer to make sampling, then input into several described convolution modules, every convolution module is formed from several convolution layers with same quantity and batch normalization layer, every batch normalization layer is followed by every convolution layer, every convolution module is connected with maximum pooling layer, and finally one described convolution module outputs depth characteristic graph X containing convolution characteristic_i。

4. The method for training a character recognition network model according to claim 1 or 3, wherein the attention weight is obtained by the following steps:

the attention mechanism module aggregates the input depth feature maps in spatial dimensions using global average pooling to generate channel descriptors, which are processed using a gating mechanism with Sigmoid activation to derive an attention weight for each channel.

5. The method of claim 3, wherein the inputting the depth feature map into an attention mechanism module having a plurality of channels, obtaining an attention weight for each channel, and rescaling each channel of the depth feature map using the attention weight to obtain a plurality of attention feature maps comprises:

the attention mechanism module uses a global flattening pool to assemble the input depth feature map X in the spatial dimension H X W_iTo generate a channel descriptor z^s＝[z₁,…，z_C]Wherein z is^sThe c element of (a)^cThe calculation method comprises the following steps:

wherein S is 1, S is the number of attention mechanism modules;

wherein C is 1, C, C is the number of channels;

at z^sThe channel descriptors are processed using a gating mechanism with Sigmoid activation to obtain the attention weight of each attention mechanism module:

where σ is Sigmoid function, δ is ReLU function,

r is the channel compression ratio;

each attention mechanism module re-aligns the depth feature map X using the attention weights_iIs scaled to obtain a plurality of attention feature maps

Wherein

Picture I representing normalization_iCorresponding c channel of the attention feature map

And scalar quantity

The product between them.

6. The method as claimed in claim 5, wherein said inputting each said attention feature map into a full-connected layer to obtain a plurality of attention feature vectors comprises:

mapping a plurality of said attention profiles

Respectively input to the full connection layer:

7. The method as claimed in claim 6, wherein said performing feature fusion on a plurality of attention feature vectors, and inputting the feature fusion into a character class fully-connected layer for character class prediction comprises:

a plurality of said attention feature vectors f_i ^s(S is 1, …, S), and then inputting the character into a character full-connection layer for character type prediction:

Y_i＝softmax(W·[f_i ¹,…,f_i ^S])

wherein [ ·]Denotes cascade operation, Y_iRepresenting a picture I_iAnd the corresponding scores belonging to the character categories, wherein the category with the highest score is the result of character category prediction.

8. The method of claim 7, wherein the step of designing an objective loss function according to the result of the character class prediction and the character class label, and performing iteration by using a back propagation algorithm to minimize the objective loss function and optimize the attention weight comprises:

defining the target loss function as:

L_total＝L_cls+λ(L_center+L_contra)

wherein L is_clsAs a cross-entropy loss function, L_centerFor the region-centered loss function for reducing the distance between the individual attention features of the same type of character, L_contraFor drawing a picture I_iA plurality of said attention feature vectors f_i ^sA contrast loss function of zooming out in a high-dimensional space, wherein lambda is a hyper-parameter used for controlling the weight occupied by the two loss functions;

the contrast loss function is defined as:

wherein D (I)_i) Is defined as:

wherein m is a preset threshold;

the area center loss function is defined as:

wherein

Is y_iThe center of the s-th attention feature of the class, d the dimension of the feature, the center of the attention feature

Initializing Gaussian distribution with the mean value of 0 and the variance of 1, and then updating a characteristic center according to a regional center loss function algorithm;

and according to the target loss function, utilizing a back propagation algorithm to carry out iteration, minimizing the cross entropy loss function, and optimizing the attention weight.

9. A method for recognizing a character, comprising:

inputting the picture to be tested into a convolutional neural network, extracting the convolutional characteristic of the picture to be tested, and obtaining a depth characteristic graph containing the convolutional characteristic;

10. The method of claim 9, wherein the convolutional neural network comprises convolutional layers and convolutional modules;

inputting the picture to be tested into a convolutional neural network, extracting the convolutional characteristic of the picture to be tested, and obtaining a depth characteristic graph containing the convolutional characteristic, wherein the method comprises the following steps:

the picture I to be tested is processed_iInputting the data into the plurality of convolution layers, each convolution layer is connected with a batch normalization layer and a nonlinear activation function ReLU, then inputting the data into a maximum pooling layer for sampling, and then inputting the data into the plurality of convolution modules, each convolution module is composed of a plurality of convolution layers and batch normalization layers with the same number, each batch normalization layer is connected with each convolution layer, and the last convolution module outputs a depth feature graph X containing convolution features_i。

11. The character recognition method of claim 9 or 10, wherein the attention weight is obtained by:

12. The method of claim 10, wherein the inputting the depth feature map into an attention mechanism module having a plurality of channels, obtaining an attention weight for each channel, and rescaling each channel of the depth feature map using the attention weight to obtain a plurality of attention feature maps comprises:

wherein S is 1, S is the number of attention mechanism modules;

wherein C is 1, C, C is the number of channels;

where σ is Sigmoid function, δ is ReLU function,

r is the channel compression ratio;

Wherein

And scalar quantity

The product between them.

13. The method of claim 12, wherein said inputting each said attention feature map into a full-concatenation layer to obtain a plurality of attention feature vectors comprises:

mapping a plurality of said attention profiles

Respectively input to the full connection layer:

14. The method of claim 13, wherein said feature fusing a plurality of said attention feature vectors and inputting the fused attention feature vectors into a character class full-link layer for character class prediction comprises:

Y_i＝softmax(W·[f_i ¹,…,f_i ^S])

15. A character recognition network model training device comprises: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of training a word recognition network model according to any one of claims 1 to 8 when executing the computer program.

16. A character recognition apparatus comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of word recognition according to any one of claims 9 to 14 when executing the computer program.

17. A terminal comprising the character recognition network model training apparatus of claim 15 or comprising the character recognition apparatus of claim 16.

18. A computer storage medium storing computer-executable instructions for performing the method of any of claims 1 to 8 or for performing the method of any of claims 9 to 14.