CN113378815B

CN113378815B - Scene text positioning and identifying system and training and identifying method thereof

Info

Publication number: CN113378815B
Application number: CN202110666699.5A
Authority: CN
Inventors: 刘凯; 孙乐; 朱均可; 叶堂华
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2023-11-24
Anticipated expiration: 2041-06-16
Also published as: CN113378815A

Abstract

The invention discloses a model for positioning and identifying a scene text and a training and identifying method thereof, belonging to the technical field of computers. The device comprises a locator, a grouping module and a recognizer, wherein the locator is connected with the recognizer through the grouping module; the locator outputs a character frame, a character connection frame and a text frame, the character frame and the character connection frame locate the position of a text, the grouping module cuts character pictures according to the character frame and the text frame, the character pictures are grouped and sent into the identifier, the identifier outputs identification results of all groups, and finally, the location and the identification of a scene text are completed according to the location and the identification results; the invention can make the computer accurately and efficiently finish the task of detecting and identifying the text in the natural scene.

Description

Scene text positioning and identifying system and training and identifying method thereof

Technical Field

The invention relates to a scene text positioning and identifying system and a training and identifying method thereof, belonging to the technical field of computers.

Background

Text is one of the most important sources of information for humans and is the most important information carrier for humans. After the advent of computers, there has been a desire to have computers to help us recognize and process text automatically. Text recognition is one of the branches of computer vision research and is also an important component of computer science. In early application scenarios, text recognition was applied to images containing text in a simple and canonical format, such as black and white images, tickets, passports, drivers' licenses, identification cards, and the like. Reading text from these images is typically accomplished by optical character recognition software. The initial solution is to perform recognition detection according to the characteristics of the text in the image, and the steps are generally image input, image preprocessing, text feature extraction, comparison and recognition, and finally correction of the error-recognized text through manual correction and output of the result. With the development of deep learning, a large number of solutions based on deep learning are presented, which can be applied to simple scenes with high precision.

With the advent of the artificial intelligence era, a more challenging task has emerged in the last decade, namely reading text from natural scene images. Text detection and recognition in natural scene pictures is a hotspot and difficulty problem of current research. Text in an image may exist in many different scenarios, such as handwritten text, text in a document, text in a menu, street view text, such as street names, advertisements, registry pad, restaurant names, coffee shop names, and so forth. In some cases, the text is explicitly focused on a photograph of the scanned file or restaurant menu, but it may also be incidental content in the scene image. In these images, text is present, but not the intention of the image author. The text in the natural scene image is very different from the text in the scanned image. Some of the major differences are lighting irregularities, blurring, non-horizontal directions, perspective distortion, font and size, text distortion, text occlusion, etc.

The text in the natural scene image contains important semantic information, which is helpful for us to analyze and understand the corresponding environment. Related applications for natural scene text recognition include image retrieval, image classification, autopilot, and human-computer interaction, among others, which can greatly facilitate our lives. In recent years, models based on deep neural networks have taken the dominant role in the field of scene text detection and recognition.

But there are also the following problems:

1) And the positioning and identifying efficiency is low. In order to achieve a better positioning and identifying effect, some scene text positioning and identifying models often adopt complex network structures, and the network structures often have large calculation amount, so that the model identification and positioning efficiency is low, and the model identification and identifying models are difficult to apply in a real scene;

2) The text positioning is inaccurate. The problem of inaccurate positioning of the text in the natural scene can be divided into two aspects, wherein the first aspect is that the text cannot be accurately positioned, and the accuracy of the text positioning model in the natural scene still has a certain improvement space because the text in the natural scene is complex and changeable; the second aspect is that the position of the text is marked inaccurately, most of the text positioning models at present often adopt rectangular boxes to mark the text position, and the marking form can not mark irregularly-shaped characters because of the inherent attribute of the rectangular boxes;

3) The character recognition is inaccurate. The recognition effect is affected by factors such as blurring, sundry shielding, light spots and the like of characters in a natural scene, the traditional text recognition model frame is not strong in anti-interference capability, and a small part of dirt, shielding and other noise can cause that the model cannot accurately recognize the whole text. Meanwhile, in the application scene of natural scene Chinese character recognition, the accuracy of the model recognition is often very low due to the variety of Chinese characters.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a scene text positioning and identifying system and a training and identifying method thereof, which can enable a computer to accurately and efficiently finish the task of detecting and identifying texts in a natural scene.

The invention relates to a scene text positioning and identifying system, which comprises a positioner, a grouping module and an identifier, wherein the positioner is connected with the identifier through the grouping module;

the locator outputs a character frame, a character connection frame and a text frame, the character frame and the character connection frame locate the position of a text, the grouping module cuts character pictures according to the character frame and the text frame, the character pictures are grouped and sent into the identifier, the identifier outputs identification results of all groups, and finally, the location and the identification of a scene text are completed according to the location and the identification results;

the structure of the locator adopts a network architecture of a target detection network CenterNet2 based on a central point, and comprises a depth residual convolution network, a weighted bidirectional feature pyramid network, a deconvolution module, a region of interest extraction and stacking head; the locator indicates the position and the size of the characters by predicting the position, the width and the height of the central point of the character frame and the offset of the central point, indicates the connection relation between two characters in a text segment by predicting the position, the width and the height of the central point of the character frame and indicates the position and the size of the text segment by predicting the position, the width and the height of the central point of the text frame and the offset of the central point;

the structure of the recognizer adopts a network architecture of a semantic enhancement coding and decoding frame Seed facing scene text recognition, and comprises a convolution feature extraction network VGG16, a two-way long and short-term memory network BiLSTM, a semantic pre-training depth two-way converter language model Bert and a gating circulation unit network based on Bahdanau attention.

Further, the input of the locator is an RGB three-channel picture, the picture is normalized and standardized, then the processed picture is fed into a depth residual convolution network ResNet of 101 layers, the network can be divided into a rhizome module and four residual convolution modules, deformable convolution deformable convolution is used in the last two residual convolution modules, and the feature images output by the four residual convolution blocks are fed into a weighted bidirectional feature pyramid network BiFPN; the characteristic diagram output by the BiFPN is sent to a deconvolution module, the deconvolution module comprises three deconvolution groups, each group comprises one convolution and one deconvolution, the size of the characteristic diagram is amplified by one time in each deconvolution, and the finally obtained characteristic diagram is sent to three convolution branches to output prediction results; the scene text positioning and identifying system predicts the form of a central point thermodynamic diagram of text, characters and character connection and the width, height and offset of a central point of a frame thereof, carries out regression on the width, height and offset of the central point of the text, the characters and the character connection, uses a Loss function of focus Loss (Focal Loss) function for the central point prediction of the text, the characters and the character connection, uses minimum absolute value deviation L1 Loss for the offset and the width and the height of the frame, extracts characteristics from the corresponding characteristic diagram according to the character frame, and further subdivides the three types of the text frame, the character frame and the character connection frame by utilizing the Cascade Head Cascade Head; and accurately positioning the text according to the character boxes and the character connection boxes, wherein the text comprises the text with a regular shape and the text with an irregular shape, and the characters can be grouped according to the text boxes.

Further, the character pictures grouped by the grouping module are sequentially sent to a convolutional feature extraction network (VGG 16) to generate character features, the generated character features in the same group are sent to an encoder of a two-way long-short-period memory network (BiLSTM) with 256 hidden units, the two-way long-period memory network can obtain a hidden layer output, the hidden layer output can be sent to two modules, the first module is a semantic module, and the second module is a decoder module of a gate-control circulation unit (GRU) based on Bahdanau attention; the semantic module is provided with two linear layers, and semantic information is generated through the semantic module; the semantic information is also sent to the second module through one of the linear layers; the loss of the whole network consists of two parts, the first loss is the cross entropy loss of the predicted result and the real result, and the second loss is the cosine embedding loss of the predicted semantic information embedded from the transcription tag words in the deep bi-directional transformer language model (Bert) of semantic pre-training.

Further, the second module is composed of 512 hidden units and a single-layer attention gated loop unit (GRU) of 512 attention units.

The invention also provides a training and identifying method of the scene text positioning and identifying system, which mainly comprises the following steps:

step 1: dataset preprocessing

Synthesizing the text data set by using a data generation method of a synthetic text data set (SynthText);

combining the synthesized text data set with an existing document analysis and recognition international conference provided (ICDAR) data set to form a required text data set for a second stage of weak supervision training;

converting the data set labels into data set labels required by model training;

step 2: positioner for training model

The data set is disturbed according to the preset weak supervision training parameters;

training the locator in a first stage of weak supervision, the stage being trained using only the synthetic data set;

training the weak supervision second stage of the locator, wherein the stage uses a mixed training of a synthesized data set and a real data set, and the training proportion is 5:1, namely the synthesized data set accounts for 83% and the real data set accounts for 17%;

selecting optimal model parameters according to the test effects of the trained localizer on the verification set under different super parameters;

step 3: identifier for training model

Extracting character positions in a real data set by using a trained text locator, combining the extracted character positions with a synthesized data set, and packaging the combined character positions into a recognition data set required by a recognizer;

sending the identification data set into an identification model for training;

selecting optimal model parameters according to the test effects of the trained identifier on the verification set under different super parameters;

step 4: recognition effect of test model

And assembling the locator and the identifier, and testing the whole model.

Further, the synthetic text data set SynthText is a set of text data consisting of 80 tens of thousands of images, approximately 800 tens of thousands of synthetic word instances, each annotated with their text strings, characters and character-level bounding boxes.

Further, the locator portion requires four part data sets: the method comprises the steps of a required synthetic data set in a first weak supervision stage, a mixed data set, a verification set and a test set required in a second weak supervision stage, wherein the mixed data set is the synthesis of the synthetic data set and a real data set, and the ratio of the synthetic data set to the real data set is 1:5; the test set is a real data set.

Further, the identifier portion requires a three-part dataset: the system comprises a mixed data set, a verification set and a test set, wherein the mixed data set is the synthesis of a data set generated from a real data set by using a positioner prediction result and a synthetic data set, and the ratio of the mixed data set to the synthetic data set is 1:3; the test set is a real data set.

1. The invention can accurately detect the bending deformation text:

the method is suitable for positioning the text affected by noise such as illumination, visual angle, shielding, dirt and the like in a natural scene. The data set mainly utilized is a synthetic text (SynthText) data set, and the data generation method used in the synthetic text data set (SynthText) is a method for generating a text synthetic image, and the generated text can be naturally fused in the existing natural scene, and the method uses the existing deep learning and segmentation technology to well embed a text box into a background image. Therefore, the invention can achieve higher positioning accuracy by only training with the synthesized data set under the condition of not training with the real data set. The invention adopts a target detector (CenterNet) based on a central point as a target probability detector in a first stage, and adopts a Cascade Head (Cascade-Head) to carry out further classification and regression in a second stage, so that the locator part of the invention can accurately locate texts;

2. the invention can accurately identify the text in the natural scene:

the recognizer part of the invention utilizes a localizer to accurately extract characters, and accurately recognizes the characters by a recognition model based on a semantic enhanced coding and decoding framework (Seed) for scene-oriented text recognition. The semantic enhancement coding and decoding framework (Seed) oriented to scene text recognition is adopted, the recognition speed is improved by using a feature extraction network (VGG 16), and the recognition accuracy is improved by using a pre-trained deep bi-directional transformer language model (Bert).

Drawings

FIG. 1 is an overall structure diagram of a scene text positioning recognition model of the present invention;

FIG. 2 is a diagram of the text positioning recognition effect of FIG. 1;

FIG. 3 is a block diagram of the positioner of FIG. 1;

FIG. 4 is an example of a character box in FIG. 1;

FIG. 5 is an example of a character connection box in FIG. 1;

FIG. 6 is an example of a text box in FIG. 1;

fig. 7 is a structural diagram of the identifier in fig. 1.

Description of the embodiments

The invention relates to a scene text positioning and identifying system, which comprises a positioner, a grouping module and an identifier, wherein the positioner is connected with the identifier through the grouping module, and as shown in fig. 1 and 2, the positioner outputs a character frame, a character connection frame and a text frame. The character box and the character connection box can locate the position of the text. The grouping module cuts out the character pictures according to the character frames and the text frames, the grouping is sent into the identifier, and the identifier outputs the identification results of the groups. And finally, positioning and identifying the scene text according to the positioning and identifying results.

In the aspect of a locator, the invention combines a plurality of mature one-stage target detectors and detection heads to be applied to scene text locating by utilizing the two-stage target detection theory based on probability interpretation for the first time, thereby improving the locating accuracy. In addition, the locator is trained by using a weak supervision mode combining the traditional deep learning method, and the position of the character can be effectively extracted.

In the aspect of a recognizer, a semantic enhanced coding and decoding frame (Seed) text recognition model facing scene text recognition is simplified and improved, character features are extracted by the recognizer according to character positions output by a locator, and a result is output. Compared with the characteristic recognition mode of the input text line, the recognition mode of the input character characteristic has faster text recognition speed and better accuracy.

Specifically, the architecture of the locator employs a network architecture of a center point-based target detection network (centrnet 2) that includes a depth residual convolution network, a weighted bi-directional feature pyramid network, a deconvolution module, a region of interest extraction and a layering head, as shown in fig. 3.

The input of the locator is an RGB three-channel picture, the picture is normalized and standardized, then the processed picture is fed into a depth residual convolution network (ResNet) of 101 layers, the network can be divided into a rhizome module and four residual convolution modules, and the deformable convolution is used in the last two residual convolution modules (deformable convolution). We use four residual convolution blocks to output feature maps that are fed into a weighted bi-directional feature pyramid network (BiFPN). The feature map output by the weighted bidirectional feature pyramid network (BiFPN) is sent to a deconvolution module, the module comprises three deconvolution groups, each group comprises one convolution and one deconvolution, the size of the feature map is amplified by one time in each deconvolution, and the finally obtained feature map is sent to three convolution branches to output prediction results. The prediction result is in the form of a text box, a character box and a character connection box, wherein the text box predicts a whole text, the character box predicts all characters, and the character connection box predicts the connection between two characters even if two adjacent and related characters are in one character connection box. Model predictions are in the form of text, character and center point thermodynamic diagrams of character connections, and the width, height, and offset of the center points of their boxes. We classify the pixels of an image into two classes: (1) The center point (2) of the frame is not the center point of the frame, and regression is carried out on the width and height of the text, the characters and the frames connected by the characters and the offset of the center point. The Loss function used for the center point prediction of text, characters, and character connections is the Focal Loss (Focal Loss) function, and the minimum absolute deviation (L1) Loss is used for the offset and the width and height of the box. Features are then extracted from the corresponding feature map according to the character boxes, and further subdivided into three categories, namely text boxes, character boxes and character connection boxes, using a Cascade Head (Cascade Head). The text, including regular-shaped text and irregular-shaped text, can be accurately located based on the character boxes and character attachment boxes. As shown in fig. 2, the model detects text other than the COLCHESTER, the shape is an irregular shape, and the POST and the OFFICE are regular shapes (the rectangle indicates the location of the text). The characters can be grouped according to text boxes.

The character frame is used to represent the position and the size of the character, and as shown in fig. 4, the rectangular frame represents the position and the size of P in the word POST, and this rectangular frame is the character frame. The locator portion of the present invention indicates the position and size of the character by predicting the center point position, width and height of the character frame, and the center point offset. For example, the positioner predicts that the center point is (x, y), the width and height is (w, h), the offset is (x 1, y 1), and the center point of the character frame corrected by the offset is (x+x1, y+y1), and the width and height is (w, h).

The character connection frame is used for representing the connection relationship between two characters, as shown in fig. 5, the rectangular frame represents the connection relationship between P and 0 in the word POST, that is, the two characters P and 0 are two characters connected in a text, and similarly, in the POST word, P and 0,O are connected with S, S and T, and the rectangular frame in the figure is the character connection frame. The locator part of the invention represents the connection relation between two characters in a text by predicting the position, width and height of the central point of the character connection frame and the offset of the central point. For example, the positioner predicts that the center point is (x, y), the width and height is (w, h), the offset is (x 1, y 1), and the center point of the character connection frame corrected by the offset is (x+x1, y+y1), and the width and height is (w, h).

The text box is used for representing the position and the size of a text segment, and as shown in fig. 6, the rectangular box represents the position and the size of the text segment of the POST, and in the figure, the rectangular box is the text box. The locator portion of the invention indicates the position and the size of a text segment by predicting the position, the width and the height of the center point of the text box and the offset of the center point. For example, the center point position predicted by the positioner is (x, y), the width and height are (w, h), the offset is (x 1, y 1), and the center point position of the text box corrected by the offset is (x+x1, y+y1), and the width and height are (w, h).

The structure of the recognizer adopts a network architecture of a semantic enhanced coding and decoding box (Seed) oriented to scene text recognition, and comprises a convolutional feature extraction network (VGG 16), a two-way long and short-term memory network (BiLSTM), a semantic pre-training deep two-way converter language model (Bert) and a gating loop unit network based on Bahdanau attention, as shown in figure 7.

Because the output result of the locator is a text box, a character box and a character connection box, the grouping module cuts out character pictures according to the character box, groups the character pictures according to the text box, and normalizes and standardizes the character pictures in the same group. The character pictures which are grouped by the grouping module are sequentially sent to a convolution feature extraction network (VGG 16) to generate character features, the generated character features of the same group are sent to a two-way long-short-term memory network (BiLSTM) encoder with 256 hidden units, one hidden layer output can be obtained through the two-way long-term memory network, the hidden layer output can be sent to two modules, the first module is a semantic module, and the second module is a decoder module of a gate control circulation unit (GRU) based on Bahdanau attention. The semantic module has two linear layers, and semantic information is generated by the semantic module. The second module consists of 512 hidden units and a single-layer attention gated loop unit (GRU) of 512 attention units. Meanwhile, semantic information is also sent to a second module through one of the linear layers, the loss of the whole network is composed of two parts, the first loss is the cross entropy loss of a predicted result and a real result, and the second loss is the cosine embedding loss of the predicted semantic information embedded from a transcription tag word in a semantic pre-training depth bi-directional transformer language model (Bert).

In addition, the invention also provides a training and identifying method of the scene text positioning and identifying system, which mainly comprises the following steps:

step 1: dataset preprocessing

Since the real scene text dataset lacks character-level annotations, and the present invention requires training with datasets having character-level annotations, we have mainly used synthetic text dataset (SynthText) datasets for training. The synthetic text data set (SynthText) is a set of text data consisting of 80 tens of thousands of images, about 800 tens of thousands of synthetic word instances. Each synthesized word instance is annotated at the word level and the character level. In order to make the model better applicable to the corresponding application scenario, we also utilize the method provided in the synthetic text data set (SynthText) to embed the text used by the corresponding application scenario into different scene pictures. The method utilizes image processing methods based on deep learning such as depth estimation, image segmentation and the like, and can fully automatically generate a large number of realistic scene text images so that a trained model can be popularized to a real application scene.

In order to make our model better applicable to real scenes, we use the generated dataset and the real dataset for hybrid training. Because both the localizer and the localizer model need to be combined into a data set and a real data set, the data set needs to be divided according to corresponding super parameters, so that the data set can meet the training requirements of people. The locator portion requires four data sets: the required composite dataset for the first stage of weak supervision, the mixed dataset (composite dataset to real dataset ratio of 1:5) required for the second stage of weak supervision, validation set and test set (real dataset). The identifier part requires a three-part dataset: a hybrid dataset (a composite of datasets generated from real datasets using locator predictions and composite datasets in a 1:3 ratio), a validation set and a test set (real datasets). The combined text dataset is combined with the existing document analysis and recognition international conference provided (ICDAR) dataset to form the required text dataset for the second phase of the weak supervision training.

The true data set is marked and the marking format of the generated data set has a certain difference, and corresponding tools are written to unify the marking formats so as to facilitate the training of the whole model.

Step 2: training the localizer of the model.

In order to make the model have better generalization, a data set scrambling strategy conforming to the training of the model is designed, the strategy randomly samples data from the data sets according to the set data set proportion, and when the data volume can not reach the data volume required by the weak supervision proportion, the data is randomly extracted from the real data sets so as to meet the proportion requirement of the weak supervision.

Training of the locator in the first stage of weak supervision is performed, this stage using only the synthetic data set. After training on the composite dataset for a period of time, the locator may have some ability to recognize the character.

Training of the locator in the second stage of weak supervision is performed, which uses the composite dataset and the real dataset for mixed training. The training ratio was 5:1, i.e. 83% for the synthetic dataset and 17% for the real dataset. In the training process, when the model encounters data marked by a non-character set, a text region is cut off according to the original mark and is independently sent to the model for recognition, and whether the prediction is accurate or not is judged according to the ratio of the number of characters in the text marked by the data set to the number of predicted character frames and the area ratio of the occupied area of all predicted character frames to the area of the original mark frame. If the prediction is accurate, training is performed by taking the prediction result as a real label. If the prediction is inaccurate, a traditional image processing method such as connected domain detection is adopted to segment the text region, and the segmentation result is used as a real label for training.

The locator finally outputs a character frame and a connection frame, a mask diagram can be generated according to the character frame and the connection frame, and finally a final prediction result is generated according to the mask diagram. And selecting optimal model parameters according to the test result of the locator on the verification set.

Step 3: identifier for training model

And extracting character labels in the real data set by using the trained text locator, combining the extracted character labels with the synthesized data set, and packaging the combined character labels into a recognition data set required by the recognizer.

The recognition data set is sent into a recognition model for training. The feature extraction network of the seed model is replaced by a lighter feature extraction network (VGG 16), and a Space Transformation Network (STN) is eliminated, so that the speed of model text recognition is increased. Meanwhile, the semantic module is replaced by a pre-trained deep bi-directional converter language model (Bert), so that the accuracy of model identification is improved.

And selecting optimal model parameters according to the test effects of the trained identifier on the verification set under different super parameters.

Step 4: testing the effects of the model

The locator and the identifier under the optimal parameters are assembled to form a final model, and the combined model is tested to evaluate the comprehensive performance of the whole model. According to the evaluation result, as shown in fig. 2, the model can be used for efficiently completing the task of scene text positioning and recognition.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The system for positioning and identifying the scene text is characterized by comprising a positioner, a grouping module and an identifier, wherein the positioner and the identifier are connected through the grouping module;

the locator outputs a character frame, a character connection frame and a text frame, the character frame and the character connection frame locate the position of a text, the grouping module cuts out character pictures according to the character frame and the text frame, the character pictures are grouped and sent into the identifier, the identifier outputs identification results of all groups, and finally, the location and identification of a scene text are completed according to the location and identification results;

the structure of the locator adopts a network architecture of a target detection network based on a central point, and comprises a depth residual convolution network, a weighted bidirectional feature pyramid network, a deconvolution module, a region of interest extraction and stacking head; the locator indicates the position and the size of the characters by predicting the position, the width and the height of the central point of the character frame and the offset of the central point, indicates the connection relation between two characters in a text segment by predicting the position, the width and the height of the central point of the character frame and indicates the position and the size of the text segment by predicting the position, the width and the height of the central point of the text frame and the offset of the central point;

the structure of the recognizer adopts a network architecture of a semantic enhancement coding and decoding frame facing to scene text recognition, and comprises a convolution feature extraction network, a two-way long-short-term memory network, a semantic pre-training depth two-way converter language model and a gating circulation unit network based on Bahdanau attention;

the input of the locator is an RGB three-channel picture, the picture is normalized and standardized, then the processed picture is fed into a 101-layer depth residual convolution network, the network can be divided into a rhizome module and four residual convolution modules, deformable convolution is used in the last two residual convolution modules, and the characteristic diagram output by the four residual convolution blocks is fed into a weighted bidirectional characteristic pyramid network; the feature map output by the weighted bidirectional feature pyramid network is sent to a deconvolution module, the deconvolution module comprises three deconvolution groups, each group comprises one convolution and one deconvolution, the size of the feature map is amplified by one time in each deconvolution, and finally the obtained feature map is sent to three convolution branches to output a prediction result; the scene text positioning and identifying system predicts the form of a text, a character and a character connected central point thermodynamic diagram and the width, height and offset of a central point of a frame thereof, carries out regression on the text, the character and the width, the height and the offset of the central point of the character connected frame, uses a loss function for predicting the text, the character and the character connected central point as a focus loss function, uses minimum absolute value deviation loss for the offset and the width and the height of the frame, extracts the characteristics from the corresponding characteristic diagram according to the character frame, and further subdivides the characteristics into three types of a text frame, a character frame and a character connected frame by utilizing the lamination head; accurately positioning a text according to the character boxes and the character connection boxes, wherein the text comprises a text with a regular shape and a text with an irregular shape, and characters can be grouped according to the text boxes;

the character pictures which are grouped by the grouping module are sequentially sent to a convolution feature extraction network to generate character features, the generated character features in the same group are sent to encoders of two-way long-short-term memory networks of 256 hidden units together, the two-way long-short-term memory networks can obtain one hidden layer output, the hidden layer output can be sent to two modules, the first module is a semantic module, and the second module is a decoder module of a gating circulation unit based on Bahdanau attention; the semantic module is provided with two linear layers, and semantic information is generated through the semantic module; the semantic information is also sent to the second module through one of the linear layers; the loss of the whole network consists of two parts, the first loss is the cross entropy loss of the predicted result and the real result, and the second loss is the cosine embedding loss of the predicted semantic information embedded from the transcription tag words in the deep bi-directional transformer language model of semantic pre-training.

2. The system of claim 1, wherein the second module is comprised of 512 hidden units and a single-layer attention gating unit of 512 attention units.