CN113378815A

CN113378815A - Model for scene text positioning recognition and training and recognition method thereof

Info

Publication number: CN113378815A
Application number: CN202110666699.5A
Authority: CN
Inventors: 刘凯; 孙乐; 朱均可; 叶堂华
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-09-10
Anticipated expiration: 2041-06-16
Also published as: CN113378815B

Abstract

The invention discloses a scene text positioning and recognizing model and a training and recognizing method thereof, belonging to the technical field of computers. The system comprises a locator, a grouping module and an identifier, wherein the locator and the identifier are connected through the grouping module; the output of the locator is a character frame, a character connection frame and a text frame, the character frame and the character connection frame locate the position of the text, the grouping module cuts character pictures according to the character frame and the text frame, the character pictures are grouped and sent into the recognizer, the recognizer outputs the recognition results of each group, and finally the positioning and recognition of the scene text are completed according to the positioning and recognition results; the invention can enable the computer to accurately and efficiently complete the task of detecting and identifying the text in the natural scene.

Description

Model for scene text positioning recognition and training and recognition method thereof

Technical Field

The invention relates to a scene text positioning and recognizing model and a training and recognizing method thereof, belonging to the technical field of computers.

Background

Text is one of the most important information sources for human beings, and is the most important information carrier for human beings. After the advent of computers, people have thought to have computers that help us automatically recognize and process text. Text recognition is one of the branches of computer vision research and is also an important component of computer science. In early application scenarios, text recognition was applied to some images containing text that was simple and had a canonical format, such as black and white images, tickets, passports, driver licenses, identification cards, and the like. Reading text from these images is typically accomplished by optical character recognition software. The initial solution is to perform recognition and detection according to the characteristics of the text in the image, and the steps are generally image input, image preprocessing, character characteristic extraction, comparison and recognition, finally, the wrong characters are corrected through manual correction, and the result is output. With the development of deep learning, a large number of deep learning-based solutions have appeared, and the solutions can be applied to simple scenes to achieve high precision.

With the advent of the artificial intelligence age, a more challenging task has emerged in the past decade to read text from images of natural scenes. Character detection and recognition in natural scene pictures are a hot and difficult problem of current research. Text in an image may exist in many different scenarios, such as handwritten text, text in a document, text in a menu, street view text, such as a street name, an advertisement, a registration pad, a restaurant name, a coffee shop name, and so forth. In some cases, the text is explicitly focused on a scanned document or a photograph of a restaurant menu, but it may also be incidental content in the scene image. In these images, text is present, but not the intention of the author of the image. The text in the image of the natural scene is very different from the text in the scanned image. Some of the main differences are irregular lighting, blurring, non-horizontal orientation, perspective distortion, font and size, text distortion, text occlusion, etc.

The text in the natural scene image contains important semantic information, which helps us to analyze and understand the corresponding environment. Related applications of natural scene text recognition include image retrieval, image classification, automatic driving, automatic navigation, human-computer interaction, and the like, which can greatly facilitate our lives. In recent years, models based on deep neural networks have dominated the field of scene text detection and recognition.

However, the following problems still remain:

1) the positioning and identifying efficiency is not high. In order to achieve a good positioning and identifying effect, some scene character positioning and identifying models often adopt complex network structures, and the network structures are often large in calculated amount, so that the model identification and positioning efficiency is low, and the model identification and identification models are difficult to apply in a real scene;

2) the positioning of the text is not precise. The problem of inaccurate positioning of the text in the natural scene can be divided into two aspects, the first aspect is that the text cannot be accurately positioned, and because the characters in the natural scene are complex and changeable, the accuracy of the current scene text positioning model is improved to a certain extent; the second aspect is that the position marking of the text is not accurate, and in most text positioning models at present, the prediction result often adopts a rectangular box to mark the text position, because of the inherent property of the rectangular box, the marking form cannot mark characters with irregular shapes;

3) the character recognition is inaccurate. The recognition effect of characters under natural scenes is affected by factors such as blurring, sundry shielding and light spots, the anti-interference capability of a traditional text recognition model frame is not strong, and the whole text cannot be recognized accurately by the model due to noise such as a small part of stains and shielding. Meanwhile, in the application scene of Chinese character recognition in a natural scene, because of various types of Chinese characters, the accuracy rate of the model recognition is very low.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a scene text positioning and recognizing model and a training and recognizing method thereof, which can enable a computer to accurately and efficiently complete the task of detecting and recognizing texts in natural scenes.

The technical scheme of the invention is as follows:

the invention relates to a scene text positioning and identifying model which comprises a positioner, a grouping module and an identifier, wherein the positioner and the identifier are connected through the grouping module;

the output of the locator is a character frame, a character connection frame and a text frame, the character frame and the character connection frame locate the position of the text, the grouping module cuts character pictures according to the character frame and the text frame, the character pictures are grouped and sent into the recognizer, the recognizer outputs the recognition results of each group, and finally the positioning and recognition of the scene text are completed according to the positioning and recognition results;

the structure of the locator adopts a network architecture of a target detection network CenterNet2 based on a central point, and the locator comprises a depth residual error convolution network, a weighted bidirectional feature pyramid network, a deconvolution module, an interesting region extraction and stacking head; the locator expresses the position and the size of a character by predicting the position, the width and the height of the central point of a character box and the offset of the central point, expresses the connection relation between two characters in a section of text by predicting the position, the width and the height of the central point of a character connection box and the offset of the central point, and expresses the position and the size of the section of text by predicting the position, the width and the height of the central point of the text box and the offset of the central point;

the structure of the recognizer adopts a network architecture of a semantic enhancement coding and decoding box Seed oriented to scene text recognition, and the network architecture comprises a convolution feature extraction network VGG16, a bidirectional long-short term memory network BilSTM, a semantic pre-training deep bidirectional converter language model Bert and a gated cyclic unit network based on Bahdana attention.

Furthermore, the input of the locator is an RGB three-channel picture, the picture is normalized and standardized, then the processed picture is sent into a depth residual convolution network ResNet of 101 layers, the network can be divided into a rhizome module and four residual convolution modules, and a feature map output by the four residual convolution blocks is sent into a weighted bidirectional feature pyramid network BiFPN by using deformable convolution constrained convolution in the last two residual convolution modules; sending the feature graph output by the BiFPN into a deconvolution module, wherein the deconvolution module comprises three deconvolution groups, each group comprises a convolution and a deconvolution, the size of the feature graph is doubled in each deconvolution, and finally the obtained feature graph is sent to three convolution branches to output a prediction result; the model prediction of the scene text positioning recognition is in the form of a central point thermodynamic diagram of text, character and character connection and offsets of a width, a height and a central point of a frame of the text, the character and the character connection, the offsets of the width, the height and the central point of the frame of the character connection are regressed, a Loss function used for predicting the central point of the text, the character and the character connection is a Focal Loss (Focal Loss) function, the Loss is caused by using a minimum absolute value deviation L1 for the offset and the width and the height of the frame, then, features are extracted from a corresponding feature diagram according to the character frame, and the Cascade Head Cascade Head is further subdivided into three types of a text frame, a character frame and a character connection frame; and accurately positioning texts according to the character boxes and the character connection boxes, wherein the texts comprise texts in regular shapes and texts in irregular shapes, and grouping characters according to the text boxes.

Further, the character pictures grouped by the grouping module are sequentially sent to a convolution feature extraction network (VGG16) to generate character features, the generated character features of the same group are sent to an encoder of a bidirectional long and short term memory network (BilSTM) with 256 hiding units together, the bidirectional long and short term memory network can obtain a hidden layer output, the hidden layer output is sent to two modules, the first module is a semantic module, and the second module is a decoder module of a gated round robin unit (GRU) based on the attention of Bahdana; the semantic module is provided with two linear layers and generates semantic information through the semantic module; the semantic information is also sent to a second module through one of the linear layers; the loss of the whole network consists of two parts, the first loss is the cross entropy loss of a predicted result and a real result, and the second loss is the cosine embedding loss of predicted semantic information embedded from a transcription tag word in a semantic pre-trained deep two-way converter language model (Bert).

Further, the second module consists of a single layer attention gated round robin unit (GRU) of 512 hidden units and 512 attention units.

The invention also provides a method for training and recognizing the model for positioning and recognizing the scene text, which mainly comprises the following steps:

step 1: data set preprocessing

Synthesizing a text data set (SynthText) using a data generation method of the synthesized text data set;

combining the synthesized text data set with an existing document analysis and recognition international conference offered (ICDAR) data set to form a required text data set for a second stage of the weakly supervised training;

converting the data set labels into data set labels required by model training;

step 2: positioner for training model

The data set is disordered according to preset weak supervision training parameters;

performing a weakly supervised first phase of training of the localizer, which is performed using only the synthetic dataset;

performing weak supervision second stage training of the localizer, wherein the training proportion is 5:1, namely the synthesized data set accounts for 83% and the real data set accounts for 17%;

selecting the optimal model parameters according to the test effect of the trained positioner on the verification set under different hyper-parameters;

and step 3: recognizer for training model

Extracting the character positions in the real data set by using the trained text positioner, combining the extracted character positions with the synthesized data set, and packaging the combined data set into an identification data set required by the identifier;

sending the recognition data set into a recognition model for training;

selecting an optimal model parameter according to the test effect of the trained recognizer on the verification set under different hyper-parameters;

and 4, step 4: testing the recognition effect of a model

The positioner and recognizer are assembled and the entire model is tested.

Further, the synthetic text data set SynthText is a set of text data set consisting of 80 ten thousand images, approximately 800 thousand synthetic word instances, each annotated with its text string, characters, and character-level bounding boxes.

Further, the locator portion requires a four-part data set: the method comprises the following steps of synthesizing a data set required by a weak supervision first stage, a mixed data set required by a weak supervision second stage, a verification set and a test set, wherein the mixed data set is a synthesis of the synthesized data set and a real data set, and the ratio of the synthesized data set to the real data set is 1: 5; the test set is a real data set.

Further, the recognizer portion requires three parts of the data set: a mixed dataset, a validation set and a test set, wherein the mixed dataset is a composite of a dataset generated from a real dataset by using a locator prediction result and a composite dataset, and the ratio of the mixed dataset to the composite dataset is 1: 3; the test set is a real data set.

Advantageous effects

1. The invention can accurately detect the bending deformation text:

the method is suitable for positioning the text influenced by the noises such as illumination, visual angle, shading, dirt and the like in a natural scene. The method mainly utilizes a data set which is a synthetic text (SynthText) data set, a data generation method used in the synthetic text data set (SynthText) is a method for generating a text synthetic image, the generated text can be naturally fused in the existing natural scene, and the method well embeds a text box into a background image by using the existing deep learning and segmentation technology. Therefore, the invention can achieve higher positioning accuracy by only utilizing the synthesized data set training under the condition of not utilizing the real data set training. The invention adopts a target detector (CenterNet) based on a central point as a first-stage target probability detector, and adopts a Cascade Head (Cascade-Head) to further classify and regress in a second stage, so that the locator part can accurately locate the text;

2. the invention can accurately identify the text in the natural scene:

the recognizer part of the invention utilizes a locator to accurately extract characters and accurately recognizes characters through a recognition model based on a semantic enhanced coding and decoding framework (Seed) oriented to scene text recognition. A semantic enhancement coding and decoding framework (Seed) oriented to scene text recognition is adopted, a feature extraction network (VGG16) is used for improving the recognition speed, and a pre-trained depth bidirectional transformer language model (Bert) is used for improving the recognition accuracy.

Drawings

FIG. 1 is an overall structure diagram of a scene text positioning recognition model according to the present invention;

FIG. 2 is a diagram of the text localization recognition effect of FIG. 1;

FIG. 3 is a block diagram of the fixture of FIG. 1;

FIG. 4 is an example of a character box of FIG. 1;

FIG. 5 is an example of the character connection box of FIG. 1;

FIG. 6 is an example of a text box in FIG. 1;

fig. 7 is a block diagram of the identifier of fig. 1.

Detailed Description

The invention relates to a scene text positioning and recognizing model, which comprises a positioner, a grouping module and a recognizer, wherein the positioner and the recognizer are connected through the grouping module, as shown in figures 1 and 2, a character box, a character connecting box and a text box are output by the positioner. The character boxes and character connection boxes can locate the position of the text. The grouping module cuts the character pictures according to the character frame and the text frame, groups the character pictures and sends the character pictures into the recognizer, and the recognizer outputs the recognition results of each group. And finally, completing the positioning and recognition of the scene text according to the positioning and recognition result.

In the aspect of a locator, the invention combines some mature one-stage target detectors and detection heads to be applied to scene text positioning by utilizing a probability interpretation-based two-stage target detection theory for the first time, thereby improving the positioning accuracy. In addition, the positioner is trained in a weak supervision mode combining the traditional deep learning method, and the position of the character can be effectively extracted.

In the aspect of an identifier, a semantic enhanced coding and decoding framework (Seed) text identification model for scene text identification is simplified and improved, and the identifier extracts character features according to character positions output by a locator and outputs results. Compared with the characteristic recognition mode of the input text line, the recognition mode of the input character characteristic has higher text recognition speed and higher accuracy.

Specifically, the structure of the locator employs a network architecture of a central point-based object detection network (centret 2), which includes a depth residual convolution network, a weighted bidirectional feature pyramid network, a deconvolution module, a region of interest extraction, and a stacking header, as shown in fig. 3.

The input of the locator is an RGB three-channel picture, the picture is normalized and standardized, then the processed picture is fed into a depth residual convolution network (ResNet) with 101 layers, the network can be divided into a root module and four residual convolution modules, and deformable convolution (deformable convolution) is used in the last two residual convolution modules. We use four residual convolution block output profiles, which are fed into a weighted bidirectional feature pyramid network (bipfn). And (2) sending the feature map output by the weighted bidirectional feature pyramid network (BiFPN) into a deconvolution module, wherein the module comprises three deconvolution groups, each group comprises a convolution and a deconvolution, the size of the feature map is doubled by each deconvolution, and finally the obtained feature map is sent to three convolution branches to output a prediction result. The prediction result is in the form of a text box, a character box, wherein the text box predicts a whole text, the character box predicts all characters, and the character box predicts a connection between two characters even if two adjacent and related characters are in one character box. The model predictions are in the form of text, characters and a thermodynamic diagram of the center point of the character connection and the offsets of their box width, height and center point. We classify the pixel points of the image into two categories: (1) and the frame center point (2) is not the frame center point, and the width and the height of the text, the character and the character connected frame and the offset of the center point are regressed. The Loss function used for the prediction of the center point of text, characters, and character connections is the Focal Loss (Focal local) function, and the minimum absolute deviation (L1) Loss is used for the offset and the width and height of the box. Then extracting features from the corresponding feature map according to the character frame, and further subdividing the feature map into three types of text frames, character frames and character connection frames by using a Cascade Head (Cascade Head). The text can be accurately positioned according to the character boxes and the character connection boxes, and the text comprises regular-shaped text and irregular-shaped text. As shown in FIG. 2, the model detects that in addition to the COLCHESTER text, the shape is irregular, and POST and OFFICE are regular (rectangles can indicate the location of the text). The characters can be grouped according to the text box.

The character box is used to indicate the position and size of the character, and as shown in fig. 4, the rectangular box indicates the position and size of P in the word POST, and this rectangular box is the character box. The locator of the present invention represents the position and size of the character by predicting the center point position, width and height, and center point offset of the character box. For example, if the locator predicts that the center point is (x, y), the width and the height are (w, h), and the offset is (x1, y1), the center point of the character frame corrected by the offset is (x + x1, y + y1), and the width and the height are (w, h).

The character connection box is used to represent the connection relationship between two characters, as shown in fig. 5, the rectangular box represents the connection relationship between P and 0 in the word POST, that is, the two characters P and 0 are two characters connected in a text, similarly, in the POST word, P and 0, O and S, S and T are all connected, and the rectangular box in the figure is the character connection box. The locator part of the invention expresses the connection relation between two characters in a section of text by predicting the center point position, width and height of the character connection box and the center point offset. For example, if the locator predicts that the center point is (x, y), the width and the height are (w, h), and the offset is (x1, y1), the center point of the character box after being finally corrected by the offset is (x + x1, y + y1), and the width and the height are (w, h).

The text box is used to indicate the position and size of a piece of text, and as shown in fig. 6, the rectangular box indicates the position and size of the piece of text of POST, and the rectangular box in the figure is the text box. The locator part of the invention expresses the position and the size of a section of text by predicting the position, the width and the height of the central point of the text box and the offset of the central point. For example, the center point predicted by the locator is (x, y), the width and the height are (w, h), and the offset is (x1, y1), then the center point of the text box finally corrected by the offset is (x + x1, y + y1), and the width and the height are (w, h).

The structure of the recognizer adopts a network architecture of a semantic enhanced coding and decoding block (Seed) oriented to scene text recognition, and the network architecture comprises a convolution feature extraction network (VGG16), a bidirectional long-short term memory network (BilSTM), a semantic pre-training deep bidirectional converter language model (Bert) and a gated cyclic unit network based on Bahdana u attention, as shown in FIG. 7.

Because the output result of the positioner is a text box, a character box and a character connection box, the grouping module cuts out character pictures according to the character box, groups the individual character pictures according to the text box, and normalizes and standardizes the character pictures in the same group. The character pictures grouped by the grouping module are sequentially sent to a convolution feature extraction network (VGG16) to generate character features, the generated character features of the same group are sent to a bidirectional long and short term memory network (BilSTM) encoder with 256 hiding units, a hidden layer output is obtained through the bidirectional long and short term memory network and is sent to two modules, the first module is a semantic module, and the second module is a decoder module of a gated cycle unit (GRU) based on the attention of Bahdana. The semantic module has two linear layers, and semantic information is generated through the semantic module. The second module consists of a single-layer attention gated round robin unit (GRU) of 512 hidden units and 512 attention units. Meanwhile, semantic information is sent into a second module through one of the linear layers, the loss of the whole network consists of two parts, the first loss is the cross entropy loss of a predicted result and a real result, and the second loss is the cosine embedding loss of the predicted semantic information embedded from a transcription tag word in a semantic pre-training depth bidirectional converter language model (Bert).

The invention relates to a method for training and recognizing a scene text positioning and recognizing model, which mainly comprises the following steps of:

step 1: data set preprocessing

Because the real scene text data set lacks character-level labels, and the invention requires the data set with character-level labels for training, we mainly adopt a synthetic text data set (synthttext) data set for training. A synthetic text data set (SynthText) is a set of text data consisting of 80 ten thousand images, with approximately 800 ten thousand synthetic word instances. Each synthetic word instance is labeled at both the word level and the character level. In order to make the model better applied to the corresponding application scene, we also utilize the method provided in the synthetic text data set (SynthText), and embed the text used by the corresponding application scene into different scene pictures. The method utilizes image processing methods based on deep learning, such as depth estimation, image segmentation and the like, and can fully automatically generate a large number of vivid scene text images, so that the trained model can be popularized to a real application scene.

In order to make our model better applicable to real scenes, we perform hybrid training with the generated dataset and the real dataset. Because training both the localizer and the localizer model of us needs to synthesize a data set and a real data set, the data set needs to be divided according to corresponding hyper-parameters, so that the data set can meet the training requirements of us. The locator portion requires a four-part data set: the required synthetic dataset for the first stage of weak supervision, the mixed dataset required for the second stage of weak supervision (synthetic dataset to real dataset, their ratio is 1: 5), the validation set and the test set (real dataset). The recognizer portion requires three parts of the data set: the mixed dataset (the composite of the dataset generated from the real dataset using the locator predictors and the composite dataset in a ratio of 1: 3), the validation set, and the test set (the real dataset). The synthesized text data set is combined with existing document analysis and recognition international conference offered (ICDAR) data sets to form the desired text data set for the second stage of the weakly supervised training.

The label of the real data set and the label format of the generated data set have certain difference, and corresponding tools are written to unify the label formats, so that the training of the whole model is facilitated.

Step 2: and training a positioner of the model.

In order to enable the model to have better generalization, a data set scrambling strategy which accords with the training of the model is designed, data are randomly sampled from the data set according to the preset data set proportion, and when the data quantity does not reach the data quantity required by the weak supervision proportion, the data are randomly extracted from the real data set so as to meet the proportion requirement of the weak supervision.

Training of the localizer is performed in a weakly supervised first phase, which is trained using only the synthetic dataset. After a period of training on the composite data set, the locator will have some ability to recognize the character.

Training of the localizer is performed in a weakly supervised second phase, which is a hybrid training using the synthetic dataset and the real dataset. The training ratio was 5:1, i.e. 83% for the synthetic dataset and 17% for the real dataset. In the training process, when the model meets data labeled by a non-character set, the text region is cut according to the original label and is independently sent to the model for recognition, and whether the prediction is accurate or not is judged according to the ratio of the number of characters in the text labeled by the data set to the number of predicted character boxes and the area ratio of the occupied area of all the predicted character boxes to the area of the original label box. And if the prediction is accurate, training by using the prediction result as a real label. If the prediction is not accurate, a text region is segmented by adopting a traditional image processing method such as connected domain detection and the like, and the segmentation result is used as a real label for training.

The final output of the locator is a character frame and a connection frame, a mask image can be generated according to the character frame and the connection frame, and the final prediction result is generated according to the mask image. And selecting the optimal model parameters according to the test result of the positioner on the verification set.

And step 3: recognizer for training model

And extracting the character labels in the real data set by using the trained text positioner, combining the extracted character labels with the synthesized data set, and packaging the combined data set into the identification data set required by the recognizer.

And sending the recognition data set into a recognition model for training. The feature extraction network of the seed model is replaced by a more lightweight feature extraction network (VGG16), and a Space Transformation Network (STN) is eliminated, so that the speed of model text recognition is increased. Meanwhile, the semantic module is changed into a pre-trained deep bidirectional converter language model (Bert), so that the accuracy of model identification is improved.

And selecting the optimal model parameters according to the test effect of the trained recognizer on the verification set under different hyper-parameters.

And 4, step 4: testing the effects of the model

The locator and the recognizer under the optimal parameters are assembled to form a final model, the combined model is tested, and the comprehensive performance of the whole model is evaluated. According to the evaluation result, the model can complete the task of scene character positioning and recognition efficiently at this time as shown in fig. 2.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The model for positioning and identifying the scene text is characterized by comprising a positioner, a grouping module and an identifier, wherein the positioner and the identifier are connected through the grouping module;

the locator outputs a character frame, a character connection frame and a text frame, the character frame and the character connection frame locate the position of the text, the grouping module cuts character pictures according to the character frame and the text frame, the character pictures are grouped and sent into the recognizer, the recognizer outputs recognition results of each group, and finally the scene text is located and recognized according to the locating and recognition results;

the locator adopts a network architecture of a target detection network based on a central point, and comprises a depth residual error convolution network, a weighted bidirectional characteristic pyramid network, a deconvolution module, an interesting region extraction and stacking head; the locator expresses the position and the size of a character by predicting the position, the width and the height of the central point of a character box and the offset of the central point, expresses the connection relation between two characters in a section of text by predicting the position, the width and the height of the central point of a character connection box and the offset of the central point, and expresses the position and the size of the section of text by predicting the position, the width and the height of the central point of the text box and the offset of the central point;

the structure of the recognizer adopts a network architecture of a semantic enhanced coding and decoding box facing scene text recognition, and the network architecture comprises a convolution feature extraction network, a bidirectional long-short term memory network, a semantic pre-training deep bidirectional converter language model and a gated cyclic unit network based on Bahdana attention.

2. The model of claim 1, wherein the input to the locator is an RGB three-channel picture, the picture is normalized and normalized, and then the processed picture is fed into a 101-layer depth residual convolution network, which is divided into a root module and four residual convolution modules, and the feature map output from the four residual convolution blocks is fed into a weighted bi-directional feature pyramid network using deformable convolution among the last two residual convolution modules; then sending the feature graph output by the weighted bidirectional feature pyramid network into a deconvolution module, wherein the deconvolution module comprises three deconvolution groups, each group comprises a convolution and a deconvolution, the feature graph size is doubled by each deconvolution, and finally the obtained feature graph is sent to three convolution branches to output prediction results; the model prediction of the scene text positioning recognition is in the form of a central point thermodynamic diagram connected with texts, characters and the offsets of the width, height and central point of a frame of the text, characters and characters, the offsets of the width, height and central point of the frame connected with the texts, characters and characters are regressed, a loss function used for predicting the central point connected with the texts, characters and characters is a focus loss function, the minimum absolute value deviation loss is used for the offsets and the width and height of the frame, then the characteristics are extracted from a corresponding characteristic diagram according to the character frame, and the model prediction is further subdivided into three types of text frames, character frames and character connection frames by utilizing the layering head; and accurately positioning texts according to the character boxes and the character connection boxes, wherein the texts comprise texts in regular shapes and texts in irregular shapes, and grouping characters according to the text boxes.

3. The model of claim 2, wherein the character pictures grouped by the grouping module are sequentially sent to a convolutional feature extraction network to generate character features, and the generated character features of the same group are sent to an encoder of a bidirectional long and short term memory network of 256 hidden units, wherein the bidirectional long and short term memory network obtains a hidden layer output which is sent to two modules, the first module is a semantic module, and the second module is a decoder module of a gated cyclic unit based on the attention of bahdana; the semantic module is provided with two linear layers and generates semantic information through the semantic module; the semantic information is also sent to a second module through one of the linear layers; the loss of the whole network consists of two parts, wherein the first loss is the cross entropy loss of a predicted result and a real result, and the second loss is the cosine embedding loss of predicted semantic information embedded from a transcription tag word in a semantic pre-trained deep bidirectional converter language model.

4. The model of claim 3, wherein said second module consists of a single-layer attention gated cyclic unit of 512 hidden units and 512 attention units.

5. A training and recognition method for a scene text positioning and recognition model mainly comprises the following steps:

step 1: data set preprocessing

Synthesizing the text data set by using a data generation method for synthesizing the text data set;

combining the synthesized text data set with the data set provided by the existing document analysis and international conference recognition to form a text data set required by the second stage of the weak supervision training;

converting the data set labels into data set labels required by model training;

step 2: positioner for training model

and step 3: recognizer for training model

sending the recognition data set into a recognition model for training;

and 4, step 4: testing the recognition effect of a model

The positioner and recognizer are assembled and the entire model is tested.

6. The method of claim 5, wherein the synthetic text data set is a set of text data sets consisting of 80 ten thousand images, approximately 800 thousand synthetic word instances, each text instance annotated with its text string, character, and character-level bounding box.

7. The method of claim 6, wherein the locator portion requires a four-part data set: the method comprises the following steps of synthesizing a data set required by a weak supervision first stage, and a mixed data set, a verification set and a test set required by a weak supervision second stage, wherein the mixed data set is a synthesis of a synthesized data set and a real data set, and the ratio of the synthesized data set to the real data set is 1: 5; the test set is a real data set.

8. The method of claim 7, wherein the identifier portion requires three parts of the data set: a mixed data set, a verification set and a test set, wherein the mixed data set is a synthesis of a data set and a synthesis data set generated from a real data set by utilizing a locator prediction result, and the proportion of the mixed data set to the synthesis data set is 1: 3; the test set is a real data set.