WO2023005253A1

WO2023005253A1 - Method, apparatus and system for training text recognition model framework

Info

Publication number: WO2023005253A1
Application number: PCT/CN2022/085149
Authority: WO
Inventors: 章成全; 吕鹏原; 李煜林; 庾悦晨; 姚锟; 韩钧宇; 刘经拓; 丁二锐; 吴甜; 王海峰
Original assignee: 北京百度网讯科技有限公司
Priority date: 2021-07-28
Filing date: 2022-04-02
Publication date: 2023-02-02
Also published as: CN113591864B; CN113591864A; KR20230030005A

Abstract

The present disclosure relates to the technical field of artificial intelligence, and in particular to the technical fields of computer vision and deep learning, and can be applied in smart city and smart finance scenarios. Provided are a method, apparatus and system for training a text recognition model framework. The method comprises: performing feature processing on a sample image on the basis of a preset text detection model, so as to obtain at least two types of feature information related to text information in the sample image; performing fusion processing on the at least two types of feature information of the sample image on the basis of a preset feature fusion model, so as to obtain a fused feature of the sample image; and inputting the fused feature into the feature fusion model, and respectively adjusting parameters of the text detection model and the feature fusion model on the basis of the fused feature, so as to obtain a text recognition model framework. The text detection model and the feature fusion model in the text recognition model framework have a relatively high relevance, so as to realize the integrity and comprehensiveness of a training process, thereby improving the accuracy and reliability of the text recognition model framework.

Description

Training method, device and system for text recognition model framework

This disclosure claims the priority of the Chinese patent application submitted to the China Patent Office on July 28, 2021, with the application number CN202110858410.X and the application title "Training method, device and system for text recognition model framework", and the entire content of which is passed References are incorporated in this application.

technical field

The present disclosure relates to the field of artificial intelligence technology, specifically the field of computer vision and deep learning technology, and in particular to a training method, device and system for a text recognition model framework, which can be applied to smart cities and smart financial scenarios.

Background technique

With the development of artificial intelligence technology, the recognition of text information in images has evolved from manual recognition to automatic recognition, such as the text recognition model framework pre-trained for auxiliary training of text recognition models (also known as text recognition for auxiliary training A structured analysis framework model of the model), on the basis of the structured framework model, train and generate a text recognition model for recognizing text information in the image to be recognized.

In the prior art, the text recognition model framework is usually trained based on the text detection model and the feature fusion model, wherein the text detection model and the feature fusion model are two independent models, and the feature fusion model is based on the offline text detection model. The recognition result completes the training.

However, the text detection model and the feature fusion model are independent of each other during the training process, which may lead to a technical problem of low accuracy of the trained text recognition model framework.

Contents of the invention

The present disclosure provides a training method and device for a text recognition model framework for improving the accuracy of the text recognition model framework.

According to the first aspect of the present disclosure, a method for training a text recognition model framework is provided, the method includes: performing feature processing on a sample image based on a preset text detection model, and obtaining information related to the text information in the sample image At least two types of characteristic information;

performing fusion processing on at least two types of feature information of the sample image based on a preset feature fusion model to obtain fusion features of the sample image;

Input the fusion feature into the feature fusion model, adjust the parameters of the text detection model and the feature fusion model based on the fusion feature model respectively, to obtain a text recognition model framework, wherein the text recognition model The framework includes an adjusted text detection model and an adjusted feature fusion model.

According to a second aspect of the present disclosure, a text recognition method is provided, including:

Obtain the image to be recognized;

Inputting the image to be recognized into a pre-trained text recognition model to obtain text information in the image to be recognized, wherein the text recognition model is generated based on the pre-trained text recognition model framework to train the image to be trained, The text recognition model framework is obtained by training the training method described in the first aspect, and the image to be trained includes text information.

According to a third aspect of the present disclosure, a training device for a text recognition model framework is provided, the device comprising:

A processing unit, configured to perform feature processing on the sample image based on a preset text detection model to obtain at least two types of feature information related to the text information in the sample image;

a fusion unit, configured to perform fusion processing on at least two types of feature information of the sample image based on a preset feature fusion model, to obtain fusion features of the sample image;

A training unit, configured to input the fusion feature into the feature fusion model, and adjust the parameters of the text detection model and the feature fusion model based on the fusion feature model to obtain a text recognition model framework, wherein, The text recognition model framework includes an adjusted text detection model and an adjusted feature fusion model.

According to a fourth aspect of the present disclosure, a text recognition device is provided, including:

an acquisition unit, configured to acquire an image to be identified;

A recognition unit, configured to input the image to be recognized into a pre-trained text recognition model to obtain text information in the image to be recognized, wherein the text recognition model is based on the pre-trained text recognition model framework to treat the training image Generated by training, the text recognition model framework is obtained through training in the training method of the first aspect, and the image to be trained includes text information.

According to a fifth aspect of the present disclosure, there is provided an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method described in the first aspect; or, by The at least one processor is enabled to execute the method described in the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect; or, the The computer instructions are used to cause the computer to execute the method described in the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product, the computer program product comprising: a computer program stored in a readable storage medium, at least one processor of an electronic device can read from the Read the computer program from the storage medium, the at least one processor executes the computer program to make the electronic device execute the method described in the first aspect; or, the at least one processor executes the computer program to make the electronic device execute The method described in the second aspect.

According to an eighth aspect of the present disclosure, a training system for a text recognition model framework is provided, the system comprising:

A text detection model, configured to perform feature processing on the sample image to obtain at least two types of feature information related to the text information in the sample image;

A feature fusion model, configured to perform fusion processing on at least two types of feature information of the sample image to obtain fusion features of the sample image;

The feature fusion model is also used to adjust the parameters of the text detection model and the feature fusion model respectively to obtain a text recognition model framework, wherein the text recognition model framework includes the adjusted text detection model and Adjusted feature fusion model.

According to a ninth aspect, an embodiment of the present application provides a computer program, including program code, and when a computer runs the computer program, the program code executes the method described in the first aspect or the second aspect above.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a scene of a training method of a text recognition model framework according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a sixth embodiment of the present disclosure;

8 is a block diagram of an electronic device used to implement the training method of the text recognition model framework and the text recognition method of the embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a seventh embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Text recognition technology refers to the recognition of text information in images, and text recognition technology is widely used in various fields, such as education, finance, medical care, transportation, and insurance.

For example, when the text recognition technology is applied in the medical field, the text information in the medical record image can be recognized based on the text recognition technology. As another example, when the text recognition technology is applied in the insurance field, the text information in the insurance policy image can be recognized based on the text recognition technology, which will not be listed here.

With the development of deep learning technology in artificial intelligence technology, deep learning technology can be combined with other technologies. For example, deep learning technology can be applied to text recognition technology, thereby improving the accuracy and reliability of text information recognition.

For example, a text recognition model for recognizing text information can be trained based on deep learning technology. The training of the text recognition model usually needs to be based on the text recognition model framework, that is, generally speaking, the text recognition model framework is first trained, and then the text recognition model is trained on the basis of the text recognition model framework.

In related technologies, the text recognition model framework is usually obtained by training two independent models, the two independent models are the text detection model and the feature fusion model, and when training the text recognition model framework, the feature fusion model is based on The offline recognition results of the text detection model.

Specifically, the text detection model can be an Optical Character Recognition (OCR) model, and the feature fusion model can be a transfromer model. The transfromer model is trained based on the offline recognition results of the OCR model to obtain a text recognition model framework.

However, the optical character recognition model and the transferer model are independent of each other during the training process, which may lead to a technical problem of low accuracy of the trained text recognition model framework.

In order to avoid the above-mentioned technical problems, the inventors of the present disclosure have obtained the inventive concept of the present disclosure through creative labor: based on the text detection model and the feature fusion model, the fusion features are obtained, and the feature fusion model is based on the fusion features, and the text recognition model and The feature recognition model is trained as a whole to obtain the framework of the text recognition model.

Based on the above inventive concepts, the present disclosure provides a training method, device and system for a text recognition model framework, which is applied in the field of computer vision and deep learning technology in the field of artificial intelligence technology, and can be applied to smart cities and smart financial scenarios to improve Accuracy of Text Recognition Model Framework.

Please refer to FIG. 1 . FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure.

As shown in Figure 1, the training method of the text recognition model framework provided by the embodiment of the present disclosure includes:

S101: Perform feature processing on the sample image based on a preset text detection model to obtain at least two types of feature information related to text information in the sample image.

Exemplarily, the execution subject of this embodiment may be a training device of the text recognition model framework (hereinafter referred to as the training device), and the training device may be a server (such as a local server, or a cloud server), or a terminal device, or It may be a processor, or may be a chip, etc., which is not limited in this embodiment.

The sample image includes text information. For example, for the medical field, the sample image may be an image of a medical record, and the sample image includes text information such as patient identity and case information. As another example, for the insurance field, the sample image may be an image of an insurance policy, and the sample image includes text information such as the identity of the insurer and text information of the insurance content.

It should be understood that the number of sample images used for training the text recognition model framework can be set by the training device based on requirements, historical records, and experiments, which is not limited in this embodiment.

The text detection model is a model that can detect features related to text information in sample images. For example, for the medical field, the text detection model can detect the text information of the patient identity in the image of the medical record.

Specifically, the text detection model may be an optical character recognition model.

In this embodiment, the feature information is used to characterize the features related to the text information in the sample image. At least two kinds of feature information may include: information related to text content, information related to text vision, and the spatial relationship of each text information, etc., will not be listed here.

S102: Perform fusion processing on at least two types of feature information of the sample image based on a preset feature fusion model to obtain fusion features of the sample image.

Wherein, the feature fusion model refers to a model that can perform fusion processing on various feature information. For example, the feature fusion model may be a transferer model.

Fusion processing may be splicing multiple types of feature information, or combining multiple types of feature information, or connecting multiple types of feature information. technology, which will not be repeated here.

S103: Input the fusion feature into the feature fusion model, adjust the parameters of the text detection model and the feature fusion model based on the fusion feature model, and obtain a text recognition model framework.

Wherein, the text recognition model framework includes an adjusted text detection model and an adjusted feature fusion model.

In this embodiment, the fusion features are input into the feature fusion model, the parameters of the text detection model can be adjusted based on the fusion features, and the parameters of the feature fusion model can be adjusted to form a text recognition model framework.

It can be understood that the training of the text recognition model framework is an iterative process, that is, the process of repeatedly adjusting the parameters of the text detection model and the parameters of the feature fusion model. When the number of iterations reaches the preset threshold, or the loss function during iteration If it is less than the preset loss threshold, it means that the training has been required, so as to obtain the text recognition model framework.

Based on the above analysis, it can be seen that the embodiment of the present disclosure provides a training method for a text recognition model framework, the method includes: performing feature processing on the sample image based on a preset text detection model, and obtaining at least two text information related to the sample image Based on the preset feature fusion model, at least two kinds of feature information of the sample image are fused to obtain the fusion feature of the sample image, and the fusion feature is input to the feature fusion model. Based on the fusion feature model, the text detection model and the feature The parameters of the fusion model are adjusted respectively to obtain a text recognition model framework, wherein the text recognition model framework includes an adjusted text detection model and an adjusted feature fusion model. In this embodiment, it is introduced: The parameters of the detection model and the feature fusion model are adjusted separately to obtain the technical characteristics of the text recognition model framework, so that there is a high correlation between the text detection model and the feature fusion model in the text recognition model framework, thus realizing the training process Integrity and comprehensiveness, avoiding the independence of the text detection model and the feature fusion model in the related technology, which makes the lack of consideration from the overall dimension when training the text recognition model framework, resulting in the disadvantages of low accuracy of the text recognition model framework. Improved the accuracy and reliability of the text recognition model framework.

Please refer to FIG. 2 , which is a schematic diagram according to a second embodiment of the present disclosure.

As shown in Figure 2, the training method of the text recognition model framework provided by the embodiment of the present disclosure includes:

S201: Determine position information of a text line in a sample image based on a text detection model, and determine at least two types of feature information according to the position information.

Wherein, regarding the same features of this embodiment and the first embodiment, this embodiment will not repeat them.

Based on the above analysis, it can be seen that the training method of the text recognition model framework of this embodiment can be applied to different fields, such as the insurance field and the medical field, etc., and the training method of the text recognition model framework of this embodiment is applied to the insurance field as an example , to exemplarily describe this embodiment.

As shown in FIG. 3 , the sample image is an insurance policy image, and the insurance policy image includes text information, such as “name: XXX”, “insurance type: XXXXXX”, and “insurance period: XXXX” as shown in FIG. 3 .

In some embodiments, the sample image may be transmitted to the training device by scanning, and the text detection model in the training device determines the position information of the text line in the sample image.

In some other embodiments, as shown in Figure 3, the training device can also be connected with an external device (such as a storage device, etc.), and receive the sample image transmitted by the external device, so that the text detection model in the training device can detect the samples The location information of the text line in the image is determined.

Wherein, the text line refers to the line where the text information is located. The position information of the text line refers to information related to the position of the line where the text information is located, and specifically may be the line where the text information is located and the coordinates in the sample image.

For example, when a text detection model recognizes a sample image, it may select a text line in the sample image based on a preset rectangular frame, and determine the coordinates of the rectangular frame in the sample image.

In this embodiment, by determining the position information of the text line in the sample image, at least two types of feature information can be determined from the sample image based on the position information, so that at least two types of feature information can be determined in a relatively high-accuracy positioning manner, Furthermore, the technical effect of improving the accuracy and reliability of at least two types of characteristic information can be achieved.

In some embodiments, determining at least two kinds of characteristic information according to the location information includes: performing a cropping operation on the sample image according to the location information to obtain a text region, and obtaining at least two kinds of characteristic information from the text region.

For example, in combination with the above-mentioned embodiments, after the position information is determined, the area selected by the rectangular frame can be cut out from the sample image based on the position information, and this area is the text area, and by identifying the text information in the text area In this way, at least two kinds of feature information can be obtained.

In this embodiment, the text area is cut out from the sample image based on the position information, which can make the text area include almost all the text information, avoid the omission of the text information, and also make the cutting operation have higher accuracy, so that the text The region has relatively high accuracy and reliability, thereby making the at least two types of feature information determined based on the text region have relatively high comprehensiveness and reliability.

In some embodiments, obtaining at least two types of feature information from the text area includes: extracting image features of the sample image from the text area, and identifying the image features to obtain at least two types of feature information.

Among them, image features can be understood from two major dimensions, the two major dimensions are content dimension and appearance dimension. As in this embodiment, the sample image is an image including text information, then the image feature of the content dimension refers to the features related to the content of the text information included in the image feature, such as text content; the image feature of the appearance dimension refers to , the features related to the color and texture of the text information included in the image features.

Therefore, in this embodiment, the at least two kinds of characteristic information may include two kinds of characteristic information respectively determined based on two large dimensions (ie, content dimension and appearance dimension). Of course, combined with the analysis in the first embodiment above, it can be seen that the two large dimensions can also be split into smaller dimensions, and three or more feature information can be determined based on the smaller dimensions. This embodiment does not limited.

In this embodiment, since the text area has high accuracy and comprehensiveness, the image features extracted from the text area have high accuracy and comprehensiveness, and when the image features are identified, the obtained When analyzing feature information, it can be analyzed from multiple dimensions to obtain feature information in multiple dimensions. Therefore, the technical effect of accuracy, comprehensiveness, and reliability of feature information can be improved.

In some embodiments, the at least two types of feature information include textual features and visual features.

Among them, text features can be understood as feature information based on the content dimension, and visual features can be understood as feature information based on the appearance dimension.

S202: Perform fusion processing on text features and visual features based on a preset feature fusion model to obtain fusion features of the sample image.

For the implementation principle of S202, reference may be made to the first embodiment, which will not be repeated here.

S203: Construct multiple text feature blocks for representing text features, and construct multiple visual feature blocks for representing visual features.

For example, multiple text feature blocks having a mapping relationship with text features are constructed, and multiple text feature blocks can be used to represent text features.

Exemplarily, the number of text feature blocks can be determined based on requirements, historical records, and experiments, and the text features can be mapped to multiple text feature blocks, and the multiple text feature blocks can represent the text features.

Specifically, the text feature block can be a 2*2 (pixel) feature block, and based on the semantic information of the text feature, the text feature can be split and stored into multiple 2*2 (pixel) feature blocks, so as to obtain multiple A text feature block.

Among them, the semantic information can be understood as the information related to the field classification of the text information, the information related to the position of the text information between fields, or the information related to the representation meaning of the text information.

Similarly, for the principle and implementation of constructing multiple visual feature blocks for representing visual features, please refer to the construction of multiple text feature blocks for representing text features, which will not be repeated here.

S204: The feature fusion model adjusts the parameters of the text detection model and the feature fusion model respectively according to the fusion feature and multiple text feature blocks; and/or, the feature fusion model adjusts the text according to the fusion feature and multiple visual feature blocks The parameters of the detection model and the feature fusion model are adjusted separately.

In one example, the parameters of the text detection model can be adjusted by combining the fusion feature and multiple text feature blocks, and the parameters of the feature fusion model can be adjusted.

In some embodiments, adjusting the parameters of the text detection model in combination with the fusion feature and a plurality of text feature blocks, and adjusting the parameters of the feature fusion model may include the following steps:

The first step: the feature fusion model randomly covers some of the text features in the fusion feature, and performs prediction and completion processing on the covered part of the text features according to multiple text feature blocks, and obtains the part of the text features after the prediction and completion.

Based on the above analysis, it can be seen that the process of training the text recognition model framework is an iterative process. Therefore, during the training process, part of the text features in the random cover fusion feature of the current iteration and some text features in the random cover fusion feature of the previous iteration Are not the same.

Exemplarily, each iteration randomly covers part of the text features of the fused features which are completely different.

For example, in the first iteration, some of the text features in the randomly masked fusion feature are the top six percent of the text features in the text features, and in the second iteration, some of the text features in the randomly masked fusion feature are , the text features between the first six percent and the first twelve percent of the text features, and so on, and will not be listed one by one.

For another example, in the first iteration, part of the text features in the randomly covered fusion feature is 6% of the text features in the text feature, and in the second iteration, part of the text features in the randomly covered fusion feature is , six percent of the text features except the fused features masked in the first iteration.

Exemplarily, each iteration randomly covers part of the text features of the fused features that are not completely the same.

For example, in the first iteration, part of the text features in the randomly masked fusion feature is, six percent of the text features in the text features, and in the second iteration, the part of the text features in the randomly masked fusion feature is, Six percent of the text features, and the same text features are present in the six percent of the text features masked in the first iteration as in the six percent of the text features masked in the second iteration.

Based on the above analysis, multiple text feature blocks can be used to represent text features. Therefore, when partial text features are occluded, the occluded partial document can be predicted based on multiple text feature blocks, so as to obtain the predicted and completed part text features.

For example, the text feature in the fusion feature is A, the part of the text feature that is occluded by the text feature A is a1, and the other part of the document feature that is not occluded is a2, then the training device can be based on multiple text block features and part of the document feature a2 Infer the content of some text features (ie, predict the completed part of the text features).

The second step: adjust the parameters of the text detection model and the feature fusion model according to the predicted and completed part of the text features and the features of the fusion features except the covered part of the text features.

In conjunction with the above-mentioned embodiment, this step can be understood as: the training device adjusts the parameters of the text detection model according to the content of the inferred partial text feature (ie, the partial text feature after prediction and completion), and the partial document feature is a2, and Adjust the parameters of the feature fusion model.

In this embodiment, by covering part of the text features in the fusion feature, and based on a plurality of text feature blocks, the covered part of the text features is predicted and completed, so that the two models ( That is, the parameters of the text detection model and the feature fusion model) are adjusted separately, fully considering the relationship between the text features of each part of the text feature (including the relationship between the text content and the position), which can be The technical effect of improving the recognition and discrimination capabilities of the two models, thereby improving the accuracy and reliability of the trained text recognition model framework.

In other embodiments, adjusting the parameters of the text detection model in combination with the fusion feature and a plurality of text feature blocks, and adjusting the parameters of the feature fusion model may include the following steps:

The first step: the feature fusion model replaces the text features in the fusion features according to at least part of the text feature blocks in the plurality of text feature blocks, and obtains the replaced text features.

The second step: adjust the parameters of the text detection model and the feature fusion model respectively according to the visual features in the fusion features and the replaced text features.

Wherein, the principle of replacing the text features in the fusion features in this embodiment may be full replacement or partial replacement, which is not limited in this embodiment.

As for the principle of replacing the text features in the fused features, please refer to the principle of covering some of the text features in the fused features in the above embodiment, which will not be repeated here.

Similarly, in this embodiment, by replacing the text features in the fusion features, and based on the replaced text features and the visual features in the fusion features, the two models (namely, the text detection model and the feature fusion model ) parameters are adjusted separately, which can improve the recognition and discrimination capabilities of the two models, thereby improving the technical effect of the accuracy and reliability of the trained text recognition model framework.

In another example, the parameters of the text detection model can be adjusted by combining the fusion feature and multiple visual feature blocks, and the parameters of the feature fusion model can be adjusted.

In some embodiments, adjusting the parameters of the text detection model in combination with the fusion feature and a plurality of visual feature blocks, and adjusting the parameters of the feature fusion model may include the following steps:

The first step: the feature fusion model randomly covers some of the visual features in the fusion feature, and performs prediction and completion processing on the covered part of the visual features according to multiple visual feature blocks, and obtains the partial visual features after prediction and completion.

The second step: adjust the parameters of the text detection model and the feature fusion model according to the predicted and completed part of the visual features and the features of the fusion features except for the covered part of the visual features.

For the implementation principle of this embodiment, reference may be made to the implementation principle of combining the fusion feature and multiple text feature blocks in the above-mentioned embodiments, which will not be repeated here.

Similarly, in this embodiment, by covering part of the visual features in the fusion feature, and based on a plurality of visual feature blocks, the covered part of the visual features is predicted and completed, so that the two Adjusting the parameters of the two models (ie, the text detection model and the feature fusion model) respectively can improve the recognition and discrimination capabilities of the two models, thereby improving the technical effect of the accuracy and reliability of the trained text recognition model framework.

In other embodiments, adjusting the parameters of the text detection model in combination with the fusion feature and a plurality of visual feature blocks, and adjusting the parameters of the feature fusion model may include the following steps:

The first step: the feature fusion model replaces the visual features in the fused features according to at least part of the visual feature blocks in the plurality of visual feature blocks, and obtains the replaced visual features.

The second step: adjust the parameters of the text detection model and the feature fusion model respectively according to the text features in the fusion features and the replaced visual features.

Similarly, in this embodiment, by replacing the visual features in the fusion features, and based on the replaced visual features and the text features in the fusion features, the parameters of the two models (ie, the text detection model and the feature fusion model) Adjusting them separately can improve the recognition and discrimination capabilities of the two models, thereby improving the technical effect of the accuracy and reliability of the trained text recognition model framework.

In another example, the parameters of the text detection model and the parameters of the feature fusion model can be adjusted by combining the fusion feature, multiple text feature blocks, and multiple visual feature blocks.

For example, the example could include the following steps:

The first step: the feature fusion model determines a first adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and multiple text feature blocks.

The second step: the feature fusion model determines a second adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and a plurality of visual feature blocks.

The third step: adjust the parameters of the text detection model and the feature fusion model respectively according to the weighted average information of the first adjustment task result and the second adjustment task result.

In combination with the above-mentioned embodiments, in some embodiments, part of the text features in the occlusion fusion feature can be used as the first training task to obtain the first training result; part of the visual features in the occlusion fusion feature can be used as the second training task to obtain the first training result. Two training results; replace the text feature in the fusion feature as the third training task, perform training, and obtain the third training result; carry out weighted average processing on the first training result, the second training result, and the third training result, so as to obtain Finally, it is used to adjust the parameters of the text detection model and the parameters of the feature fusion model, and adjust the parameters of the text detection model based on the parameters that are finally used to adjust the text detection model, and based on the parameters that are finally used to adjust the feature fusion model Adjust the parameters of the feature fusion model.

In other embodiments, part of the text features in the occlusion fusion feature can be used as the first training task to obtain the first training result; part of the visual features in the occlusion fusion feature can be used as the second training task to obtain the second training result; Replacing the visual features in the fusion feature as the third training task, training to obtain the third training result; performing weighted average processing on the first training result, the second training result, and the third training result, so as to obtain the final adjustment The parameters of the text detection model and the parameters of the feature fusion model, and adjust the parameters of the text detection model based on the parameters that are finally used to adjust the text detection model, and adjust the feature fusion model based on the parameters that are finally used to adjust the feature fusion model parameters to adjust.

Regarding the combination of training tasks in this embodiment, it can be a combination of the above three training tasks (it should be understood that the combination of the above three training tasks is only used for exemplary illustrations, and cannot be understood as a combination of relative training tasks. The limitation of the way of combination, that is, it can also be other combination ways besides the above-mentioned combination way, which will not be listed one by one here), can also be a combination of two training tasks, and its realization principle can be found in the three training tasks The realization principle of the combination is not repeated here.

In this embodiment, the text recognition model framework is obtained through multi-task training, the training of fusion features and multiple text feature blocks is used as one of the training tasks, and the training of fusion features and multiple visual feature blocks is As another training task, and based on the training results of the two training tasks, determine the parameters for adjusting the parameters of the text detection model and the parameters of the feature fusion model, so as to realize the adjustment of the parameters of the text detection model and the parameters of the feature fusion model, while By adjusting the parameters of the text detection model and the parameters of the feature fusion model through multi-task training, the technical effect of the accuracy and reliability of the adjustment can be achieved.

It is worth noting that in this embodiment, by adjusting the parameters of the text detection model and the feature fusion model based on different methods, the flexibility and diversity of parameter adjustment can be improved, so as to achieve the goal of training the text recognition model framework. The technical effect of flexibility and variety.

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure. As shown in FIG. 4, the text recognition method of this embodiment includes:

S401: Acquire an image to be recognized.

S402: Input the image to be recognized into a pre-trained text recognition model to obtain text information in the image to be recognized.

Wherein, the text recognition model is generated by training images to be trained based on a pre-trained text recognition model framework, the text recognition model framework is obtained by training the training method described in any of the above embodiments, and the images to be trained include text information.

Based on the above analysis, it can be seen that the text recognition model framework includes a text detection model and a feature fusion model, which has high accuracy and reliability. The technical effect of high accuracy and reliability, and the technical effect of improving the effectiveness and reliability of recognition when the image to be recognized is recognized based on the text recognition model.

Fig. 5 is a schematic diagram according to the fourth embodiment of the present disclosure. As shown in Fig. 5, the training device 500 of the text recognition model framework of this embodiment includes:

The processing unit 501 is configured to perform feature processing on the sample image based on a preset text detection model to obtain at least two types of feature information related to text information in the sample image.

The fusion unit 502 is configured to perform fusion processing on at least two types of feature information of the sample image based on a preset feature fusion model to obtain fusion features of the sample image.

The training unit 503 is configured to input the fusion feature into the feature fusion model, adjust the parameters of the text detection model and the feature fusion model based on the fusion feature model, and obtain a text recognition model framework, wherein the text recognition model framework includes the adjusted Text detection model and adjusted feature fusion model.

Fig. 6 is a schematic diagram according to the fifth embodiment of the present disclosure. As shown in Fig. 6, the training device 600 of the text recognition model framework of this embodiment includes:

The processing unit 601 is configured to perform feature processing on the sample image based on a preset text detection model to obtain at least two types of feature information related to text information in the sample image.

As can be seen with reference to FIG. 6, in some embodiments, the processing unit 601 includes:

The first determination subunit 6011 is configured to determine the position information of the text line in the sample image based on the text detection model.

The second determining subunit 6012 is configured to determine at least two kinds of characteristic information according to the location information.

In some embodiments, the second determining subunit 6012 includes:

The cropping module is configured to perform a cropping operation on the sample image according to the location information to obtain a text area.

An acquisition module, configured to acquire at least two types of feature information from the text area.

In some embodiments, the acquisition module is configured to extract image features of the sample image from the text area, and identify the image features to obtain at least two types of feature information.

The fusion unit 602 is configured to perform fusion processing on at least two types of feature information of the sample image based on a preset feature fusion model to obtain fusion features of the sample image.

Wherein, at least two types of feature information include text features and visual features.

A construction unit 603, configured to construct multiple text feature blocks for representing text features, and construct multiple visual feature blocks for representing visual features.

The training unit 604 is configured to input the fusion feature into the feature fusion model, adjust the parameters of the text detection model and the feature fusion model based on the fusion feature model, and obtain a text recognition model framework, wherein the text recognition model framework includes the adjusted Text detection model and adjusted feature fusion model.

As can be seen from FIG. 6, in some embodiments, the training unit 604 includes:

The first covering subunit 60411 is used to randomly cover some text features in the fused features.

The second prediction and completion subunit 60412 is configured to perform prediction and completion processing on the covered partial text features according to multiple text feature blocks, and obtain the partial text features after prediction and completion.

The first adjustment subunit 60413 is used to adjust the parameters of the text detection model and the feature fusion model respectively according to the predicted and completed part of the text features and the features of the fusion features except the covered part of the text features.

In other embodiments, the training unit 604 includes:

The second covering subunit 60414 is used to randomly cover some visual features in the fused features.

The second prediction and completion subunit 60415 is used to perform prediction and completion processing on the covered partial visual features according to multiple visual feature blocks, and obtain the partial visual features after prediction and completion.

The second adjustment subunit 60416 is used to adjust the parameters of the text detection model and the feature fusion model respectively according to the predicted and completed part of the visual features and the features of the fusion features except the covered part of the visual features.

As can be seen from FIG. 6, in some embodiments, the training unit 604 may further include:

The first replacement subunit 60417 is configured to replace the text features in the fusion feature according to at least part of the text feature blocks in the plurality of text feature blocks to obtain replaced text features.

The third adjustment subunit 60418 is used to adjust the parameters of the text detection model and the feature fusion model respectively according to the visual features in the fused features and the replaced text features.

In other embodiments, the training unit 604 includes:

The second replacement subunit 60419 is configured to replace the visual features in the fusion feature according to at least part of the visual feature blocks in the plurality of visual feature blocks to obtain the replaced visual features.

The fourth adjustment subunit 60420 is used to adjust the parameters of the text detection model and the feature fusion model respectively according to the text features in the fusion features and the replaced visual features.

As can be seen from FIG. 6, in some embodiments, if the feature fusion model adjusts the parameters of the text detection model and the feature fusion model according to the fusion feature, multiple text feature blocks, and multiple visual feature blocks, the training unit 604 Can also include:

The third determination subunit 60421 is configured to determine a first adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and the plurality of text feature blocks.

The fourth determination subunit 60422 is configured to determine a second adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and multiple visual feature blocks.

The fifth adjustment subunit 60423 is configured to adjust the parameters of the text detection model and the feature fusion model respectively according to the weighted average information of the first adjustment task result and the second adjustment task result.

Fig. 7 is a schematic diagram according to the sixth embodiment of the present disclosure. As shown in Fig. 7, the text recognition device 700 of this embodiment includes:

The acquiring unit 701 is configured to acquire images to be trained, where the images to be trained include text information.

The recognition unit 702 is configured to input the image to be recognized into a pre-trained text recognition model to obtain text information in the image to be recognized, wherein the text recognition model is generated by training the image to be trained based on the pre-trained text recognition model framework, The text recognition model framework is obtained by training the training method described in any one of the above embodiments, and the images to be trained include text information.

According to the embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product. The computer program product includes: a computer program, the computer program is stored in a readable storage medium, and at least one processor of an electronic device can read the program from the readable storage medium. Taking a computer program, at least one processor executes the computer program so that the electronic device executes the solution provided by any one of the above embodiments.

FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 8 , an electronic device 800 includes a computing unit 801, which can perform calculations according to a computer program stored in a read-only memory (ROM) 802 or a computer program loaded from a storage unit 808 into a random access memory (RAM) 803. Various appropriate actions and processes are performed. In the RAM 803, various programs and data necessary for the operation of the device 800 can also be stored. The computing unit 801, ROM 802, and RAM 803 are connected to each other through a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804 .

Multiple components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, an optical disk, etc. ; and a communication unit 809, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 809 allows the device 800 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

The computing unit 801 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 executes various methods and processes described above, such as a training method of a text recognition model framework and a text recognition method. For example, in some embodiments, the training method of the text recognition model framework and the text recognition method can be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as the storage unit 808 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the training method of the text recognition model framework and the text recognition method described above can be performed. Alternatively, in other embodiments, the computing unit 801 may be configured in any other appropriate way (for example, by means of firmware) to execute the training method of the text recognition model framework and the text recognition method.

Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.

According to another aspect of the embodiments of the present disclosure, the embodiments of the present disclosure also provide a training system for a text recognition model framework, including:

The text detection model is used to perform feature processing on the sample image to obtain at least two kinds of feature information related to the text information in the sample image.

The feature fusion model is used to fuse at least two kinds of feature information of the sample image to obtain the fusion feature of the sample image.

The feature fusion model is also used to adjust the parameters of the text detection model and the feature fusion model respectively to obtain a text recognition model framework, wherein the text recognition model framework includes an adjusted text detection model and an adjusted feature fusion model.

Based on the above analysis, it can be known that in some embodiments, the text detection model may be an optical character recognition model, and the feature fusion model may be a transfromer model.

It can be seen from FIG. 9 that, in some embodiments, the training system 900 of the text recognition model framework of the embodiment of the present disclosure includes:

The optical character recognition model 901 is used to detect the text information in the sample image, obtain the position information of the text line in the sample image, and transmit the position information to the region feature extractor 902 .

The region feature extractor 902 is configured to perform a cropping operation on the sample image according to the position information to obtain a text region, and transmit the text region to the text recognizer 903 and the visual recognizer 904 respectively.

A text recognizer 903 is configured to determine text features in the text region, and transmit the text features to the transferer model 905 .

The visual recognizer 904 is configured to determine the visual features in the text region, and transmit the visual features to the transferer model 905 .

The transfromer model 905 fuses text features and visual features to obtain fusion features, and adjusts the parameters of the optical character recognition model 901 and the parameters of the transfromer model 905 based on the fusion features to obtain a text recognition model framework.

It should be understood that the components in the foregoing embodiments may be integrated or formed independently, which is not limited in this embodiment.

For example, an optical character recognition model, an area feature extractor, a text recognizer, and a visual recognizer are independent components; another example, an area feature extractor is an integrated component in an optical character recognition model, and the text recognizer and The visual recognizers are independent, and the text recognizer and the visual recognizer are two independent components, etc., which are not listed here.

For the implementation principles of the above features, reference may be made to the descriptions in the above method embodiments, which will not be repeated here.

According to another aspect of the embodiments of the present application, the embodiments of the present application further provide a computer program, including program code, and when the computer runs the computer program, the program code executes the method described in any of the above embodiments.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present application may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution provided by the present disclosure can be achieved, no limitation is imposed herein.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

A training method of a text recognition model framework, said method comprising:

performing feature processing on the sample image based on a preset text detection model to obtain at least two types of feature information related to the text information in the sample image;

performing fusion processing on at least two types of feature information of the sample image based on a preset feature fusion model to obtain fusion features of the sample image;

Input the fusion feature into the feature fusion model, adjust the parameters of the text detection model and the feature fusion model based on the fusion feature model respectively, to obtain a text recognition model framework, wherein the text recognition model The framework includes an adjusted text detection model and an adjusted feature fusion model.
The method according to claim 1, wherein the sample image is subjected to feature processing based on a preset text detection model to obtain at least two types of feature information related to the text in the sample image, including:

The position information of the text line in the sample image is determined based on the text detection model, and the at least two kinds of feature information are determined according to the position information.
The method according to claim 2, wherein determining the at least two kinds of characteristic information according to the location information comprises:

A cropping operation is performed on the sample image according to the position information to obtain a text area, and the at least two kinds of feature information are obtained from the text area.
The method according to claim 3, wherein obtaining the at least two types of characteristic information from the text region comprises:

Extracting image features of the sample image from the text area, and identifying the image features, to obtain the at least two types of feature information.
According to the method according to any one of claims 2 to 4, the at least two kinds of characteristic information include textual characteristics and visual characteristics; after determining the at least two kinds of characteristic information according to the position information, further comprising:

Constructing a plurality of text feature blocks for characterizing the text features, and constructing a plurality of visual feature blocks for characterizing the visual features;

And, inputting the fusion feature into the feature fusion model, adjusting the parameters of the text detection model and the feature fusion model respectively based on the fusion feature model, including: using the feature fusion model according to the Fusing features and the plurality of text feature blocks, respectively adjusting the parameters of the text detection model and the feature fusion model; and/or, using the feature fusion model according to the fusion features and the plurality of visual The feature block adjusts the parameters of the text detection model and the feature fusion model respectively.
The method according to claim 5, wherein the parameters of the text detection model and the feature fusion model are respectively adjusted by the feature fusion model according to the fusion feature and the plurality of text feature blocks, including :

Randomly cover some text features in the fused features by the feature fusion model, and perform prediction and completion processing on the covered part of the text features according to the plurality of text feature blocks, to obtain part of the text features after prediction and completion ;

The parameters of the text detection model and the feature fusion model are adjusted respectively according to the predicted and completed part of the text features and the features of the fusion features other than the covered part of the text features.
The method according to claim 5, wherein the parameters of the text detection model and the feature fusion model are respectively adjusted by the feature fusion model according to the fusion feature and the plurality of visual feature blocks, including :

Randomly cover part of the visual features in the fused features by the feature fusion model, and perform prediction and completion processing on the covered part of the visual features according to the plurality of visual feature blocks, to obtain the partial visual features after prediction and completion feature;

Adjust the parameters of the text detection model and the feature fusion model according to the predicted and completed part of the visual features and the features of the fusion features except for the covered part of the visual features.
The method according to claim 5, wherein the parameters of the text detection model and the feature fusion model are respectively adjusted by the feature fusion model according to the fusion feature and the plurality of text feature blocks, including :

performing replacement processing on the text features in the fusion features according to at least some of the text feature blocks in the plurality of text feature blocks by the feature fusion model, to obtain replaced text features;

Adjust the parameters of the text detection model and the feature fusion model respectively according to the visual features in the fusion features and the replaced text features.
The method according to claim 5, wherein the parameters of the text detection model and the feature fusion model are respectively adjusted by the feature fusion model according to the fusion feature and the plurality of visual feature blocks, including :

performing replacement processing on the visual features in the fusion features according to at least part of the visual feature blocks in the plurality of visual feature blocks by the feature fusion model, to obtain replaced visual features;

Adjust the parameters of the text detection model and the feature fusion model respectively according to the text features in the fusion features and the replaced visual features.
The method according to any one of claims 5 to 9, wherein, if the feature fusion model is based on the fusion feature, the plurality of text feature blocks, and the plurality of visual feature blocks, the The parameters of the text detection model and the feature fusion model are adjusted respectively, and then the parameters of the text detection model and the feature fusion model are adjusted respectively, including:

Using the feature fusion model to determine a first adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and the plurality of text feature blocks;

Using the feature fusion model to determine a second adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and the plurality of visual feature blocks;

The parameters of the text detection model and the feature fusion model are respectively adjusted according to the weighted average information of the first adjustment task result and the second adjustment task result.
A text recognition method, comprising:

Obtain the image to be recognized;

Inputting the image to be recognized into a pre-trained text recognition model to obtain text information in the image to be recognized, wherein the text recognition model is generated based on the pre-trained text recognition model framework to train the image to be trained, The text recognition model framework is obtained by training the training method described in any one of claims 1 to 10, and the image to be trained includes text information.
A training device of a text recognition model framework, said device comprising:

A processing unit, configured to perform feature processing on the sample image based on a preset text detection model to obtain at least two types of feature information related to the text information in the sample image;

a fusion unit, configured to perform fusion processing on at least two types of feature information of the sample image based on a preset feature fusion model, to obtain fusion features of the sample image;

A training unit, configured to input the fusion feature into the feature fusion model, and adjust the parameters of the text detection model and the feature fusion model based on the fusion feature model to obtain a text recognition model framework, wherein, The text recognition model framework includes an adjusted text detection model and an adjusted feature fusion model.
The device according to claim 12, wherein the processing unit comprises:

A first determination subunit, configured to determine position information of text lines in the sample image based on the text detection model;

The second determining subunit is configured to determine the at least two kinds of characteristic information according to the location information.
The device according to claim 13, wherein the second determining subunit comprises:

A cropping module, configured to perform a cropping operation on the sample image according to the position information to obtain a text area;

An obtaining module, configured to obtain the at least two kinds of characteristic information from the text area.
The device according to claim 14, wherein the acquisition module is configured to extract image features of the sample image from the text region, and identify the image features to obtain the at least two types of feature information .
According to the device according to any one of claims 13 to 15, the at least two kinds of feature information include text features and visual features; further comprising:

a construction unit, configured to construct a plurality of text feature blocks for characterizing the text features, and construct a plurality of visual feature blocks for characterizing the visual features;

And, the training unit is configured to use the feature fusion model to adjust the parameters of the text detection model and the feature fusion model respectively according to the fusion feature and the plurality of text feature blocks; and/or , for the feature fusion model to adjust the parameters of the text detection model and the feature fusion model respectively according to the fusion feature and the plurality of visual feature blocks.
The device according to claim 16, wherein the training unit comprises:

The first covering subunit is used to randomly cover some text features in the fusion features;

The first prediction and completion subunit is used to perform prediction and completion processing on the covered part of the text features according to the plurality of text feature blocks, and obtain the part of the text features after the prediction and completion;

A first adjustment subunit, configured to fuse the text detection model with the features according to the predicted and completed partial text features and features in the fusion features other than the covered partial text features The parameters of the model were tuned separately.
The device according to claim 16, wherein the training unit comprises:

The second covering subunit is used to randomly cover some visual features in the fusion features;

The second prediction and completion subunit is used to perform prediction and completion processing on the covered part of the visual features according to the plurality of visual feature blocks, and obtain the partial visual features after prediction and completion;

The second adjustment subunit is configured to fuse the text detection model and the features according to the predicted and completed part of the visual features and the features of the fusion features except the covered part of the visual features The parameters of the model were tuned separately.
The device according to claim 16, wherein the training unit comprises:

The first replacement subunit is configured to replace the text features in the fusion features according to at least some of the text feature blocks in the plurality of text feature blocks, to obtain replaced text features;

The third adjustment subunit is configured to adjust the parameters of the text detection model and the feature fusion model respectively according to the visual features in the fusion features and the replaced text features.
The device according to claim 16, wherein the training unit comprises:

The second replacement subunit is configured to replace the visual features in the fusion feature according to at least part of the visual feature blocks in the plurality of visual feature blocks to obtain replaced visual features;

The fourth adjustment subunit is configured to adjust the parameters of the text detection model and the feature fusion model respectively according to the text features in the fusion features and the replaced visual features.
The device according to any one of claims 16 to 20, wherein, if the feature fusion model is based on the fusion feature, the plurality of text feature blocks, and the plurality of visual feature blocks, the The parameters of the text detection model and the feature fusion model are adjusted respectively, then the training unit includes:

A third determining subunit, configured to determine a first adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and the plurality of text feature blocks;

A fourth determining subunit, configured to determine a second adjustment task result for adjusting the text detection model and the feature fusion model according to the fusion feature and the plurality of visual feature blocks;

The fifth adjustment subunit is configured to adjust the parameters of the text detection model and the feature fusion model according to the weighted average information of the first adjustment task result and the second adjustment task result.
A text recognition device, comprising:

an acquisition unit, configured to acquire an image to be identified;

The recognition unit is used to input the image to be recognized into a pre-trained text recognition model to obtain the text information in the image to be recognized, wherein the text recognition model is based on the pre-trained text recognition model framework for the image to be trained Generated by training, the text recognition model framework is obtained by training the training method described in any one of claims 1 to 10, and the image to be trained includes text information.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1 to 10 or, to enable said at least one processor to perform the method of claim 11.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform the method according to any one of claims 1 to 10; or, the computer instructions are used to causing the computer to execute the method of claim 11.
A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 10; or, when executed by a processor, implements the method according to The method of claim 11.
A training system of a text recognition model framework, said system comprising:

A text detection model, configured to perform feature processing on the sample image to obtain at least two types of feature information related to the text information in the sample image;

A feature fusion model, configured to perform fusion processing on at least two types of feature information of the sample image to obtain fusion features of the sample image;

The feature fusion model is also used to adjust the parameters of the text detection model and the feature fusion model respectively to obtain a text recognition model framework, wherein the text recognition model framework includes the adjusted text detection model and Adjusted feature fusion model.
A computer program, including program code, when the computer runs the computer program, the program code executes the method according to any one of claims 1-10; or, the program code executes the method according to claim 11 described method.