CN115035510A

CN115035510A - Text recognition model training method, text recognition device, and medium

Info

Publication number: CN115035510A
Application number: CN202210959313.4A
Authority: CN
Inventors: 莫秀云; 王国鹏; 王洁瑶
Original assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Current assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-09-09
Anticipated expiration: 2042-08-11
Also published as: CN115035510B

Abstract

The application relates to an artificial intelligence technology, and provides a text recognition model training method, a text recognition method, equipment and a medium, wherein a test set and a training set are respectively established by utilizing first type image data and second type image data, so that a text recognition model with high generalization capability is trained by utilizing different types of image data, the data under different scenes can be subjected to combined training, the marking cost is reduced compared with the training model under a single scene, the training set is compressed according to the word frequency, the compression training of the text recognition model is realized, the efficiency of model training is improved, and meanwhile, the importance of characters is considered when the training set is compressed, so that the text recognition model obtained by training can accurately recognize texts, and the efficiency of model training and optimization is improved.

Description

Text recognition model training method, text recognition device, and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text recognition model training method, a text recognition device, and a medium.

Background

Currently, text recognition models typically require accurate recognition of thousands of characters to meet basic text recognition requirements. However, a large number of rare words exist in the Chinese character library, and the characters are rare in daily life, so that the corpus data is rare. Moreover, the labeling engineer needs to label all the text contents in one picture, and the labeling cost is also high. Moreover, because the data styles of the text recognition tasks have great differences, the difficulty degree of collecting different types of text and picture data is different, for example, document type picture data such as contracts and the like are rich, and natural scene type pictures shot by a mobile phone are fewer.

In recent years, text recognition is mainly performed by a deep learning method, such as a CNN-RNN (Convolutional Neural Network-Recurrent Neural Network) model, a CNN combination Sequence 2 Sequence (Sequence to Sequence) model, and a CNN-Sequence 2 Sequence model added with attention. Although text content can be accurately and objectively recognized by using the deep learning model, the model mainly trains static distribution data of a fixed single scene, and the capacity of continuous learning and generalization of knowledge is very deficient in the deep learning model. Models face the challenge of retaining and accumulating knowledge in learning new tasks due to the migration of data distributions. Specifically, the current text recognition model trained based on the deep learning method mainly has the following problems:

1) the text recognition model is over-fitted to single scene data, and the generalization capability is poor. For the algorithm research of text recognition, most models and algorithms only aim at one scene when recognizing text data, and the migration and the generalization of the models are poor. For example, in a recognition model trained based on scanned document-type text data, the recognition effect on natural scene-type text data is not good.

2) When the training data is small, the accuracy of model recognition is extremely low.

3) Iterative optimization of the text recognition model is inefficient. Most of the existing text recognition models are learned from ten million levels of training data, when new task data is added, the models are usually required to be trained from the beginning in order to have better recognition effect, the time for waiting for updating can reach hours, days or weeks, and the waiting period is longer.

Disclosure of Invention

The embodiment of the application provides a text recognition model training method, a text recognition method, equipment and a medium, and aims to solve the problems of poor generalization capability, low accuracy and low optimization efficiency of a text recognition model.

In a first aspect, an embodiment of the present application provides a text recognition model training method, which includes:

acquiring first type image data and second type image data, preprocessing the first type image data to obtain a first image data set, and preprocessing the second type image data to obtain a second image data set;

acquiring a pre-constructed dictionary, and splitting the first image data set by using the dictionary to obtain a first training set and a first testing set;

splitting the second image data set according to a configuration proportion to obtain a second training set and a second testing set;

detecting high-frequency words and low-frequency words in the first training set;

compressing the first training set according to the high-frequency words and the low-frequency words to obtain a third training set;

combining the second training set and the third training set to obtain a fourth training set;

training a preset recognition model by using the first training set and the first test set to obtain a first recognition model;

training the first recognition model using the fourth training set;

in the training process of the first recognition model, testing the model of each iteration by using the first test set and the second test set respectively to obtain a test result;

selecting a target recognition model from the models of each iteration according to the test result;

and acquiring an image to be recognized, and performing text recognition on the image to be recognized by using the target recognition model to obtain a recognition result.

In a second aspect, an embodiment of the present application provides a text recognition method, where the text recognition method obtains a target recognition model by training with the text recognition model training method of the first aspect, and the method includes:

acquiring an image to be identified;

and performing text recognition on the image to be recognized by using the target recognition model to obtain a recognition result.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the text recognition model training method and/or the text recognition method according to the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the text recognition model training method and/or the text recognition method according to the first aspect.

The embodiment of the application provides a text recognition model training method, a text recognition method, equipment and a medium, which can train a text recognition model with strong generalization capability by utilizing different types of image data, reduce the labeling cost compared with a training model under a single scene, further compress training data through word frequency, reduce the calculation cost of model training while ensuring the accuracy of the model, and improve the efficiency of model training and optimization.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a text recognition model training method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a text recognition model training method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a preprocessing process in a text recognition model training method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a text recognition method according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a text recognition model training apparatus provided in an embodiment of the present application;

FIG. 6 is a schematic block diagram of a text recognition apparatus provided in an embodiment of the present application;

fig. 7 is a schematic block diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a text recognition model training method according to an embodiment of the present application; fig. 2 is a schematic flowchart of a text recognition model training method according to an embodiment of the present application, where the text recognition model training method is applicable to a server and can be executed by application software installed in the server.

As shown in FIG. 2, the method includes steps S101 to S110.

S101, collecting first type image data and second type image data, preprocessing the first type image data to obtain a first image data set, and preprocessing the second type image data to obtain a second image data set.

In this embodiment, a server is used as an execution subject to describe the technical solution. The user end (such as a smart terminal like a smart phone or a tablet computer) used by a user can perform data interaction with the server, and specifically, the server provides a text recognition model training platform, and the user end can log in the text recognition model training platform. And a user interaction interface of the text recognition model training platform is displayed on a terminal interface of the user side, and at least one picture uploading interface exists in the user interaction interface. When the user selects the image as training data and uploads the training data to the server from the image uploading interface, subsequent text recognition model training can be carried out in the server.

In this embodiment, the first type image data may be document type text data such as a contract scanned part; the second type of image data may be natural scene text data with characters shot by electronic devices such as a mobile phone, for example, a license plate image shot by a mobile phone, a company license plate image shot by a tablet computer, and the like.

Obviously, the first type image data is easier to acquire and thus has a larger data amount, and the second type image data is harder to acquire and thus has a smaller data amount.

Namely: the data amount of the first type image data is larger than the data amount of the second type image data.

For example: when the first type image data is a scanned piece of the contract, the data amount may reach 800 ten thousand lines of text, and correspondingly, when the second type image data is a company plaque image shot by a mobile phone, the data amount may reach only 1 ten thousand lines of text.

In this embodiment, in order to train a model using the first type image data and the second type image data, the first type image data and the second type image data need to be preprocessed into a usable data format.

Specifically, the preprocessing the first type image data to obtain a first image data set includes:

detecting a line text area of each first type of image data by using a text detection model, and cutting the detected line text area to obtain line text characteristics of each first type of image data;

marking the line text characteristics of each first type image data to obtain characters included by each line text characteristic;

and combining the marked first type image data to obtain the first image data set.

For example, referring to fig. 3, when the first type image data is a scanned menu, a text detection model is used to detect a line text area of the menu and cut the line text area, and the cut line text features are labeled, for example, the text features "confectionary" are labeled to obtain the characters "confectionary". Further, all the labeled text features are combined to obtain the first image data set.

The text detection model may be any model with a text detection function, such as a DBnet text detection model.

In this embodiment, a manner of preprocessing the second type image data to obtain the second image data set is similar to a manner of preprocessing the first type image data to obtain the first image data set, and is not described herein again.

S102, a pre-constructed dictionary is obtained, and the dictionary is utilized to split the first image data set to obtain a first training set and a first testing set.

In this embodiment, the characters appearing in the dictionary may include chinese characters, english characters, symbols, and the like that are used in daily life, and the characters that can be recognized by the subsequent models are characters in the dictionary.

For example: the dictionary may be a 6000 word dictionary.

In this embodiment, the splitting the first image data set by using the dictionary to obtain a first training set and a first test set includes:

acquiring each character in the dictionary;

detecting a line text containing each character in the first image data set to obtain line text characteristics corresponding to each character;

extracting a first preset number of line text features from the line text features corresponding to each character to construct the first test set;

constructing the first training set using the remaining data in the first image dataset other than the first test set.

Wherein the first predetermined number may be configured by a user, such as 50.

For example: when the first image data set comprises 800 ten thousand lines of text features, respectively extracting 50 lines of text features containing the character from the 800 ten thousand lines of text features, constructing the first test set by using the extracted line text features, and constructing the first training set by using the rest lines of text features in the 800 ten thousand lines of text features.

S103, splitting the second image data set according to a configuration proportion to obtain a second training set and a second testing set.

Wherein, the configuration proportion can be configured in a self-defining way, such as 3: 7.

For example: when the configuration ratio is 3:7, the data in the second image data set may be randomly split according to 3:7, where 30% of the data is used as the second test set and 70% of the data is used as the second training set.

And S104, detecting the high-frequency words and the low-frequency words in the first training set.

It should be noted that, because the amount of data in the first training set is large, in order to compress the amount of data and reduce the time consumption of model training, the training data may also be compressed according to the word frequency, so as to implement the compression training of the model.

Therefore, the high frequency words and the low frequency words in the first training set need to be detected first.

Specifically, the detecting high-frequency words and low-frequency words in the first training set includes:

acquiring the total number of all characters in the dictionary;

calculating the product of the total quantity and a preset value to obtain a target quantity;

calculating the occurrence frequency of each character in the first training set;

extracting the characters with the target number from each character of the first training set as the high-frequency words according to the sequence from high to low of the occurrence frequency;

and determining the rest characters except the high-frequency character in the first training set as the low-frequency character.

The preset value may be configured by a user, for example, 50%.

And S105, compressing the first training set according to the high-frequency words and the low-frequency words to obtain a third training set.

It can be understood that, since the first training set may contain millions of data, the training is computationally expensive and time-consuming, and therefore, the data compression may be performed on the first training set first.

Specifically, the compressing the first training set according to the high frequency words and the low frequency words to obtain a third training set includes:

acquiring each image data in the first training set;

calculating the total quantity of line text features in each image data as a first numerical value;

for each high-frequency word, acquiring the number of image data containing the high-frequency word as a second numerical value corresponding to each high-frequency word;

calculating the quotient of a second numerical value corresponding to each high-frequency word and the first numerical value to obtain the weight corresponding to each high-frequency word;

acquiring a first coefficient configured in advance;

calculating the product of the weight corresponding to each high-frequency word, the first coefficient and the first numerical value to obtain the corpus extraction amount corresponding to each high-frequency word;

randomly extracting line text features from the first training set according to the corpus extraction amount corresponding to each high-frequency word;

randomly extracting a second preset number of line text features for each low-frequency word from the first training set;

and combining the extracted line text features to obtain the third training set.

Wherein the first coefficient may be configured according to an experimental effect, for example: through a large number of experiments, it is determined that the training effect of the model is optimal when the value of the first coefficient is 0.1, and then the first coefficient can be configured to be 0.1.

The second preset number can be configured in a user-defined mode to guarantee the recognition rate of the low-frequency words. For example: the second preset number may be 200.

Further to exemplify the compression process, the first training set may include a plurality of image data, and each image data may include a plurality of lines of text, that is, a plurality of lines of text features. When the first training set includes 100 ten thousand lines of text features, the first value is 100 ten thousand. Wherein, for the high frequency character X, when 50 ten thousand graphs in the first training set include the high frequency character X, the second numerical value is 50 ten thousand. The quotient of 50 ten thousand of the second value and 100 ten thousand of the first value is calculated to obtain the importance weight of the high frequency character X as 0.5. And when the first coefficient is 0.1, calculating the product of the importance weight 0.5 of the high-frequency character X, the first coefficient 0.1 and the first numerical value 100 ten thousand to obtain the corpus extraction quantity of the high-frequency character X, wherein the corpus extraction quantity of the high-frequency character X is 0.5 × 0.1 × 100 ten thousand =5 ten thousand. Further, 5 million line text features with high frequency character X are randomly extracted from the first training set. Meanwhile, in order to maintain the recognition rate of the low frequency words, each low frequency word may correspond to randomly drawing 200 line text features. And combining the line text characteristics of the high-frequency words and the low-frequency words respectively extracted to obtain the third training set.

Therefore, the first training set with the million-level data volume is effectively compressed, and the training efficiency of the model is obviously improved while the recognition rate of high-frequency words and low-frequency words is ensured.

And S106, combining the second training set and the third training set to obtain a fourth training set.

It can be understood that the second training set is specific natural scene-type text data, and the fourth training set is compression of document-type text data with a large number of samples, so that after the second training set and the third training set are combined, the obtained fourth training set not only includes sufficient training data, but also can ensure the training effect of the model due to the fact that the training samples are compressed by the importance weights of the characters.

S107, training a preset recognition model by using the first training set and the first test set to obtain a first recognition model.

The preset recognition model may be any model with a character recognition function, such as a CRNN (Convolutional Recurrent Neural Net) model.

Specifically, the preset recognition model is trained by using the data in the first training set, the accuracy of the model obtained by training is tested by using the data in the first test set, and the model with the highest accuracy is selected as the first recognition model.

And S108, training the first recognition model by utilizing the fourth training set.

In this embodiment, when the first recognition model is trained by using the fourth training set, an optimal hyper-parameter configuration mode may be determined according to a large number of experimental results.

For example: the hyper-parameters of the model may be configured to: the learning rate is set to 1, the number of sample training batches is set to 256, and the number of iterations is set to 100.

In this embodiment, the training the first recognition model by using the fourth training set includes:

starting an iteration after inputting the fourth training set to the first recognition model;

in each iteration process, acquiring the output probability of the current iteration model in real time;

calculating a CTC loss for the model of the current iteration based on the output probabilities;

acquiring a second coefficient and a third coefficient which are configured in advance;

calculating a difference value between 1 and the output probability, and calculating the N power of the difference value as a fifth numerical value, wherein the value of N is the third coefficient;

calculating the product of the second coefficient, the fifth numerical value and the CTC loss to obtain the real-time loss of the current iterative model;

when the real-time loss reaches convergence, stopping the current iteration.

Wherein the loss of CTCs may lead the model to training.

The second coefficient and the third coefficient may also be configured by a user, for example: the second coefficient may be configured to be 1, and the third coefficient may be configured to be 2.

Through the configuration of the loss function, the difficult samples can be given larger weight, and the training effect of the model is improved.

And S109, in the training process of the first recognition model, testing the model of each iteration by using the first test set and the second test set respectively to obtain a test result.

In this embodiment, the first test set and the second test set are respectively used for testing the model of each iteration, so that the model obtained through training can be ensured to be simultaneously suitable for a scene (such as a document text data scene) corresponding to the first type of image data and a scene (such as a natural scene text data scene) corresponding to the second type of image data, and further the generalization capability of the model is improved.

And S110, selecting a target recognition model from the models of each iteration according to the test result.

Specifically, the selecting a target recognition model from the models of each iteration according to the test result includes:

and obtaining a model with the highest accuracy from the models of each iteration according to the test result to serve as the target recognition model.

In the above embodiment, the model with the highest accuracy in the test results of the first test set and the second test set is taken as the final text recognition model, so as to ensure the accuracy of the model recognition.

The embodiment trains out the text recognition model with strong generalization ability by utilizing the image data of different types, so that the data under different scenes can be subjected to combined training, the marking cost is reduced compared with the training model under a single scene, the training data is further compressed through word frequency, the calculation cost of model training is reduced while the accuracy of the model is ensured, and the efficiency of model training and optimization is improved.

As shown in FIG. 4, the method obtains the target recognition model by training the text recognition model training method as described in S101-S110, and the method includes steps S201-S202.

S201, acquiring an image to be identified.

The image to be recognized may be an image captured by a tool such as a scanner and a camera, and the present invention is not limited thereto.

S202, performing text recognition on the image to be recognized by using the target recognition model to obtain a recognition result.

In the embodiment, the target recognition model obtained by training through the text recognition model training method in S101-S110 is used for text recognition, so that the accuracy of text recognition is ensured.

The embodiment of the application also provides a text recognition model training device, which is used for executing any embodiment of the text recognition model training method. Specifically, referring to fig. 5, fig. 5 is a schematic block diagram of a text recognition model training apparatus 100 according to an embodiment of the present application.

As shown in fig. 5, the text recognition model training apparatus 100 includes a preprocessing unit 101, a splitting unit 102, a detecting unit 103, a compressing unit 104, a combining unit 105, a training unit 106, a testing unit 107, and a selecting unit 108.

The preprocessing unit 101 is configured to acquire first-type image data and second-type image data, preprocess the first-type image data to obtain a first image data set, and preprocess the second-type image data to obtain a second image data set.

In this embodiment, the first type image data may be document type text data such as a contract scanned part; the second type of image data may be scene type text data with characters, such as a license plate image shot by a mobile phone, a company license plate image shot by a tablet computer, and the like, shot by electronic devices such as a mobile phone.

In this embodiment, in order to train a model using the first type image data and the second type image data, the first type image data and the second type image data need to be preprocessed into usable data forms.

Specifically, the preprocessing unit 101 performs preprocessing on the first type image data to obtain a first image data set, including:

marking the line text characteristics of each first type image data to obtain characters included in each line text characteristic;

The splitting unit 102 is configured to obtain a pre-constructed dictionary, and split the first image data set by using the dictionary to obtain a first training set and a first test set.

For example: the dictionary may be a 6000 word dictionary.

In this embodiment, the splitting unit 102 splits the first image data set by using the dictionary to obtain a first training set and a first test set, including:

acquiring each character in the dictionary;

Wherein the first predetermined number may be configured by a user, such as 50.

The splitting unit 102 is further configured to split the second image data set according to a configuration ratio to obtain a second training set and a second test set.

The configuration proportion can be customized, such as 3: 7.

The detecting unit 103 is configured to detect a high frequency word and a low frequency word in the first training set.

Specifically, the detecting unit 103 detects the high frequency words and the low frequency words in the first training set, including:

acquiring the total number of all characters in the dictionary;

The preset value may be configured by a user, for example, 50%.

The compressing unit 104 is configured to compress the first training set according to the high frequency words and the low frequency words to obtain a third training set.

Specifically, the compressing unit 104 compresses the first training set according to the high frequency words and the low frequency words to obtain a third training set, which includes:

acquiring each image data in the first training set;

calculating the quotient of the second numerical value corresponding to each high-frequency word and the first numerical value to obtain the weight corresponding to each high-frequency word;

acquiring a first coefficient configured in advance;

And the second preset number can be configured in a user-defined mode so as to ensure the recognition rate of the low-frequency words. For example: the second preset number may be 200.

Further to exemplify the compression process, the first training set may include a plurality of image data, and each image data may include a plurality of lines of text, that is, a plurality of lines of text features. When the first training set includes 100 ten thousand lines of text features, the first numerical value is 100 ten thousand. Wherein, for the high frequency character X, when 50 ten thousand graphs in the first training set include the high frequency character X, the second numerical value is 50 ten thousand. The quotient of 50 ten thousand of the second value and 100 ten thousand of the first value is calculated to obtain the importance weight of the high frequency character X as 0.5. And when the first coefficient is 0.1, calculating the product of the importance weight 0.5, the first coefficient 0.1 and the first numerical value 100 ten thousand of the high-frequency character X to obtain the corpus extraction quantity of the high-frequency character X, wherein the corpus extraction quantity of the high-frequency character X is 0.5X 0.1X 100 ten thousand =5 ten thousand. Further, 5 million line text features with high frequency character X are randomly extracted from the first training set. Meanwhile, in order to maintain the recognition rate of the low frequency words, each low frequency word may correspond to randomly drawing 200 line text features. And combining the line text characteristics of the high-frequency words and the low-frequency words respectively extracted to obtain the third training set.

The combining unit 105 is configured to combine the second training set and the third training set to obtain a fourth training set.

It can be understood that the second training set is specific natural scene type text data, and the fourth training set is compression of document type text data with a large number of samples, so that after the second training set and the third training set are combined, the obtained fourth training set not only includes sufficient training data, but also can ensure the training effect of the model due to the compression of the training samples by the importance weights of the characters.

The training unit 106 is configured to train a preset recognition model by using the first training set and the first test set to obtain a first recognition model.

The training unit 106 is further configured to train the first recognition model by using the fourth training set.

For example: the hyper-parameters of the model may be configured as: the learning rate is set to 1, the sample training batch number is set to 256, and the iteration number is set to 100.

when the real-time loss reaches convergence, stopping the current iteration.

Wherein the loss of CTCs may lead the model to training.

The testing unit 107 is configured to test the model of each iteration by using the first test set and the second test set respectively in the training process of the first recognition model, so as to obtain a test result.

The selecting unit 108 is configured to select a target recognition model from the models of each iteration according to the test result.

Specifically, the selecting unit 108 selects a target recognition model from the models of each iteration according to the test result, including:

and obtaining a model with the highest accuracy from the models of each iteration according to the test result as the target recognition model.

In the above embodiment, the model with the highest accuracy in the test results of the first test set and the second test set is taken as the final text recognition model, so as to ensure the accuracy of model recognition.

The embodiment trains out the text recognition model with stronger generalization ability by utilizing the image data of different types, so that the data under different scenes can be combined and trained, compared with the training model under a single scene, the marking cost is reduced, the training data is further compressed through word frequency, the calculation cost of model training is reduced while the accuracy of the model is ensured, and the efficiency of model training and optimization is improved.

Embodiments of the present application further provide a text recognition apparatus, where the text recognition apparatus is configured to execute any of the foregoing text recognition methods. Specifically, referring to fig. 6, fig. 6 is a schematic block diagram of a text recognition model training apparatus 200 according to an embodiment of the present application.

As shown in fig. 6, the text recognition apparatus 200 performs text recognition using the target recognition model trained by the text recognition model training apparatus 100. The text recognition apparatus 200 includes an acquisition module 201 and a recognition module 202.

The acquiring module 201 is configured to acquire an image to be identified.

The recognition module 202 is configured to perform text recognition on the image to be recognized by using the target recognition model to obtain a recognition result.

In this embodiment, the target recognition model trained by the text recognition model training device 100 is used to perform text recognition, so as to ensure the accuracy of text recognition.

The text recognition model training means and/or the text recognition means may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 7.

Referring to fig. 7, fig. 7 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server or a server cluster. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.

Referring to fig. 7, the computer apparatus 500 includes a processor 502, a memory, which may include a storage medium 503 and an internal memory 504, and a network interface 505 connected by a device bus 501.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a text recognition model training method and/or a text recognition method.

The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the computer program 5032 in the storage medium 503 to run, and when the computer program 5032 is executed by the processor 502, the processor 502 may perform a text recognition model training method and/or a text recognition method.

The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 7 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 500 to which the present application may be applied, and that a particular computer device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The processor 502 is configured to run the computer program 5032 stored in the memory to implement the text recognition model training method and/or the text recognition method disclosed in the embodiment of the present application.

Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 7 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 7, and are not described herein again.

It should be understood that, in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a nonvolatile computer-readable storage medium or a volatile computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the text recognition model training method and/or the text recognition method disclosed in the embodiments of the present application.

It should be noted that all the data involved in the present application are legally acquired.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electrical, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the elements may be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a backend server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text recognition model training method is characterized by comprising the following steps:

acquiring first type image data and second type image data, preprocessing the first type image data to obtain a first image data set, and preprocessing the second type image data to obtain a second image data set; wherein a data amount of the first type image data is larger than a data amount of the second type image data;

acquiring a pre-constructed dictionary, and splitting the first image data set by using the dictionary to obtain a first training set and a first test set;

training the first recognition model using the fourth training set;

and selecting a target recognition model from the models of each iteration according to the test result.

2. The method for training the text recognition model according to claim 1, wherein the preprocessing the first type of image data to obtain a first image data set comprises:

3. The method for training the text recognition model according to claim 1, wherein the splitting the first image data set by using the dictionary to obtain a first training set and a first testing set comprises:

acquiring each character in the dictionary;

4. The method for training the text recognition model according to claim 1, wherein the detecting the high-frequency words and the low-frequency words in the first training set comprises:

acquiring the total number of all characters in the dictionary;

5. The method of claim 1, wherein the compressing the first training set according to the high frequency words and the low frequency words to obtain a third training set comprises:

acquiring each image data in the first training set;

acquiring a first coefficient configured in advance;

6. The method of training a text recognition model according to claim 1, wherein the training the first recognition model using the fourth training set comprises:

when the real-time loss reaches convergence, stopping the current iteration.

7. The method for training the text recognition model according to claim 1, wherein the selecting the target recognition model from the models of each iteration according to the test result comprises:

8. A text recognition method, wherein the text recognition method is trained to obtain a target recognition model by using the text recognition model training method according to any one of claims 1 to 7, and the method comprises:

acquiring an image to be identified;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the text recognition model training method according to any one of claims 1 to 7 and/or the text recognition method according to claim 8 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the text recognition model training method according to any one of claims 1 to 7 and/or the text recognition method according to claim 8.