WO2023173546A1

WO2023173546A1 - Method and apparatus for training text recognition model, and computer device and storage medium

Info

Publication number: WO2023173546A1
Application number: PCT/CN2022/090160
Authority: WO
Inventors: 郑喜民; 朱翌; 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-03-15
Filing date: 2022-04-29
Publication date: 2023-09-21
Also published as: CN114724162A

Abstract

The present application relates to the technical field of natural language processing of artificial intelligence technology. Provided in the present application are a method and apparatus for training a text recognition model, and a computer device and a storage medium. The method comprises: performing random augmentation processing on a first image, so as to obtain a plurality of second images; marking the first image and the plurality of second images as reference images; acquiring a text feature of text information in each reference image, and calculating the similarity of the text features of every two reference images; taking two reference images, the similarity between which is greater than a preset similarity threshold, as a reference image pair, and inputting the reference image pair into a neural network model for training; acquiring a training result after the neural network model is trained, and determining whether the training result meets a requirement; and if so, taking the trained neural network model as a text recognition model. In this way, the data volume of training data is increased by means of a data augmentation processing mode, such that the recognition accuracy of a text recognition model is improved.

Description

Training method, device, computer equipment and storage medium for text recognition model

This application claims the priority of the Chinese patent application submitted to the China Patent Office on March 15, 2022, with application number 202210253870.4, and the invention name is "Training method, device, computer equipment and storage medium for text recognition model", and its entire content incorporated herein by reference.

Technical field

This application relates to the technical field of natural language processing of artificial intelligence technology. Specifically, this application relates to a training method, device, computer equipment and storage medium for a text recognition model.

Background technique

The text recognition task requires certain image processing to identify the text content in the image. Text recognition can be used in many fields, such as sorting letters and packages, editing and proofreading of manuscripts, summarizing and analyzing a large number of statistical reports and cards, processing bank checks, statistical summarization of commodity invoices, identification of commodity codes, commodity warehouses Management, document retrieval, identification of various documents and office automation of financial bill processing, etc., facilitate users to quickly enter information and improve work efficiency in all walks of life.

The inventor found that the current text recognition method commonly uses deep learning methods to perform end-to-end processing without segmentation. The algorithm model that currently has better results and is more commonly used is CRNN (ConvolutionalRecurrentNeural Network). This model first Use Convolutional Neural Networks (CNN) to extract feature sequences from the input image, then use Recurrent Neural Networks (RNN) to predict the label distribution of the feature sequences obtained from the convolutional layer, and finally introduce connectionist time series Classification (Connectionist temporal classification, CTC) converts the label distribution obtained from the loop layer into the final recognition result through operations such as deduplication and integration. The performance of the convolutional neural network is highly dependent on the training data. When the training data is diverse The more characteristics and the larger the amount of data, the better the performance of the trained model will be. However, when the amount of training data is smaller, the recognition accuracy of the trained text recognition model will be lower.

technical problem

The main purpose of this application is to provide a training method, device, computer equipment and storage medium for a text recognition model, so as to increase the amount of training data and thereby improve the recognition accuracy of the text recognition model.

Technical solutions

In order to achieve the above-mentioned object of the invention, this application provides a training method for a text recognition model, which includes:

Obtain the first image containing text information;

Perform random amplification processing on the first image to obtain multiple second images;

Mark the first image and the plurality of second images as reference images;

Obtain the text features of the text information in each of the reference images, and calculate the similarity of the text features of each two of the reference images;

Use the two reference images whose similarity is greater than the preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

Obtain the training results after training the neural network model, and determine whether the training results meet the requirements;

If so, use the trained neural network model as a text recognition model.

This application also provides a training device for a text recognition model, which includes:

The acquisition module is used to acquire the first image containing text information;

An amplification processing module, used to perform random amplification processing on the first image to obtain multiple second images;

A marking module, used to mark the first image and the plurality of second images as reference images;

A calculation module, used to obtain the text features of the text information in each of the reference images, and calculate the similarity of the text features of each two of the reference images;

An input module, configured to use the two reference images with a similarity greater than a preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

A judgment module, used to obtain the training results after the neural network model is trained, and judge whether the training results meet the requirements;

A determination module, configured to use the trained neural network model as a text recognition model when it is determined that the training results meet the requirements.

This application also provides a computer device, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements a training method for a text recognition model, wherein the method includes the following step:

Obtain the first image containing text information;

Mark the first image and the plurality of second images as reference images;

If so, use the trained neural network model as a text recognition model.

This application also provides a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, it implements a training method for a text recognition model, wherein the method includes the following: step:

Obtain the first image containing text information;

Mark the first image and the plurality of second images as reference images;

If so, use the trained neural network model as a text recognition model.

beneficial effects

The training method, device, computer equipment and storage medium of a text recognition model provided by this application can improve the recognition accuracy of the text recognition model.

Description of the drawings

Figure 1 is a schematic flowchart of a training method for a text recognition model according to an embodiment of the present application;

Figure 2 is a schematic structural block diagram of a text recognition model training device according to an embodiment of the present application;

FIG. 3 is a schematic structural block diagram of a computer device according to an embodiment of the present application.

Best Mode of Carrying Out the Invention

This application proposes a training method for a text recognition model. The embodiments of this application can acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or digital computer-controlled machines to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, etc. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

The text recognition model training method proposed in this application uses a server as the execution subject. The server can be an independent server, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications. , middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.

This text recognition model training method is used to solve the technical problem that when the amount of training data is small, the recognition accuracy of the trained text recognition model is low. Referring to Figure 1, in one embodiment, the training method of the text recognition model includes:

S11. Obtain the first image containing text information;

S12. Perform random amplification processing on the first image to obtain multiple second images;

S13. Mark the first image and multiple second images as reference images;

S14. Obtain the text features of the text information in each reference image, and calculate the similarity of the text features of each two reference images;

S15. Use the two reference images whose similarity is greater than the preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

S16. Obtain the training results after training the neural network model, and determine whether the training results meet the requirements;

S17. If yes, use the trained neural network model as a text recognition model.

As described in step S11 above, the object recognized by the text recognition model of the present application is an image containing text information, and the text information in the image is recognized to realize the text recognition function of the image. The first image with text information obtained in this embodiment can be an image uploaded by the user. For example, the user can obtain it by scanning a paper or other media document with text information, or it can also be a screenshot of the mobile phone screen content, etc. .

In one embodiment, after acquiring the first image containing text information, the first image can also be pre-processed, such as adjusting image parameters such as image size, brightness, and sharpness of the first image. In addition, the first image is usually in color and has multiple colors. The character color of the text information is mostly a color with a relatively dark brightness value. In order to facilitate the extraction of each character of the text information in the first image, it is also possible to set The first image is binarized using the brightness value as the standard, and the first image is converted into a black and white image to highlight the text information in the first image and avoid color interference in the first image.

Specifically, the server obtains the color brightness value in the first image, compares the color brightness value in the first image with the preset color brightness value, and obtains a comparison result, which includes the color in the first image. The brightness value is greater than, equal to or less than the preset color brightness value; according to the comparison result, the first image in the first image whose color brightness value is greater than the preset color brightness value is converted to white, and vice versa is converted to black to facilitate the conversion. Each character of the text information in the first image is extracted. Among them, the preset color brightness value can be adjusted as needed.

In one embodiment, when it is detected that the background of the first image is black and the text information is white, it is a situation of white text on a black background. In order to avoid affecting the recognition of text information, the server in this embodiment can also determine the background color of the first image, and convert the first image with a background color of black and text information of white into a first image with a background color of white and text information of black image, that is, convert an image with white text on a black background into an image with black text on a white background.

As mentioned in step S12 above, random amplification is a method of expanding data. Random amplification can increase the number of samples in the training set, which can effectively alleviate the overfitting of the model and can also bring stronger performance to the model. Generalization. The purpose of random amplification processing is to make the training data as close as possible to the test data, thereby improving the prediction accuracy. In addition, random amplification processing can force the network to learn more robust features, thereby giving the model stronger generalization capabilities.

This embodiment performs random amplification processing on the first image, such as enlarging, reducing, cropping, brightness adjustment, saturation adjustment, etc. random amplification processing methods on the first image. A random amplification processing method can be used, and also Multiple random amplification processing methods can be used in combination to finally obtain multiple second images. The image amplification technology of this embodiment has a positive effect on target detection in deep learning. It can increase the amount of data in each category, balance each category, and avoid over-fitting problems caused by sample imbalance. It can also Reduce the amount of data collected in early samples to a certain extent.

As described in steps S13-S14 above, this embodiment marks the first image and multiple second images as reference images, generates a data set including all reference images, and then obtains the text of the text information in each reference image from the data set. Features, calculate the similarity of text features of each two reference images. Specifically, the text position information of the text information can be identified in the reference image, the reference image is corrected according to the text position information, and a corrected reference image is obtained. The coding network of the recognition model is used to encode the text information of the corrected reference image. Perform feature extraction to obtain text features, and then construct a vector space model for calculating the similarity between the text features of each two reference images based on the word features contained in the text features of each two reference images. According to the The vector space model represents the word features of each two reference images as word vectors. According to the cosine distance algorithm, the cosine value of the angle between the word vectors of each two reference images is calculated, and the cosine value is used as the reference for each two images. Similarity of text features of images.

The text position information may be position information of a text frame including text information in the reference image. For example, identify a text area containing text information in the reference image, and obtain the position information of the text area as the text position information of the text information. For example, identify a text area containing content in the reference image, and calculate the corresponding text area. The position information of the virtual text box in the entire reference image is used as the text position information of the text information.

As described in step S15 above, in this embodiment, based on the calculated similarity of the text features of each two reference images, the two reference images with a similarity greater than the preset similarity threshold are used as a reference image pair, and the reference image pair is used as a training data, and use reference images to train the input neural network model, so that the trained text recognition model can combine the correlation between the training data and improve the recognition accuracy of the text recognition model. Among them, the preset similarity threshold can be customized, for example, set to 0.9.

At the same time, this application also considers the possibility of introducing a blockchain structure and making full use of the relevant characteristics of the blockchain (for example, the data on the blockchain cannot be tampered with, etc.), and uploads the training data to the blockchain before training. Certificate storage; during the training process, the associated data during the training process is uploaded to the blockchain for certificate storage, so that if necessary later, the triggered supervision server can obtain and trace back based on the relevant data saved on the blockchain , to reconstruct the training process; and then detect whether there are risky behaviors during the training process based on the reconstructed training process to protect the data security of the data side and improve the security and credibility of the training process.

As described in the above steps S16-S17, this embodiment can set the iteration conditions of the neural network model. The iteration conditions include the number of training times or training duration, etc. When the neural network model meets the iteration conditions, the training ends, and the above-mentioned iteration conditions are obtained. The training results after the training of the neural network model are used to determine whether the training results meet the requirements. When it is determined that the training results meet the requirements, the trained neural network model is used as a text recognition model to identify text information in the image.

The training results may include the recognized text information of each reference image in the reference image pair, and mark it as the target text information of each reference image in the reference image pair. This embodiment can calculate the similarity of the target text information of the two reference images in the reference image pair, obtain the predicted similarity, and determine whether the predicted similarity is consistent with the similarity of the corresponding text features. If so, the trained The neural network model serves as a text recognition model to accurately identify text information in images.

The training method of a text recognition model provided by this application obtains a first image containing text information, performs random amplification processing on the first image to obtain multiple second images, and combines the first image and multiple second images. Mark it as a reference image, obtain the text features of the text information in each reference image, calculate the similarity of the text features of each two reference images, and use the two reference images with a similarity greater than the preset similarity threshold as a reference image pair. The input neural network model is trained with reference to the image, the training results after the training of the neural network model are obtained, and it is judged whether the training results meet the requirements. When it is determined that the training results meet the requirements, the trained neural network model is used as the text recognition model to Through the processing method of data amplification, the amount of training data is increased, thereby improving the recognition accuracy of the text recognition model; and the neural network model is trained by using two reference images with high similarity, so that the trained text The recognition model can combine the correlation between training data to further improve the recognition accuracy of the text recognition model.

In one embodiment, determining whether the training results meet the requirements may specifically include:

Calculate the loss value of the trained neural network model according to the training results and the preset loss function;

Determine whether the loss value is lower than the preset loss value;

If so, determine that the training results meet the requirements;

If not, it is determined that the training results do not meet the requirements.

In this embodiment, after each training of the neural network model, the preset cross entropy loss function can be used to calculate the loss value of the neural network model after each training is completed, and when the loss value meets the preset threshold or is less than the preset When the loss value is set, that is, the training results of the neural network model meet the requirements, it means that the neural network model meets the training requirements, and the training of the text recognition model is completed to improve the text recognition accuracy of the text recognition model.

Among them, the cross-entropy loss function is used to evaluate the degree to which the predicted value of the text recognition model is different from the true value. The better the loss function, the better the performance of the text recognition model. In addition, cross entropy loss function is often used in classification problems, especially when neural networks do classification problems. Cross entropy is also often used as a loss function. Since cross entropy involves calculating the probability of each category, cross entropy is used almost every time. All appear together with the sigmoid (or softmax) function. In addition, the loss function in this embodiment is not specifically limited. For example, it can be a mean square error function, a covariance function, etc.

In addition, the preset loss value in this embodiment can be determined according to the actual situation, and the preset loss value is different from the corresponding loss threshold when the text recognition model is finally trained. Generally, the preset loss value here is greater than the final training of the text recognition model. Hershey's corresponding loss threshold. For example, the corresponding loss threshold when the text recognition model is finally trained is 0.002. The preset loss value here should be larger than 0.002, for example, it can be 0.005.

In one embodiment, after determining that the training results do not meet the requirements, the method further includes:

Update the parameters of the neural network model based on the loss value, retrain the neural network model after inputting the updated parameters with the reference image until the training results meet the requirements, and output the trained text Identify the model.

When the loss value of the text recognition model is not less than the preset loss value, forward transmission can be carried out in the neural network structure of the text recognition model according to the loss value, the relevant parameters of the text recognition model can be adjusted, and the reference image pair input can be reset to the relevant parameters. The text recognition model of the parameters is retrained until the loss value of the text recognition model is less than the preset loss value. At this point, the text recognition model training ends, and a text recognition model whose training results meet the requirements is obtained to obtain a trained text recognition model.

In an embodiment, after using the trained neural network model as a text recognition model, the method may further include:

Obtain the target image to be recognized;

The target image is input into the text recognition model to obtain text information of the target image.

This embodiment obtains the target image to be recognized, inputs the target image into the text recognition model, and obtains the text information of the target image with the help of the text recognition model output. The target image to be recognized may be a text image uploaded by the user, or may be a text image collected directly through a camera by an electronic device that performs the text recognition method. The acquisition method of the target image to be recognized is not limited here. Since the text recognition model of this application does not require sample labeling, the text recognition model can be obtained at a lower cost. The cost of directly using the text recognition model for text recognition is also low. In addition, since the text recognition model does not need to be trained during training Sample labeling and recognition accuracy will no longer be affected by the sample labeling method, and will no longer be limited by the number of training samples. The recognition accuracy and reliability of the model obtained after training with a large number of training samples are higher, so using The text recognition model trained by this application can accurately identify the text information of the target image.

In an embodiment, calculating the similarity of text features of each two reference images may specifically include:

Convert the text features of each reference image into vector form to obtain the text vector of each reference image;

Calculate the cosine distance of the text vectors of each two reference images to obtain the similarity of the text features of each two reference images.

In this embodiment, for measuring the similarity between text features, a common method is to calculate the cosine distance between text features. Cosine distance can reflect the difference between two vectors in space, aggregate two similar semantic relationships to complete the aggregation of all semantic relationships, and filter out the most aggregated semantic relationships as the semantic recognition result of text features, such as When most semantic relations are gathered in area A, the semantic relations closest to the center of area A are selected from area A as the semantic recognition result.

In this embodiment, the Word2Vec word vector model can be used to convert the text features of each reference image into word vectors to obtain the text vector of each reference image. Then, the cosine distance of the text vectors of each two reference images is calculated, and the cosine distance is calculated. distance as the similarity.

Among them, the Word2Vec word vector model is a model that learns semantic knowledge from a large amount of text in an unsupervised manner. It trains a large amount of text and represents the words in the text in the form of vectors. This vector is called a word vector. We can calculate the distance between the word vectors of two words to learn the connection between the two words.

In one embodiment, training the input neural network model with the reference image may specifically include:

Randomly select one reference image from the reference image pair as the training image, and use the other reference image from the reference image pair as the verification image;

Input the training images into the neural network model for training;

Determining whether the training results meet the requirements includes:

The trained neural network model is verified according to the verification image. If the verification result does not meet the preset iteration stop conditions, it is determined that the training result does not meet the requirements.

In this embodiment, one reference image can be randomly selected from the reference image pair as the training image, and the other reference image in the reference image pair can be used as the verification image. The training image can be used to train the neural network model. According to the verification image Verify the neural network model after each training. If the verification results do not meet the preset iteration stop conditions, it is determined that the training results do not meet the requirements. The verification results may include that the predicted similarity is the same as or different from the similarity of the corresponding text features. For example, the similarity between the text information of the training image output by the neural network model and the text information of the verification image output by the neural network model may be calculated, Obtain the predicted similarity, and determine whether the predicted similarity is consistent with the similarity of the corresponding text features. If so, use the trained neural network model as a text recognition model to accurately identify text information in the image.

In one embodiment, the random amplification process on the first image to obtain multiple second images may specifically include:

Perform at least one random amplification processing method of flipping, translating, scaling, rotating and adjusting the weight of each RGB channel of the image on the first image to obtain a plurality of second images.

In this embodiment, the random amplification processing method includes but is not limited to flipping, translating, scaling the image, adjusting the weight of each RGB channel of the image, and rotating the image. For example, the first image can be flipped, and then the flipped first image can be enlarged to obtain a second image.

Referring to Figure 2, an embodiment of the present application also provides a training device for a text recognition model, including:

Acquisition module 11, used to acquire the first image containing text information;

The amplification processing module 12 is used to perform random amplification processing on the first image to obtain multiple second images;

Marking module 13, used to mark the first image and multiple second images as reference images;

The calculation module 14 is used to obtain the text features of the text information in each of the reference images, and calculate the similarity of the text features of each two of the reference images;

The input module 15 is used to use the two reference images whose similarity is greater than the preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

The judgment module 16 is used to obtain the training results after training the neural network model and judge whether the training results meet the requirements;

The determination module 17 is configured to use the trained neural network model as a text recognition model when it is determined that the training results meet the requirements.

The object recognized by the text recognition model of this application is an image containing text information, and the text information in the image is recognized to realize the text recognition function of the image. The first image with text information obtained in this embodiment can be an image uploaded by the user. For example, the user can obtain it by scanning a paper or other media document with text information, or it can also be a screenshot of the mobile phone screen content, etc. .

In this embodiment, random amplification processing is a method of expanding data. Random amplification processing can increase the number of samples in the training set, can effectively alleviate the over-fitting of the model, and can also bring stronger generalization to the model. ization ability. The purpose of random amplification processing is to make the training data as close as possible to the test data, thereby improving the prediction accuracy. In addition, random amplification processing can force the network to learn more robust features, thereby giving the model stronger generalization capabilities.

This embodiment marks the first image and multiple second images as reference images, generates a data set including all reference images, then obtains the text features of the text information in each reference image from the data set, and calculates the text features of each two reference images. Similarity of text features. Specifically, the text position information of the text information can be identified in the reference image, the reference image is corrected according to the text position information, and a corrected reference image is obtained. The coding network of the recognition model is used to encode the text information of the corrected reference image. Perform feature extraction to obtain text features, and then construct a vector space model for calculating the similarity between the text features of each two reference images based on the word features contained in the text features of each two reference images. According to the The vector space model represents the word features of each two reference images as word vectors. According to the cosine distance algorithm, the cosine value of the angle between the word vectors of each two reference images is calculated, and the cosine value is used as the reference for each two images. Similarity of text features of images.

In this embodiment, based on the calculated similarity of the text features of each two reference images, the two reference images with a similarity greater than the preset similarity threshold are used as a reference image pair, the reference image pair is used as training data, and the reference image is The input neural network model is trained so that the trained text recognition model can combine the correlation between the training data and improve the recognition accuracy of the text recognition model. Among them, the preset similarity threshold can be customized, for example, set to 0.9.

At the same time, this application can also introduce a blockchain structure and make full use of the relevant characteristics of the blockchain (such as the data on the blockchain cannot be tampered with, etc.). Before training, the training data can be uploaded to the blockchain for storage. ;During the training process, the associated data during the training process is uploaded to the blockchain for storage, so that if necessary, the triggered supervision server can obtain and trace back based on the relevant data stored on the blockchain to Reconstruct the training process; and then detect whether there are risky behaviors during the training process based on the reconstructed training process to protect the data security of the data side and improve the security and credibility of the training process.

This embodiment can set the iteration conditions of the neural network model. The iteration conditions include the number of training times or the training duration, etc. When the neural network model meets the iteration conditions, the training is ended. At this time, the training results after training of the neural network model are obtained. Determine whether the training results meet the requirements. When it is determined that the training results meet the requirements, the trained neural network model is used as a text recognition model to identify text information in the image.

As mentioned above, it can be understood that each component of the text recognition model training device proposed in this application can implement the functions of any of the above text recognition model training methods, and the specific structure will not be described again.

Referring to Figure 3, an embodiment of the present application also provides a computer device, the internal structure of which can be shown in Figure 3. The computer device includes a processor, memory, network interface, and database connected through a system bus. Among them, the processor designed by the computer is used to provide computing and control capabilities. The memory of the computer device includes storage media and internal memory. The storage medium stores operating systems, computer programs and databases. This memory provides an environment for the operating system and computer programs in the storage medium to run. The database of the computer device is used to store data related to the training method of the text recognition model. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer program, when executed by the processor, implements a method for training a text recognition model.

The above-mentioned processor executes the above-mentioned text recognition model training method, including the following steps:

Obtain the first image containing text information;

Mark the first image and the plurality of second images as reference images;

If so, use the trained neural network model as a text recognition model.

An embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. A computer program is stored thereon. When the computer program is executed by a processor, Implement a training method for text recognition models, which includes the following steps:

Obtain the first image containing text information;

Mark the first image and the plurality of second images as reference images;

If so, use the trained neural network model as a text recognition model.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed, the computer program may include the processes of the above method embodiments. Any reference to memory, storage, database or other media provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-speed data rate SDRAM (SSRSDRAM), expanded SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

In summary, this application provides a training method, device, computer equipment and storage medium for a text recognition model, which acquires a first image containing text information, performs random amplification processing on the first image, and obtains multiple second images. image, mark the first image and multiple second images as reference images, obtain the text features of the text information in each reference image, calculate the similarity of the text features of each two reference images, and set the similarity to be greater than the preset similarity The two reference images of the threshold are used as a reference image pair, the reference image pair is input to the neural network model for training, the training results after the neural network model training are obtained, and it is judged whether the training results meet the requirements. When it is determined that the training results meet the requirements, the The trained neural network model is used as a text recognition model. The data volume of the training data is increased through data amplification, and the neural network model is trained by using two reference images with high similarity, so that the training can obtain The text recognition model can combine the correlation between training data, thereby improving the recognition accuracy of the text recognition model.

Claims

A training method for a text recognition model, wherein the training method includes:

Obtain the first image containing text information;

Perform random amplification processing on the first image to obtain multiple second images;

Mark the first image and the plurality of second images as reference images;

Obtain the text features of the text information in each of the reference images, and calculate the similarity of the text features of each two of the reference images;

Use the two reference images whose similarity is greater than the preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

Obtain the training results after training the neural network model, and determine whether the training results meet the requirements;

If so, use the trained neural network model as a text recognition model.
The method according to claim 1, wherein determining whether the training results meet requirements includes:

Calculate the loss value of the trained neural network model according to the training results and the preset loss function;

Determine whether the loss value is lower than the preset loss value;

If so, determine that the training results meet the requirements;

If not, it is determined that the training results do not meet the requirements.
The method according to claim 2, wherein after determining that the training results do not meet requirements, it further includes:

Update the parameters of the neural network model based on the loss value, retrain the neural network model after inputting the updated parameters with the reference image until the training results meet the requirements, and output the trained text Identify the model.
The method according to claim 1, wherein after using the trained neural network model as a text recognition model, it further includes:

Obtain the target image to be recognized;

The target image is input into the text recognition model to obtain text information of the target image.
The method according to claim 1, wherein calculating the similarity of text features of each two reference images includes:

Convert the text features of each reference image into vector form to obtain the text vector of each reference image;

Calculate the cosine distance of the text vectors of each two reference images to obtain the similarity of the text features of each two reference images.
The method according to claim 1, wherein said inputting the reference image to a neural network model for training includes:

Randomly select one reference image from the reference image pair as the training image, and use the other reference image from the reference image pair as the verification image;

Input the training images into the neural network model for training;

Determining whether the training results meet the requirements includes:

The trained neural network model is verified according to the verification image. If the verification result does not meet the preset iteration stop conditions, it is determined that the training result does not meet the requirements.
The method of claim 1, wherein the first image is randomly amplified to obtain a plurality of second images, including:

Perform at least one random amplification processing method of flipping, translating, scaling, rotating and adjusting the weight of each RGB channel of the image on the first image to obtain a plurality of second images.
A training device for a text recognition model, wherein the training device includes:

The acquisition module is used to acquire the first image containing text information;

An amplification processing module, used to perform random amplification processing on the first image to obtain multiple second images;

A marking module, used to mark the first image and the plurality of second images as reference images;

A calculation module, used to obtain the text features of the text information in each of the reference images, and calculate the similarity of the text features of each two of the reference images;

An input module, configured to use the two reference images with a similarity greater than a preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

A judgment module, used to obtain the training results after the neural network model is trained, and judge whether the training results meet the requirements;

A determination module, configured to use the trained neural network model as a text recognition model when it is determined that the training results meet the requirements.
A computer device, wherein the computer device includes:

processor;

memory;

Wherein, the memory stores a computer program, and when the processor executes the computer program, it implements a training method for a text recognition model, wherein the method includes the following steps:

Obtain the first image containing text information;

Perform random amplification processing on the first image to obtain multiple second images;

Mark the first image and the plurality of second images as reference images;

Obtain the text features of the text information in each of the reference images, and calculate the similarity of the text features of each two of the reference images;

Use the two reference images whose similarity is greater than the preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

Obtain the training results after training the neural network model, and determine whether the training results meet the requirements;

If so, use the trained neural network model as a text recognition model.
The computer device according to claim 9, wherein the determining whether the training result meets the requirements includes:

Calculate the loss value of the trained neural network model according to the training results and the preset loss function;

Determine whether the loss value is lower than the preset loss value;

If so, determine that the training results meet the requirements;

If not, it is determined that the training results do not meet the requirements.
The computer device according to claim 10, wherein after determining that the training result does not meet requirements, it further includes:

Update the parameters of the neural network model based on the loss value, retrain the neural network model after inputting the updated parameters with the reference image until the training results meet the requirements, and output the trained text Identify the model.
The computer device according to claim 9, wherein after using the trained neural network model as a text recognition model, the method further includes:

Obtain the target image to be recognized;

The target image is input into the text recognition model to obtain text information of the target image.
The computer device according to claim 9, wherein the calculating the similarity of text features of each two reference images includes:

Convert the text features of each reference image into vector form to obtain the text vector of each reference image;

Calculate the cosine distance of the text vectors of each two reference images to obtain the similarity of the text features of each two reference images.
The computer device according to claim 9, wherein said inputting the reference image to a neural network model for training includes:

Randomly select one reference image from the reference image pair as the training image, and use the other reference image from the reference image pair as the verification image;

Input the training images into the neural network model for training;

Determining whether the training results meet the requirements includes:

The trained neural network model is verified according to the verification image. If the verification result does not meet the preset iteration stop conditions, it is determined that the training result does not meet the requirements.
The computer device according to claim 9, wherein the random amplification process is performed on the first image to obtain a plurality of second images, including:

Perform at least one random amplification processing method of flipping, translating, scaling, rotating and adjusting the weight of each RGB channel of the image on the first image to obtain a plurality of second images.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, a training method for a text recognition model is implemented, wherein the method includes the following steps :

Obtain the first image containing text information;

Perform random amplification processing on the first image to obtain multiple second images;

Mark the first image and the plurality of second images as reference images;

Obtain the text features of the text information in each of the reference images, and calculate the similarity of the text features of each two of the reference images;

Use the two reference images whose similarity is greater than the preset similarity threshold as a reference image pair, and input the reference image pair into the neural network model for training;

Obtain the training results after training the neural network model, and determine whether the training results meet the requirements;

If so, use the trained neural network model as a text recognition model.
The computer-readable storage medium according to claim 16, wherein the determining whether the training results meet requirements includes:

Calculate the loss value of the trained neural network model according to the training results and the preset loss function;

Determine whether the loss value is lower than the preset loss value;

If so, determine that the training results meet the requirements;

If not, it is determined that the training results do not meet the requirements.
The computer-readable storage medium according to claim 17, wherein after determining that the training result does not meet the requirements, it further includes:

Update the parameters of the neural network model based on the loss value, retrain the neural network model after inputting the updated parameters with the reference image until the training results meet the requirements, and output the trained text Identify the model.
The computer-readable storage medium according to claim 16, wherein after using the trained neural network model as a text recognition model, it further includes:

Obtain the target image to be recognized;

The target image is input into the text recognition model to obtain text information of the target image.
The computer-readable storage medium according to claim 16, wherein the calculating the similarity of text features of each two reference images includes:

Convert the text features of each reference image into vector form to obtain the text vector of each reference image;

Calculate the cosine distance of the text vectors of each two reference images to obtain the similarity of the text features of each two reference images.