CN118196819A

CN118196819A - Text character detection method, computer device, and storage medium

Info

Publication number: CN118196819A
Application number: CN202211605061.1A
Authority: CN
Inventors: 许玉辉
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2024-06-14

Abstract

The present application relates to a text character detection method, a computer device, a storage medium and a computer program product. The text detection method comprises the following steps: acquiring a certificate image of a target certificate; inputting the certificate image into a text detection model, and detecting characters in the certificate image through the text detection model to obtain character detection result information of the target certificate; the text detection model is trained based on a first regression loss function corresponding to the character frame category and a second regression loss function corresponding to the field frame category. According to the method, the loss function of the field frame and the loss function of the character frame can be calculated respectively, the trained text detection model can accurately learn the characteristics of the character frame and accurately learn the characteristics of the field frame, text detection is carried out based on the text detection module, the convergence accuracy of the character frame is ensured under the condition that the convergence accuracy of the field frame is ensured, and accurate character information of a target certificate image can be obtained.

Description

Text character detection method, computer device, and storage medium

Technical Field

The present application relates to the field of artificial intelligence technology, and in particular, to a text character detection method, a computer device, a storage medium, and a computer program product.

Background

With the rapid development of the technical field of artificial intelligence, a technology for text recognition through a neural network appears, and the method can be particularly applied to acquisition and verification of identity information.

In the related art, a neural network model can be trained through a YOLO algorithm (You Only Look Once, object recognition and positioning algorithm based on a deep neural network) to detect text information in an image, specifically, the neural network model can be trained through a plurality of sample images containing the text information, prediction is performed based on the sample images to obtain predicted text information, a loss function is calculated based on the predicted text information, and after the loss function converges, the trained text detection model can be obtained after multiple times of training. But since there are various areas in the sample image, such as a field frame area and a character frame area. When the sample image is trained as a whole and the loss function converges, the convergence precision of various types of areas is different, and the detection precision of various areas is larger, so that the precision of the text detection result of the image by the model trained by the method is lower.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a text character detection method, a computer device, a storage medium, and a computer program product that are capable of balancing character detection accuracy and field detection accuracy.

In a first aspect, the present application provides a text character detection method. The method comprises the following steps:

acquiring a certificate image of a target certificate;

Inputting the certificate image into a text detection model, and detecting characters in the certificate image through the text detection model to obtain character detection result information of the target certificate; the text detection model is trained based on a first regression loss function corresponding to the character frame category and a second regression loss function corresponding to the field frame category.

In one embodiment, training the text detection model based on a first regression loss function corresponding to a character box category and a second regression loss function corresponding to a field box category includes:

Performing weighted calculation on the first regression loss function and the second regression loss function based on a first weight corresponding to the first regression loss function and a second weight corresponding to the second regression loss function, and determining a coordinate frame regression loss function;

Determining a comprehensive loss function according to the target classification loss function and the coordinate frame regression loss function;

and updating network parameters of the text detection model to be trained according to the comprehensive loss function until the comprehensive loss function meets the target training completion condition, and obtaining the trained text detection model.

In one embodiment, the first weight is greater than the second weight.

In one embodiment, the first regression loss function and the second regression loss function are generated by:

acquiring training data, wherein the training data comprises a plurality of sample images, the sample images comprise sample image data and sample marking data, and the sample marking data comprises first sample marking data of a character frame type and second sample marking data of a field frame type;

Determining predictive marker data corresponding to the sample image according to a text detection model to be trained, the sample image data, the first sample marker data and the second sample marker data, wherein the predictive marker data comprises first predictive marker data of a character frame type and second predictive marker data of a field frame type;

Determining a first regression loss function corresponding to the character frame category according to the target regression loss function, the first sample marking data and the first prediction marking data, and determining a second regression loss function corresponding to the field frame category according to a preset regression loss function, the second sample marking data and the second prediction marking data.

In one embodiment, the inputting the document image into a text detection model, detecting characters in the document image by the text detection model, and obtaining the character detection result information of the target document includes:

Downsampling the certificate image to obtain a first compressed feature vector;

Downsampling the first compressed feature vector to obtain a second compressed feature vector;

downsampling the second compressed feature vector to obtain a third compressed feature vector, and determining the third compressed feature vector as a feature vector of a first dimension;

performing up-sampling processing on the feature vector of the first dimension to obtain a target feature vector;

Fusing the target feature vector and the second compressed feature vector to obtain a feature vector with a second dimension;

Fusing the feature vector of the second dimension with the first compressed feature vector to obtain a feature vector of a third dimension;

and determining character detection result information of the target certificate based on the feature vector of the first dimension, the feature vector of the second dimension and the feature vector of the third dimension.

In one embodiment, the determining the character detection result information of the target document based on the feature vector of the first dimension, the feature vector of the second dimension, and the feature vector of the third dimension includes:

Performing convolution calculation on the feature vector of the first dimension to obtain mark data of the first dimension;

performing convolution calculation on the feature vector of the second dimension to obtain mark data of the second dimension;

performing convolution calculation on the feature vector of the third dimension to obtain marking data of the third dimension;

And screening the first size mark data, the second size mark data and the third size mark data according to the size information of the character frame and the size information of the field frame to obtain first character detection result information of the character frame type and second character detection result information of the field frame type.

In one embodiment, the sample image is a composite sample image; the sample image is synthesized by:

Erasing text information of the initial certificate image to obtain a first image;

labeling the field frames in the first image to obtain a second image;

and obtaining a random text through a random text synthesis algorithm, and carrying out fusion processing on the random text and the second image to obtain a sample image.

In a second aspect, the application further provides a text character detection device. The device comprises:

The first acquisition module is used for acquiring a certificate image of the target certificate;

the first determining module is used for inputting the certificate image into a text detection model, detecting characters in the certificate image through the text detection model, and obtaining character detection result information of the target certificate; the text detection model is trained based on a first regression loss function corresponding to the character frame category and a second regression loss function corresponding to the field frame category.

In one embodiment, the text character detecting apparatus further includes:

the second determining module is used for carrying out weighted calculation on the first regression loss function and the second regression loss function based on a first weight corresponding to the first regression loss function and a second weight corresponding to the second regression loss function, and determining a coordinate frame regression loss function;

the third determining module is used for determining a comprehensive loss function according to the target classification loss function and the coordinate frame regression loss function;

And the training module is used for updating the network parameters of the text detection model to be trained according to the comprehensive loss function until the comprehensive loss function meets the target training completion condition, so as to obtain the trained text detection model.

In one embodiment, the text character detecting apparatus further includes:

The second acquisition module is used for acquiring training data, wherein the training data comprises a plurality of sample images, the sample images comprise sample image data and sample marking data, and the sample marking data comprises first sample marking data of a character frame type and second sample marking data of a field frame type;

A fourth determining module, configured to determine, according to a text detection model to be trained, the sample image data, the first sample flag data, and the second sample flag data, prediction flag data corresponding to the sample image, where the prediction flag data includes first prediction flag data of a character frame type and second prediction flag data of a field frame type;

And a fifth determining module, configured to determine a first regression loss function corresponding to the character frame category according to the target regression loss function, the first sample flag data, and the first prediction flag data, and determine a second regression loss function corresponding to the field frame category according to a preset regression loss function, the second sample flag data, and the second prediction flag data.

In one embodiment, the first obtaining module is specifically configured to:

Downsampling the certificate image to obtain a first compressed feature vector;

In one embodiment, the first obtaining module is further specifically configured to:

In one embodiment, the sample image is a composite sample image; the text detection device further includes:

The erasing module is used for erasing text information of the initial certificate image to obtain a first image;

the labeling module is used for labeling the field frames in the first image to obtain a second image;

and the fusion module is used for obtaining a random text through a random text synthesis algorithm, and carrying out fusion processing on the random text and the second image to obtain a sample image.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

acquiring a certificate image of a target certificate;

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring a certificate image of a target certificate;

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

acquiring a certificate image of a target certificate;

The above text character detection method, computer device, storage medium and computer program product, the method comprising: acquiring a certificate image of a target certificate; inputting the certificate image into a text detection model, and detecting characters in the certificate image through the text detection model to obtain character detection result information of the target certificate; the text detection model is trained based on a first regression loss function corresponding to the character frame category and a second regression loss function corresponding to the field frame category. By adopting the invention, the loss function of the field frame and the loss function of the character frame can be calculated respectively, the trained text detection model can accurately learn the characteristics of the character frame and accurately learn the characteristics of the field frame, the field detection and the character detection in the text detection are decoupled, and the text detection is carried out according to the text detection module obtained under the condition, so that the convergence accuracy of the field frame is ensured, the convergence accuracy of the character frame is ensured, the missing detection and repeated detection of the character frame are avoided, the accuracy of the text detection is greatly improved, and the accurate character information of the target certificate image is obtained.

Drawings

FIG. 1 is a flow diagram of a text character detection method in one embodiment;

FIG. 2 is a flow diagram of training steps for a text detection model in one embodiment;

FIG. 3 is a flow chart illustrating the steps for calculating a loss function in one embodiment;

FIG. 4 is a flowchart illustrating steps for determining character detection result information in one embodiment;

FIG. 5 is a flowchart illustrating steps for determining a multi-size character detection result in one embodiment;

FIG. 6 is a flow chart illustrating the steps of synthesizing a sample image in one embodiment;

FIG. 7 is a flow chart illustrating the step of determining feature vectors for multiple dimensions in one embodiment;

FIG. 8 is a flowchart illustrating a step of predicting the predicted tag data corresponding to the sample image according to one embodiment;

FIG. 9A is a schematic illustration of an initial image in one embodiment;

FIG. 9B is a schematic diagram of a first image in one embodiment;

FIG. 9C is a diagram of address information in one embodiment;

FIG. 9D is a schematic diagram of a sample image in one embodiment;

FIG. 9E is a schematic diagram of sample label data corresponding to a sample image in one embodiment;

FIG. 10 is a block diagram of a text character detecting device in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In an application scene for actually performing text detection, many technical problems exist, for example, the shooting background of an image of a text to be detected is complex and changeable; because the shot scenes are more changeable, more situations of lower image quality, such as image blurring, shadow, reflection, shielding and the like, may exist; any change in image angle due to the change in shooting angle, and the like.

In the related art, an electronic device generally performs text detection by means of line detection, where line detection refers to detecting text regions of each line, and if a field is longer, the text regions may be divided according to punctuation marks. The conventional text detection methods can be roughly classified into two types based on segmentation and coordinate frame regression based on algorithm types. The text region can be effectively detected by line detection, but the accuracy of the line recognition in the real scene is lower than that of the character recognition, so that the service requirement cannot be met.

Because the precision of the line recognition is lower, a character recognition method with higher precision can be adopted, and the character recognition can be performed by a character detection mode. However, the character detection is easy to have the problems of missing detection and repeated detection of characters, missing detection can be caused under the condition that numbers and letters are doped in a text, and a plurality of character detection frames can be generated if left and right components exist in Chinese characters, so that the recognition accuracy is affected.

Based on the background, the character missing detection and repeated detection problems can be detected through yolov detection algorithm. However, since the area of the text region box (field box) in the card image is large in comparison with the area of the character box, the proportion of the text region box is large compared with the proportion of the character box when the Loss function is calculated through the IOU Loss function. Therefore, the convergence accuracy of the loss function can deviate to the text region with larger proportion, so that the text region box detection accuracy of the training result is higher; the character frame detection accuracy is low. Therefore, the text detection model used in the text character detection method is obtained by respectively calculating the loss function of the text coordinate frame and the loss function of the character coordinate frame in a decoupling mode of text field detection and character detection, so that the convergence accuracy of the text coordinate frame is ensured to be higher, the problems of missing detection and repeated detection of the character coordinate frame are avoided, the character detection accuracy is greatly improved, and the balance between the detection accuracy of the text coordinate frame and the detection accuracy of the character coordinate frame is ensured.

In one embodiment, as shown in fig. 1, a text character detection method is provided, and the method is applied to a terminal for illustration, it can be understood that the method can also be applied to a server, and can also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server, where the terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices can be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device and the like, and the server can be realized by a stand-alone server or a server cluster formed by a plurality of servers. In this embodiment, the text character detection method includes the following steps:

Step 102, obtaining a certificate image of a target certificate.

The target certificate may be a target certificate of a target object determined according to an actual application scene, and the target certificate may be a certificate, a card, a license and a ticket, for example, may be an identity card, a bank card, a business license and the like.

In this embodiment, the terminal may obtain the document image corresponding to the target document according to the requirement of the actual application scenario, for example, the document image of the target document obtained by the image acquisition device or the document image of the target document input by the user.

And 104, inputting the certificate image into a text detection model, and detecting characters in the certificate image through the text detection model to obtain character detection result information of the target certificate.

The text detection model is trained based on a first regression loss function corresponding to the character frame category and a second regression loss function corresponding to the field frame category.

In this embodiment, the terminal may acquire training data, input sample image data, first sample marking data and second sample marking data to a prediction module in a text detection model to be trained, and determine prediction marking data corresponding to the sample image; determining a first regression loss function corresponding to the character frame category according to a preset regression loss function, first sample marking data and first prediction marking data, and determining a second regression loss function corresponding to the field frame category according to the preset regression loss function, second sample marking data and second prediction marking data; determining a composite loss function based on the first regression loss function and the second regression loss function; and updating network parameters of the text detection model to be trained according to the comprehensive loss function, and returning to the step of executing the acquisition of training data until the comprehensive loss function meets the preset training completion condition, so as to obtain the trained text detection model.

Based on the above, the terminal can input the acquired certificate image into the text detection model after training, so that the text detection model can process the input certificate image, detect the characters contained in the certificate image, and obtain the character detection result information corresponding to the target certificate. The specific process of the treatment can be as follows: the text detection model can perform feature extraction on the certificate image to obtain a plurality of feature vectors with different dimensions corresponding to the certificate image, wherein the feature vectors with different dimensions respectively contain the feature information of coordinate frames with different dimensions of the certificate image. The terminal can input the feature vectors with multiple dimensions into a prediction module in the text detection model to be trained, so that the terminal can predict the feature vectors with multiple dimensions through the prediction module to obtain an output result. The terminal can obtain the predictive mark data corresponding to the certificate image data and the text information corresponding to each predictive mark data respectively based on the output result of the predictive module, namely the text information corresponding to the certificate image of the target certificate.

The predicted tag data may include tag data of a plurality of categories, for example, may include tag data corresponding to a category of a field coordinate frame, and may also include tag data corresponding to a character coordinate frame.

According to the text detection method, the certificate image of the target certificate is obtained; inputting the certificate image into a text detection model, and detecting characters in the certificate image through the text detection model to obtain character detection result information of the target certificate; the text detection model is trained based on a first regression loss function corresponding to the character frame category and a second regression loss function corresponding to the field frame category. By adopting the invention, the loss function of the field frame and the loss function of the character frame can be calculated respectively, the trained text detection model can accurately learn the characteristics of the character frame and accurately learn the characteristics of the field frame, the field detection and the character detection in the text detection are decoupled, and the text detection is carried out according to the text detection module obtained under the condition, so that the convergence accuracy of the field frame is ensured, the convergence accuracy of the character frame is ensured, the missing detection and the repeated detection of the character frame are avoided, and the accuracy of the character detection is greatly improved.

In one embodiment, as shown in fig. 2, the process of training the text detection model based on the first regression loss function corresponding to the character frame category and the second regression loss function corresponding to the field frame category includes:

Step 202, performing weighted calculation on the first regression loss function and the second regression loss function based on the first weight corresponding to the first regression loss function and the second weight corresponding to the second regression loss function, and determining the coordinate frame regression loss function.

In this embodiment, the terminal may determine, according to the area of the character coordinate frame and the area of the field coordinate frame, a first weight corresponding to the character frame category and a second weight corresponding to the field frame category; the value of the first weight is larger than that of the second weight. The first weight may be a scaling factor corresponding to the character frame category, the second weight may be a scaling factor of the field frame category, for example, the first weight may be 3, and the second weight may be 1.

Specifically, the terminal may calculate a first weight corresponding to the character frame category, a first regression loss function corresponding to the character frame category, a second weight corresponding to the field frame category, and a second regression loss function corresponding to the field frame category, so as to obtain a coordinate frame regression loss function of the text detection model to be trained.

Specifically, the terminal may calculate the coordinate frame regression loss function L _box by the following formula:

L_box＝αL_sbox+βL_cbox

Wherein L _sbox is a second regression loss function corresponding to the field frame class, α is a second weight corresponding to the field frame class, L _cbox is a first regression loss function corresponding to the character frame class, and β is a first weight corresponding to the character frame class.

Step 204, determining the comprehensive loss function according to the target classification loss function and the coordinate frame regression loss function.

In this embodiment, the terminal may determine that the sum of the target classification loss function and the coordinate frame regression loss function is the integrated loss function.

Specifically, the objective classification loss function (Classificition Loss) may be calculated by BCELoss (Binary Cross Entropy), i.e., the terminal may determine the objective classification loss function from the objective loss function and the class loss function.

Specifically, the terminal may calculate the integrated loss function by the following formula:

The Loss _total may be a comprehensive Loss function, L _box is a coordinate frame regression Loss function, L _obj is a target Loss function, L _class may be a class Loss function, γ may be a weight corresponding to the target Loss function, may be determined according to an actual application scenario, and ρ may be a weight corresponding to the class Loss function, or may be determined according to an actual application scenario.

Since the objective loss function and class loss function are binary cross entropy loss functions, this can be calculated by the following formula:

Wherein σ (x) represents the predicted value of the target result, the threshold is between [0,1], the confidence level of the prediction is represented, y _i represents the actual value of the target result, y _i is 1, the positive sample value converges, y _i is 0, the negative sample value converges, and N is the number of feature points, and each feature point is classified.

And step 206, updating network parameters of the text detection model to be trained according to the comprehensive loss function until the comprehensive loss function meets the target training completion condition, and obtaining the trained text detection model.

The preset training completion condition may be that the comprehensive loss function has converged, the iteration number of the training data has reached the target number, or the training duration has reached the target duration, etc. For example, the target number of times may be 100 times, 300 times, and the like, and the target duration may be 1 hour, 2 hours, and the like, and the specific values of the target number of times and the target duration are not particularly limited in the embodiment of the present invention.

Specifically, the terminal determines network parameters corresponding to the text detection model to be trained according to the comprehensive loss function, and updates the text detection model to be trained based on the network parameters to obtain an updated text detection model. In this way, the terminal re-inputs other multiple sample images in the training data into the updated text detection model, and re-executes the steps of the text character detection method provided in the above embodiment until the calculated loss function has converged, or the iteration number of the training data has reached the target number of times, or the training duration of the training data has reached the target duration, so as to obtain the trained text detection model.

Based on the scheme, the split convergence is realized by decoupling the loss function of the character frame and the loss function of the field frame, and meanwhile, the proportion coefficient of iouloss of the character frame is increased, so that the convergence of iouloss biased to the character frame can be realized, and different sample feature distribution is more balanced. And the detection of the character frame is more accurate while the convergence of the model is accelerated.

In one embodiment, as shown in FIG. 3, the first and second regression loss functions are generated by:

in step 302, training data is acquired, the training data comprising a plurality of sample images, the sample images comprising sample image data and sample label data.

Wherein the sample marking data comprises first sample marking data of a character frame type and second sample marking data of a field frame type;

Step 304, determining prediction mark data corresponding to the sample image according to the text detection model to be trained, the sample image data, the first sample mark data and the second sample mark data.

The prediction mark data comprises first prediction mark data of a character frame type and second prediction mark data of a field frame type;

Step 306, determining a first regression loss function corresponding to the character frame category according to the target regression loss function, the first sample marking data and the first prediction marking data, and determining a second regression loss function corresponding to the field frame category according to the preset regression loss function, the second sample marking data and the second prediction marking data.

Based on the scheme, the loss function of the field frame and the loss function of the character frame can be calculated respectively, the trained text detection model can accurately learn the characteristics of the character frame and accurately learn the characteristics of the field frame, the field detection and the character detection in the text detection are decoupled, the convergence accuracy of the field frame is ensured, the convergence accuracy of the character frame is also ensured, and the accuracy of the text detection is greatly improved under the condition that the missing detection and the repeated detection of the character frame are not caused.

In one embodiment, as shown in fig. 4, step 104, inputting the document image into a text detection model, and detecting the characters in the document image by the text detection model, a specific implementation procedure for obtaining the character detection result information of the target document may include:

step 402, downsampling the document image to obtain a first compressed feature vector.

In this embodiment, the terminal may perform downsampling processing on the certificate image to obtain a compressed first compressed feature vector.

Step 404, performing downsampling processing on the first compressed feature vector to obtain a second compressed feature vector.

In this embodiment, the terminal may perform downsampling processing on the first compressed feature vector to obtain a compressed second compressed feature vector, where the dimensions of the first compressed feature vector and the second compressed feature vector are not the same. For example, the dimension of the first compressed feature vector may be 72 x 72 and the dimension of the second compressed feature vector may be 36 x 36.

And step 406, performing downsampling processing on the second compressed feature vector to obtain a third compressed feature vector, and determining the third compressed feature vector as the feature vector of the first dimension.

In this embodiment, the terminal may perform downsampling processing on the second compressed feature vector to obtain a compressed third compressed feature vector, where the dimensions of the third compressed feature vector and the second compressed feature vector are not the same. For example, the dimension of the third compressed feature vector may be 18×18 and the dimension of the second compressed feature vector may be 36×36. In this way, the terminal may determine the third compressed feature vector as a feature vector of the first dimension, that is, determine the output result of the first output terminal as a feature vector of the first dimension, for example, the feature vector of the first dimension may be 18×18 dimensions.

The downsampling process to obtain the first compressed feature vector, the downsampling process to obtain the second compressed feature vector, and the downsampling process to obtain the third compressed feature vector may be different.

In step 408, the feature vector of the first dimension is up-sampled to obtain a target feature vector.

In this embodiment, the terminal may perform upsampling processing on the feature vector in the first dimension, that is, upsampling processing on the third compressed feature vector, to obtain a feature vector after upsampling processing, that is, a target feature vector.

In step 410, the target feature vector and the second compressed feature vector are fused to obtain a feature vector of a second dimension.

In this embodiment, the terminal may perform fusion splicing processing on the calculated target feature vector and the second compressed feature vector through a concat algorithm to obtain a feature vector of a second dimension, where the feature vector of the second dimension may be 36×36 dimensions.

In step 412, the feature vector of the second dimension is fused with the first compressed feature vector to obtain a feature vector of the third dimension.

In this embodiment, the terminal may perform fusion splicing processing on the calculated target feature vector and the second compressed feature vector through a concat algorithm to obtain a feature vector of a third dimension, where the feature vector of the third dimension may be 72×72 dimensions.

Step 414, determining character detection result information of the target document based on the feature vector of the first dimension, the feature vector of the second dimension, and the feature vector of the third dimension.

In this embodiment, the terminal may input feature vectors of multiple dimensions to a prediction module in the text detection model to be trained, so that the terminal may predict, through the prediction module, based on the feature vectors of multiple dimensions, to obtain an output result. The terminal can obtain character detection result information of the certificate image based on the output result of the prediction module. The character detection result information may include a plurality of types of detection results, for example, a character detection result corresponding to a field coordinate frame type, and a character detection result corresponding to a character coordinate frame.

In this embodiment, through the above feature extraction process, pruning of the network structure of the text detection model may be achieved, the size of the text detection model may be also compressed, the utilization rate of the CPU may be improved, the structure of the text detection model may be simplified, and the training speed and the convergence speed of the model may be accelerated without significant degradation of the training precision and the detection precision.

In one embodiment, as shown in fig. 5, step 414, determining the character detection result information of the target document based on the feature vector of the first dimension, the feature vector of the second dimension, and the feature vector of the third dimension may include:

Step 502, performing convolution calculation on the feature vector of the first dimension to obtain the mark data of the first dimension.

In this embodiment, the terminal may perform a first-scale convolution calculation on the feature vector of the first dimension through a preset convolution algorithm to obtain first-dimension tag data corresponding to the first dimension, that is, a first-dimension character detection result, for example, may obtain 18×18-dimension tag data.

And step 504, performing convolution calculation on the feature vector of the second dimension to obtain the mark data of the second dimension.

In this embodiment, the terminal may perform a second-scale convolution calculation on the feature vector of the second dimension through a preset convolution algorithm to obtain the second-size tag data corresponding to the second dimension, that is, the second-size character detection result may, for example, obtain the 36×36-dimension tag data.

And step 506, performing convolution calculation on the feature vector of the third dimension to obtain the marking data of the third dimension.

In this embodiment, the terminal may perform a third-dimension convolution calculation on the feature vector of the third dimension through a preset convolution algorithm to obtain third-dimension tag data corresponding to the third dimension, that is, a third-dimension character detection result, for example, may obtain 72×72-dimension tag data.

Step 508, screening the first size of the tag data, the second size of the tag data and the third size of the tag data according to the size information of the character frame and the size information of the field frame, respectively, to obtain first character detection result information of the character frame type and second character detection result information of the field frame type.

In this embodiment, the terminal may screen, according to the size information of the character frame, among the prediction flag data of the first size, the prediction flag data of the second size, and the prediction flag data of the third size, to extract the prediction flag data conforming to the size information of the character frame, that is, the first character detection result information of the character frame class; the terminal may also screen among the first size predictive flag data, the second size predictive flag data, and the third size predictive flag data according to the size information of the field frame, and extract predictive flag data conforming to the size information of the field frame, that is, the second character detection result information of the field frame category.

In this embodiment, by performing feature extraction in multiple dimensions and feature extraction in multiple dimensions on the sample image, the accuracy and comprehensiveness of feature extraction of the sample image can be ensured, and an accurate character detection result can be obtained.

In one embodiment, as shown in FIG. 6, the sample image is a composite sample image; sample images were synthesized by the following procedure:

and step 602, erasing text information of the initial certificate image to obtain a first image.

In this embodiment, the terminal may acquire a plurality of initial images, and for an initial document image in the initial images, the terminal may erase text information included in the initial images by using a preset erasing tool to obtain a background image, i.e. a first image, after removing the text information.

Specifically, the initial document image may be an identification image, so that the terminal may remove text information contained in the identification image by using a preset erasing tool (e.g., photoshop) to obtain an identification image that only contains a background image, i.e., a first image. The text information contained in the initial image may include character information as well as field information.

Step 604, labeling the field frame in the first image to obtain a second image;

the preset labeling tool may be an open source labelme labeling tool or a labelImage labeling tool, etc. for marking the coordinate position of the text generating region.

In this embodiment, the terminal may label the coordinate positions of the areas needing to be spliced in the first image by using a preset labeling tool, and label a plurality of field frames in the first image to obtain the second image. The second image is an image labeled with a plurality of field boxes.

Step 606, obtaining a random text through a random text synthesis algorithm, and performing fusion processing on the random text and the second image to obtain a sample image.

In this embodiment, the terminal may generate a random text based on the text requirements of each field frame through a preset random text synthesis algorithm, and paste the generated random text to the second image, so as to obtain the sample image.

Specifically, the terminal may perform random text generation according to text information required by each field frame. For example, the terminal may randomly screen 2 to 10 words in the preset character set through a preset random text synthesis algorithm to obtain a random text in the field frame corresponding to the name category. The generation process of the random text in the field frame corresponding to the other address category and the field frame corresponding to the identity information category is similar to the generation process of the random text in the field frame corresponding to the name category, and is not repeated here.

Optionally, the terminal may place the random text in the field frame corresponding to each category in a preset black background image, randomly determine a target font format, and adjust the random text to the target font format. In addition, the terminal can also adjust the spacing among a plurality of characters in the random text and adjust the line spacing among a plurality of lines of characters to generate a random text image to be spliced. In this way, the terminal can determine the region to be spliced in the second image based on the coordinate information of the field frames of each category, and perform superposition processing on the random text image to be spliced corresponding to the category and the region to be spliced corresponding to the category in the second image to obtain the synthesized sample image.

Optionally, the terminal may further perform a simulated reality process on the synthesized sample image to obtain a sample image after the simulated reality process, where the simulated display process includes one or more of gaussian blur process, pretzel noise process, contrast stretching process, sharpening process, and motion blur process, so that a low quality scene in the real scene may be simulated, and the simulation precision of the sample image is ensured.

In this embodiment, by generating random text information, it is possible to avoid acquiring private data and sensitive data of a user, and obtain a huge amount of sample images for training of a subsequent text detection model.

In one embodiment, the specific detection process of the text detection model may include:

S1, training data are acquired.

The training data comprises a plurality of sample images, the sample images comprise sample image data and sample marking data, and the sample marking data comprises first sample marking data of a character frame type and second sample marking data of a field frame type.

In this embodiment, the sample image data may be pixel data on the sample image. The electronic device may obtain training data comprising a plurality of sample images, each sample image may be a sample image labeled with a character frame and a field frame.

S2, determining prediction mark data corresponding to the sample image according to the text detection model to be trained, the sample image data, the first sample mark data and the second sample mark data.

The predictive marker data comprises first predictive marker data of a character frame type and second predictive marker data of a field frame type.

In this embodiment, the text detection model to be trained may be a neural network model, and the terminal may input the sample image data, the first sample marking data and the second sample marking data to the text detection model to be trained together, so as to obtain an output result of the trained text detection model. The output result may be predictive flag data of the sample image, the predictive flag data including first predictive flag data of a character frame class and second predictive flag data of a field frame class.

The terminal can perform text detection based on the data containing the first predictive markers and the second predictive markers to obtain predictive text information corresponding to the sample image.

S3, determining a first regression loss function corresponding to the character frame type according to a preset regression loss function, first sample marking data and first prediction marking data, and determining a second regression loss function corresponding to the field frame type according to the preset regression loss function, second sample marking data and second prediction marking data.

The preset regression loss function may be CIOU LOSS functions.

In this embodiment, the terminal may calculate, according to the CIOU LOSS functions, a regression loss function of the character frame type, that is, a first regression loss function, based on the first sample flag data corresponding to the character frame type and the first prediction flag data corresponding to the character frame type. Similarly, the terminal may calculate a regression loss function of the field frame class, that is, a second regression loss function, according to the CIOU LOSS function based on the second sample flag data corresponding to the field frame class and the second prediction flag data corresponding to the field frame class.

For example, CIOU LOSS functions can be calculated by the following formula:

L_CIoU＝L_DIoU+θω

L_DIoU＝1-IoU+σ²(box,box^gt)/c²

θ＝ω/[(1-IoU)+ω]

Wherein θ is a first influencing factor, and box is the center coordinates of the sample coordinate frame; ω is a second influencing factor, box ^gt is the center coordinates of the predicted coordinate frame, and L _DIoU is the calculated loss value according to the preset DIoU Loss loss function. σ ²(box,box^gt) represents the distance between the center coordinates of the sample coordinate frame and the center coordinates of the predicted coordinate frame, c ² is the diagonal distance of the minimum circumscribing matrix, w ^gt is the length of the predicted coordinate frame, h ^gt is the width of the predicted coordinate frame, w is the length of the sample coordinate frame, and h is the width of the sample coordinate frame.

Wherein the first influence factor and the second influence factor may increase the speed of model convergence and the accuracy of the prediction.

S4, determining a comprehensive loss function based on the first regression loss function and the second regression loss function.

In this embodiment, the first regression loss function may be a loss function of a character frame class, and the second regression loss function may be a loss function of a field frame class. The integrated loss function may be an overall loss function in the text detection model to be trained. In this way, the terminal can calculate the coordinate frame regression loss of the text detection model to be trained according to the first regression loss function and the second regression loss function, and calculate the comprehensive loss function of the text detection model to be trained based on the coordinate frame regression loss function.

And S5, updating network parameters of the text detection model to be trained according to the comprehensive loss function, and returning to execute the step of acquiring training data until the comprehensive loss function meets the preset training completion condition, so as to obtain the trained text detection model.

In one embodiment, the text detection model to be trained may include a feature extraction module and a prediction module. Correspondingly, the specific processing procedure of determining the prediction mark data corresponding to the sample image according to the text detection model to be trained, the sample image data, the first sample mark data and the second sample mark data includes:

And extracting the characteristics of the sample image data, the first sample marking data and the second sample marking data to obtain characteristic vectors with multiple dimensions.

Wherein the feature vectors for the plurality of dimensions may be determined based on size information of the sample image.

In this embodiment, the terminal may input a sample image including sample image data, first sample tag data and second sample tag data to a feature extraction module in a text detection model to be trained, so that the terminal may perform feature extraction on the training data through the feature extraction module to obtain a plurality of feature vectors of different dimensions corresponding to the sample image, where the feature vectors of each dimension respectively include feature information of coordinate frames of different dimensions of the sample image.

And predicting the prediction mark data corresponding to the sample image based on the feature vectors of the multiple dimensions.

In this embodiment, the terminal may input feature vectors of multiple dimensions to a prediction module in the text detection model to be trained, so that the terminal may predict, through the prediction module, based on the feature vectors of multiple dimensions, to obtain an output result. The terminal can obtain the prediction mark data corresponding to the sample image data based on the output result of the prediction module. The predicted tag data may include tag data of a plurality of categories, for example, may include tag data corresponding to a category of a field coordinate frame, and may also include tag data corresponding to a character coordinate frame.

In this embodiment, a text detection model with high accuracy may be obtained.

In one embodiment, as shown in fig. 7, the specific processing procedure of step 202 "extracting features from the sample image data, the first sample tag data, and the second sample tag data to obtain feature vectors of multiple dimensions" includes:

step 702, performing downsampling processing on the sample image data, the first sample tag data and the second sample tag data to obtain a first compressed feature vector.

In this embodiment, the terminal may perform downsampling processing on a sample image including sample image data, first sample tag data, and second sample tag data, to obtain a compressed first compressed feature vector.

Step 704, performing downsampling processing on the first compressed feature vector to obtain a second compressed feature vector.

Step 706, performing downsampling processing on the second compressed feature vector to obtain a third compressed feature vector, and determining the third compressed feature vector as the feature vector of the first dimension.

Step 708, upsampling the feature vector of the first dimension to obtain a target feature vector.

And 710, fusing the target feature vector and the second compressed feature vector to obtain a feature vector of a second dimension.

Step 712, fusion processing is performed on the feature vector of the second dimension and the first compressed feature vector to obtain a feature vector of the third dimension.

In one embodiment, as shown in fig. 8, the specific processing procedure of the step of "predicting the prediction flag data corresponding to the sample image based on the feature vectors of multiple dimensions" includes:

Step 802, performing convolution calculation on the feature vector of the first dimension to obtain predictive marker data of the first dimension.

In this embodiment, the terminal may perform a first-scale convolution calculation on the feature vector of the first dimension through a preset convolution algorithm to obtain the predictive marker data of the first dimension corresponding to the first dimension, for example, may obtain the predictive marker data of 18×18 dimensions.

And step 804, performing convolution calculation on the feature vector of the second dimension to obtain the prediction mark data of the second dimension.

In this embodiment, the terminal may perform a second-scale convolution calculation on the feature vector of the second dimension through a preset convolution algorithm to obtain the predicted tag data of the second dimension corresponding to the second dimension, for example, may obtain the predicted tag data of 36×36 dimensions.

And step 806, performing convolution calculation on the feature vector of the third dimension to obtain prediction mark data of the third dimension.

In this embodiment, the terminal may perform a third-scale convolution calculation on the feature vector of the third dimension through a preset convolution algorithm to obtain prediction tag data of the third dimension corresponding to the third dimension, for example, may obtain prediction tag data of 72×72 dimensions.

Step 808, screening the first size predictive marker data, the second size predictive marker data and the third size predictive marker data according to the size information of the character frame and the size information of the field frame, respectively, to obtain the first predictive marker data of the character frame type and the second predictive marker data of the field frame type.

In this embodiment, the terminal may screen, according to the size information of the character frame, among the predictive flag data of the first size, the predictive flag data of the second size, and the predictive flag data of the third size, to extract predictive flag data conforming to the size information of the character frame, that is, predictive flag data of the character frame class; the terminal may also screen among the first size of predictive flag data, the second size of predictive flag data, and the third size of predictive flag data according to the size information of the field frame, and extract predictive flag data conforming to the size information of the field frame, that is, predictive flag data of the field frame type.

In this embodiment, by performing feature extraction in multiple dimensions and feature extraction in multiple dimensions on the sample image, accuracy and comprehensiveness of feature extraction of the sample image can be ensured, and accurate prediction mark data can be obtained.

In one embodiment, the specific process of step "determining the composite loss function based on the first regression loss function and the second regression loss function" includes:

And determining a first weight corresponding to the first regression loss function and a second weight corresponding to the second regression loss function. And carrying out weighted calculation according to the first weight, the second weight, the first regression loss function and the second regression loss function, and determining the regression loss function of the coordinate frame. And determining a comprehensive loss function according to the preset classification loss function and the coordinate frame regression loss function.

The steps of the text character detection method provided in the foregoing may be described in detail with reference to a specific embodiment, and in the case where the actual application scenario of the text detection model is an identification card image, the identification image is privacy data, so that training data including multiple sample images may be obtained by a synthetic method. The process of synthesizing training data may include: a background picture processing stage, a mapping region labeling stage, a corpus preparation stage, a field picture synthesis stage and a final mapping and image processing stage.

A plurality of initial images are acquired, and in the case where the initial images are initial document images, the initial document images may be subjected to an erasing process. In one example, the initial document image may be an actual identity image (identity document data), so that the terminal may erase the initial document image, specifically may erase the actual information in the identity image, and the erased initial document image (i.e., the first image) may be an identity image of the user, including name, gender, ethnicity, year of birth, month and day of birth, address, and citizen identity number information, etc., as shown in fig. 9A. In the process of erasing, the character field area and the background texture of the identity card are required to be ensured not to have larger difference.

Labeling the map area. And performing field frame marking processing on the first image through a preset marking tool to obtain a second image. The terminal can determine the coordinate information of the field frame in the first image, and label the field frame on the first image based on the coordinate information of the field frame and a preset labeling tool to obtain a second image. The preset labeling tool may be labelme labeling tools, etc., and the second image may be, as shown in fig. 9B, a field frame including a name, a field frame of gender, a field frame of ethnicity, a field frame of birth date, a field frame of address, a field frame of citizen identification number, etc.

A corpus stage is prepared. And obtaining the random text through a preset random text synthesis algorithm. The terminal can obtain the random text corresponding to each field frame through a preset text synthesis algorithm according to the text format of the category of the field frame. For example, the terminal may randomly screen 2 to 10 words in the preset character set through a preset random text synthesis algorithm to obtain a random text in the field frame corresponding to the name category. The generation process of the random text in the field frame corresponding to the other address category and the field frame corresponding to the identity information category is similar to the generation process of the random text in the field frame corresponding to the name category, and is not repeated here.

And synthesizing a field slice diagram. The terminal may generate an image corresponding to the synthesized field frame based on the character in the random text and the background image of the preset target color, for example, the image corresponding to the field frame of the synthesized address may be as shown in fig. 9C, and the address may be: XX Country XXX village XXX in XX City and XX county of XX province community X zone XXX X group XX house.

Mapping and image processing stages. And the mapping is to attach the synthesized image of the field frame to the corresponding position of the field frame, namely, the splicing processing is carried out, and the sample image is obtained. The terminal can also process Gaussian blur, spiced salt noise, contrast stretching, sharpening, motion blur and the like on the sample image to obtain a more real sample image. The synthesized sample image may be as shown in fig. 9D, and the sample mark data corresponding to the synthesized sample image may be as shown in fig. 9E.

The method provided by the embodiment of the invention aims at the situation that the image quality of the identity card image is lower in a real scene, and can realize the real requirements on character detection and field detection through yolov algorithm. The method provided by the embodiment of the invention designs the loss function of field frame detection and the loss function of character frame detection, increases the weight of character detection loss, decouples field detection and character detection, and improves the character detection accuracy. The network structure is light, the up-sampling process in the final stage is eliminated, and the reasoning speed of the model and the like can be accelerated. When the method provided by the embodiment of the invention is used for training the text detection model, the character labels and the field labels can be marked separately, so that the separation convergence between the character coordinate frame and the field coordinate frame is realized, and the effects of reducing the image input size and improving the reasoning speed in the forward reasoning stage are achieved.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a text character detection device for realizing the above related text character detection method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the text character detection device or devices provided below may refer to the limitation of the text character detection method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 10, there is provided a text character detecting apparatus 1000 including:

a first obtaining module 1001, configured to obtain a document image of a target document;

a first determining module 1002, configured to input a document image into a text detection model, and detect characters in the document image through the text detection model to obtain character detection result information of a target document; the text detection model is trained based on a first regression loss function corresponding to the character frame category and a second regression loss function corresponding to the field frame category.

In one embodiment, the text character detecting apparatus further includes:

The second determining module is used for carrying out weighted calculation on the first regression loss function and the second regression loss function based on the first weight corresponding to the first regression loss function and the second weight corresponding to the second regression loss function, and determining a coordinate frame regression loss function;

and the training module is used for updating the network parameters of the text detection model to be trained according to the comprehensive loss function until the comprehensive loss function meets the target training completion condition, and obtaining the trained text detection model.

In one embodiment, the text character detecting apparatus further includes:

The second acquisition module is used for acquiring training data, the training data comprises a plurality of sample images, the sample images comprise sample image data and sample marking data, and the sample marking data comprises first sample marking data of a character frame type and second sample marking data of a field frame type;

A fourth determining module, configured to determine, according to a text detection model to be trained, sample image data, first sample tag data, and second sample tag data, prediction tag data corresponding to a sample image, where the prediction tag data includes first prediction tag data of a character frame type and second prediction tag data of a field frame type;

And a fifth determining module, configured to determine a first regression loss function corresponding to the character frame category according to the target regression loss function, the first sample tag data, and the first prediction tag data, and determine a second regression loss function corresponding to the field frame category according to the preset regression loss function, the second sample tag data, and the second prediction tag data.

In one embodiment, the first obtaining module is specifically configured to:

Downsampling the certificate image to obtain a first compressed feature vector;

The respective modules in the above-described text character detecting apparatus 1000 may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing training data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text character detection method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 11 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for text character detection, the method comprising:

acquiring a certificate image of a target certificate;

2. The method of claim 1, wherein training the text detection model based on a first regression loss function corresponding to a character box category and a second regression loss function corresponding to a field box category comprises:

3. The method of claim 2, wherein the first weight is greater than the second weight.

4. The method of claim 2, wherein the first regression loss function and the second regression loss function are generated by:

5. The method of claim 1, wherein the inputting the document image into a text detection model, detecting characters in the document image by the text detection model, and obtaining character detection result information of the target document, comprises:

Downsampling the certificate image to obtain a first compressed feature vector;

6. The method of claim 5, wherein the determining character detection result information of the target document based on the feature vector of the first dimension, the feature vector of the second dimension, and the feature vector of the third dimension comprises:

7. The method of claim 4, wherein the sample image is a composite sample image; the sample image is synthesized by:

labeling the field frames in the first image to obtain a second image;

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.