CN111428717B

CN111428717B - Text recognition method, text recognition device, electronic equipment and computer readable storage medium

Info

Publication number: CN111428717B
Application number: CN202010226050.7A
Authority: CN
Inventors: 李月; 黄光伟; 史新艳
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2024-04-26
Anticipated expiration: 2040-03-26
Also published as: CN111428717A

Abstract

The application provides a text recognition method, a text recognition device, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring a picture to be identified containing text information; identifying the picture to be identified through a pre-trained text detection model, and determining at least one text box containing text in the picture to be identified and the corresponding inclination direction of each text box; correcting the text direction of each text box according to the inclined direction to obtain corrected text boxes with corrected text directions; identifying text information in the corrected text box. The embodiment of the application can improve the accuracy of identifying the direction of the text content and the accuracy of identifying the picture text.

Description

Text recognition method, text recognition device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a text recognition method, a text recognition device, an electronic device, and a computer readable storage medium.

Background

When the text content in the picture is identified, all text boxes in the picture are required to be detected by a text detection method, each obtained text box is rotated to be horizontal, and finally each text box picture (the text direction is horizontal forward direction) is sent into an identification model to identify the text in the text box.

A text detection method can obtain the position information (center point coordinates, width and height and angle) of a text box in a picture, but cannot reflect the true orientation of characters. As shown in fig. 1, the shape and direction of the text box are identical, but the direction of the text within the box varies greatly. This causes a problem in that a case where a text direction rotation error may occur for each text box rotation, as shown in fig. 2, will result in a text content recognition error.

Disclosure of Invention

The application provides a text recognition method, a text recognition device, electronic equipment and a computer readable storage medium, which are used for solving the problem that text content recognition errors are easily caused by the fact that the true orientation of characters cannot be recognized.

In order to solve the above problems, the present application discloses a text recognition method, comprising:

acquiring a picture to be identified containing text information;

Identifying the picture to be identified through a pre-trained text detection model, and determining at least one text box containing text in the picture to be identified and the corresponding inclination direction of each text box;

correcting the text direction of each text box according to the inclined direction to obtain corrected text boxes with corrected text directions;

identifying text information in the corrected text box.

Optionally, before the obtaining the picture to be identified containing the text information, the method further includes:

determining a pre-trained text detection model;

the determining the text detection model trained in advance comprises:

Acquiring a sample picture; the sample picture comprises at least one initial text box marked in advance, initial position information of each initial text box in the sample picture and initial inclination directions of each initial text box;

Sequentially inputting the sample pictures into an initial text detection model to train the initial text detection model, and determining at least one predictive text box corresponding to the sample pictures, the predictive position information of each predictive text box in the sample pictures and the predictive inclination direction of each predictive text box;

Calculating a loss value of the initial text detection model according to the initial position information, the predicted position information, the initial inclination direction and the predicted inclination direction;

And under the condition that the loss value is in a preset range, taking the trained initial text detection model as the text detection model.

Optionally, the calculating, according to each initial position information, each predicted position information, each initial tilt direction, and each predicted tilt direction, a loss value of the initial text detection model includes:

calculating a position loss value according to the initial position information and the predicted position information;

calculating to obtain a tilt loss value according to each initial tilt direction and each predicted tilt direction;

And calculating the loss value of the initial text detection model according to the position loss value, the position weight, the inclination loss value and the inclination weight.

Optionally, the text detection model includes: the method comprises a classification result acquisition layer and an inclination direction acquisition layer, wherein the pictures to be identified are identified through a pre-trained text detection model, at least one text box containing text in the pictures to be identified and the inclination direction corresponding to each text box are determined, and the method comprises the following steps:

invoking the classification result acquisition layer to process the picture to be identified, and acquiring a pixel classification result and a connection classification result on the picture to be identified;

determining at least one text box in the picture to be identified according to the pixel classification result and the connection classification result;

invoking the inclination direction acquisition layer to process the picture to be identified, and determining an inclination direction to be confirmed of the text in the at least one text box and an inclination threshold value corresponding to the inclination direction to be confirmed;

and determining the inclination direction corresponding to the at least one text box according to the maximum inclination threshold value in the inclination threshold values.

Optionally, the correcting processing is performed on each text box according to the oblique direction to obtain a corrected text box, which includes:

when the inclination angle of the text box is determined to be between 90 degrees and 180 degrees according to the inclination direction, rotating the text box by 90 degrees anticlockwise to obtain the corrected text box;

when the inclination angle of the text box is determined to be between 180 degrees and 270 degrees according to the inclination direction, rotating the text box by 180 degrees anticlockwise to obtain the corrected text box;

and when the inclination angle of the text box is determined to be 270-360 degrees according to the inclination direction, rotating the text box by 270 degrees anticlockwise to obtain the corrected text box.

Optionally, the identifying text information in the corrected text box includes:

and inputting the corrected text box into a text recognition model, and determining text information contained in the corrected text box through the text recognition model.

In order to solve the above problems, the present application discloses a text recognition apparatus comprising:

The picture to be identified acquisition module is used for acquiring a picture to be identified containing text information;

The picture to be identified identifying module is used for identifying the picture to be identified through a pre-trained text detection model, and determining at least one text box containing text in the picture to be identified and the corresponding inclination direction of each text box;

The corrected text box acquisition module is used for carrying out correction processing on the text direction of each text box according to the inclined direction to obtain corrected text boxes with corrected text directions;

And the text information determining module is used for identifying the text information in the corrected text box.

Optionally, the method further comprises:

A text detection model determination module for determining the text detection model trained in advance;

The text detection model determination module includes:

The sample picture acquisition unit is used for acquiring a sample picture; the sample picture comprises at least one initial text box marked in advance, initial position information of each initial text box in the sample picture and initial inclination directions of each initial text box;

The prediction text box determining unit is used for inputting the sample pictures into an initial text detection model in sequence to train the initial text detection model, determining at least one prediction text box corresponding to the sample pictures, and the prediction position information of each prediction text box in the sample pictures and the prediction inclination direction of each prediction text box;

A loss value calculation unit, configured to calculate a loss value of the initial text detection model according to each of the initial position information, each of the predicted position information, each of the initial tilt directions, and each of the predicted tilt directions;

The text detection model acquisition unit is used for taking the trained initial text detection model as the text detection model under the condition that the loss value is in a preset range.

Optionally, the loss value calculation unit includes:

a position loss value calculating subunit, configured to calculate a position loss value according to each initial position information and each predicted position information;

A tilt loss value calculating subunit, configured to calculate a tilt loss value according to each initial tilt direction and each predicted tilt direction;

And the loss value calculating subunit is used for calculating the loss value of the initial text detection model according to the position loss value, the position weight, the inclination loss value and the inclination weight.

Optionally, the text detection model includes: the classification result acquisition layer and the inclination direction acquisition layer, the picture identification module to be identified comprises:

The positive pixel acquisition unit is used for calling the classification result acquisition layer to process the picture to be identified and acquire a pixel classification result and a communication classification result on the picture to be identified;

A text box determining unit, configured to determine at least one text box in the picture to be identified according to the pixel classification result and the connected classification result;

The inclination threshold determining unit is used for calling the inclination direction acquisition layer to process the picture to be identified, and determining an inclination direction to be confirmed of the text in the at least one text box and an inclination threshold corresponding to the inclination direction to be confirmed;

And the inclination direction determining unit is used for determining the inclination direction corresponding to the at least one text box according to the maximum inclination threshold value in the inclination threshold values.

Optionally, the correction text box acquisition module includes:

A first correction box acquisition unit for rotating the text box counterclockwise by 90 degrees to obtain the correction text box when the inclination angle of the text box is determined to be between 90 degrees and 180 degrees according to the inclination direction;

a second correction frame acquisition unit configured to rotate the text frame counterclockwise by 180 ° to obtain the corrected text frame when it is determined that the tilt angle of the text frame is between 180 ° and 270 ° according to the tilt direction;

And a third correction box acquisition unit, configured to rotate the text box counterclockwise by 270 ° to obtain the correction text box when it is determined that the tilt angle of the text box is between 270 ° and 360 ° according to the tilt direction.

Optionally, the text information determining module includes:

And the text information determining unit is used for inputting the corrected text box into a text recognition model and determining the text information contained in the corrected text box through the text recognition model.

In order to solve the above problems, the present application discloses an electronic device including:

A processor, a memory, and a computer program stored on the memory and executable on the processor, the processor implementing the text recognition method of any of the above when the program is executed.

In order to solve the above-described problems, the present application discloses a computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the text recognition method of any one of the above-described.

Compared with the prior art, the application has the following advantages:

According to the text recognition scheme provided by the embodiment of the application, the picture to be recognized containing text information is obtained, recognition processing is carried out on the picture to be recognized through a pre-trained text detection model, at least one text box containing text in the picture to be recognized and the corresponding inclination direction of each text box are determined, correction processing is carried out on the text direction of each text box according to the inclination direction, a corrected text box with corrected text direction is obtained, and text information in the corrected text box is recognized. According to the embodiment of the application, the text direction classification network and the text box position detection network are fused, and then the accurate text content direction is finally obtained according to the large angle classification result and the text box position detection result, so that the picture text recognition accuracy is improved.

Drawings

FIG. 1a shows a schematic diagram of a text picture;

FIG. 1b shows a schematic representation of a rotated text picture;

FIG. 2 is a flowchart showing steps of a text recognition method according to an embodiment of the present application;

FIG. 3 is a flowchart showing steps of a text recognition method according to an embodiment of the present application;

FIG. 4a shows a schematic diagram of a word to be rotated according to an embodiment of the present application;

FIG. 4b is a schematic diagram of a data annotation according to an embodiment of the present application;

FIG. 4c is a schematic diagram of a network structure according to an embodiment of the present application;

FIG. 4d is a schematic diagram of a text box result provided by an embodiment of the present application;

FIG. 4e shows a schematic diagram of a text image provided by an embodiment of the present application;

FIG. 4f is a schematic diagram of a virtual pen tip frame according to an embodiment of the present application;

fig. 5 shows a schematic structural diagram of a text recognition device according to an embodiment of the present application;

Fig. 6 shows a schematic structural diagram of a text recognition device according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 2, a flowchart illustrating steps of a text recognition method according to an embodiment of the present application is shown, and some embodiments may be executed by a processor, where the text recognition method may specifically include the following steps:

step 101: and acquiring a picture to be identified containing the text information.

The embodiment of the application can be applied to the scene for identifying the characters in the pictures in some embodiments.

The picture to be identified refers to a picture which contains text information and is used for text identification.

In some examples, the picture to be identified may be a picture randomly selected from the internet, for example, a picture selected from a website and including text information.

In some examples, the picture to be identified may be a picture entered by a user, e.g., a picture entered by a user when searching for desired information using the picture, etc.

It will be appreciated that the above examples are only examples listed for better understanding of the technical solution of the embodiments of the present application, and are not to be construed as the only limitation of the embodiments of the present application.

After obtaining the picture to be recognized containing the text information, step 102 is performed.

Step 102: and identifying the picture to be identified through a pre-trained text detection model, and determining at least one text box containing text in the picture to be identified and the corresponding inclination direction of each text box.

The text detection model is a model which is obtained through pre-training and is at least used for detecting the text inclination direction and the connected information of the text information in the picture. The training process for the text detection model can be described in detail in the following embodiments.

The text box refers to a box formed by four image coordinate connecting lines which are obtained through recognition of a text detection model and can contain text in a picture. As shown in fig. 4a, in the left half of fig. 4a, the word is formed by enclosing coordinates corresponding to 1, 2, 3 and 4, and a text box can be formed by connecting the four coordinate lines. It will be appreciated that the above examples are only examples listed for better understanding of the technical solution of the embodiments of the present application, and are not to be construed as the only limitation of the embodiments of the present application.

The tilting direction refers to the determined direction of the text tilting based on the direction of forward display of the picture, and of course, in this embodiment, the picture is photographed horizontally and also photographed vertically, the forward display direction of the picture can be determined according to the photographing direction, and then the tilting direction of the text in the picture is determined based on the forward display direction, as shown in fig. 4a, when the picture is displayed forward, the tilting direction of the word in the upper graph of the left half of fig. 4a is: tilting up to the right, the words in the lower diagram of the left half of fig. 4a are tilted in the direction: tilt up to the left.

After the picture to be identified is acquired, the picture to be identified can be input into a text detection model, and the picture to be identified is identified through the text detection model, so that at least one text box contained in the picture to be identified and the corresponding inclination direction of each text box are determined.

After determining at least one text box containing text in the picture to be identified, and the corresponding tilt direction of each text box, step 103 is performed.

Step 103: and correcting the text direction of each text box according to the inclined direction to obtain corrected text boxes with corrected text directions.

The correction processing means an operation of correcting the tilt of the tilted text box, specifically, correcting the coordinates of the four corners of the text box and the character direction.

Correcting the text box refers to correcting the text direction of the inclined text box, and the corrected text box can be obtained after correcting the text box in the left half diagram of fig. 4a as shown in fig. 4a, so as to obtain the corrected text box shown in the right half diagram of fig. 4 a.

After the text direction of each text box is corrected according to the oblique direction, a corrected text box with corrected text direction is obtained, and step 104 is executed.

Step 104: identifying text information in the corrected text box.

The target text information is text information contained in a picture to be identified, which is obtained by identifying the correction text box.

After correction processing is performed on each text box according to the tilt direction, recognition processing may be performed on the corrected text box to determine text information contained in each text box, and then the result obtained by recognition is combined to obtain target text information.

According to the embodiment, the inclined text box is identified by adopting the text detection model, so that the problem of text identification errors caused by inaccurate identification of the text content direction can be avoided.

According to the text recognition method provided by the embodiment of the application, the picture to be recognized containing text information is obtained, the picture to be recognized is recognized through a pre-trained text detection model, at least one text box containing text in the picture to be recognized and the corresponding inclination direction of each text box are determined, the text direction of each text box is corrected according to the inclination direction, the corrected text box with corrected text direction is obtained, and the text information in the corrected text box is recognized. According to the embodiment of the application, the text direction classification network and the text box position detection network are fused, and then the accurate text content direction is finally obtained according to the large angle classification result and the text box position detection result, so that the picture text recognition accuracy is improved.

Next, the implementation process of the present embodiment will be described in detail with reference to fig. 3.

Referring to fig. 3, a step flowchart of a text recognition method provided by an embodiment of the present application is shown, where the text recognition method specifically includes the following steps:

step 201: a pre-trained text detection model is determined.

The embodiment of the application can be applied to the process of identifying the text in the picture containing the text.

When a picture containing text needs to be identified, a text detection model may be determined first, and in particular, may be described in connection with the following specific implementation manner.

In a specific implementation of the present application, the step 201 may include:

Substep S1: and obtaining a sample picture.

When the inclined text in the picture is identified, a pre-trained text detection model can be adopted for identification, the text detection model can be shown as 4c, a left backbone network is used for extracting the characteristics, a MobileNet-V2 lightweight model is adopted, and the model depth and the model parameter size are considered. The right network is a feature map fusion network, output feature maps bottleneck, bottleneck2, bottleneck, bottleneck and conv2d are respectively taken out, feature map fusion with different sizes is carried out through convolution calculation and up-sampling, and 3 groups of final calculation results are obtained, wherein the final calculation results are respectively X1:112X112X2, X2:112X112X16, X3:112x112x4. Where X1 represents the probability that the prediction at each pixel location is a positive pixel (text pixel) and the probability of a negative pixel (background pixel); x2 represents a probability value of predicting communication and non-communication between 8 pixels adjacent to the vicinity at each pixel position; x3 represents the probability value corresponding to the 4 text direction intervals predicted at each pixel position.

In this embodiment, the different sizes may correspond to three sizes of large, medium and small, and the specific numerical division for the large, medium and small sizes may be determined according to the service requirement, which is not limited in this embodiment.

First, a training process of the text detection model is described.

The sample picture is a picture containing inclined text and used for training a text detection model.

Each sample picture comprises at least one initial text box marked in advance, initial position information of each initial text box in the sample picture and initial inclination direction of each initial text box.

In this embodiment, the initial text box may be pre-labeled by the service personnel, specifically, the service personnel may pre-label four vertices according to the position of the text in the sample picture, the four vertices may just enclose the text, and then the four vertices are connected to form a square frame, where the square frame is the initial text box, for example, as shown in fig. 4b, when the initial text box is labeled, the forward display direction of the picture may be used as a reference, the left vertex of the picture is used as an origin, then four vertices that may just enclose the graphic text "ABC" are labeled, and an initial text box is formed by combining the coordinates of the four vertices.

After the sample picture is acquired, a substep S2 is performed.

Substep S2: and sequentially inputting the sample pictures into an initial text detection model, and determining at least one prediction text box corresponding to the sample pictures, prediction position information of each prediction text box in the sample pictures and prediction inclination directions of each prediction text box.

An initial text detection model refers to a text detection model that can identify text in a picture containing the text, but has not been trained.

The predictive text box is a text box formed by surrounding four coordinates through text recognition of a sample picture by an initial text detection model, and as shown in fig. 4b, the inclined text is "ABC", and coordinates of four points can be combined: 1.2, 3 and 4, the frames of the coordinates of the four points are text frames.

The predicted position information refers to a position of the predicted text box in the sample picture, which is obtained through the initial text detection model, and specifically, the position of the predicted text box in the sample picture can be determined according to coordinates of four vertexes of the predicted text box.

The predicted tilt direction refers to a tilt direction of a predicted text box obtained through an initial text detection model, as shown in fig. 4b, after the predicted text box in the sample picture is identified through the initial text detection model, the tilt direction of the predicted text box can be predicted by combining with the display direction of the sample picture, and if the tilt direction of the word "ABC" in the picture is: tilting upward to the right.

Of course, not limited to this, for the initial text detection model, the detected tilt direction of the predicted text box is not necessarily the same as the tilt direction of the text box enclosing the text in the sample picture, and the above examples are only examples listed for better understanding of the technical solution of the embodiment of the present application, and are not the only limitation of the embodiment of the present application.

Of course, when determining the text box, it is also necessary to determine the connection information between the pixels on the picture, where the connection information refers to the connection relationship between the pixels, and the text pixels with the connection relationship form a text box.

After the sample picture is obtained, the sample picture can be sequentially input into an initial text detection model, at least one prediction text box corresponding to the sample picture, the prediction position information of each prediction text box in the sample picture and the prediction inclination direction of each prediction text box are determined through the initial text detection model, and then the sub-step S3 is executed.

Substep S3: and calculating a loss value of the initial text detection model according to the initial position information, the predicted position information, the initial inclination direction and the predicted inclination direction.

The loss value may represent a degree of deviation between each predicted position information and each initial position information, each predicted inclination direction and each initial inclination direction of the sample picture.

At least one predicted text box corresponding to the sample picture, the predicted position information of each predicted text box in the sample picture and the predicted inclination direction of each predicted text box are determined through the initial text detection model, and the loss value of the initial text detection model can be calculated by combining each initial position information, each predicted position information, each initial inclination direction and each predicted inclination direction, and specifically, the method can be described in detail in combination with the following specific implementation modes.

In a specific implementation of the present application, the above sub-step S3 may include:

substep S31: and calculating a position loss value according to the initial position information and the predicted position information.

In this embodiment, the position loss value refers to a loss value calculated by predicting position information and initial position information, and specifically, the position loss value may include an inclination angle classification loss value and a connectivity classification loss value, where a specific calculation manner of the position loss value may be: a cross entropy loss function between the initial position information and the predicted position information is calculated, so that a position loss value can be obtained.

After the predicted position information of each predicted text box is obtained, a position loss value can be calculated by combining each predicted position information and each initial position information.

Substep S32: and calculating a tilt loss value according to each initial tilt direction and each predicted tilt direction.

The tilt loss value refers to a loss value calculated by predicting the tilt direction and the initial tilt direction.

After the predicted tilt directions of the respective predicted text boxes are obtained, tilt loss values may be calculated in combination with the respective predicted tilt directions and the respective initial tilt directions, and specifically, tilt loss values may be obtained by calculating a cross entropy loss function between the initial tilt directions and the predicted tilt directions.

Substep S33: and calculating the loss value of the initial text detection model according to the position loss value, the position weight, the inclination loss value and the inclination weight.

In this embodiment, the position loss value may include a connected classification loss value and a pixel classification loss value, i.e., the position loss value is determined by both the connected classification loss value and the pixel classification loss value.

Pixels (pixels) are divided into positive and negative pixels, all pixels falling within the text region are labeled positive pixels, all pixels falling outside the text region are labeled negative pixels, and multiple text overlap regions are also labeled negative pixels.

The communication relationship is determined by two pixels in two directions, and for a given pixel and eight adjacent pixels, if both pixels are positive pixels, the communication relationship between them is positive, if one pixel is positive and the other is negative, the communication relationship between them is positive, and if both pixels are negative, the communication relationship between them is negative.

In this embodiment, the connection classification loss value and the pixel classification loss value may be calculated first, and then the position loss value may be calculated in combination with the connection classification loss value and the pixel classification loss value. The connected classification loss value is obtained through a cross entropy function based on the connected relation classification among pixels on the marked data and the connected relation classification result of network prediction. The pixel classification loss value is calculated based on the pixel classification result.

After obtaining the position loss value and the tilt loss value, the loss value may be calculated by combining the position weight corresponding to the position loss value and the tilt weight corresponding to the tilt loss value, as shown in the following formula (1):

L＝λ₁L_pixel+λ₂L_link+λ₃L_direction (1)

where L is the loss value, L _direction is the tilt loss value, L _link is the connected class loss value, L _pixel is the pixel class (positive or negative) loss value, λ ₃ is the tilt weight, λ ₂ is the connected class weight, and λ ₁ is the pixel class weight.

Based on the pixel classification result and the connected classification result (0 or 1), groups which are positive pixels and have connected relations among the pixels are formed into text boxes based on a parallel search algorithm. The connectivity weight is the weight corresponding to the connectivity loss value.

L _pixel is calculated by a cross entropy loss function based on the labeling data and the network prediction result X1; l _link is calculated by a cross entropy loss function based on the labeling data and the network prediction result X2; l _direction is calculated by a cross entropy loss function based on the labeling data and the network prediction result X3. Lambda ₁、λ₂、λ₃ is the weight parameter occupied by each loss function and can be adjusted through the actual training effect.

After calculating the loss value of the initial text detection model from the initial position information, the predicted position information, the initial tilt direction, and the predicted tilt direction, the sub-step S4 is executed.

Substep S4: and under the condition that the loss value is in a preset range, taking the trained initial text detection model as the text detection model.

The preset range can be preset by a developer according to the actual application scene and the actual requirement, such as 3-5, and the specific numerical values of the preset range are not limited in the embodiment of the application.

When the loss value is within the preset range, the initial text detection model can be considered to be trained, and then the preset requirement can be met when the text picture is identified, at this time, the trained initial text detection model can be used as a final text detection model, for example, the preset range is 3-5, and when the loss value is within the range of 3-5, the training of the initial text detection model is considered to be completed, and the trained initial text detection model can be used as the final text detection model. And when the loss value is out of the range of 3-5, the initial text detection model is considered to be not trained successfully, and training samples can be added to train the initial document detection model continuously.

After determining the pre-trained text detection model, step 202 is performed.

Step 202: and acquiring a picture to be identified containing the text information.

The picture to be identified is a picture containing text information and used for text identification.

In some examples, the picture to be identified may be a picture entered by a user, e.g., a picture entered by a user when searching for desired information using the picture, etc. As described in connection with fig. 4e and 4 f. Application background: the user specifies the word to be translated through the pen point, the relative positions of the pen point and the camera are fixed, so that the position of the pen point in the image is fixed, and the method is selected: 1. setting a rectangular area with a fixed size (as shown in fig. 4 f) as a virtual pen point according to the size of the text in the image by taking the position of the pen point as the center of the bottom edge; 2. respectively calculating the overlapping area of the rectangular area and each detected text box; 3. and finding out the text box with the largest proportion of the rectangular area of the overlapping area nib, namely selecting the text box as a word to be translated designated by a user.

After obtaining the picture to be recognized containing the text information, step 203 is performed.

Step 203: and calling the classification result acquisition layer to process the picture to be identified, and acquiring positive pixels on the picture to be identified.

In this embodiment, the initial text detection model may include a classification result acquisition layer and an oblique direction acquisition layer, where the classification result acquisition layer may be used to identify a pixel classification result and a connected classification result in the acquired picture, and the oblique direction acquisition layer may be used to identify an oblique direction of text in the acquired picture.

The network obtains a predicted value of whether each pixel on the picture is a positive pixel (i.e., text portion) and a predicted value of whether the pixels are connected. Based on these values (0 or 1), the final union algorithm will be a group of positive pixels, each of which has a connected relationship between them, each forming a text box.

Step 204: and determining at least one text box in the picture to be identified according to the positive pixel.

After the predicted values of the positive pixels on the picture to be identified are obtained, the predicted values and the predicted values of whether the pixels are communicated or not can be groups of positive pixels, wherein the pixels are communicated with each other, and each group of positive pixels and the pixels are communicated with each other to form a text box.

Step 205: and calling the inclination direction acquisition layer to process the picture to be identified, and determining the inclination direction to be confirmed of the at least one text box and an inclination threshold value corresponding to the inclination direction to be confirmed.

The tilt direction to be confirmed refers to the tilt direction of at least one text box obtained by identifying pixels in the at least one text box through a second neural network layer in the text detection model.

The inclination threshold value refers to a threshold value corresponding to an inclination direction to be confirmed, which corresponds to a text pixel in at least one text box obtained through the second neural network layer recognition.

And processing the picture to be identified by calling the inclination direction acquisition layer, so that the inclination direction to be confirmed of the text in at least one text box and an inclination threshold value corresponding to the inclination direction to be confirmed can be determined.

Step 206: and determining the inclination direction corresponding to the at least one text box according to the maximum inclination threshold value in the inclination threshold values.

After the tilt threshold corresponding to the tilt direction of the at least one text box is obtained, the magnitude relation between the tilt thresholds may be compared, and the tilt direction of the at least one text box may be determined according to the comparison result. As shown in fig. 4c, based on X3, the largest value of the probability values corresponding to the four text direction sections is extracted, and the direction section corresponding to the largest probability value is considered as the text direction of the text box to which the pixel belongs, that is, the text direction corresponding to the pixel is classified (which of the four direction sections belongs).

Step 207: and when the inclination angle of the text box is determined to be between 90 degrees and 180 degrees according to the inclination direction, rotating the text box by 90 degrees anticlockwise to obtain the corrected text box.

When the inclination angle of the text box is determined to be between 90 and 180 degrees according to the inclination direction, the text box is rotated 90 degrees anticlockwise to obtain a corrected text box, and specifically, if the text direction is between 90 and 180 degrees, the picture is rotated 90 degrees anticlockwise by taking the upper left corner point as the rotation center to obtain the corrected text box.

Step 208: and when the inclination angle of the text box is determined to be between 180 and 270 degrees according to the inclination direction, rotating the text box by 180 degrees anticlockwise to obtain the corrected text box.

When the inclination angle of the text box is determined to be between 180 and 270 degrees according to the inclination direction, the text box is rotated 180 degrees anticlockwise to obtain a corrected text box, and specifically, if the text direction is between 180 and 270 degrees, the picture is rotated 180 degrees anticlockwise by taking the upper left corner point as the rotation center to obtain the corrected text box.

Step 209: and when the inclination angle of the text box is determined to be 270-360 degrees according to the inclination direction, rotating the text box by 270 degrees anticlockwise to obtain the corrected text box.

When the inclination angle of the text box is determined to be between 270 and 360 degrees according to the inclination direction, the text box is rotated 270 degrees anticlockwise, so that a corrected text box is obtained, and specifically, if the text direction is between 270 and 360 degrees, the picture is rotated 270 degrees anticlockwise with the upper left corner point as the rotation center.

Of course, when it is determined that the tilt angle of the text box is between 0 and 90 degrees according to the tilt direction, correction processing for the text box is not required.

Step 210: and inputting the corrected text box into a text recognition model, and determining text information contained in the corrected text box through the text recognition model.

After correcting the text box to obtain a corrected text box, MINAREARECT of opencv may be used to obtain a rectangular text box corresponding to the set of pixels, where the text box is represented by a center point coordinate, a width W, a height H, and an inclination angle a (as shown in fig. 4 d). The picture is rotated counterclockwise by a with the center point as the rotation center, and the text in the text box is rotated horizontally. And finally, taking out the picture part of the corrected text box range for sending into a subsequent text recognition model to recognize the text content. That is, in this embodiment, two models are adopted, one is a text detection model, the other is a text recognition model, the text detection model can be used for detecting and correcting oblique text in a picture, and the text recognition model can be used for recognizing text information in a corrected text box.

After obtaining at least one text message within the corrected text box, at least one may be combined to determine the target text message contained in the picture to be identified, specifically, it may be determined which texts are connected, and then the two texts are combined, thereby forming the final target text message.

According to the text recognition method provided by the embodiment of the application, the picture to be recognized containing text information is obtained, the picture to be recognized is recognized through a pre-trained text detection model, at least one text box containing text in the picture to be recognized and the corresponding inclination direction of each text box are determined, the text direction of each text box is corrected according to the inclination direction, a corrected text box with corrected text direction is obtained, and the text information in the corrected text box is recognized. According to the embodiment of the application, the text direction classification network and the text box position detection network are fused, and then the accurate text content direction is finally obtained according to the large angle classification result and the text box position detection result, so that the picture text recognition accuracy is improved.

Referring to fig. 5, a schematic structural diagram of a text recognition device provided by an embodiment of the present application is shown, where the text recognition device may specifically include the following modules:

a to-be-identified picture obtaining module 310, configured to obtain a to-be-identified picture including text information;

The picture to be identified identifying module 320 is configured to identify the picture to be identified by using a pre-trained text detection model, and determine at least one text box containing text in the picture to be identified and a tilting direction corresponding to each text box;

A corrected text box obtaining module 330, configured to correct the text direction of each text box according to the oblique direction, so as to obtain a corrected text box after the text direction correction;

a text information determination module 340, configured to identify text information in the corrected text box.

According to the text recognition device provided by the embodiment of the application, the picture to be recognized containing text information is obtained, the picture to be recognized is recognized through the pre-trained text detection model, at least one text box containing text in the picture to be recognized and the corresponding inclined direction of each text box are determined, the text direction of each text box is corrected according to the inclined direction, the corrected text box with corrected text direction is obtained, and the text information in the corrected text box is recognized. According to the embodiment of the application, the text direction classification network and the text box position detection network are fused, and then the accurate text content direction is finally obtained according to the large angle classification result and the text box position detection result, so that the picture text recognition accuracy is improved.

Referring to fig. 6, a schematic structural diagram of a text recognition device provided by an embodiment of the present application is shown, where the text recognition device may specifically include the following modules:

A text detection model determination module 410 for determining the text detection model trained in advance;

the picture to be identified acquisition module 420 is configured to acquire a picture to be identified including text information;

The picture to be identified identifying module 430 is configured to identify the picture to be identified by using a pre-trained text detection model, and determine at least one text box containing text in the picture to be identified and a tilting direction corresponding to each text box;

A corrected text box obtaining module 440, configured to correct the text direction of each text box according to the oblique direction, so as to obtain a corrected text box after the text direction correction;

a text information determination module 450, configured to identify text information in the corrected text box.

Optionally, the text detection model determining module 410 includes:

the text detection model acquisition unit is used for taking the trained initial text detection model as the text detection model under the condition that the loss value is in a preset range;

Optionally, the loss value calculation unit includes:

Optionally, the text detection model includes: a classification result obtaining layer and an oblique direction obtaining layer, the to-be-identified picture identifying module 430 includes:

a positive pixel obtaining unit 431, configured to invoke the classification result obtaining layer to process the picture to be identified, and obtain a pixel classification result and a connected classification result on the picture to be identified;

A text box determining unit 432, configured to determine at least one text box in the picture to be identified according to the pixel classification result and the connected classification result;

The tilt threshold determining unit 433 is configured to invoke the tilt direction acquiring layer to process the to-be-identified picture, determine a to-be-identified tilt direction of the text in the at least one text box, and a tilt threshold corresponding to the to-be-identified tilt direction;

and the tilt direction determining unit 434 is configured to determine a tilt direction corresponding to the at least one text box according to a maximum tilt threshold value of the tilt threshold values.

Optionally, the corrected text box acquisition module 440 includes:

A first correction box acquisition unit 441 configured to, when it is determined that the tilt angle of the text box is between 90 ° and 180 ° according to the tilt direction, rotate the text box counterclockwise by 90 °, to obtain the corrected text box;

a second correction frame acquiring unit 442 for rotating the text frame counterclockwise by 180 ° to obtain the correction text frame when it is determined that the tilt angle of the text frame is between 180 ° and 270 ° according to the tilt direction;

And a third correction box acquiring unit 443 for rotating the text box counterclockwise by 270 ° to obtain the correction text box when the tilt angle of the text box is determined to be between 270 ° and 360 ° according to the tilt direction.

Optionally, the text information determining module 450 includes:

a text information determining unit 451 for inputting the corrected text box to a text recognition model, and determining the text information contained in the corrected text box by the text recognition model.

According to the text recognition device provided by the embodiment of the application, the picture to be recognized containing text information is obtained, the picture to be recognized is recognized through the pre-trained text detection model, at least one text box containing text in the picture to be recognized and the corresponding inclination direction of each text box are determined, the text direction of each text box is corrected according to the inclination direction, the corrected text box with corrected text direction is obtained, and the text information in the corrected text box is recognized. According to the embodiment of the application, the text direction classification network and the text box position detection network are fused, and then the accurate text content direction is finally obtained according to the large angle classification result and the text box position detection result, so that the picture text recognition accuracy is improved.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

Additionally, the embodiment of the application also provides electronic equipment, which comprises: a processor, a memory, and a computer program stored on the memory and executable on the processor, the processor implementing the text recognition method of any of the above when the program is executed.

Embodiments of the present application also provide a non-transitory computer readable storage medium, which when executed by a processor, causes the processor to perform the text recognition method described in any of the above embodiments.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing has outlined a detailed description of a text recognition method, a text recognition device, an electronic device and a computer readable storage medium, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application and to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of text recognition, comprising:

acquiring a picture to be identified containing text information;

Identifying the picture to be identified through a pre-trained text detection model, and determining at least one text box containing text in the picture to be identified and the corresponding inclination direction of each text box; wherein, include: acquiring a predicted value of a positive pixel in the picture to be identified and a communication predicted value among pixels, and determining the at least one text box based on the predicted value of the pixel and the communication predicted value; the positive pixels are pixels falling in a text region;

Identifying text information in the corrected text box;

Wherein determining the tilt direction comprises: identifying pixels in the text box, determining the tilt direction to be confirmed of the text box, and determining a tilt threshold value corresponding to the text pixels in the text box in the tilt direction to be confirmed; and taking the inclination direction to be confirmed corresponding to the maximum value of the inclination threshold value as the inclination direction corresponding to the text box.

2. The method of claim 1, further comprising, prior to said capturing the picture to be identified comprising text information:

determining a pre-trained text detection model;

said determining said text detection model pre-trained comprises

3. The method according to claim 2, wherein calculating a loss value of the initial text detection model based on each of the initial position information, each of the predicted position information, each of the initial tilt directions, and each of the predicted tilt directions includes:

4. The method of claim 3, wherein the text detection model comprises: the method comprises a classification result acquisition layer and an inclination direction acquisition layer, wherein the pictures to be identified are identified through a pre-trained text detection model, at least one text box containing text in the pictures to be identified and the inclination direction corresponding to each text box are determined, and the method comprises the following steps:

5. The method of claim 1, wherein said performing correction processing on each text box according to the tilt direction to obtain a corrected text box comprises:

6. The method of claim 1, wherein the identifying text information in the corrected text box comprises:

7. A text recognition device, comprising:

The picture to be identified identifying module is used for identifying the picture to be identified through a pre-trained text detection model, and determining at least one text box containing text in the picture to be identified and the corresponding inclination direction of each text box; wherein, include: acquiring a predicted value of a positive pixel in the picture to be identified and a communication predicted value among pixels, and determining the at least one text box based on the predicted value of the pixel and the communication predicted value; the positive pixels are pixels falling in a text region;

A text information determining module for identifying text information in the corrected text box;

8. The apparatus as recited in claim 7, further comprising:

The text detection model determination module includes:

9. An electronic device, comprising:

a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the text recognition method of any one of claims 1 to 6 when the program is executed.

10. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method of any one of claims 1 to 6.