CN114220108A

CN114220108A - Text recognition method, readable storage medium and text recognition device for natural scene

Info

Publication number: CN114220108A
Application number: CN202111565107.7A
Authority: CN
Inventors: 李球; 王和平; 陈昌全; 陈余泉; 徐波; 陈雅琼
Original assignee: Maxvision Technology Corp
Current assignee: Maxvision Technology Corp
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-22

Abstract

The application discloses a text recognition method for a natural scene, which comprises the following steps: acquiring a text image to be recognized, and performing text region detection on the text image to be recognized to obtain a first text region of a rectangular frame; performing perspective transformation on the first text region, and rotating the first text region after the perspective transformation to obtain a second text region, wherein the long side of a rectangular frame of the second text region is parallel to the X axis; training based on a deep learning model to obtain an angle detection model, detecting the angle of the characters in the second text region by using the angle detection model, and adjusting the character angle of the second text region of the rectangular frame according to the angle detected by the angle detection model to obtain a third text region, wherein the character angle in the third text region is 0 degree; and performing single-character segmentation and single-character recognition on the characters in the third text region. The present application also provides a computer-readable storage medium and a text recognition apparatus.

Description

Text recognition method, readable storage medium and text recognition device for natural scene

Technical Field

The present application relates to the field of text recognition technologies, and in particular, to a text recognition method, a readable storage medium, and a text recognition apparatus for a natural scene.

Background

Under the current development trend of science and technology, the technology of recognizing characters by means of images is common. It can be mainly classified into optical character recognition, character recognition in natural scenes, and the like. Optical Character Recognition (OCR) is mainly oriented to high-definition document images, and such techniques assume that the input image has a clean background, simple fonts and orderly arrangement of characters. Under the condition of meeting the requirements of the former proposal, the trained network model can achieve high recognition accuracy and the training process is fast.

Character recognition (STR) in natural scenes is mainly oriented to natural scene images containing characters. However, in life, characters in texts in some natural scenes have different angles and other attributes, which makes it difficult to recognize the characters in the natural scenes.

Disclosure of Invention

Aiming at the prior art, the technical problem to be solved by the application is to provide a text recognition method, a readable storage medium and a terminal for a natural scene, which are beneficial to improving the recognition efficiency of texts containing characters in different angles.

In order to solve the above technical problem, the present application provides a text recognition method for a natural scene, including:

acquiring a text image to be recognized, and performing text region detection on the text image to be recognized to obtain a first text region of a rectangular frame;

performing perspective transformation on the first text region, and rotating the first text region after the perspective transformation to obtain a second text region, so that the long side of a rectangular frame of the second text region is parallel to the X axis;

training based on a deep learning model to obtain an angle detection model, detecting the angle of the characters in the second text region by using the angle detection model, and adjusting the character angle of the second text region of the rectangular frame according to the angle detected by the angle detection model to obtain a third text region, so that the included angle of the characters in the third text region is 0 degree;

performing single character segmentation and single character recognition on the characters in the third text region;

wherein, X-axis and Y-axis are mutually perpendicular to form an image coordinate system, and the character angle is the included angle between the characters and the Y-axis.

In one possible implementation manner, the step of rotating the perspective-transformed first text region to obtain the second text region includes:

judging whether the length ratio of the Y axis to the X axis of the rectangular frame of the first text area is more than 1.5;

if so, rotating the first text region of the rectangular frame by 90 degrees anticlockwise;

otherwise, the first text region of the rectangular box is rotated counterclockwise by 0 degree.

In one possible implementation, the step of obtaining the angle detection model based on deep learning model training includes:

intercepting a text image of a rectangular box in which characters are transversely distributed in parallel and the character angle is 0 degree in a natural scene as a data set;

dividing the data set into six parts, and respectively recording the six parts as first part data, second part data, third part data, fourth part data, fifth part data and sixth part data;

rotating each character of each text image in the first part of data counterclockwise by 0 degree to obtain a first training data set; rotating each character of each second text image in the second data by 90 degrees anticlockwise to obtain a second training data set; rotating each character of each three text images in the third data counterclockwise by 180 degrees to obtain a third training data set; rotating each character of every four text images in the fourth data counterclockwise by 270 degrees to obtain a fourth training data set; rotating each character of each fifth text image in the fifth data counterclockwise by 45 degrees to obtain a fifth training data set; clockwise rotating every character of every six text images in the sixth data by minus 45 degrees to obtain a sixth training data set;

extracting character angle characteristics of a first training data set, a second training data set, a third training data set, a fourth training data set, a fifth training data set and a sixth training data set relative to a text image by using a characteristic layer of a ShuffeNet V2 network model to generate a characteristic diagram, and carrying out learning training based on the ShuffeNet V2 network model until the ShuffeNet V2 network model converges to obtain an angle detection model.

In one possible implementation, the number of text images in the first training data set, the second training data set, the third training data set, the fourth training data set, the fifth training data set, and the sixth training data set is set to be the same.

In a possible implementation manner, the step of performing text angle adjustment on the second text region of the rectangular frame according to the angle detected by the angle detection model to obtain the third text region includes:

if the angle of the characters in the second text region detected by the angle detection model is 0 degree, maintaining the angle of the characters in the second text region unchanged;

if the character angle in the second text region detected by the angle detection model is 90 degrees, rotating the second text region counterclockwise by 270 degrees;

if the character angle in the second text region detected by the angle detection model is 180 degrees, rotating the second text region by 180 degrees anticlockwise;

if the character angle in the second text region detected by the angle detection model is 270 degrees, rotating the second text region by 90 degrees anticlockwise;

and if the character angle in the second text region detected by the angle detection model is 45 degrees, rotating the second text region by 215 degrees anticlockwise.

In one possible implementation manner, the step of performing text region detection on the text image to be recognized to obtain a first text region of a rectangular frame includes:

continuously performing five times of convolution operation on the text image by using a 3 x 3 convolution core, and performing cascade fusion on results of the five times of convolution operation based on a feature map pyramid network to obtain a feature map of the text image;

predicting the feature map by using a DBNet learning network to obtain a probability map about the text;

carrying out threshold operation on the probability map to obtain a segmentation result about the text;

and extracting the contour of the segmentation result, and calculating a circumscribed rectangle frame of the contour, wherein the circumscribed rectangle frame frames the first text region of the regional rectangle frame.

In one possible implementation, the step of performing single-character segmentation and single-character recognition on the characters in the third text region includes:

segmenting all the single characters in the third text area and each single character bounding rectangle by using the yolov3 model;

and inputting the single characters into a single character recognition model one by one for character recognition according to the sequence of the horizontal coordinates of the top left corner vertexes of the circumscribed rectangular frames of all the single characters from small to large.

In one possible implementation, the single character recognition model is the ResNet50 learning model.

The present application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the method for recognizing text in a natural scene.

The present application further provides a text recognition apparatus comprising a memory and one or more processors, the memory and the processors coupled; the memory is for storing computer program code comprising computer instructions which, when executed by the text recognition apparatus, cause the text recognition apparatus to perform a text recognition method of the natural scene.

In the text recognition method of the natural scene, firstly, detecting a text area of a text image to be recognized to obtain a first text area of a rectangular frame; rotating the first text region after perspective transformation to obtain a second text region, so that the long side of a rectangular frame of the second text region is parallel to the X axis, and obtaining a second text region of a transverse rectangular frame; detecting the angles of the characters in the second text region by using the trained angle detection model, and adjusting the character angles of the second text region of the rectangular frame according to the angles detected by the angle detection model to obtain a third text region, so that the included angle of the characters in the third text region is 0 degree, namely, the characters in the third text region have no angle deviation on the Y axis, and the character angles in the third text region are unified into a state which is conventionally checked by human eyes; therefore, the problems that the difficulty of subsequent character recognition is increased and the character recognition efficiency is influenced due to the problem of different angles of characters are solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a text recognition method for a natural scene according to an embodiment of the present application;

FIG. 2 is a diagram illustrating results of a first text region obtained, perspective transformation and rotation performed on the first text region, a second text region obtained, and a third text region obtained according to an embodiment of the present application;

fig. 3 is a flowchart illustrating steps for performing single-character segmentation and single-character recognition on characters in the third text region according to an embodiment of the present application.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present application clearer, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element.

It will be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like, as used herein, refer to an orientation or positional relationship indicated in the drawings that is solely for the purpose of facilitating the description and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be considered as limiting the present application.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

A text recognition method, a readable storage medium, and a text recognition apparatus for a natural scene according to embodiments of the present application will now be described with reference to the drawings.

Referring to fig. 1, a text recognition method for a natural scene provided in an embodiment of the present application includes the following steps:

step S100: acquiring a text image to be recognized, and performing text region detection on the text image to be recognized to obtain a first text region of the rectangular frame, wherein each text in a first column in fig. 2 represents a text result of performing text region detection to obtain the first text region of the rectangular frame.

Step S200: and performing perspective transformation on the first text region, and rotating the first text region after the perspective transformation to obtain a second text region, so that the long side of the rectangular frame of the second text region is parallel to the X axis.

Step S300: and training based on the deep learning model to obtain an angle detection model.

Step S400: and detecting the angle of the characters in the second text region by using the angle detection model.

Step S500: performing character angle adjustment on a second text region of the rectangular frame according to the angle detected by the angle detection model to obtain a third text region, so that the included angle of characters in the third text region is 0 degree; .

Step S600: and performing single-character segmentation and single-character recognition on the characters in the third text region.

In the above steps, the X-axis and the Y-axis are perpendicular to each other to form an image coordinate system, which is shown in fig. 2. It should be noted that the text angle is an angle between the text and the Y axis, and can be understood as an angular deviation between the text and the Y axis observed from the usual visual angle of human eyes. To facilitate understanding of the word angle, for example, the word angle in the text region of the fifth rectangular box in the first column of text in fig. 2 is 0 degrees, and the word angle in the text region of the first rectangular box in the second column of text in fig. 2 is 0 degrees; the angle of the word in the text area of the fifth rectangular box in the three columns of text in fig. 2 is 90 degrees; the angle of the word in the text area of the sixth rectangular box in the third column of text in fig. 2 is 270 degrees; the angle of the word within the text area of the seventh rectangular box in the third column of text in fig. 2 is 45 degrees; the angle of the text in the text area of the second rectangular box in the third column of text in fig. 2 is 180 degrees.

Referring to fig. 3, in step S100, the step of performing text region detection on the text image to be recognized to obtain a first text region of a rectangular frame includes:

step S110: the text image is continuously subjected to five convolution operations using a 3 × 3 convolution kernel.

Step S120: performing cascade fusion on the result of the five times of convolution based on a feature map pyramid network (FPN) to obtain a feature map of the text image; wherein, the feature in the feature map is a feature related to the text image characteristic.

Step S130: and predicting the feature map by using a DBNet learning network to obtain a probability map about the text.

Step S140: and performing threshold operation on the probability map to obtain a segmentation result about the text.

Step S150: and extracting the contour of the segmentation result, and calculating a circumscribed rectangle frame of the contour, wherein the circumscribed rectangle frame frames the first text region of the regional rectangle frame.

In one embodiment, the threshold of the threshold operation in step S140 is 0.2.

In step S200, the first text region is subjected to perspective transformation, and the result of the perspective transformation of the text in the first column in fig. 2 is the text in the second column.

In step S200, the step of rotating the perspective-transformed first text region to obtain a second text region includes: judging whether the length ratio of the Y axis to the X axis of the rectangular frame of the first text area is more than 1.5; if so, rotating the first text region of the rectangular frame by 90 degrees anticlockwise; otherwise, the first text region of the rectangular box is rotated counterclockwise by 0 degree. The third column of text in fig. 2 is the result after the second column of text has been rotated.

It is understood that the length of the rectangular frame of the first text region in the Y-axis may be understood as the height of the rectangular frame, and the length of the rectangular frame of the first text region in the X-axis may be understood as the width of the rectangular frame; the rotation of the perspective-transformed first text region in step S200 is performed to obtain a horizontal rectangular frame, i.e., the long side of the rectangular frame of the second text region is parallel to the X-axis; if the rectangular frame is in the positive direction, it can be determined that any side is a long side, that is, the length of the rectangular frame in the first text region in the X axis is set to be a long side or the length of the rectangular frame in the first text region in the Y axis is set to be a long side.

Step S300: the step of obtaining the angle detection model based on deep learning model training comprises the following steps:

extracting character angle characteristics of a first training data set, a second training data set, a third training data set, a fourth training data set, a fifth training data set and a sixth training data set relative to a text image by using a characteristic layer of a ShuffeNet V2 network model to generate a characteristic diagram, and carrying out learning training based on the ShuffeNet V2 network model until the ShuffeNet V2 network model converges to obtain an angle detection model. The ShuffleNet V2 network model is a neural network model.

Further, in order to improve the accuracy of the angle detection model, the number of text images in the first training data set, the second training data set, the third training data set, the fourth training data set, the fifth training data set and the sixth training data set is set to be the same. And adding negative samples to the first training data set, the second training data set, the third training data set, the fourth training data set, the fifth training data set and the sixth training data set.

With further reference to fig. 1, in step S500, the step of performing a text angle adjustment on the second text region of the rectangular frame according to the angle detected by the angle detection model to obtain a third text region includes:

step S510: if the angle of the characters in the second text region detected by the angle detection model is 0 degree, maintaining the angle of the characters in the second text region unchanged;

step S520: if the character angle in the second text region detected by the angle detection model is 90 degrees, rotating the second text region counterclockwise by 270 degrees;

step S530: if the character angle in the second text region detected by the angle detection model is 180 degrees, rotating the second text region by 180 degrees anticlockwise;

step S540: if the character angle in the second text region detected by the angle detection model is 270 degrees, rotating the second text region by 90 degrees anticlockwise;

step S550: and if the character angle in the second text region detected by the angle detection model is 45 degrees, rotating the second text region by 215 degrees anticlockwise.

It should be noted that the included angle of the text is 0 degree, which can be understood as: when a person checks the characters at a usual visual angle, the characters are normally and vertically written and have no angle deviation in the vertical direction; for example, the text in the fourth column of text in fig. 2 is 0 degrees from the Y-axis direction, and when the text in the fourth column of text region is observed by the general visual angle of the human eye, the text is normally vertical and has no angular deviation in the vertical direction.

In step S500, a second text region of the rectangular frame is subjected to a character angle adjustment according to the angle detected by the angle detection model to obtain a third text region, so that a character angle in the third text region is parallel to the Y axis, that is, there is no angle deviation in the vertical direction of characters in the third text region, and thus the character angle in the third text region is unified into a state that human eyes are conventionally used to check, thereby facilitating subsequent single character segmentation and single character recognition, and reducing the difficulty of subsequent character segmentation and character recognition. It can be understood that when characters are various in angle, because characters in the character recognition library commonly used are normally vertical, so when utilizing the character recognition library commonly used to carry out different angle character recognition, the recognition difficulty and the recognition efficiency are inevitably improved.

The fourth column of texts in fig. 2 is a result of performing angle detection and character angle adjustment on the third column of texts by using the angle detection model.

In step S600, the step of performing single-character segmentation and single-character recognition on the characters in the third text region includes: segmenting all the single characters in the third text area and each single character bounding rectangle by using the yolov3 model; and inputting the single characters into a single character recognition model one by one for character recognition according to the sequence of the horizontal coordinates of the top left corner vertexes of the circumscribed rectangular frames of all the single characters from small to large.

In one embodiment of the application, the single character recognition model is the ResNet50 learning model. The training data for training the ResNet50 learning model adopts 6763 Chinese characters in the first-level character library and the second-level character library of the character set GB 2312-80. In order to increase the diversity of the data set and increase the accuracy of the ResNet50 learning model, the brightness of at least part of character images in the character set used by the training model is changed to 70-130% of the original brightness; randomly changing the contrast of at least part of character images to 70-130% of the original contrast; randomly changing the saturation of at least part of character images to 70-130% of the original saturation; and adding the images with the brightness change, the saturation change and the contrast change into the original character set to be mixed to generate new training data.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for recognizing a text in a natural scene in the foregoing embodiment is implemented.

In the present embodiment, the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, data centers, and the like, which may be integrated with the medium. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Embodiments of the present application also provide a text recognition apparatus, which includes a memory and one or more processors, the memory and the processors being coupled; the memory is used for storing computer program code, which includes computer instructions, when executed by the text recognition apparatus, causes the text recognition apparatus to execute the text recognition method of the natural scene in the embodiment.

In this embodiment, the processor may include one or more processing units, such as: the processor may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc.; the different processing units may be separate devices or may be integrated into one or more processors. The memory may be, but is not limited to, an electronic, magnetic, optical, or semiconductor system, apparatus, or device, and in particular, but not limited to, a magnetic disk, hard disk, read only memory, random access memory, or erasable programmable read only memory. The processor may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the like.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A text recognition method for a natural scene, comprising:

2. The method for recognizing text in natural scene according to claim 1, wherein the step of rotating the perspective-transformed first text region to obtain the second text region comprises:

3. The natural scene text recognition method of claim 1, wherein the step of obtaining the angle detection model based on deep learning model training comprises:

4. The method for recognizing text in natural scenes according to claim 3, wherein the number of text images in the first training data set, the second training data set, the third training data set, the fourth training data set, the fifth training data set, and the sixth training data set is set to be the same.

5. The method for recognizing texts in natural scenes according to claim 3, wherein the step of performing text angle adjustment on the second text region of the rectangular frame according to the angle detected by the angle detection model to obtain the third text region comprises:

6. The natural scene text recognition method of claim 1, wherein the step of performing text region detection on the text image to be recognized to obtain a first text region of a rectangular box comprises:

7. The natural scene text recognition method of claim 1, wherein the step of performing single-character segmentation and single-character recognition on the characters in the third text region comprises:

8. The method for recognizing text of a natural scene as recited in claim 1, wherein the single character recognition model is a ResNet50 learning model.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the text recognition method of a natural scene according to any one of claims 1 to 8.

10. A text recognition apparatus comprising a memory and one or more processors, the memory and the processors being coupled; the memory for storing computer program code comprising computer instructions which, when executed by the text recognition apparatus, cause the text recognition apparatus to perform a method of text recognition of a natural scene as claimed in any one of claims 1 to 8.