CN113269181A - Information processing apparatus, information processing method, and computer-readable recording medium - Google Patents
Information processing apparatus, information processing method, and computer-readable recording medium Download PDFInfo
- Publication number
- CN113269181A CN113269181A CN202010093279.8A CN202010093279A CN113269181A CN 113269181 A CN113269181 A CN 113269181A CN 202010093279 A CN202010093279 A CN 202010093279A CN 113269181 A CN113269181 A CN 113269181A
- Authority
- CN
- China
- Prior art keywords
- point
- baseline
- text line
- image
- points
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 58
- 238000003672 processing method Methods 0.000 title claims abstract description 21
- 238000001514 detection method Methods 0.000 claims abstract description 28
- 238000013135 deep learning Methods 0.000 claims description 38
- 238000013507 mapping Methods 0.000 claims description 34
- 238000000034 method Methods 0.000 claims description 30
- 238000012545 processing Methods 0.000 claims description 26
- 230000000717 retained effect Effects 0.000 claims description 25
- 230000008569 process Effects 0.000 claims description 22
- 238000005457 optimization Methods 0.000 claims description 18
- 238000005070 sampling Methods 0.000 claims description 15
- 230000001131 transforming effect Effects 0.000 claims description 8
- 230000014509 gene expression Effects 0.000 description 17
- 239000011159 matrix material Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 238000003860 storage Methods 0.000 description 11
- 238000013527 convolutional neural network Methods 0.000 description 10
- 230000009466 transformation Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000007639 printing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/635—Overlay text, e.g. embedded captions in a TV program
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Character Input (AREA)
Abstract
An information processing apparatus, an information processing method, and a computer-readable recording medium are disclosed. The information processing apparatus includes: a detection unit configured to detect a start point and an end point of each of at least one text line included in an image; a prediction unit configured to predict, for each text line, a first baseline point of the text line from an image block corresponding to a start point extracted from an image based on the start point of the text line, and predict an N +1 th baseline point of the text line from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, where N is 1,2, …, M is a positive integer equal to or greater than 2; and an obtaining unit configured to obtain a corrected image of each text line based on a start point, a plurality of baseline points, and an end point of the text line.
Description
Technical Field
The present disclosure relates to the field of information processing, and more particularly, to an information processing apparatus, an information processing method, and a computer-readable recording medium that correct a tilt and/or a curve on a path trajectory of a text line.
Background
OCR technology (optical character recognition) is widely used in, for example, postal service, finance, insurance, and tax industries, and facilitates industrial and life efficiencies. The accurate and automatic text recognition result can provide more information, and labor force is saved. Existing high performance recognition engines typically can only process a single line of text string images, while actual document images often contain many lines of text. Therefore, the document image is usually processed after being segmented into the text lines in advance. The traditional text line segmentation method is based on text line pixel distribution, the whole text image is projected to a fixed direction, and segmentation points are judged through a blank area of pixel values. This method can only process regular text line images in general, and does not cut the image well when there is a bend, distortion, etc. in the handwritten document image.
Disclosure of Invention
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In view of the above problems, it is an object of the present disclosure to provide an information processing apparatus and an information processing method capable of solving one or more disadvantages in the related art.
According to an aspect of the present disclosure, there is provided an information processing apparatus including: a detection unit configured to detect a start point and an end point of each of at least one text line included in an image; a prediction unit configured to predict, for each text line, a first baseline point of the text line from an image block corresponding to a start point extracted from the image based on the start point of the text line, and predict an N +1 th baseline point of the text line from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing path trajectories of the text line, where N is 1,2, …, M is a positive integer equal to or greater than 2; and an obtaining unit configured to obtain a corrected image of each text line based on the start point, the plurality of baseline points, and the end point of the text line.
According to another aspect of the present disclosure, there is provided an information processing method including: detecting a start point and an end point of each text line in at least one text line included in an image; predicting, for each text line, a first baseline point of the text line from an image block corresponding to a start point extracted from an image based on the start point of the text line, and predicting an N +1 th baseline point of the text line from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, where N is 1,2, …, M is a positive integer greater than or equal to 2; and obtaining a corrected image of each text line based on the starting point, the plurality of baseline points and the end point of the text line.
According to still another aspect of the present disclosure, there is provided a computer-readable recording medium having a program recorded thereon for causing a computer to execute the steps of: detecting a start point and an end point of each text line in at least one text line included in an image; predicting, for each text line, a first baseline point of the text line from an image block corresponding to a start point extracted from an image based on the start point of the text line, and predicting an N +1 th baseline point of the text line from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, where N is 1,2, …, M is a positive integer greater than or equal to 2; and obtaining a corrected image of each text line based on the starting point, the plurality of baseline points and the end point of the text line.
According to other aspects of the present disclosure, there is also provided computer program code and a computer program product for implementing the above-described method according to the present disclosure.
Additional aspects of the disclosed embodiments are set forth in the description section that follows, wherein the detailed description is presented to fully disclose the preferred embodiments of the disclosed embodiments without imposing limitations thereon.
Drawings
The disclosure may be better understood by reference to the following detailed description taken in conjunction with the accompanying drawings, in which like or similar reference numerals are used throughout the figures to designate like or similar components. The accompanying drawings, which are incorporated in and form a part of the specification, further illustrate preferred embodiments of the present disclosure and explain the principles and advantages of the present disclosure, are incorporated in and form a part of the specification. Wherein:
fig. 1 is a block diagram showing a functional configuration example of an information processing apparatus according to an embodiment of the present disclosure;
fig. 2 is a diagram showing an example in which a detection unit detects start and end points of a text line according to an embodiment of the present disclosure;
fig. 3 is a diagram illustrating an example of image blocks extracted from an image corresponding to a start point according to an embodiment of the present disclosure;
fig. 4 is a diagram illustrating an example of a first baseline point according to an embodiment of the present disclosure;
FIG. 5 illustrates an example of an image block extracted from an image corresponding to a first baseline point in accordance with an embodiment of the disclosure;
FIG. 6 shows an example of path trajectories in a text line in accordance with an embodiment of the present disclosure;
fig. 7 shows an example of a process of obtaining a corrected image of text lines according to an embodiment of the present disclosure;
FIG. 8 illustrates an example of a location corresponding to a baseline point in accordance with an embodiment of the disclosure;
FIG. 9 illustrates an example of a corrected image of a line of text according to an embodiment of the disclosure;
FIG. 10 shows a schematic view of a polygon according to an embodiment of the present disclosure;
FIG. 11 shows an example of a first shape and a second shape in accordance with an embodiment of the present disclosure;
FIG. 12 shows an example of a final corrected image according to an embodiment of the disclosure;
fig. 13 is a flowchart showing an example of a flow of an information processing method according to an embodiment of the present disclosure; and
fig. 14 is a block diagram showing an example structure of a personal computer employable in the embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structures and/or processing steps closely related to the scheme according to the present disclosure are shown in the drawings, and other details not so relevant to the present disclosure are omitted.
Embodiments according to the present disclosure are described in detail below with reference to the accompanying drawings.
First, a functional block diagram of an information processing apparatus 100 of an embodiment of the present disclosure will be described with reference to fig. 1. Fig. 1 is a block diagram showing a functional configuration example of an information processing apparatus 100 according to an embodiment of the present disclosure. As shown in fig. 1, an information processing apparatus 100 according to an embodiment of the present disclosure includes a detection unit 102, a prediction unit 104, and an obtaining unit 106.
The detecting unit 102 may be configured to detect a start point and an end point of each of at least one text line included in the image.
By way of example, the lines of text may be lines of text in Chinese, Japanese, English, and so forth. As an example, the detection unit 102 may predict a Start-of-line (SOL) and an End-of-line (EOL) of each text line at the same time. As an example, the detection unit 102 may predict candidates for SOL and EOL and classify the candidates as SOL or EOL.
As an example, the detection unit 102 may be configured to sequentially input image patches extracted from an image to the second deep learning network, calculate a confidence representing a point in the image patch as a start point or an end point of a text line; and in a case where it is determined by the confidence that the point in the image patch is the start point or the end point of the text line, classifying the point in the image patch as the start point or the end point using the classification confidence output from the second deep learning network, thereby detecting the start point and the end point of each text line.
Fig. 2 is a diagram illustrating an example in which the detection unit 102 detects the start point and the end point of a text line according to an embodiment of the present disclosure. For convenience of description, the image shown in fig. 2 includes only one text line, however, it may be understood by those skilled in the art that if a plurality of text lines are included in the image, the image including each text line may be extracted from the image, and then the image including each text line is input to the second deep learning network, respectively, so that the start point and the end point of each text line may be detected.
As shown in fig. 2, the detection unit 102 divides an image into image blocks, and hereinafter, the image is sometimes referred to as an original image. The size of the image blocks can be set by those skilled in the art according to the actual application scenario. As an example, the size of the image partition may be set to 16 × 16. As an example, the detection unit 102 may extract image patches from the image by moving a bounding box over the image.
As an example, the deep learning network may extract seven features for each pixel point in an image patch from the extracted image patch as its input. As an example, the position information (including horizontal coordinate information and vertical coordinate information), scale information, rotation information, confidence that a start point or an end point of a text line is, classification confidence (including "confidence classified as a start point" and "confidence classified as an end point") of each pixel point in the extracted image patch may be extracted (regressed) using the second deep learning network. The rotation information is, for example, a rotation angle, and the confidence is, for example, a probability. The scale information and the rotation information are related to the degree of tilt of the starting point of the text line.
Note that, for convenience of description, the deep learning network involved in the detection of the start point and the end point by the detection unit 102 is referred to as a second deep learning network, and hereinafter, the deep learning network that predicts the position and rotation information of the baseline point is referred to as a first deep learning network. The "first" and "second" are used herein only to distinguish between different deep learning networks. As an example, the first deep learning network and the second deep learning network are both trained networks.
As an example, the second deep learning network may be a convolutional neural network CNN. As an example, the second deep learning network may be a truncated VGG-11 network.
As an example, the detection unit 102 may determine that a point in the image patch is a start point or an end point of a text line in a case where "the confidence as the start point or the end point of the text line" is greater than a predetermined threshold. As an example, the predetermined threshold may be determined according to an actual application scenario.
As an example, the detection unit 102 may detect a point having a maximum "confidence classified as a start point" value as a start point of a text line, and may detect a point having a maximum "confidence classified as an end point" value as an end point of the text line.
As is apparent from the above description, the detection unit 102 according to the embodiment of the present disclosure can accurately detect the start point and the end point of each text line.
The prediction unit 104 is configured to predict, for each text line, a first baseline point of the text line from an image block corresponding to a start point extracted from an image based on the start point of the text line, and predict an N +1 th baseline point of the text line from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, where N is 1,2, …, M is a positive integer equal to or greater than 2.
The baseline point may be a point in a single line of text that represents the writing path of the line of text. The baseline points are typically pixels on the character in the text line, representing the location of the character. Once a series of baseline points for a line of text are predicted, the writing path of the line of text can be approximated, resulting in a further image of the line of text.
Since the detection unit 102 has obtained the position, scale, and rotation information of the start point via the second deep learning network, the position, scale, and rotation information of the first baseline point of the text line can be predicted from the image block corresponding to the start point extracted from the image based on the start point. In this way, the second baseline point of the text line may be predicted from the image block corresponding to the first baseline point extracted from the image based on the first baseline point, the third baseline point of the text line may be predicted from the image block corresponding to the second baseline point extracted from the image based on the second baseline point, and the above steps are repeated until the edge of the text line image is reached, whereby a plurality of baseline points representing the path trajectory of the text line may be predicted. It should be noted here that since there is sometimes noise after the end point (for example, there is blur due to printing at the end of a text line, etc.), the baseline point may be detected after the end point.
The obtaining unit 106 is configured to obtain a corrected image of each text line based on the start point, the plurality of baseline points, and the end point of the text line.
As an example, the corrected image of each text line may be an image in which a tilt and/or a curvature on the path trajectory of the text line is corrected, so that the obtaining unit 106 may straighten and normalize the text line image curved at an arbitrary angle.
As is apparent from the above description, the information processing apparatus 100 according to the embodiment of the present disclosure can correct the inclination and/or the curvature on the path trajectory of the text line image, so that the text line image curved at an arbitrary angle can be straightened out to further recognize the text line.
As an example, the prediction unit 104 may be configured to extract an image block corresponding to a start point using a window unit corresponding to the start point based on information of the start point.
As an example, the window unit may be an image window matrix.
The transformation matrix defining the starting point is:
in expression 1, S0A scale representing the starting point, theta0Indicates the rotation angle of the starting point, (x)0,y0) Position information indicating a start point, as described above, the above-mentioned information S of the start point0、θ0And (x)0,y0) Is obtained by the detection unit 102 via the second deep learning network when detecting the start point.
As an example, the window unit W of the starting point0Can be obtained by transformation matrix of the starting point.
W0=AWSOL(expression 2)
one skilled in the art may also think of obtaining the window unit W0Other ways of doing so will not be described here again.
As an example, the prediction unit 104 may be configured to utilize a windowing unit W0Extracting and starting point from original imageAnd (4) corresponding image blocks. Other methods of extracting the image block corresponding to the start point are also conceivable by those skilled in the art and will not be described here again.
Fig. 3 is a diagram illustrating an example of image blocks extracted from an image corresponding to a start point according to an embodiment of the present disclosure.
The boxes in fig. 3 may represent the window units W corresponding to the start points0Wherein, the included angle between the lower line segment of the four line segments of the square frame and the horizontal right direction represents the rotation angle theta of the starting point0(ii) a Using the window unit W0The image block extracted from the original image is an image block corresponding to the start point. In FIG. 3, W0Shown with its left line segment passing through the starting point. However, this is merely an example and not a limitation. For example, W0May be a starting point, or W0May pass the starting point, or as long as the starting point is at W0The preparation method is implemented by the following steps.
As an example, the prediction unit 104 may be configured to: converting a window unit corresponding to the starting point based on the predicted information of the first baseline point to obtain a window unit corresponding to the first baseline point, and extracting an image block corresponding to the first baseline point by using the window unit corresponding to the first baseline point; converting the window unit corresponding to the nth baseline point based on the predicted information of the (N + 1) th baseline point to obtain a window unit corresponding to the (N + 1) th baseline point, and extracting an image block corresponding to the (N + 1) th baseline point by using the window unit corresponding to the (N + 1) th baseline point; and transforming a window unit corresponding to the baseline point immediately before the end point based on the information of the end point to obtain a window unit corresponding to the end point, and extracting an image block corresponding to the end point by using the window unit corresponding to the end point.
As an example, the information of the start point may include position information, scale information, and rotation information of the start point; the information of the (N + 1) th baseline point comprises position information, scale information and rotation information of the (N + 1) th baseline point; the information of the tail point comprises position information, scale information and rotation information of the tail point; the scale of all baseline points in each text line is equal to the scale of the starting point; the position and rotation information of the first baseline point are predicted by the prediction unit 104 inputting the image block corresponding to the starting point into the first deep learning network; and the position and rotation information of the N +1 th baseline point are predicted by the prediction unit 104 inputting the image block corresponding to the nth baseline point into the first deep learning network.
The dimension of the baseline point in each line of text may be equal to the dimension of the starting point or not equal to the dimension of the starting point. Hereinafter, for convenience of description, it is assumed that the scale of all the baseline points in each text row may be equal to the scale of the start point. As hereinbefore with S0Representing the scale of the starting point, the scales of all baseline points belonging to the same text line as the starting point may all be S0. In addition, by θ1The rotation angle of the first baseline point is expressed as (x)1,y1) Indicating position information of the first baseline point.
As an example, the first deep learning network may be a convolutional neural network CNN. By way of example, a CNN may consist of seven convolutional layers of 3 x 3 kernel size, with feature vector dimensions of the first six layers being 64,128,256,256,512 and 512, respectively. Batch Normalization (BN) may be added after the fourth and fifth convolutional layers, pooling layers may be added after the first, second, fourth and sixth convolutional layers, and one full convolutional layer may be used for the final output of CNN.
The prediction unit 104 may input an image block extracted from an image based on a start point to the first deep learning network to predict θ of the first baseline point1And (x)1,y1)。
Fig. 4 is a diagram illustrating an example of a first baseline point according to an embodiment of the present disclosure. In fig. 4, an example of a first baseline point is shown.
May be based on theta1And (x)1,y1) To obtain P1:
Suppose using W1The window unit representing the first baseline point correspondence may then be based on P1To W0Is transformed to obtain W1:
W1=P1Wb(expression 4)
One skilled in the art may also think of obtaining the window unit W1Other ways of doing so will not be described here again.
As an example, prediction unit 104 may utilize W1And extracting the image block corresponding to the first baseline point.
Fig. 5 illustrates an example of an image block extracted from an image corresponding to a first baseline point according to an embodiment of the present disclosure.
The box in fig. 5 may represent a window unit W corresponding to the first baseline point1Wherein an angle between a lower line segment of the four line segments of the square box and the horizontal right direction represents a rotation angle θ of the first baseline point1(ii) a Using the window unit W1The image block extracted from the original image is an image block corresponding to the first baseline point. In FIG. 5, W1Shown as its left line segment passing through the first baseline point. However, this is merely an example and not a limitation. For example, W1May be the first baseline point, or W1May pass through the first baseline point, or as long as the first baseline point is at W1The preparation method is implemented by the following steps.
As an example, the prediction unit 104 may input an image block extracted from the image based on the nth baseline point to the first depth learning network to predict the rotation angle θ of the (N + 1) th baseline pointN+1And location information (x)N+1,yN+1) Thereby predicting the N +1 th baseline point.
The above process is repeated until the edge of the text line image is reached, and a plurality of baseline points representing the path trajectory of the text line can be predicted.
FIG. 6 shows an example of path trajectories in a text line in accordance with an embodiment of the present disclosure.
In fig. 6, a line connecting the start point, the baseline point, and the end point included in the text line may represent a path trajectory of the text line.
May be based on thetaN+1And (x)N+1,yN+1) To obtain PN+1:
Suppose using WN+1Indicating the corresponding window unit of the N +1 th base line point and using WNThe window unit corresponding to the Nth base line point can be based on PN+1To WNIs transformed to obtain WN+1:
WN+1=PN+1WN(expression 6)
By expression 6, let WN+1Relative to WNHaving a (theta)N+1-θN) And in the vertical position is (y)N+1-yN) Is (x) in the horizontal positionN+1-xN) Is shifted.
One skilled in the art may also think of obtaining the window unit WN+1Other ways of doing so will not be described here again.
As an example, prediction unit 104 may utilize WN+1And extracting the image block corresponding to the (N + 1) th baseline point.
As described above, the detection unit 102 may obtain the rotation information and the position information of the end point via the second deep learning network when detecting the end point, and with the above information and the window unit of the baseline point immediately before the end point, similarly to expression 6, a window unit corresponding to the end point may be calculated.
As an example, the prediction unit 104 may extract an image block corresponding to the end point using a window unit corresponding to the end point.
Fig. 7 shows an example of a process of obtaining a corrected image of text lines according to an embodiment of the present disclosure.
As shown in FIG. 7, prediction unit 104 utilizes WNAnd extracting the image block corresponding to the Nth baseline point. Then, the image block is inputPredicting the rotation angle theta of the N +1 th baseline point by the first deep learning networkN+1And location information (x)N+1,yN+1) And is based on thetaN+1And (x)N+1,yN+1) To obtain PN+1. Next, based on PN+1To WNIs transformed to obtain WN+1。
As an example, the obtaining unit 106 may be configured to, for each text line: obtaining a first position corresponding to the starting point based on the window unit corresponding to the starting point and the window unit corresponding to the first baseline point, sampling a pixel value at the first position in the image, and mapping the sampled pixel value to a first mapping image block of a first predetermined size; obtaining a second position corresponding to the nth baseline point based on the window unit corresponding to the nth baseline point and the window unit corresponding to the (N + 1) th baseline point, sampling a pixel value at the second position in the image, and mapping the sampled pixel value to a second mapping image block of a first predetermined size corresponding to the nth baseline point; obtaining a third position corresponding to the end point based on the window unit corresponding to the end point and the window unit corresponding to the baseline point immediately before the end point, sampling a pixel value at the third position in the image, and mapping the sampled pixel value to a third mapped image block of the first predetermined size; and sequentially connecting the first mapping image block corresponding to the starting point, the second mapping image block corresponding to each base line point and the third mapping image block corresponding to the tail point together, thereby forming a corrected image of the text line.
As an example, the window unit W corresponding to the start point may be paired with the following expression 70Inverse transformation is carried out to obtain position pairs pu,0,pl,0In which position pu,0Is the position of the pixel above the starting point in the original image, position pl,0Is the position of the pixel located below the starting point in the original image.
In expression 7, position pu,0Is represented as (x)u,0,yu,0) Wherein x isu,0Denotes the abscissa, y, of a pixel located above the starting pointu,0Indicating the ordinate, position p, of a pixel located above the starting pointl,0Is represented as (x)l,0,yl,0) Wherein x isl,0Denotes the abscissa, y, of the pixel located below the starting pointl,0Indicating the ordinate of the pixel located below the starting point.
As an example, the person skilled in the art may determine the first predetermined size according to the actual application scenario. As an example, the first predetermined size may be 60 × 60.
As an example, the following expression 8 may be used to pair the window unit W corresponding to the nth baseline point (where N is 1,2, …, M is a positive integer of 2 or more)NInverse transformation is carried out to obtain position pairs pu,N,pl,NIn which position pu,NIs the position in the original image of the pixel above the Nth baseline point, position pl,NIs the position of the pixel located below the nth baseline point in the original image.
In expression 8, position pu,NIs represented as (x)u,N,yu,N) Wherein x isu,NDenotes the abscissa, y, of the pixel located above the Nth base line pointu,NIndicating the ordinate, position p, of a pixel located above the Nth base line pointl,NIs represented as (x)l,N,yl,N) Wherein x isl,NDenotes the abscissa, y, of the pixel located below the Nth base line pointl,NIndicating the ordinate of the pixel located below the nth baseline point.
As an example, let N in expression (8) be 1, for the window unit W corresponding to the 1 st baseline point1Inverse transformation is carried out to obtain position pairs pu,1,pl,1Wherein, isPut pu,1Is the position in the original image of the pixel above the 1 st baseline point, position pl,1Is the position of the pixel located below the 1 st baseline point in the original image.
The obtaining unit 106 may utilize pu,0,pl,0And pu,1,pl,1The interpolation results in a coordinate matrix as the first position (the size of the coordinate matrix is, as an example, the first predetermined size), and the pixel value at each coordinate point in the coordinate matrix is sampled in the original image, and the sampled pixel value is mapped to a first mapped image block of the first predetermined size.
As an example, expression (8) can be utilized for the window unit W corresponding to the N +1 th baseline pointN+1Inverse transformation is carried out to obtain position pairs pu,N+1,pl,N+1Position pu,N+1Is the position of the pixel above the N +1 th baseline point in the original image, position pl,N+1Is the position of the pixel located below the N +1 th baseline point in the original image.
The obtaining unit 106 may utilize pu,N,pl,NAnd pu,N+1,pl,N+1The interpolation results in a coordinate matrix as the above-described second position (the size of the coordinate matrix is, as an example, the above-described first predetermined size), and the pixel value at each coordinate point in the coordinate matrix is sampled in the original image, and the sampled pixel value is mapped to a second mapped image block of the first predetermined size. As an example, the coordinates in the coordinate matrix described above may correspond to utilizing WNThe pixel position in the image block corresponding to the nth baseline point extracted from the original image, and therefore, the obtaining unit 106 may map the image block corresponding to the nth baseline point to a second mapped image block corresponding to the nth baseline point.
Fig. 8 illustrates an example of a location corresponding to a baseline point in accordance with an embodiment of the disclosure. FIG. 8 shows a position pair p obtained by inverse transforming a window unit corresponding to the Nth baseline pointu,N,pl,NAnd the window unit corresponding to the (N + 1) th baseline point is inversely transformedPosition pair pu,N+1,pl,N+1And the representation of the image blocks in the block is by WNAnd extracting an image block corresponding to the Nth baseline point from the original image.
As an example, the obtaining unit 106 may map the image block corresponding to the nth baseline point shown in fig. 8 as the second mapped image block by the above-described processing. Due to p atu,N,pl,NAnd pu,N+1,pl,N+1The rotation angle and position information of the nth baseline point and the rotation angle and position information of the (N + 1) th baseline point are used in the calculation of (1), so that the second mapped image block corresponding to the nth baseline point can display the characters in the image block corresponding to the nth baseline point and the second mapped image block corresponding to the (N + 1) th baseline point can display the characters in the image block corresponding to the (N + 1) th baseline point in a fidelity manner, and the inclination and/or curvature of the image block corresponding to the nth baseline point and the image block corresponding to the (N + 1) th baseline point are corrected.
As an example, a window unit corresponding to the end point may be inversely transformed into a position pair corresponding to the end point, and a window unit corresponding to a baseline point immediately before the end point may be inversely transformed into a position pair corresponding to a baseline point immediately before the end point. The obtaining unit 106 may interpolate a coordinate matrix (the size of which is the first predetermined size described above) as the third position described above using the pair of positions corresponding to the end point and the pair of positions corresponding to the baseline point immediately before the end point, and sample a pixel value at each coordinate point in the coordinate matrix in the original image and map the sampled pixel value to the third mapped image block of the first predetermined size.
Similar to the description above in connection with fig. 8, the skilled person will understand that the first mapped image block may correct a tilt and/or curvature in the image block corresponding to the start point and that the third mapped image block may correct a tilt and/or curvature in the image block corresponding to the end point.
Note that, in fig. 7, for the sake of simplicity, only the second mapped image block corresponding to the nth baseline point and the second mapped image block corresponding to the N +1 th baseline point are shown connected together. However, as described above, the first mapped image block corresponding to the start point of the text line, the second mapped image block corresponding to each base line point of the text line, and the third mapped image block corresponding to the end point of the text line are sequentially connected together to form a corrected image of the text line. Assuming that the first predetermined size is 60 × 60 as described above, if the sum of the numbers of the start point, the baseline point, and the end point in one text line is s, the corrected image size of the text line is 60s (length) × 60 (width).
Fig. 9 illustrates an example of a corrected image of a text line according to an embodiment of the present disclosure. The image shown in fig. 9 is a corrected image of the text line whose path trajectory is shown in fig. 6. As can be seen in fig. 9, the corrected image according to the embodiment of the present disclosure corrects the skew and curvature on the path trajectory of its corresponding text line shown in fig. 6, thereby facilitating further recognition of the text line.
Although the text lines shown in the above examples are text lines arranged in the horizontal direction, it will be understood by those skilled in the art that the information processing apparatus 100 according to the embodiment of the present disclosure can also correct the inclination and curvature existing in the text lines arranged in the vertical direction, and will not be described in a repeated manner here.
As an example, the detecting unit 102 may be configured to detect a plurality of starting points for each text line, and the obtaining unit 104 may be configured to perform the following optimization process for each text line: selecting a predetermined number of baseline points from each starting point, respectively, from each baseline path starting with each starting point of the plurality of starting points and including the baseline point corresponding to the starting point; for each baseline path, sequentially connecting the starting point with the head and tail points of the selected baseline point to obtain a polygon; deleting the baseline path of the starting point with lower confidence in the case that the overlap ratio between any two polygons in the polygons is greater than a first predetermined threshold; and in the event that more than one baseline path remains for the line of text, only the baseline path of the starting point with the highest confidence is retained.
As an example, the predetermined number may be determined according to an actual application scenario.
As an example, a baseline corresponding to a starting point refers to a plurality of baselines predicted from the starting point.
As an example, the polygon may be derived based on the position information of the starting point and the selected baseline point. As an example, the start point and the head and tail points of the selected baseline point are sequentially connected to obtain the polygonal area.
FIG. 10 shows a schematic diagram of a polygon in accordance with an embodiment of the present disclosure. As shown in fig. 10, assuming that three baseline points (a first baseline point, a second baseline point, and a third baseline point) from the start point are selected, the start point and the head and tail points of the selected three baseline points are sequentially connected, and a polygon can be obtained.
As an example, the confidence mentioned in the above optimization processing refers to "confidence classified as a starting point".
As an example, the above optimization processing may be performed using an NMS (non-maximum suppression) method.
Redundant baseline paths for lines of text may be removed by the optimization process described above.
As an example, the obtaining unit 106 may be configured to calculate, before performing the above-described optimization process, a first shape corresponding to each of the plurality of start points based on the position information and the scale information of each of the start points, and delete, in a case where an overlap ratio between any two first shapes of the first shapes is larger than a second predetermined threshold, a start point whose confidence is lower of the two start points corresponding to the two first shapes, respectively. Since part of the start point is deleted before the optimization processing, the amount of calculation of the optimization processing can be reduced.
As an example, the second predetermined threshold may be based on an actual application scenario.
As an example, the first shape may be a circle having a start point as a center, or the first shape may be a square or a rectangle having a start point as a center, or the like. Other examples of the first shape will occur to those skilled in the art and will not be described here in detail.
As an example, the detecting unit 102 may be configured to detect a plurality of end points for each text line, and the obtaining unit 106 may be configured to perform the following obtaining process for each text line: calculating a second shape corresponding to each end point based on the position information and the scale information of each end point of the plurality of end points; in the case where none of the baseline points on the retained baseline path falls within the second shape corresponding to any of the plurality of end points, leaving the baseline point on the retained baseline path intact; in a case where one of the baseline points on the retained baseline path falls within the second shapes corresponding to at least some of the end points, respectively, the one of the baseline points is paired with the end point having the highest confidence among the at least some of the end points, and the baseline point on the retained baseline path after the paired end point is deleted; and removing second mapped image blocks respectively corresponding to the deleted baseline points after the paired end points from the corrected image corresponding to the retained baseline path, thereby obtaining a final corrected image.
As described above, since there is sometimes noise after the end point (for example, there is blur due to printing or the like at the end of the text line), for example, the baseline point may be detected after the end point. By the above processing, the base line points subsequent to the end point can be deleted, and the second mapped image blocks respectively corresponding to the deleted base line points subsequent to the paired end point can be removed from the corrected image, thereby obtaining a more accurate, final corrected image.
As an example, the second shape may be a circle centered at the end point, or the second shape may be a square or rectangle centered at the end point, or the like. Other examples of the second shape will occur to those skilled in the art and will not be described here in detail.
As an example, the confidence mentioned in the above obtaining processing refers to "confidence classified as an end point".
Fig. 11 illustrates examples of a first shape and a second shape according to embodiments of the present disclosure. In fig. 11, the first shape and the second shape are shown in a substantially circular shape. Where the line on each text line in fig. 11 is obtained by connecting the start point, the plurality of baseline points, and the end point of the text line, the line representing the path trajectory of the text line.
As an example, the obtaining unit 106 may be configured to delete, in a case where an overlap ratio between any two of the second shapes is larger than a third predetermined threshold before performing the obtaining process, an end point having a lower confidence in two end points respectively corresponding to the two second shapes. Since the partial end point is deleted before the obtaining processing, the amount of calculation of the obtaining processing can be reduced.
As an example, the third predetermined threshold may be set according to an actual application scenario.
Fig. 12 shows an example of a final corrected image according to an embodiment of the present disclosure. The image shown in fig. 12 is the final corrected image of each text line shown in fig. 11. As can be seen from fig. 12, the final corrected image corrects the skew and/or curvature on the path trajectory of the text line shown in fig. 11, thereby straightening and normalizing the curved text line image shown in fig. 11.
In correspondence with the above-described information processing apparatus embodiments, the present disclosure also provides embodiments of the following information processing method.
Fig. 13 is a flowchart illustrating an example of a flow of an information processing method 1300 according to an embodiment of the present disclosure.
As shown in fig. 13, an information processing method 1300 according to an embodiment of the present disclosure includes a detection step S1302, a prediction step S1304, and an obtaining step S1306.
The information processing method 1300 starts in step S1301.
In the detection step S1302, a start point and an end point of each of at least one text line included in the image are detected.
As an example, in the detecting step S1302, image patches extracted from the image may be sequentially input to the second deep learning network, and confidence levels indicating points in the image patches as the start points or the end points of the text lines may be calculated; and in a case where it is determined by the confidence that the point in the image patch is the start point or the end point of the text line, classifying the point in the image patch as the start point or the end point using the classification confidence output from the second deep learning network, thereby detecting the start point and the end point of each text line.
As an example, the second deep learning network may be a convolutional neural network.
Specific examples of detecting the start point and the end point of each text line can be found in the corresponding parts of the above apparatus embodiments, such as the description about the detection unit 102 and fig. 2, and will not be repeated here.
In the detecting step S1302 according to the embodiment of the present disclosure, the start point and the end point of each text line can be accurately detected.
In the predicting step S1304, for each text line, a first baseline point of the text line is predicted from an image block corresponding to a start point extracted from an image based on the start point of the text line, and an N +1 th baseline point of the text line is predicted from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, where N is 1,2, …, M is a positive integer equal to or greater than 2.
Specific examples of predicting a plurality of baseline points of a path trajectory used to represent a line of text can be found in the corresponding parts of the above apparatus embodiments, e.g., the description of the prediction unit 104, and will not be repeated here.
In the obtaining step S1306, a corrected image of each text line is obtained based on the start point, the plurality of baseline points, and the end point of the text line.
As an example, the corrected image of each text line may be an image in which the inclination and/or curvature on the path trajectory of the text line is corrected, so that the text line image curved at an arbitrary angle may be straightened and normalized.
The information processing method 1300 ends in step S1307.
As is apparent from the above description, the information processing method 1300 according to the embodiment of the present disclosure can correct the inclination and/or curvature on the path trajectory of the text line image, so that the text line image curved at an arbitrary angle can be straightened out to further recognize the text line.
As an example, in the prediction step S1304, an image block corresponding to the start point may be extracted using a window unit corresponding to the start point based on the information of the start point.
Specific examples of extracting the image block corresponding to the starting point can be found in the corresponding parts of the above device embodiments, for example, the descriptions about expressions 1-2 and fig. 3, and are not repeated here.
As an example, in the predicting step S1304, a window unit corresponding to the starting point may be transformed based on the predicted information of the first baseline point to obtain a window unit corresponding to the first baseline point, and an image block corresponding to the first baseline point may be extracted using the window unit corresponding to the first baseline point; converting the window unit corresponding to the nth baseline point based on the predicted information of the (N + 1) th baseline point to obtain a window unit corresponding to the (N + 1) th baseline point, and extracting an image block corresponding to the (N + 1) th baseline point by using the window unit corresponding to the (N + 1) th baseline point; and transforming a window unit corresponding to the baseline point immediately before the end point based on the information of the end point to obtain a window unit corresponding to the end point, and extracting an image block corresponding to the end point by using the window unit corresponding to the end point.
As an example, the information of the start point may include position information, scale information, and rotation information of the start point; the information of the (N + 1) th baseline point may include position information, scale information, and rotation information of the (N + 1) th baseline point; the information of the tail point comprises position information, scale information and rotation information of the tail point; the scale of all baseline points in each text line may be equal to the scale of the starting point; the position and rotation information of the first baseline point may be predicted by inputting the image block corresponding to the start point to the first deep learning network in the predicting step S1304; and the position and rotation information of the N +1 th baseline point are predicted by inputting the image block corresponding to the nth baseline point to the first deep learning network in the prediction step S1304.
As an example, the first deep learning network may be a convolutional neural network.
Specific examples of extracting the image block corresponding to the first baseline point can be found in the corresponding parts of the above apparatus embodiments, for example, the descriptions about expression 3-4 and fig. 4-5, and will not be repeated here.
Specific examples of extracting the image block corresponding to the N +1 th baseline point can be found in the corresponding parts of the above apparatus embodiments, for example, the descriptions about the expressions 5-6 and fig. 7, and will not be repeated here.
The above process is repeated until the edge of the text line image is reached, and a plurality of baseline points representing the path trajectory of the text line can be predicted.
As an example, in the obtaining step S1306, for each text line: obtaining a first position corresponding to the starting point based on the window unit corresponding to the starting point and the window unit corresponding to the first baseline point, sampling a pixel value at the first position in the image, and mapping the sampled pixel value to a first mapping image block of a first predetermined size; obtaining a second position corresponding to the nth baseline point based on the window unit corresponding to the nth baseline point and the window unit corresponding to the (N + 1) th baseline point, sampling a pixel value at the second position in the image, and mapping the sampled pixel value to a second mapping image block of a first predetermined size corresponding to the nth baseline point; obtaining a third position corresponding to the end point based on the window unit corresponding to the end point and the window unit corresponding to the baseline point immediately before the end point, sampling a pixel value at the third position in the image, and mapping the sampled pixel value to a third mapped image block of the first predetermined size; and sequentially connecting the first mapping image block corresponding to the starting point, the second mapping image block corresponding to each base line point and the third mapping image block corresponding to the tail point together, thereby forming a corrected image of the text line.
Specific examples of forming the corrected image of each text line may be found in the corresponding parts of the above apparatus embodiments, for example, the description about expressions 7 to 8 and fig. 7 to 8, and will not be repeated here.
As an example, in the detecting step S1302, a plurality of starting points may be detected for each text line, and in the obtaining step S1306, the following optimization processing may be performed for each text line: selecting a predetermined number of baseline points from each starting point, respectively, from each baseline path starting with each starting point of the plurality of starting points and including the baseline point corresponding to the starting point; for each baseline path, sequentially connecting the starting point with the head and tail points of the selected baseline point to obtain a polygon; deleting the baseline path of the starting point with lower confidence in the case that the overlap ratio between any two polygons in the polygons is greater than a first predetermined threshold; and in the event that more than one baseline path remains for the line of text, only the baseline path of the starting point with the highest confidence is retained. Redundant baseline paths for lines of text may be removed by the optimization process described above.
Specific examples of the optimization process can be found in the corresponding parts of the above apparatus embodiments, for example, the description about fig. 10, and are not repeated here.
As an example, in the obtaining step S1306, before the above-described optimization process is performed, a first shape corresponding to each of the plurality of start points may be calculated based on the position information and scale information of each of the start points, and in a case where an overlap ratio between any two first shapes of the first shapes is larger than a second predetermined threshold, a start point having a lower confidence in two start points respectively corresponding to the two first shapes is deleted. Since part of the start point is deleted before the optimization processing, the amount of calculation of the optimization processing can be reduced.
As an example, the first shape may be a circle having a start point as a center, or the first shape may be a square or a rectangle having a start point as a center, or the like. Other examples of the first shape will occur to those skilled in the art and will not be described here in detail.
As an example, in the detecting step S1302, a plurality of end points may be detected for each text line, and in the obtaining step S1306, the following obtaining processing may be performed for each text line: calculating a second shape corresponding to each end point based on the position information and the scale information of each end point of the plurality of end points; in the case where none of the baseline points on the retained baseline path falls within the second shape corresponding to any of the plurality of end points, leaving the baseline point on the retained baseline path intact; in a case where one of the baseline points on the retained baseline path falls within the second shapes corresponding to at least some of the end points, respectively, the one of the baseline points is paired with the end point having the highest confidence among the at least some of the end points, and the baseline point on the retained baseline path after the paired end point is deleted; and removing second mapped image blocks respectively corresponding to the deleted baseline points after the paired end points from the corrected image corresponding to the retained baseline path, thereby obtaining a final corrected image.
By the obtaining processing, the second mapped image blocks respectively corresponding to the deleted baseline points after the paired end points can be removed from the corrected image, thereby obtaining a more accurate, final corrected image.
As an example, the second shape may be a circle centered at the end point, or the second shape may be a square or rectangle centered at the end point, or the like. Other examples of the second shape will occur to those skilled in the art and will not be described here in detail.
As an example, in the obtaining step S1306, in the case where the overlapping ratio between any two of the second shapes is larger than a third predetermined threshold before the obtaining process is performed, the end point having the lower confidence of the two end points respectively corresponding to the two second shapes may be deleted. Since the partial end point is deleted before the obtaining processing, the amount of calculation of the obtaining processing can be reduced.
It should be noted that although the functional configuration of the information processing apparatus according to the embodiment of the present disclosure is described above, this is merely an example and not a limitation, and a person skilled in the art may modify the above embodiment according to the principle of the present disclosure, for example, addition, deletion, combination, or the like of functional blocks in the respective embodiments may be made, and such modifications fall within the scope of the present disclosure.
In addition, it should be further noted that the method embodiments herein correspond to the system embodiments described above, and therefore, the contents that are not described in detail in the method embodiments may refer to the descriptions of the corresponding parts in the system embodiments, and the description is not repeated here.
In addition, the present disclosure also provides a storage medium and a program product. The machine-executable instructions in the storage medium and the program product according to the embodiments of the present disclosure may be configured to perform the above-described information processing method, and thus, a content not described in detail herein may refer to the description of the corresponding parts previously, and the description will not be repeated herein.
Accordingly, storage media for carrying the above-described program products comprising machine-executable instructions are also included in the present disclosure. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.
Further, it should be noted that the above series of processes and means may also be implemented by software and/or firmware. In the case of implementation by software and/or firmware, a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware structure, such as the general-purpose personal computer 1400 shown in fig. 14, which is capable of executing various functions and the like when various programs are installed.
In fig. 14, a Central Processing Unit (CPU)1401 performs various processes in accordance with a program stored in a Read Only Memory (ROM)1402 or a program loaded from a storage portion 1408 to a Random Access Memory (RAM) 1403. In the RAM 1403, data necessary when the CPU 1401 executes various kinds of processing and the like is also stored as necessary.
The CPU 1401, the ROM 1402, and the RAM 1403 are connected to each other via a bus 1404. An input/output interface 1405 is also connected to the bus 1404.
The following components are connected to the input/output interface 1405: an input portion 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage section 1408 including a hard disk or the like; and a communication portion 1409 including a network interface card such as a LAN card, a modem, and the like. The communication section 1409 performs communication processing via a network such as the internet.
A driver 1410 is also connected to the input/output interface 1405 as necessary. A removable medium 1411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1410 as necessary, so that a computer program read out therefrom is installed into the storage section 1408 as necessary.
In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1411.
It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1411 shown in fig. 14 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1411 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 1402, a hard disk included in the storage section 1408, or the like, in which programs are stored, and which is distributed to users together with the device including them.
The preferred embodiments of the present disclosure are described above with reference to the drawings, but the present disclosure is of course not limited to the above examples. Various changes and modifications within the scope of the appended claims may be made by those skilled in the art, and it should be understood that these changes and modifications naturally will fall within the technical scope of the present disclosure.
For example, a plurality of functions included in one unit may be implemented by separate devices in the above embodiments. Alternatively, a plurality of functions implemented by a plurality of units in the above embodiments may be implemented by separate devices, respectively. In addition, one of the above functions may be implemented by a plurality of units. Needless to say, such a configuration is included in the technical scope of the present disclosure.
In this specification, the steps described in the flowcharts include not only the processing performed in time series in the described order but also the processing performed in parallel or individually without necessarily being performed in time series. Further, even in the steps processed in time series, needless to say, the order can be changed as appropriate.
In addition, the technique according to the present disclosure can also be configured as follows.
a detection unit configured to detect a start point and an end point of each of at least one text line included in an image;
a prediction unit configured to predict, for each text line, a first baseline point of the text line from an image block corresponding to a start point extracted from the image based on the start point of the text line, and predict an N +1 th baseline point of the text line from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, where N is 1,2, …, M is a positive integer equal to or greater than 2; and
an obtaining unit configured to obtain a corrected image of each text line based on the start point, the plurality of baseline points, and the end point of the text line.
Supplementary note 2. the information processing apparatus according to supplementary note 1, wherein,
the prediction unit is configured to extract an image block corresponding to the start point using a window unit corresponding to the start point based on the information of the start point.
Note 3 that the information processing apparatus according to note 2, wherein,
the prediction unit is configured to transform a window unit corresponding to the starting point based on the predicted information of the first baseline point to obtain a window unit corresponding to the first baseline point, and extract an image block corresponding to the first baseline point by using the window unit corresponding to the first baseline point;
converting a window unit corresponding to the nth baseline point based on the predicted information of the (N + 1) th baseline point to obtain a window unit corresponding to the (N + 1) th baseline point, and extracting an image block corresponding to the (N + 1) th baseline point by using the window unit corresponding to the (N + 1) th baseline point; and
and transforming a window unit corresponding to a baseline point immediately before the end point based on the information of the end point to obtain a window unit corresponding to the end point, and extracting an image block corresponding to the end point by using the window unit corresponding to the end point.
Note 4. the information processing apparatus according to note 3, wherein,
the information of the starting point comprises position information, scale information and rotation information of the starting point;
the information of the (N + 1) th baseline point comprises position information, scale information and rotation information of the (N + 1) th baseline point;
the information of the tail point comprises position information, scale information and rotation information of the tail point;
the scale of all baseline points in each text line is equal to the scale of the starting point;
the position and rotation information of the first baseline point are obtained by inputting the image block corresponding to the starting point into a first deep learning network through the prediction unit and predicting; and
the position and rotation information of the N +1 th baseline point are predicted by the prediction unit inputting the image block corresponding to the nth baseline point into the first deep learning network.
Note 5 the information processing apparatus according to note 4, wherein the first deep learning network is a convolutional neural network.
Supplementary note 6. the information processing apparatus according to supplementary note 3, wherein the obtaining unit is configured to, for each text line:
obtaining a first position corresponding to the starting point based on the window unit corresponding to the starting point and the window unit corresponding to the first baseline point, sampling pixel values at the first position in the image, and mapping the sampled pixel values to a first mapped image block of a first predetermined size;
obtaining a second position corresponding to the nth baseline point based on the windowed unit corresponding to the nth baseline point and the windowed unit corresponding to the (N + 1) th baseline point, sampling pixel values at the second position in the image, and mapping the sampled pixel values to a second mapped image block of the first predetermined size corresponding to the nth baseline point;
obtaining a third position corresponding to the end point based on the window unit corresponding to the end point and the window unit corresponding to the baseline point immediately before the end point, sampling a pixel value at the third position in the image, and mapping the sampled pixel value to a third mapped image block of the first predetermined size; and
and sequentially connecting the first mapping image block corresponding to the starting point, the second mapping image block corresponding to each base line point and the third mapping image block corresponding to the tail point together, thereby forming the corrected image of the text line.
Note 7 that the information processing apparatus according to note 6, wherein,
the detection unit is configured to detect a plurality of starting points for each text line, an
The obtaining unit is configured to perform the following optimization process for each text line:
selecting a predetermined number of baseline points from each starting point, respectively, from each baseline path starting with each starting point of the plurality of starting points and including a baseline point corresponding to the starting point;
for each baseline path, sequentially connecting the starting point and the head and tail points of the selected baseline point to obtain a polygon;
deleting a baseline path of a starting point with lower confidence in the case that the overlap ratio between any two of the polygons is greater than a first predetermined threshold; and
in the event that more than one baseline path remains for the line of text, only the baseline path of the starting point with the highest confidence is retained.
Note 8 that the information processing apparatus according to note 7, wherein,
the obtaining unit is configured to calculate, before performing the optimization processing, a first shape corresponding to each of the plurality of start points based on the position information and scale information of each of the start points, and delete, in a case where an overlap ratio between any two first shapes of the first shapes is larger than a second predetermined threshold, a start point having a lower degree of confidence of two start points respectively corresponding to the two first shapes.
Note 9 that the information processing apparatus according to note 7 or 8, wherein,
the detection unit is configured to detect a plurality of end points for each text line, an
The obtaining unit is configured to perform, for each text line, the following obtaining processing:
calculating a second shape corresponding to each end point of the plurality of end points based on the position information and the scale information of each end point;
in the case where none of the baseline points on the retained baseline path falls within the second shape corresponding to any of the plurality of end points, leaving the baseline point on the retained baseline path intact;
in a case where one of the baseline points on the retained baseline path falls within the second shapes respectively corresponding to at least some of the end points, pairing the one of the baseline points with the end point having the highest confidence among the at least some of the end points, and deleting the baseline point on the retained baseline path after the paired end point; and
removing second mapped image blocks respectively corresponding to the deleted baseline points after the paired end points from the corrected image corresponding to the retained baseline path, thereby obtaining a final corrected image.
Supplementary note 10 the information processing apparatus according to supplementary note 9, wherein,
the obtaining unit is configured to delete, in a case where an overlap ratio between any two of the second shapes is larger than a third predetermined threshold before the obtaining process is performed, an end point having a lower confidence in two end points respectively corresponding to the two second shapes.
Note 11 the information processing apparatus according to note 1, wherein the detection unit is configured to:
sequentially inputting image blocks extracted from the image into a second deep learning network, and calculating confidence coefficient which represents a point in the image blocks as a starting point or an end point of a text line; and
in a case where it is determined by the confidence that the point in the image patch is a start point or an end point of a text line, classifying the point in the image patch as the start point or the end point using a classification confidence output from the second deep learning network, thereby detecting a start point and an end point of each text line.
Note 12 the information processing apparatus according to note 11, wherein the second deep learning network is a convolutional neural network.
Note 13 the information processing apparatus according to note 1, wherein the corrected image of each text line is an image in which a tilt and/or a curve on a path trajectory of the text line is corrected.
Supplementary note 14. an information processing method, comprising:
detecting a start point and an end point of each text line in at least one text line included in an image;
for each text line, predicting a first baseline point of the text line according to an image block corresponding to a starting point extracted from the image based on the starting point of the text line, and predicting an N +1 th baseline point of the text line according to an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, wherein N is 1,2, …, M is a positive integer greater than or equal to 2; and
obtaining a corrected image of each text line based on the starting point, the plurality of baseline points, and the end point of the text line.
Supplementary note 15. the information processing method according to supplementary note 14, wherein,
in the predicting step, an image block corresponding to the start point is extracted using a window unit corresponding to the start point based on the information of the start point.
Supplementary note 16. the information processing method according to supplementary note 15, wherein, in the predicting step,
transforming a window unit corresponding to the starting point based on the predicted information of the first baseline point to obtain a window unit corresponding to the first baseline point, and extracting an image block corresponding to the first baseline point by using the window unit corresponding to the first baseline point;
converting a window unit corresponding to the nth baseline point based on the predicted information of the (N + 1) th baseline point to obtain a window unit corresponding to the (N + 1) th baseline point, and extracting an image block corresponding to the (N + 1) th baseline point by using the window unit corresponding to the (N + 1) th baseline point; and
and transforming a window unit corresponding to a baseline point immediately before the end point based on the information of the end point to obtain a window unit corresponding to the end point, and extracting an image block corresponding to the end point by using the window unit corresponding to the end point.
Note 17. the information processing method according to note 16, wherein,
the information of the starting point comprises position information, scale information and rotation information of the starting point;
the information of the (N + 1) th baseline point comprises position information, scale information and rotation information of the (N + 1) th baseline point;
the information of the tail point comprises position information, scale information and rotation information of the tail point;
the scale of all baseline points in each text line is equal to the scale of the starting point;
the position and rotation information of the first baseline point are obtained by inputting an image block corresponding to the starting point into a first deep learning network in the predicting step; and
the position and rotation information of the N +1 th baseline point are predicted by the prediction unit inputting the image block corresponding to the nth baseline point into the first deep learning network.
Note 18 the information processing method according to note 16, wherein in the obtaining step, for each text line:
obtaining a first position corresponding to the starting point based on the window unit corresponding to the starting point and the window unit corresponding to the first baseline point, sampling pixel values at the first position in the image, and mapping the sampled pixel values to a first mapped image block of a first predetermined size;
obtaining a second position corresponding to the nth baseline point based on the windowed unit corresponding to the nth baseline point and the windowed unit corresponding to the (N + 1) th baseline point, sampling pixel values at the second position in the image, and mapping the sampled pixel values to a second mapped image block of the first predetermined size corresponding to the nth baseline point;
obtaining a third position corresponding to the end point based on the window unit corresponding to the end point and the window unit corresponding to the baseline point immediately before the end point, sampling a pixel value at the third position in the image, and mapping the sampled pixel value to a third mapped image block of the first predetermined size; and
and sequentially connecting the first mapping image block corresponding to the starting point, the second mapping image block corresponding to each base line point and the third mapping image block corresponding to the tail point together, thereby forming the corrected image of the text line.
Supplementary note 19. the information processing method according to supplementary note 18, wherein,
in the detecting step, a plurality of start points are detected for each text line, an
In the obtaining step, the following optimization process is performed for each text line:
selecting a predetermined number of baseline points from each starting point, respectively, from each baseline path starting with each starting point of the plurality of starting points and including a baseline point corresponding to the starting point;
for each baseline path, sequentially connecting the starting point and the head and tail points of the selected baseline point to obtain a polygon;
deleting a baseline path of a starting point with lower confidence in the case that the overlap ratio between any two of the polygons is greater than a first predetermined threshold; and
in the event that more than one baseline path remains for the line of text, only the baseline path of the starting point with the highest confidence is retained.
Supplementary note 20 a computer-readable recording medium having recorded thereon a program for causing a computer to execute the steps of:
detecting a start point and an end point of each text line in at least one text line included in an image;
for each text line, predicting a first baseline point of the text line according to an image block corresponding to a starting point extracted from the image based on the starting point of the text line, and predicting an N +1 th baseline point of the text line according to an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, wherein N is 1,2, …, M is a positive integer greater than or equal to 2; and
obtaining a corrected image of each text line based on the starting point, the plurality of baseline points, and the end point of the text line.
Claims (10)
1. An information processing apparatus comprising:
a detection unit configured to detect a start point and an end point of each of at least one text line included in an image;
a prediction unit configured to predict, for each text line, a first baseline point of the text line from an image block corresponding to a start point extracted from the image based on the start point of the text line, and predict an N +1 th baseline point of the text line from an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, where N is 1,2, …, M is a positive integer equal to or greater than 2; and
an obtaining unit configured to obtain a corrected image of each text line based on the start point, the plurality of baseline points, and the end point of the text line.
2. The information processing apparatus according to claim 1,
the prediction unit is configured to extract an image block corresponding to the start point using a window unit corresponding to the start point based on the information of the start point.
3. The information processing apparatus according to claim 2, wherein the prediction unit is configured to:
transforming a window unit corresponding to the starting point based on the predicted information of the first baseline point to obtain a window unit corresponding to the first baseline point, and extracting an image block corresponding to the first baseline point by using the window unit corresponding to the first baseline point;
converting a window unit corresponding to the nth baseline point based on the predicted information of the (N + 1) th baseline point to obtain a window unit corresponding to the (N + 1) th baseline point, and extracting an image block corresponding to the (N + 1) th baseline point by using the window unit corresponding to the (N + 1) th baseline point; and
and transforming a window unit corresponding to a baseline point immediately before the end point based on the information of the end point to obtain a window unit corresponding to the end point, and extracting an image block corresponding to the end point by using the window unit corresponding to the end point.
4. The information processing apparatus according to claim 3,
the information of the starting point comprises position information, scale information and rotation information of the starting point;
the information of the (N + 1) th baseline point comprises position information, scale information and rotation information of the (N + 1) th baseline point;
the information of the tail point comprises position information, scale information and rotation information of the tail point;
the scale of all baseline points in each text line is equal to the scale of the starting point;
the position and rotation information of the first baseline point are obtained by inputting the image block corresponding to the starting point into a first deep learning network through the prediction unit and predicting; and
the position and rotation information of the N +1 th baseline point are predicted by the prediction unit inputting the image block corresponding to the nth baseline point into the first deep learning network.
5. The information processing apparatus according to claim 3, wherein the obtaining unit is configured to, for each text line:
obtaining a first position corresponding to the starting point based on the window unit corresponding to the starting point and the window unit corresponding to the first baseline point, sampling pixel values at the first position in the image, and mapping the sampled pixel values to a first mapped image block of a first predetermined size;
obtaining a second position corresponding to the nth baseline point based on the windowed unit corresponding to the nth baseline point and the windowed unit corresponding to the (N + 1) th baseline point, sampling pixel values at the second position in the image, and mapping the sampled pixel values to a second mapped image block of the first predetermined size corresponding to the nth baseline point;
obtaining a third position corresponding to the end point based on the window unit corresponding to the end point and the window unit corresponding to the baseline point immediately before the end point, sampling a pixel value at the third position in the image, and mapping the sampled pixel value to a third mapped image block of the first predetermined size; and
and sequentially connecting the first mapping image block corresponding to the starting point, the second mapping image block corresponding to each base line point and the third mapping image block corresponding to the tail point together, thereby forming the corrected image of the text line.
6. The information processing apparatus according to claim 5,
the detection unit is configured to detect a plurality of starting points for each text line, an
The obtaining unit is configured to perform the following optimization process for each text line:
selecting a predetermined number of baseline points from each starting point, respectively, from each baseline path starting with each starting point of the plurality of starting points and including a baseline point corresponding to the starting point;
for each baseline path, sequentially connecting the starting point and the head and tail points of the selected baseline point to obtain a polygon;
deleting a baseline path of a starting point with lower confidence in the case that the overlap ratio between any two of the polygons is greater than a first predetermined threshold; and
in the event that more than one baseline path remains for the line of text, only the baseline path of the starting point with the highest confidence is retained.
7. The information processing apparatus according to claim 6,
the obtaining unit is configured to calculate, before performing the optimization processing, a first shape corresponding to each of the plurality of start points based on the position information and scale information of each of the start points, and delete, in a case where an overlap ratio between any two first shapes of the first shapes is larger than a second predetermined threshold, a start point having a lower degree of confidence of two start points respectively corresponding to the two first shapes.
8. The information processing apparatus according to claim 6 or 7,
the detection unit is configured to detect a plurality of end points for each text line, an
The obtaining unit is configured to perform, for each text line, the following obtaining processing:
calculating a second shape corresponding to each end point of the plurality of end points based on the position information and the scale information of each end point;
in the case where none of the baseline points on the retained baseline path falls within the second shape corresponding to any of the plurality of end points, leaving the baseline point on the retained baseline path intact;
in a case where one of the baseline points on the retained baseline path falls within the second shapes respectively corresponding to at least some of the end points, pairing the one of the baseline points with the end point having the highest confidence among the at least some of the end points, and deleting the baseline point on the retained baseline path after the paired end point; and
removing second mapped image blocks respectively corresponding to the deleted baseline points after the paired end points from the corrected image corresponding to the retained baseline path, thereby obtaining a final corrected image.
9. An information processing method comprising:
detecting a start point and an end point of each text line in at least one text line included in an image;
for each text line, predicting a first baseline point of the text line according to an image block corresponding to a starting point extracted from the image based on the starting point of the text line, and predicting an N +1 th baseline point of the text line according to an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, wherein N is 1,2, …, M is a positive integer greater than or equal to 2; and
obtaining a corrected image of each text line based on the starting point, the plurality of baseline points, and the end point of the text line.
10. A computer-readable recording medium having a program recorded thereon for causing a computer to execute the steps of:
detecting a start point and an end point of each text line in at least one text line included in an image;
for each text line, predicting a first baseline point of the text line according to an image block corresponding to a starting point extracted from the image based on the starting point of the text line, and predicting an N +1 th baseline point of the text line according to an image block corresponding to an nth baseline point extracted from the image based on the nth baseline point, so as to predict a plurality of baseline points representing a path trajectory of the text line, wherein N is 1,2, …, M is a positive integer greater than or equal to 2; and
obtaining a corrected image of each text line based on the starting point, the plurality of baseline points, and the end point of the text line.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010093279.8A CN113269181A (en) | 2020-02-14 | 2020-02-14 | Information processing apparatus, information processing method, and computer-readable recording medium |
JP2021003680A JP2021128762A (en) | 2020-02-14 | 2021-01-13 | Information processing apparatus, information processing method, and computer-readable recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010093279.8A CN113269181A (en) | 2020-02-14 | 2020-02-14 | Information processing apparatus, information processing method, and computer-readable recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113269181A true CN113269181A (en) | 2021-08-17 |
Family
ID=77227277
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010093279.8A Pending CN113269181A (en) | 2020-02-14 | 2020-02-14 | Information processing apparatus, information processing method, and computer-readable recording medium |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2021128762A (en) |
CN (1) | CN113269181A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5131053A (en) * | 1988-08-10 | 1992-07-14 | Caere Corporation | Optical character recognition method and apparatus |
US20120099791A1 (en) * | 2008-04-03 | 2012-04-26 | Olga Kacher | Straightening Out Distorted Text Lines of Images |
US20140002650A1 (en) * | 2012-06-28 | 2014-01-02 | GM Global Technology Operations LLC | Wide baseline binocular object matching method using minimal cost flow network |
US20140140635A1 (en) * | 2012-11-20 | 2014-05-22 | Hao Wu | Image rectification using text line tracks |
CN104504387A (en) * | 2014-12-16 | 2015-04-08 | 杭州华为数字技术有限公司 | Correcting method and device for text image |
CN107730511A (en) * | 2017-09-20 | 2018-02-23 | 北京工业大学 | A kind of Tibetan language historical document line of text cutting method based on baseline estimations |
-
2020
- 2020-02-14 CN CN202010093279.8A patent/CN113269181A/en active Pending
-
2021
- 2021-01-13 JP JP2021003680A patent/JP2021128762A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5131053A (en) * | 1988-08-10 | 1992-07-14 | Caere Corporation | Optical character recognition method and apparatus |
US20120099791A1 (en) * | 2008-04-03 | 2012-04-26 | Olga Kacher | Straightening Out Distorted Text Lines of Images |
US20140002650A1 (en) * | 2012-06-28 | 2014-01-02 | GM Global Technology Operations LLC | Wide baseline binocular object matching method using minimal cost flow network |
US20140140635A1 (en) * | 2012-11-20 | 2014-05-22 | Hao Wu | Image rectification using text line tracks |
CN104504387A (en) * | 2014-12-16 | 2015-04-08 | 杭州华为数字技术有限公司 | Correcting method and device for text image |
CN107730511A (en) * | 2017-09-20 | 2018-02-23 | 北京工业大学 | A kind of Tibetan language historical document line of text cutting method based on baseline estimations |
Also Published As
Publication number | Publication date |
---|---|
JP2021128762A (en) | 2021-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3453134B2 (en) | How to determine equivalence of multiple symbol strings | |
CN110942074B (en) | Character segmentation recognition method and device, electronic equipment and storage medium | |
EP2669847B1 (en) | Document processing apparatus, document processing method and scanner | |
US8457403B2 (en) | Method of detecting and correcting digital images of books in the book spine area | |
CN110032998B (en) | Method, system, device and storage medium for detecting characters of natural scene picture | |
WO2017020723A1 (en) | Character segmentation method and device and electronic device | |
JP6151763B2 (en) | Word segmentation for document images using recursive segmentation | |
CN109389121B (en) | Nameplate identification method and system based on deep learning | |
US8494273B2 (en) | Adaptive optical character recognition on a document with distorted characters | |
EP1999688B1 (en) | Converting digital images containing text to token-based files for rendering | |
CN108345827B (en) | Method, system and neural network for identifying document direction | |
EP2553625B1 (en) | Detecting position of word breaks in a textual line image | |
JP5616308B2 (en) | Document modification detection method by character comparison using character shape feature | |
JPH0652354A (en) | Skew correcting method, skew angle detecting method, document segmentation system and skew angle detector | |
JPH05282490A (en) | Word form forming method | |
JP7132050B2 (en) | How text lines are segmented | |
US9330331B2 (en) | Systems and methods for offline character recognition | |
JPH08305803A (en) | Operating method of learning machine of character template set | |
CN110210297B (en) | Method for locating and extracting Chinese characters in customs clearance image | |
CN101149801A (en) | Complex structure file image inclination quick detection method | |
CN115457565A (en) | OCR character recognition method, electronic equipment and storage medium | |
JP2019102061A5 (en) | ||
CN115346227B (en) | Method for vectorizing electronic file based on layout file | |
CN110399873A (en) | ID Card Image acquisition methods, device, electronic equipment and storage medium | |
US8306335B2 (en) | Method of analyzing digital document images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |