CN115100663A

CN115100663A - Method and device for estimating distribution situation of character height in document image

Info

Publication number: CN115100663A
Application number: CN202210507208.7A
Authority: CN
Inventors: 熊永平; 丁运运; 黄思远; 伍贵宾
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-09-23

Abstract

The invention provides a method and a device for estimating the distribution situation of the height of characters in a document image, wherein the method comprises the following steps: acquiring a first sample training set; training an initial network model based on the first sample training set to obtain a word height detection model; zooming the document image to be detected to different proportions and inputting the scaled document image to the character height detection model to obtain character height recognition results corresponding to the document image to be detected in different proportions; establishing a to-be-classified character height distribution map of the to-be-detected document images in each proportion based on the acquired character height recognition results corresponding to the to-be-detected document images in different proportions; and inputting the height distribution map of the characters to be classified into a classifier model, obtaining the optimal scaling of the document image to be detected, and determining the height distribution condition of the characters of the document image to be detected based on the height of the characters corresponding to the optimal scaling. The method can accurately detect the height distribution condition of the characters on the document image.

Description

Method and device for estimating distribution situation of character height in document image

Technical Field

The invention relates to the technical field of computer information, in particular to a method and a device for estimating the distribution condition of the height of characters in a document image.

Background

At present, the problem of how to scale an image to a size which is consistent with the vision needs to be solved in application scenes such as reading of electronic documents, OCR recognition and the like. Particularly, the image needs to be scaled to a proper size in an input preprocessing stage of the OCR, so that the accuracy of OCR recognition can be obviously improved; when the reader is used for reading a PDF page or a text webpage is browsed on a browser page, the page image is zoomed to a proper position, so that the size of the text in the page is adjusted to a size suitable for reading, and the reading experience of a user can be improved.

When the conventional PDF page or text-type web page image is zoomed, a user generally zooms the image containing text read or browsed by the user to a size desired by the user according to reading preference. Because the height distribution of characters in an image cannot be accurately identified by the existing method, it is difficult to ensure that the image is scaled in a better scaling ratio in the image scaling process, and therefore how to accurately detect the height distribution of characters on the image is an urgent technical problem to be solved.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for estimating a distribution of a height of a text in a document image, so as to solve one or more problems in the prior art.

According to one aspect of the invention, the invention discloses a method for estimating the distribution situation of the height of characters in a document image, which comprises the following steps:

acquiring a first sample training set, wherein sample data in the first sample training set comprises text block images and text heights;

training an initial network model based on the first sample training set to obtain a word height detection model;

zooming the document image to be detected to different proportions and inputting the scaled document image to the character height detection model to obtain character height recognition results corresponding to the document image to be detected in different proportions;

establishing a to-be-classified character height distribution map of the to-be-detected document images in each proportion based on the acquired character height recognition results corresponding to the to-be-detected document images in different proportions; the horizontal axis of the height distribution diagram of the characters to be classified represents the actual character height, and the vertical axis represents the ratio of the number of the characters with corresponding character height to the total number of the characters;

and inputting the height distribution map of the characters to be classified into a classifier model, obtaining the optimal scaling of the document image to be detected, and determining the height distribution condition of the characters of the document image to be detected based on the height of the characters corresponding to the optimal scaling.

In some embodiments of the present invention, obtaining a first training set of samples comprises:

acquiring a document image, cutting the document image into a plurality of character block images, and marking the character height of each character block image;

and performing mosaic processing on the characters cut in the height direction in the character block images.

In some embodiments of the present invention, obtaining the first training set of samples further comprises:

randomly generating a document image, and adding an identification interference item in each generated region of the document image, wherein the identification interference item comprises at least one of a pure white background, Gaussian noise, crystals, salt and pepper noise and real environment information; and/or

And performing data enhancement and rotation on the text block image, and adjusting the brightness, the contrast, the saturation and the hue of the text block image.

In some embodiments of the invention, the method comprises:

constructing a height loss function, wherein the height loss function is as follows:

wherein HLoss represents height loss, H ₁ ＝min(D ₁ ,D ₃ )+min(D ₂ ,D ₄ )，H ₂ ＝D ₁ +D ₂ +D ₃ +D ₄ -H ₁ ，D ₁ For the predicted distance between the pixel point and the top of the text block image, D ₂ For the predicted distance between the pixel point and the bottom of the text block image, D ₃ Distance between the marked pixel point and the top of the image of the text block, D ₄ The distance between the marked pixel point and the bottom of the image of the text block.

In some embodiments of the present invention, when obtaining the first training set of samples includes data enhancement and rotation of the text block image, the method further comprises: constructing an angle loss function, wherein the angle loss function is angleLoss ═ 1-cos (theta) ₁ -θ ₂ ) (ii) a Where angleLoss represents the angular loss, θ ₁ For the predicted rotation angle value, theta, of the image of the text block ₂ And (5) rotating the angle value for the marked text block image.

In some embodiments of the present invention, the inputting the text height distribution map into a classifier model to obtain an optimal scaling of the document image to be detected, and determining a text height distribution condition of the document image to be detected based on a text height corresponding to the optimal scaling, before, further includes:

acquiring a second sample training set, wherein sample data in the second sample training set comprises a character height distribution map and a corresponding optimal scaling;

and training an initial classifier model based on the second sample training set to obtain the classifier model.

In some embodiments of the invention, the method further comprises:

constructing a text recognition loss function of

The DiceLoss represents the text recognition loss, X represents the probability that the labeled pixel points are texts, and Y represents the probability that the predicted pixel points are texts.

In some embodiments of the invention, the classifier model is a SVM classifier; and/or

Establishing a height distribution map of characters to be classified of the document images to be detected in each proportion based on the obtained character height recognition results corresponding to the document images to be detected in different proportions, wherein the height distribution map comprises the following steps:

counting the number of characters with high characters corresponding to the document image to be detected in each proportion;

calculating the ratio of the number of characters with the height of each character corresponding to the document image to be detected under each proportion to the total number of the characters;

and establishing a height distribution map of characters to be classified of the document images to be detected in each proportion by adopting a drawing tool based on the ratio.

According to another aspect of the present invention, there is also disclosed a system for estimating the distribution of text height in a document image, the system comprising a processor and a memory, the memory having stored therein computer instructions, the processor being configured to execute the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system implementing the steps of the method according to any one of the above embodiments.

According to yet another aspect of the present invention, a computer-readable storage medium is also disclosed, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any of the embodiments above.

According to the method for estimating the distribution condition of the heights of the characters in the document image, a large number of text blocks of the document image are constructed to serve as a source of a network training set, the heights of the characters in the document image to be detected are obtained through a character height detection model, and then the optimal scaling of the document image to be detected is obtained through a classifier model, so that the height distribution condition of the characters on the image can be accurately detected, and the image with the characters can be scaled to the optimal scaling in a document reading or OCR recognition occasion.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principle of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts of the drawings may be exaggerated, i.e., may be larger, relative to other components in an exemplary apparatus actually manufactured according to the present invention. In the drawings:

FIG. 1 is a flowchart illustrating a method for estimating distribution of text heights in a document image according to an embodiment of the present invention.

FIG. 2 is a flowchart illustrating a method for estimating a distribution of heights of characters in a document image according to another embodiment of the present invention.

Fig. 3 is a schematic flow chart of the first sample training set according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a text block image according to an embodiment of the invention.

Fig. 5 is a schematic network structure diagram of a word height detection model according to an embodiment of the present invention.

FIG. 6 is a schematic diagram illustrating a height of a text in a text block image according to an embodiment of the invention.

Fig. 7 is a diagram illustrating a height distribution of characters to be classified according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a text height line in a text block image according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps that are closely related to the scheme according to the present invention are shown in the drawings, and other details that are not so relevant to the present invention are omitted.

It should be emphasized that the term "comprises/comprising/comprises/having" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

In order to solve the problem that the document image needs to be scaled to a proper size in occasions such as OCR recognition, document image reading and the like, the invention provides a method for estimating the distribution situation of the heights of characters in the document image.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals denote the same or similar parts, or the same or similar steps.

Fig. 1 is a flowchart illustrating a method for estimating distribution of heights of characters in a document image according to an embodiment of the present invention, where as shown in fig. 1, the method for estimating distribution of heights of characters in a document image at least includes steps S10 to S50.

Step S10: obtaining a first sample training set, wherein sample data in the first sample training set comprises text block images and text heights.

In this step, the first sample training set is used to train the initial network model; the first sample training set comprises a plurality of sample data, but each sample data at least comprises a character block image and a character height, and the character heights of the sample data can be partially the same or different. The text block image is shown in fig. 4, which is a plurality of text block images, that is, the text block for training with proper size obtained by cutting the document image is shown in fig. 4. It should be understood that the number of the text block images cut from the document image is not specifically limited, and may be set according to the actual application scenario; illustratively, the document image may be cropped into a plurality of 128 x 128 text block images. Fig. 4 is only for illustrating the expression form of characters in the character block image of the present invention, and the specific contents of the characters in each character block image are out of the scope of the present invention.

Illustratively, obtaining a first training set of samples includes: acquiring a document image, cutting the document image into a plurality of character block images, and marking the character height of each character block image; and performing mosaic processing on the characters cut in the height direction in the character block images. In this embodiment, the content in the document image includes chinese and english characters, upper and lower case letters, punctuation marks, and the like. Specifically, before the document image is obtained, the method further comprises randomly generating the document image, wherein the generated randomness of the document image is represented by random Chinese and English characters, capital and small cases, character sizes, fonts and colors, and various punctuations can be randomly generated.

In the above embodiment, when the document image is cut, the situation that characters are cut can occur; in the present invention, since font height detection is required, it is only necessary to perform mosaic processing when the text block is cut along the vertical axis (text height). In order to determine whether the height of the text in the text block image is cut, the coordinate information of each text block image may be obtained first, and the coordinates of the plurality of text block images in which the document image is cut are compared to determine whether there is an overlapping area between the two text block images in the vertical axis direction (text height direction). If yes, judging that the characters in the corresponding character block image are cut, removing the corresponding characters in the character block image at the moment, and covering the background color; the reflection on the image is to erase the segmented characters in the image along the longitudinal axis and to cover up the background.

In addition, in order to increase the anti-interference capability of the first sample training set, identification interference items are added in each region of the randomly generated document image, and each identification interference item comprises at least one of pure white background, Gaussian noise, crystal, salt and pepper noise and real environment information; the real environment information is a real environment image directly shot by a camera. Referring to fig. 4, the partial text block image in fig. 4 has a pure white background, and gaussian noise is added to the partial text block image; in fig. 4, the blank area in each block image is in a state after mosaic processing, that is, when the document image is cut, the cut text in the height direction is covered with a background color.

Step S20: and training an initial network model based on the first sample training set to obtain a word height detection model.

This step is to train the initial network model through the first sample training set constructed in step S10. The initial network model may specifically include four convolutional layers.

Specifically, before the initial network model is trained, the method further comprises preprocessing sample data used for training, namely performing data enhancement and rotation on the text block image, and adjusting the brightness, the contrast, the saturation, the hue and the like of the text block image. In addition to the above, a height loss function is constructed, which is:

wherein HLoss represents height loss, H ₁ ＝min(D ₁ ,D ₃ )+min(D ₂ ,D ₄ )，H ₂ ＝D ₁ +D ₂ +D ₃ +D ₄ -H ₁ ，D ₁ For the predicted distance between the pixel point and the top of the text block image, D ₂ For the predicted distance between the pixel point and the bottom of the text block image, D ₃ Distance between the marked pixel point and the top of the image of the text block, D ₄ The distance between the marked pixel point and the bottom of the image of the text block. Wherein D is ₃ And D ₄ And calculating based on the size of the character block image and the marked character height. It should be understood that the pixel point refers to a pixel point corresponding to the top or bottom of a character in the image of the character block, because if the height of the character in the image is to be obtained, the pixel point at the top of the character and the top boundary of the image of the character block are determined based on the size of the image of the character blockAnd the word height of the corresponding word can be obtained by the distance between the pixel point at the bottom of the word and the boundary at the bottom of the image of the word block.

After the text block image is subjected to data enhancement and rotation, an angle loss function is further constructed, wherein the angle loss function is 1-cos (theta) ₁ -θ ₂ ) (ii) a Where angleLoss represents the angular loss, θ ₁ For the predicted rotation angle value, theta, of the image of the text block ₂ And (4) indicating the rotation angle value of the image of the marked text block.

Furthermore, a text recognition loss function is constructed in the step, and the text recognition loss function is

Fig. 5 is a schematic diagram of a network structure of a word height detection model according to an embodiment of the present invention, and in the embodiment shown in fig. 5, an initial network model is trained based on a first sample training set to obtain the word height detection model, which specifically includes the following steps:

(1) the method comprises the steps of obtaining sample data, preprocessing the sample data, enhancing data, adjusting height, rotating a character block image, cutting the rotated character block image into a uniform size which accords with input, and then transforming the character block image, namely changing the brightness, contrast, saturation, hue and the like of the image.

(2) A height loss function is constructed. The height loss is related to the predicted merging area and the actual labeled labeling area. The merge region refers to a common region of the detected height lines of the text block image, such as a common part of the vertical lines in the height direction in fig. 8; the labeling area refers to an actual text area labeled during text block image generation. Similarly, since only the text height information is concerned in the invention, the loss function is mainly height loss, and the calculation of the corresponding height loss can be reduced to one dimension, so that the convergence can be faster.

(3) And carrying out feature extraction on the character block image. Specifically, 1/4, 1/8, 1/16 and 1/32 of the text block images are extracted as feature maps of different scales, and are respectively marked as f1, f2, f3 and f 4.

(4) And (6) merging the features. The characteristic f1 at the deepest layer is subjected to upsampling by 2 times and spliced with f2, characteristic dimensionality reduction is carried out through convolution of 1x1, the characteristic f1 at the deepest layer is subjected to upsampling by 2 times and spliced with f3 after convolution of 3x3, the characteristic f4 splicing is sequentially carried out, and the result is output to an output layer.

(5) The output layer comprises a score map corresponding to the probability that each pixel belongs to the text region, and the characteristic channel output by the output layer is 3, wherein two channels respectively represent the distance from the pixel position to the top boundary and the bottom boundary of the text block image, and the other channel represents the rotation angle of the text block image.

(6) Locally aware NMS (non-maximum suppression) is done. Combining all output text region sets with corresponding threshold values (combining when the output text region sets are larger than the threshold values, and not combining when the output text region sets are smaller than the threshold values), and weighting and combining the confidence score as weight to obtain combined text region sets; standard NMS operations are performed on the merged set of text regions.

(7) And storing the trained word height detection model.

Step S30: and zooming the document image to be detected to different proportions and inputting the scaled document image to the character height detection model to obtain character height recognition results corresponding to the document image to be detected in different proportions.

In this step, the height of the characters in the document image to be detected is identified by using the character height detection model trained in step S20. Specifically, the document image to be detected can be firstly scaled to different proportions according to requirements, and then the document images to be detected with different proportions are sequentially input into the character height detection model, so that the character height information corresponding to the document images to be detected with different proportions is obtained. In one embodiment, the document image to be detected is specifically scaled by four scales, such as 0.5 times, 1 time, 1.5 times and 2 times; at this time, the text height information corresponding to the document image to be detected at four scales can be obtained through step S30. It should be understood that scaling the text image to be detected four times is only an example, and the text height detection may be performed on more or less scales of the document image to be detected as required.

Step S40: establishing a to-be-classified character height distribution map of the to-be-detected document images in each proportion based on the acquired character height recognition results corresponding to the to-be-detected document images in different proportions; the horizontal axis of the height distribution diagram of the characters to be classified represents the actual character height, and the vertical axis represents the ratio of the number of the characters with the corresponding character height to the total number of the characters.

In the step, a character height distribution map to be classified is constructed according to a character height recognition result output by the height detection model, and specifically, the number of characters with each character height corresponding to the document image to be detected in each proportion can be counted firstly; calculating the ratio of the number of characters with the height of each character corresponding to the document image to be detected under each proportion to the total number of the characters; and establishing a to-be-classified character height distribution map of the to-be-detected document images in each proportion by adopting a drawing tool based on the ratio.

In this step, the real text height is obtained only by dividing the text height value of the document image to be detected output by the text height detection model by the scaling. The drawing tool may be a plt toolkit, and if the document image to be detected is zoomed by 0.5 times, 1 time, 1.5 times and 2 times, the height distribution diagram of the characters to be classified constructed by the plt toolkit is shown in fig. 7, the horizontal axis represents the actual character height result output based on the character height detection model, since the text block image to be detected identified by the character height detection model is an image zoomed at a certain proportion, the character height information output by the character height detection model is also the character height information zoomed at the character height, the actual character height result is the actual character height calculated by the zoom ratio of the corresponding document image to be detected, and the vertical axis represents the ratio of the corresponding character height to the total number of the characters.

Step S50: and inputting the character height distribution map to be classified into a classifier model, obtaining the optimal scaling of the document image to be classified, and determining the character height distribution condition of the document image to be classified based on the character height corresponding to the optimal scaling.

The step is to select the most suitable scaling of the character block image to be recognized from the established character height distribution map to be classified through a classifier model. Before this step, the estimation method may further include the steps of: acquiring a second sample training set, wherein sample data in the second sample training set comprises a character height distribution map and a corresponding optimal scaling; and training an initial classifier model based on the second sample training set to obtain the classifier model.

The second sample training set includes a plurality of sample data, each sample data includes a text height distribution map and a corresponding optimal scaling ratio, wherein the text height distribution map can be labeled to obtain the optimal scaling ratio in the text height distribution map, and in an embodiment, 159 text height distribution maps are labeled to form the second sample training set. In addition, the adopted classifier model can be an SVM model, specifically, an SVM classifier can be constructed based on sklern.

When the optimal scaling in the height distribution map of the characters to be classified is obtained through the trained classifier model, the height distribution map of the characters to be classified obtained in the step S40 is firstly input to the SVM classifier model, and the optimal scaling in the range of 0.5 times, 1 time, 1.5 times and 2 times is obtained. The optimal scale represents the height detection result at the scale at which the most accurate, i.e., the font height corresponding to the scale, is the best for text reading and OCR recognition.

The above method is described below with reference to a specific example, however, it should be noted that this specific example is only for better illustration of the present application and should not be construed as an undue limitation on the present application. As shown in FIG. 3, in another embodiment, the method for estimating the distribution of the heights of the characters in the document image includes steps S101 to S105.

Step S101: and constructing a text word height detection sample for training.

This step can generate text block image sample data with different text contents, different backgrounds, and different font sizes. Specifically, as shown in fig. 1, document images with different text sizes, text colors, text contents (including punctuation marks, chinese and english), and different backgrounds are generated in batch; then cutting the document image to obtain a text block image with a proper size for training; and performing mosaic processing on the character blocks cut into the characters.

Step S102: and (5) training a character high detection model.

In this step, a word height detection model is trained based on the sample data in step S101 and saved.

Specifically, a text block image is obtained, the height of the text block image is subjected to data enhancement, and the text block image is rotated and cut. Changing the brightness, contrast, saturation and hue of the text block image; a height loss function is constructed. Feature extraction is performed on the text block image, and an exemplary feature map can be extracted from the feature extraction stage at four levels based on the VGG16 model, wherein the sizes of the feature maps are 1/32, 1/16, 1/8 and 1/4 of the input image. And further performing feature merging, namely, inputting the feature map from the previous stage into an upper pooling layer to enlarge the size of the feature map, merging the feature map with the feature map of the current layer, reducing the number of channels and the amount of calculation through conv1 × 1, and fusing local information through conv3 × 3 to finally generate the output of the merging stage. The output layer contains a probability map (the probability that each pixel belongs to the text region) and model output information (3 channels are output in total, wherein 2 channels respectively represent the distance from the pixel position to the top and bottom boundaries of the text block image, and the other channel represents the rotation angle of the text block image). And then carrying out local perception non-maximum suppression, weighting and combining corresponding thresholds of all output text region sets by using the confidence score as weight to obtain a combined text region set, and carrying out standard non-maximum suppression operation on the combined text region set. And finally, storing the trained word height detection model.

Step S103: and scaling the image documents to different proportions and inputting the scaled image documents into the trained height detection model.

Inputting the document images to be detected in different proportions into a trained character height detection model, and outputting character height information corresponding to the character blocks to be detected in different proportions through the character height detection model.

Step S104: and acquiring a word height distribution diagram corresponding to the to-be-detected document images in each proportion.

The step is to count the proportion of the characters with the height of each character in the total number of the characters by means of a statistical tool, and then establish a character height distribution map by means of a drawing tool.

Step S105: and deducing a proper scaling ratio through an SVM classifier.

Specifically, a document image height distribution data set is constructed, each image is document image height distribution under different proportions, and each image is marked with the most appropriate scaling proportion; the document image height distribution data set is the second sample data set in the above embodiment. And training the SVM classifier, and storing the trained SVM classifier. Inputting the character height distribution maps of the document images to be detected in different proportions into an SVM classifier to obtain the optimal scaling.

According to the embodiment, the method for estimating the distribution situation of the heights of the characters in the document image can be found out that firstly, the character height detection data set is built through an automatic tool, and then, the built data set is used for training the character height detection model. When the method is used, the document image to be detected is input into a trained character height detection model by different scaling ratios, page character height information of each ratio is analyzed to obtain the character height distribution condition of the document image to be detected of each ratio, the character height distribution under different scaling ratios is compared, the most appropriate scaling ratio is selected by a certain rule, and finally the detected character height is divided by the corresponding scaling ratio to obtain the character height distribution condition of the page; the accuracy of word height detection can be improved based on the method.

The document image to be detected input by the method comprises an image converted from common document types such as Word, PDF, OFD and the like, an image of a traditional paper carrier such as a book newspaper, and an image obtained by means of screenshot and the like on a webpage; therefore, the method has wide application range. In addition, the method can be applied to character reading scenes, including browsers on electronic equipment such as personal computers and mobile phones, readers of various electronic documents and the like; when a document is read, word height distribution estimation is carried out on a document page image, so that the document page is conveniently scaled to a size more suitable for reading, and the condition that the font is too large or too small is avoided. Besides, the method removes the width information of the text, only pays attention to the height information, converts two-dimensional output into one-dimensional output to simplify the network, and simultaneously completes the task of acquiring the character height, thereby improving the convergence capability of the network.

Correspondingly, the invention also discloses a system for estimating the distribution situation of the character height in the document image, which comprises a processor and a memory, wherein the memory is stored with computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method of any one of the above embodiments

In addition, the invention also discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any of the above embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments can be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed at the same time.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for estimating the distribution situation of the heights of characters in a document image is characterized by comprising the following steps:

zooming the document image to be detected to different proportions and inputting the document image to be detected to the character height detection model to obtain character height identification results corresponding to the document image to be detected in different proportions;

establishing a to-be-classified character height distribution map of the to-be-detected document images in each proportion based on the acquired character height recognition results corresponding to the to-be-detected document images in different proportions; the horizontal axis of the character height distribution diagram to be classified represents the actual character height, and the vertical axis represents the ratio of the number of characters with corresponding character heights to the total number of characters;

2. The method of claim 1, wherein obtaining a first training set of samples comprises:

3. The method of claim 2, wherein obtaining a first training set of samples further comprises:

4. The method according to claim 3, wherein the method comprises:

wherein HLoss represents height loss, H ₁ ＝min(D ₁ ，D ₃ )+min(D ₂ ，D ₄ )，H ₂ ＝D ₁ +D ₂ +D ₃ +D ₄ -H ₁ ，D ₁ For the predicted distance between the pixel point and the top of the text block image, D ₂ For the predicted distance between the pixel point and the bottom of the text block image, D ₃ Distance between the marked pixel point and the top of the image of the text block, D ₄ The distance between the marked pixel point and the bottom of the image of the text block.

5. The method of claim 4, wherein when obtaining the first training set of samples includes performing data enhancement and rotation on the text block image, the method further comprises: constructing an angle loss function, wherein the angle loss function is angleLoss ═ 1-cos (theta) ₁ -θ ₂ ) (ii) a Where angleLoss represents the angular loss, θ ₁ For the predicted rotation angle value, theta, of the image of the text block ₂ And (4) indicating the rotation angle value of the image of the marked text block.

6. The method according to claim 1, wherein the text height distribution map is input to a classifier model, an optimal scaling of the document image to be detected is obtained, and a text height distribution of the document image to be detected is determined based on a text height corresponding to the optimal scaling, and before that, the method further comprises:

7. The method of estimating distribution of text heights in a document image according to claim 5, further comprising:

constructing a text recognition loss function of

8. The method for estimating the distribution situation of the text heights in the document image according to any one of claims 1 to 7, wherein the classifier model is an SVM classifier; and/or

9. A system for estimating the distribution of the height of a text in a document image, the system comprising a processor and a memory, wherein the memory has stored therein computer instructions, the processor being configured to execute the computer instructions stored in the memory, and wherein the system, when executed by the processor, implements the steps of the method according to any one of claims 1 to 8.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.