CN106951401B

CN106951401B - Document text recognition method and device

Info

Publication number: CN106951401B
Application number: CN201710150271.9A
Authority: CN
Inventors: 徐佳宏; 朱吕亮
Original assignee: Shenzhen Ipanel TV Inc
Current assignee: Shenzhen Ipanel TV Inc
Priority date: 2017-03-14
Filing date: 2017-03-14
Publication date: 2020-03-20
Anticipated expiration: 2037-03-14
Also published as: CN106951401A

Abstract

The application discloses a document text recognition method and a document text recognition device, wherein the method comprises the following steps: determining all elements in the page to be identified; traversing all elements in the page to be identified, and trying to discard the elements one by one; determining an actual discarded element; discarding the actual discarded element; calculating the average density of the remaining elements; judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not; if so, taking the residual elements as text area elements; if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one. The invention fully utilizes the principle that the distance between the text elements is smaller, the average density is larger, but the distance between the non-text elements and the text elements is larger, so that the overall average density of the text elements and the non-text elements is smaller, thereby obtaining the text elements by discarding the non-text elements and ensuring that the recognition accuracy of the text elements is higher.

Description

Document text recognition method and device

Technical Field

The present application relates to the field of document processing technologies, and in particular, to a method and an apparatus for identifying a text of a document.

Background

A document is generally paginated, and each page can have a header area, a footer area and document annotation areas on the left side and the right side besides a text.

When a document is displayed on a device with different resolutions, the document needs to be subjected to typesetting conversion according to the resolution of the device, that is, the document is converted into the document with the corresponding resolution according to the resolution of the target display device. The existing document conversion mode is to obtain the content of the original document, and then to re-typeset according to the target resolution to generate a new document. The prior method does not consider the difference of document content types, so the problem of disordered typesetting of the text content and other contents can occur after the typesetting is carried out again.

Therefore, accurate identification of the text region of the document is crucial to the accuracy of document typesetting conversion, and a document text identification scheme is urgently needed in the prior art.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for identifying a text region of a document, so as to accurately identify the text region of the document and provide a basis for document typesetting conversion.

In order to achieve the purpose, the invention provides the following technical scheme:

a document text recognition method comprises the following steps:

determining all elements in the page to be identified;

traversing all elements in the page to be identified, and trying to discard the elements one by one;

determining an actual discarded element;

discarding the actual discarded element;

calculating the average density of the remaining elements;

judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not;

if so, taking the residual elements as text area elements;

if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one.

Preferably, the determining the actual discarded element specifically includes:

calculating the density gain of each element after being discarded;

comparing the magnitude of all of the density gains;

and taking the element corresponding to the maximum value of the density gain as an actual discarding element.

Preferably, the calculating the density gain of each discarded element specifically includes:

when one element is discarded, calculating the average density of all the remaining elements;

and subtracting the average density before discarding the element from the average density of all the remaining elements to obtain the density gain after discarding the element.

Preferably, before the discarding the actual discarded element, the method further comprises: and calculating the average density of all elements in the page to be identified as the initial average density.

Preferably, the determining all elements in the page to be recognized specifically includes:

determining the areas occupied by all elements in the page to be identified;

and carrying out black blocking processing on the areas occupied by all the elements, and determining all the elements in the page to be identified.

The invention also provides a document text recognition device, comprising:

the element determining unit is used for determining all elements in the page to be identified;

the actual discarded element determining unit is used for traversing all elements in the page to be identified, trying to discard the elements one by one and determining the actual discarded elements;

an actual discarded element discarding unit configured to discard the actual discarded element;

an average density calculation unit for calculating an average density of the remaining elements;

and the judging unit is used for judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value, if so, the residual elements are text area elements, and if not, the steps of traversing all the elements in the page to be identified and trying to discard one by one are returned.

Preferably, the actual discarded element determining unit includes:

the single element discarding unit is used for traversing all elements in the page to be identified and trying to discard the elements one by one;

a density gain calculation unit for calculating a density gain after each element is discarded;

the comparison unit is used for comparing the magnitude of all the density gains;

and the actual discarded element determining subunit determines the element corresponding to the maximum density gain as the actual discarded element.

Preferably, the average density calculation unit is further configured to calculate an initial average density of all elements in the page to be identified.

According to the technical scheme, the document text identification method provided by the invention comprises the following steps: determining all elements in the page to be identified; traversing all elements in the page to be identified, and trying to discard the elements one by one; determining an actual discarded element; discarding the actual discarded element; calculating the average density of the remaining elements; judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not; if so, taking the residual elements as text area elements; if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one. The invention fully utilizes the principle that the distance between the text elements is smaller, the average density is larger, but the distance between the non-text elements and the text elements is larger, so that the overall average density of the text elements and the non-text elements is smaller, thereby obtaining the text elements by discarding the non-text elements and ensuring that the recognition accuracy of the text elements is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1a is a display interface of an original document;

FIG. 1b is a display page rearranged according to the prior art;

FIG. 2a is a diagram of a rearranged prospective page after a normal document conversion;

FIG. 2b is an expected page of only text re-typeset after normal document conversion;

FIG. 3 is a flowchart of a document text recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram showing the effect of blackening the area occupied by the elements;

FIG. 5 is a flowchart of a method for determining actual discarded elements according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a region coordinate representation according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of region merging according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a structure of a page to be identified according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating a structure of the page to be identified after discarding element number 18 in FIG. 8;

FIG. 10 is a structural diagram of the page to be identified in FIG. 8 after discarding elements No. 17 and No. 18;

FIG. 11 is a graph of average density versus iteration number according to an embodiment of the present invention;

FIG. 12 is a graph of density gain versus iteration number according to an embodiment of the present invention;

FIG. 13 is a schematic structural diagram of a document text recognition apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of an actual discarded element determining unit according to an embodiment of the present invention.

Detailed Description

As described in the background section, problems with document conversion often arise in the prior art, such as: electronic documents are typically a4 paper size when viewed on a computer. However, when it is necessary to display a document on another display device, the paper size of a4 is not appropriate. For example, on a cell phone screen, the size of a4 paper is certainly too large; on a tv screen, the portrait orientation a4 does not fit in the landscape tv screen. In this case, it is necessary to convert the document into a document at a corresponding resolution according to the characteristics of the target display device.

However, the problem of disordered typesetting of the text content and other content may occur after the typesetting is performed again in the prior art, specifically, refer to fig. 1a and fig. 1b, where fig. 1a illustrates an original document display interface, and fig. 1b illustrates a display page after the typesetting is performed again according to the prior art. As can be seen by comparison, the text content after the typesetting is disordered with the header and footer content. The header footer content is wrongly re-typeset as text content, which is obviously not expected by document conversion in real life, and the reading of readers is seriously influenced by the content of the header footer which is sharp in the content of the political committee.

Referring to fig. 2a and 2b, fig. 2a illustrates the rearranged intended pages after the normal document conversion, and fig. 2b illustrates only the text rearranged intended pages after the normal document conversion. FIG. 2a is a diagram illustrating an optimal state after document conversion, i.e., requiring that the conversion program automatically recognize the header footer content and the body content of the document, and only the body content is rearranged, while the header footer still appears as a header footer in the converted resolution; FIG. 2 is a compromise display effect that directly masks the header footer content and only displays the re-laid out text regions.

The inventor finds that whether the document is the page shown in fig. 2a or the page shown in fig. 2b after the typesetting is completed, the text content needs to be accurately identified, so that the text region can be typeset again without being influenced by the header and footer content and other content, and the normally displayed text content typesetting page is obtained.

Based on this, the invention provides a method and a device for identifying a text area of a document, wherein the identification method comprises the following steps:

determining all elements in the page to be identified;

determining an actual discarded element;

discarding the actual discarded element;

calculating the average density of the remaining elements;

if so, taking the residual elements as text area elements;

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 3, fig. 3 is a flowchart of a document text recognition method according to an embodiment of the present invention.

As shown in fig. 3, the method includes:

step S101: determining all elements in the page to be identified;

specifically, the document referred to in the present invention includes, but is not limited to, a word, PDF, WPS, or Html format document. Documents in any format have corresponding editing or displaying programs, and are finally presented to a user for viewing. In the present application, the content of a document is converted into regions and content, i.e., elements, for display. In the embodiment, the document is composed of pages, the pages are composed of elements, and each page comprises various types of elements, such as a text, a header, a footer, a comment, a footer and the like.

The text recognition method provided by the embodiment of the invention has universality and does not depend on the specific content of the text, namely the content of the element is unimportant and the relative position before the element area is important for the text recognition method provided by the embodiment of the invention. Therefore, in the embodiment of the invention, the area occupied by all the elements in the page to be identified is determined firstly; and carrying out black blocking processing on the areas occupied by all the elements, and determining all the elements in the page to be identified. Referring to fig. 4, the area occupied by the elements in the document page in fig. 4 is a black area, and the non-content elements are white areas, so as to determine all the elements in the page to be identified.

Step S102: traversing all elements in the page to be identified, and trying to discard the elements one by one;

it should be noted that, in this embodiment, traversing all elements in the page to be recognized does not refer to all elements in the page to be recognized specifically, and may further include discarding all remaining elements after one or more elements are discarded.

The process of attempting to discard is: in this embodiment, the elements that are tried to be discarded are not limited in order, and when there are n elements, the elements are tried to be discarded one by one, that is, one element is tried to be discarded, and subsequent calculation is performed; the state before discarding the element is restored, then another element is discarded, and subsequent calculation is carried out, and n discarding times are carried out in total.

Step S103: determining an actual discarded element;

optionally, in the embodiment of the present invention, a specific process of determining an actual discarded element, as shown in fig. 5, includes:

step S31: calculating the density gain of each element after being discarded;

step S32: comparing the magnitude of all of the density gains;

step S33: and taking the element corresponding to the maximum value of the density gain as an actual discarding element.

The density gain in this embodiment refers to an increase in average density; specifically, the calculating the density gain after each element is discarded specifically includes: when one element is discarded, calculating the average density of all the remaining elements; and subtracting the average density before discarding the element from the average density of all the remaining elements to obtain the density gain after discarding the element. It should be noted that, in the process of calculating the density gain, in this embodiment, it is required to first calculate an average density of all elements in the page to be identified, which is used as an initial average density, that is, D_0,0。

Referring to fig. 6 and 7, fig. 6 shows the definition of the region, in this embodiment, the outer frame is the current page of the document, the inner frame is the region, the upper left corner of the current page of the document is used as the origin, and the representation forms of the region include two types:

the first expression: (x, y, w, h), namely, the coordinates at the upper left corner and the width and height;

the second expression: (x1, y1, x2, y2), i.e. in upper left and lower right coordinates;

it should be noted that the two expressions are equivalent, and the formula for converting from the first expression to the second expression is as follows:

x1＝x；

y1＝y；

x2＝x+w；

y2＝y+h。

in this embodiment, for a single element, the density is defined as 1, ignoring the difference in the contents of the device itself. The difference of the content means that a small section of characters is one-to-one, if the content is in the same density as the content in the standing state, the density is 1, so that the processing can be simplified, and the processing speed is accelerated. As shown in fig. 7, in this embodiment, it is assumed that the regions where the elements are located are a and B, the density of each of the regions a and B is 1, and the mass is the area w × h, i.e., A.w × A.h and B.w × B.h

For the condition of merging the two regions, the merged region is a region C; the formula for the average density is as follows:

d total mass/combined area (A.w × A.h + B.w × B.h)/(C.w × C.h)

In this embodiment, one element is discarded, and a primary density gain is calculated; then, after the state before the element discarding is recovered, other elements are discarded, and the corresponding density gain after the element discarding is calculated again.

And comparing all the density gains, and selecting the element corresponding to the maximum value of the density gains as an actual discarded element to be discarded in the subsequent steps.

Step S104: discarding the actual discarded element;

the actual discarded element in this step is the actual discarded element determined in step S103 described above.

Step S105: calculating the average density of the remaining elements;

it should be noted that, referring to fig. 8 and 9, fig. 8 is a to-be-recognized page including 18 elements, and fig. 9 is a to-be-recognized page after 18 elements are discarded, as can be seen from the figures, the total mass is reduced by the mass corresponding to 18 elements, and the total area is reduced by the area corresponding to 18 elements and the area corresponding to the border region, so that the average density of the remaining elements is increased; when element No. 2 is discarded, the average density of the remaining elements is reduced because the total mass is reduced by the mass corresponding to element No. 2, but the area is not reduced over the total area. I.e. whatever element is discarded, causes a variation in the average density of the remaining elements with respect to the average density before discarding.

Step S106: judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not;

if yes, the process proceeds to step S107: taking the residual elements as text area elements;

in this embodiment, based on that the distance between the text elements is generally smaller than the distance between the text elements and the non-text elements, an average density threshold is preset for each page to be identified in this embodiment, and when the average density of all the remaining elements is greater than or equal to the average density, the remaining elements are considered to be text elements. The preset average density threshold value is an average density value close to 1, the distance between the characterization elements is small, and the arrangement is dense.

In this embodiment, if it is found that after one actual discarded element is discarded, the average density of the remaining elements is smaller than the preset average density value, which indicates that there are non-text elements in the remaining elements, the actual discarded element that needs to be discarded needs to be determined continuously, and then the average density after the second element is discarded is calculated again until the average density of the remaining elements is greater than or equal to the preset average density threshold, and then the remaining elements are text elements.

As shown in fig. 10, it is a schematic diagram of all elements of the page to be identified after discarding element number 18 and element number 17; the element number 17 is an actual discarded element, and the determination process thereof is synchronized in step S103, which is not described herein again, and the embodiment is only described by taking a schematic diagram as an example, and does not limit the actual situation.

The document text recognition method provided in the embodiment includes determining all elements in a page to be recognized; traversing all elements in the page to be identified, and trying to discard the elements one by one; determining an actual discarded element; discarding the actual discarded element; calculating the average density of the remaining elements; judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not; if so, taking the residual elements as text area elements; if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one. The principle that the overall average density of the text elements and the non-text elements is small is fully utilized due to the fact that the distance between the text elements is small, the average density is large, and the distance between the non-text elements and the text elements is large, so that the text elements are obtained by discarding the non-text elements, and the text element identification accuracy is high.

In addition, in the embodiment of the invention, in the process of trying to discard one by one and determining the actual discarded elements, a plurality of iteration processes can be carried out to obtain the data sequence of the actual discarded elements and the average density before discarding the elements each time.

Assuming that the page to be identified includes n elements, n iterations are required, and each iteration will obtain an actual discarded element, and the specific process is as follows:

traversing n elements in the page to be identified, performing abandoning attempt one by one, calculating the density gain of each abandoned element, and if the average density is reduced due to abandoning the element, recording the density gain as negative; if the average density increases as a result of discarding the element, the density gain is recorded as positive.

During the 1 st iteration, n elements are used, the abandon attempt is carried out one by one to obtain n density gains, and the element with the maximum density gain is selected as the actual abandon element R₁；

In the 2 nd iteration, n-1 elements are provided, the abandon attempt is carried out one by one to obtain n-1 density gains, and the element with the maximum density gain is selected as the actual abandon element R₂；

...

During the (n-1) th iteration, 2 elements are respectively discarded to obtain two density gains, and the element with the larger density gain is selected as the actual discarded element R_n-1；

At the nth iteration, there are 1 element, which is the actual discarded element that needs to be discarded finally.

From the above, the following rule is obtained, and in the k-th iteration, there are n-k +1 elements, and n-k +1 traversals are required. Discarding the n-k +1 elements one by one to obtain a density gain, denoted as Δ D_k,j. Where k is the number of iterations and j is the number of discarded elements. From n-k +1 Δ D_k,jSelecting the element m with the maximum value as the actual discarded element and recording as R_kAnd calculating the average density D of all elements before discarding_k。

After all iterations are completed n times, two data sequences are obtained:

R₁,R₂,...,R_nthat is, the actual discarded element for each iteration;

D₁,D₂,...,D_ni.e. the average density before elements are discarded for each iteration.

Taking 18 elements shown in fig. 8 as an example, according to the average density data sequence obtained after 18 iterations; plotting the average density versus iteration number after each iteration, as shown in fig. 11, with the abscissa as the iteration number; mean density value on ordinate. The analysis graph can obtain that the average density of the elements jumps between the third iteration and the fourth iteration, so that the elements discarded in the previous three iterations are non-text elements. That is, by analyzing the jump points of the average density value in the graph, the boundary between the text element and the non-text element can be found, and the small distance between the text elements and the large average density are fully utilized; and the relative positions of the text elements and the non-text elements are found by the characteristics of larger distance and smaller average density between the text elements and the non-text elements.

In this embodiment, a density gain value data sequence may also be obtained by calculation according to the average density value, and a graph of the density gain with respect to the number of iterations is drawn, as shown in fig. 12, where an abscissa is the number of iterations and an ordinate is the density gain value Δ D. Wherein the density gain Δ D_k＝D_k-D_k-1(ii) a Wherein, Δ D ₁0; the smaller the value of Δ D in fig. 12, indicating that the element corresponding to the iterative deletion is located adjacent to the element previously iteratively deleted, it is clear from an analysis of fig. 12 that the jump point of the density gain occurs at Δ D₄(ii) a Indicating that the elements deleted by the fourth iteration are farther apart from the elements deleted by the third iteration.

It should be noted that, for the text area, as the iteration proceeds, the number of discarded elements increases, the number of remaining elements decreases, the discarding of a single element has an increased influence on the average density, and the Δ D change curve of the text area is in a shape of vibrating up and down and having an increasingly larger amplitude. As shown in fig. 12 for iterations 4-17.

It can also be obtained by combining the two judgment methods, and the conclusion obtained by analyzing fig. 12 is the same as the conclusion obtained by analyzing fig. 11, that is, the element deleted by the previous three iterations does not belong to the text region, and the remaining element after the previous three iterations is discarded is the text region.

The following describes the document text recognition apparatus provided in the embodiment of the present invention, and the document text recognition apparatus described below and the document text recognition method described above may be referred to in correspondence with each other.

The embodiment of the invention also provides a document text recognition device, and referring to fig. 13, fig. 13 is a schematic structural diagram of the document text recognition device provided by the embodiment of the invention.

As shown in fig. 13, the document body recognition apparatus includes:

an element determination unit 11 configured to determine all elements in the page to be recognized;

an actual discarded element determining unit 12, configured to traverse all elements in the page to be identified, and try to discard one by one to determine an actual discarded element;

an actual discarded element discarding unit 13 configured to discard the actual discarded element;

an average density calculation unit 14 for calculating an average density of the remaining elements;

and the judging unit 15 is configured to judge whether the average density of the remaining elements is greater than or equal to a preset average density threshold, if so, the remaining elements are text area elements, and if not, the step of traversing all the elements in the page to be identified is returned, and the steps of trying to discard the elements one by one are performed.

Alternatively, as shown in fig. 14, the actual discarded element determining unit 12 includes:

a single element discarding unit 121, configured to traverse all elements in the page to be identified, and attempt to discard one by one;

a density gain calculation unit 122 for calculating a density gain after each element is discarded;

a comparing unit 123 for comparing the magnitude of all the density gains;

the actual discarded element determining subunit 124 determines the element corresponding to the maximum value of the density gain as the actual discarded element.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A document text recognition method is characterized by comprising the following steps:

determining all elements in the page to be identified;

traversing all elements in the page to be identified, and trying to discard the elements one by one; the discarding process comprises: randomly discarding one element and calculating the average density of the remaining elements;

determining actual discarded elements after all elements are discarded once;

discarding the actual discarded element;

calculating the average density of the remaining elements;

if so, taking the residual elements as text area elements;

if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one;

the determining all elements in the page to be identified specifically includes:

determining the areas occupied by all elements in the page to be identified;

2. The document text recognition method according to claim 1, wherein the determining actual discarded elements specifically comprises:

calculating the density gain of each element after being discarded;

comparing the magnitude of all of the density gains;

3. The document text recognition method according to claim 2, wherein the calculating the density gain after each element is discarded specifically comprises:

4. The document text recognition method according to claim 1, further comprising, before said discarding the actual discarded element: and calculating the average density of all elements in the page to be identified as the initial average density.

5. A document text recognition apparatus, comprising:

the actual discarded element determining unit is used for traversing all elements in the page to be identified, trying to discard the elements one by one and determining the actual discarded elements; the discarding process comprises: randomly discarding one element and calculating the average density of the remaining elements;

a real discarded element discarding unit, configured to discard the real discarded element after all elements have been discarded once;

a judging unit, configured to judge whether the average density of the remaining elements is greater than or equal to a preset average density threshold, if so, the remaining elements are text area elements, and if not, the step of traversing all the elements in the page to be identified and trying to discard the elements one by one is returned;

determining the areas occupied by all elements in the page to be identified;

6. The document text recognition apparatus according to claim 5, wherein the actual discarded element determination unit includes:

7. The document text recognition apparatus according to claim 5, wherein the average density calculation unit is further configured to calculate an initial average density of all elements in the page to be recognized.