CN106951401B - Document text recognition method and device - Google Patents

Document text recognition method and device Download PDF

Info

Publication number
CN106951401B
CN106951401B CN201710150271.9A CN201710150271A CN106951401B CN 106951401 B CN106951401 B CN 106951401B CN 201710150271 A CN201710150271 A CN 201710150271A CN 106951401 B CN106951401 B CN 106951401B
Authority
CN
China
Prior art keywords
elements
page
average density
discarded
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710150271.9A
Other languages
Chinese (zh)
Other versions
CN106951401A (en
Inventor
徐佳宏
朱吕亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ipanel TV Inc
Original Assignee
Shenzhen Ipanel TV Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ipanel TV Inc filed Critical Shenzhen Ipanel TV Inc
Priority to CN201710150271.9A priority Critical patent/CN106951401B/en
Publication of CN106951401A publication Critical patent/CN106951401A/en
Application granted granted Critical
Publication of CN106951401B publication Critical patent/CN106951401B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Input (AREA)

Abstract

The application discloses a document text recognition method and a document text recognition device, wherein the method comprises the following steps: determining all elements in the page to be identified; traversing all elements in the page to be identified, and trying to discard the elements one by one; determining an actual discarded element; discarding the actual discarded element; calculating the average density of the remaining elements; judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not; if so, taking the residual elements as text area elements; if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one. The invention fully utilizes the principle that the distance between the text elements is smaller, the average density is larger, but the distance between the non-text elements and the text elements is larger, so that the overall average density of the text elements and the non-text elements is smaller, thereby obtaining the text elements by discarding the non-text elements and ensuring that the recognition accuracy of the text elements is higher.

Description

Document text recognition method and device
Technical Field
The present application relates to the field of document processing technologies, and in particular, to a method and an apparatus for identifying a text of a document.
Background
A document is generally paginated, and each page can have a header area, a footer area and document annotation areas on the left side and the right side besides a text.
When a document is displayed on a device with different resolutions, the document needs to be subjected to typesetting conversion according to the resolution of the device, that is, the document is converted into the document with the corresponding resolution according to the resolution of the target display device. The existing document conversion mode is to obtain the content of the original document, and then to re-typeset according to the target resolution to generate a new document. The prior method does not consider the difference of document content types, so the problem of disordered typesetting of the text content and other contents can occur after the typesetting is carried out again.
Therefore, accurate identification of the text region of the document is crucial to the accuracy of document typesetting conversion, and a document text identification scheme is urgently needed in the prior art.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for identifying a text region of a document, so as to accurately identify the text region of the document and provide a basis for document typesetting conversion.
In order to achieve the purpose, the invention provides the following technical scheme:
a document text recognition method comprises the following steps:
determining all elements in the page to be identified;
traversing all elements in the page to be identified, and trying to discard the elements one by one;
determining an actual discarded element;
discarding the actual discarded element;
calculating the average density of the remaining elements;
judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not;
if so, taking the residual elements as text area elements;
if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one.
Preferably, the determining the actual discarded element specifically includes:
calculating the density gain of each element after being discarded;
comparing the magnitude of all of the density gains;
and taking the element corresponding to the maximum value of the density gain as an actual discarding element.
Preferably, the calculating the density gain of each discarded element specifically includes:
when one element is discarded, calculating the average density of all the remaining elements;
and subtracting the average density before discarding the element from the average density of all the remaining elements to obtain the density gain after discarding the element.
Preferably, before the discarding the actual discarded element, the method further comprises: and calculating the average density of all elements in the page to be identified as the initial average density.
Preferably, the determining all elements in the page to be recognized specifically includes:
determining the areas occupied by all elements in the page to be identified;
and carrying out black blocking processing on the areas occupied by all the elements, and determining all the elements in the page to be identified.
The invention also provides a document text recognition device, comprising:
the element determining unit is used for determining all elements in the page to be identified;
the actual discarded element determining unit is used for traversing all elements in the page to be identified, trying to discard the elements one by one and determining the actual discarded elements;
an actual discarded element discarding unit configured to discard the actual discarded element;
an average density calculation unit for calculating an average density of the remaining elements;
and the judging unit is used for judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value, if so, the residual elements are text area elements, and if not, the steps of traversing all the elements in the page to be identified and trying to discard one by one are returned.
Preferably, the actual discarded element determining unit includes:
the single element discarding unit is used for traversing all elements in the page to be identified and trying to discard the elements one by one;
a density gain calculation unit for calculating a density gain after each element is discarded;
the comparison unit is used for comparing the magnitude of all the density gains;
and the actual discarded element determining subunit determines the element corresponding to the maximum density gain as the actual discarded element.
Preferably, the average density calculation unit is further configured to calculate an initial average density of all elements in the page to be identified.
According to the technical scheme, the document text identification method provided by the invention comprises the following steps: determining all elements in the page to be identified; traversing all elements in the page to be identified, and trying to discard the elements one by one; determining an actual discarded element; discarding the actual discarded element; calculating the average density of the remaining elements; judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not; if so, taking the residual elements as text area elements; if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one. The invention fully utilizes the principle that the distance between the text elements is smaller, the average density is larger, but the distance between the non-text elements and the text elements is larger, so that the overall average density of the text elements and the non-text elements is smaller, thereby obtaining the text elements by discarding the non-text elements and ensuring that the recognition accuracy of the text elements is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1a is a display interface of an original document;
FIG. 1b is a display page rearranged according to the prior art;
FIG. 2a is a diagram of a rearranged prospective page after a normal document conversion;
FIG. 2b is an expected page of only text re-typeset after normal document conversion;
FIG. 3 is a flowchart of a document text recognition method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing the effect of blackening the area occupied by the elements;
FIG. 5 is a flowchart of a method for determining actual discarded elements according to an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a region coordinate representation according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of region merging according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a structure of a page to be identified according to an embodiment of the present invention;
FIG. 9 is a diagram illustrating a structure of the page to be identified after discarding element number 18 in FIG. 8;
FIG. 10 is a structural diagram of the page to be identified in FIG. 8 after discarding elements No. 17 and No. 18;
FIG. 11 is a graph of average density versus iteration number according to an embodiment of the present invention;
FIG. 12 is a graph of density gain versus iteration number according to an embodiment of the present invention;
FIG. 13 is a schematic structural diagram of a document text recognition apparatus according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of an actual discarded element determining unit according to an embodiment of the present invention.
Detailed Description
As described in the background section, problems with document conversion often arise in the prior art, such as: electronic documents are typically a4 paper size when viewed on a computer. However, when it is necessary to display a document on another display device, the paper size of a4 is not appropriate. For example, on a cell phone screen, the size of a4 paper is certainly too large; on a tv screen, the portrait orientation a4 does not fit in the landscape tv screen. In this case, it is necessary to convert the document into a document at a corresponding resolution according to the characteristics of the target display device.
However, the problem of disordered typesetting of the text content and other content may occur after the typesetting is performed again in the prior art, specifically, refer to fig. 1a and fig. 1b, where fig. 1a illustrates an original document display interface, and fig. 1b illustrates a display page after the typesetting is performed again according to the prior art. As can be seen by comparison, the text content after the typesetting is disordered with the header and footer content. The header footer content is wrongly re-typeset as text content, which is obviously not expected by document conversion in real life, and the reading of readers is seriously influenced by the content of the header footer which is sharp in the content of the political committee.
Referring to fig. 2a and 2b, fig. 2a illustrates the rearranged intended pages after the normal document conversion, and fig. 2b illustrates only the text rearranged intended pages after the normal document conversion. FIG. 2a is a diagram illustrating an optimal state after document conversion, i.e., requiring that the conversion program automatically recognize the header footer content and the body content of the document, and only the body content is rearranged, while the header footer still appears as a header footer in the converted resolution; FIG. 2 is a compromise display effect that directly masks the header footer content and only displays the re-laid out text regions.
The inventor finds that whether the document is the page shown in fig. 2a or the page shown in fig. 2b after the typesetting is completed, the text content needs to be accurately identified, so that the text region can be typeset again without being influenced by the header and footer content and other content, and the normally displayed text content typesetting page is obtained.
Based on this, the invention provides a method and a device for identifying a text area of a document, wherein the identification method comprises the following steps:
determining all elements in the page to be identified;
traversing all elements in the page to be identified, and trying to discard the elements one by one;
determining an actual discarded element;
discarding the actual discarded element;
calculating the average density of the remaining elements;
judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not;
if so, taking the residual elements as text area elements;
if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 3, fig. 3 is a flowchart of a document text recognition method according to an embodiment of the present invention.
As shown in fig. 3, the method includes:
step S101: determining all elements in the page to be identified;
specifically, the document referred to in the present invention includes, but is not limited to, a word, PDF, WPS, or Html format document. Documents in any format have corresponding editing or displaying programs, and are finally presented to a user for viewing. In the present application, the content of a document is converted into regions and content, i.e., elements, for display. In the embodiment, the document is composed of pages, the pages are composed of elements, and each page comprises various types of elements, such as a text, a header, a footer, a comment, a footer and the like.
The text recognition method provided by the embodiment of the invention has universality and does not depend on the specific content of the text, namely the content of the element is unimportant and the relative position before the element area is important for the text recognition method provided by the embodiment of the invention. Therefore, in the embodiment of the invention, the area occupied by all the elements in the page to be identified is determined firstly; and carrying out black blocking processing on the areas occupied by all the elements, and determining all the elements in the page to be identified. Referring to fig. 4, the area occupied by the elements in the document page in fig. 4 is a black area, and the non-content elements are white areas, so as to determine all the elements in the page to be identified.
Step S102: traversing all elements in the page to be identified, and trying to discard the elements one by one;
it should be noted that, in this embodiment, traversing all elements in the page to be recognized does not refer to all elements in the page to be recognized specifically, and may further include discarding all remaining elements after one or more elements are discarded.
The process of attempting to discard is: in this embodiment, the elements that are tried to be discarded are not limited in order, and when there are n elements, the elements are tried to be discarded one by one, that is, one element is tried to be discarded, and subsequent calculation is performed; the state before discarding the element is restored, then another element is discarded, and subsequent calculation is carried out, and n discarding times are carried out in total.
Step S103: determining an actual discarded element;
optionally, in the embodiment of the present invention, a specific process of determining an actual discarded element, as shown in fig. 5, includes:
step S31: calculating the density gain of each element after being discarded;
step S32: comparing the magnitude of all of the density gains;
step S33: and taking the element corresponding to the maximum value of the density gain as an actual discarding element.
The density gain in this embodiment refers to an increase in average density; specifically, the calculating the density gain after each element is discarded specifically includes: when one element is discarded, calculating the average density of all the remaining elements; and subtracting the average density before discarding the element from the average density of all the remaining elements to obtain the density gain after discarding the element. It should be noted that, in the process of calculating the density gain, in this embodiment, it is required to first calculate an average density of all elements in the page to be identified, which is used as an initial average density, that is, D0,0
Referring to fig. 6 and 7, fig. 6 shows the definition of the region, in this embodiment, the outer frame is the current page of the document, the inner frame is the region, the upper left corner of the current page of the document is used as the origin, and the representation forms of the region include two types:
the first expression: (x, y, w, h), namely, the coordinates at the upper left corner and the width and height;
the second expression: (x1, y1, x2, y2), i.e. in upper left and lower right coordinates;
it should be noted that the two expressions are equivalent, and the formula for converting from the first expression to the second expression is as follows:
x1=x;
y1=y;
x2=x+w;
y2=y+h。
in this embodiment, for a single element, the density is defined as 1, ignoring the difference in the contents of the device itself. The difference of the content means that a small section of characters is one-to-one, if the content is in the same density as the content in the standing state, the density is 1, so that the processing can be simplified, and the processing speed is accelerated. As shown in fig. 7, in this embodiment, it is assumed that the regions where the elements are located are a and B, the density of each of the regions a and B is 1, and the mass is the area w × h, i.e., A.w × A.h and B.w × B.h
For the condition of merging the two regions, the merged region is a region C; the formula for the average density is as follows:
d total mass/combined area (A.w × A.h + B.w × B.h)/(C.w × C.h)
In this embodiment, one element is discarded, and a primary density gain is calculated; then, after the state before the element discarding is recovered, other elements are discarded, and the corresponding density gain after the element discarding is calculated again.
And comparing all the density gains, and selecting the element corresponding to the maximum value of the density gains as an actual discarded element to be discarded in the subsequent steps.
Step S104: discarding the actual discarded element;
the actual discarded element in this step is the actual discarded element determined in step S103 described above.
Step S105: calculating the average density of the remaining elements;
it should be noted that, referring to fig. 8 and 9, fig. 8 is a to-be-recognized page including 18 elements, and fig. 9 is a to-be-recognized page after 18 elements are discarded, as can be seen from the figures, the total mass is reduced by the mass corresponding to 18 elements, and the total area is reduced by the area corresponding to 18 elements and the area corresponding to the border region, so that the average density of the remaining elements is increased; when element No. 2 is discarded, the average density of the remaining elements is reduced because the total mass is reduced by the mass corresponding to element No. 2, but the area is not reduced over the total area. I.e. whatever element is discarded, causes a variation in the average density of the remaining elements with respect to the average density before discarding.
Step S106: judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not;
if yes, the process proceeds to step S107: taking the residual elements as text area elements;
in this embodiment, based on that the distance between the text elements is generally smaller than the distance between the text elements and the non-text elements, an average density threshold is preset for each page to be identified in this embodiment, and when the average density of all the remaining elements is greater than or equal to the average density, the remaining elements are considered to be text elements. The preset average density threshold value is an average density value close to 1, the distance between the characterization elements is small, and the arrangement is dense.
If not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one.
In this embodiment, if it is found that after one actual discarded element is discarded, the average density of the remaining elements is smaller than the preset average density value, which indicates that there are non-text elements in the remaining elements, the actual discarded element that needs to be discarded needs to be determined continuously, and then the average density after the second element is discarded is calculated again until the average density of the remaining elements is greater than or equal to the preset average density threshold, and then the remaining elements are text elements.
As shown in fig. 10, it is a schematic diagram of all elements of the page to be identified after discarding element number 18 and element number 17; the element number 17 is an actual discarded element, and the determination process thereof is synchronized in step S103, which is not described herein again, and the embodiment is only described by taking a schematic diagram as an example, and does not limit the actual situation.
The document text recognition method provided in the embodiment includes determining all elements in a page to be recognized; traversing all elements in the page to be identified, and trying to discard the elements one by one; determining an actual discarded element; discarding the actual discarded element; calculating the average density of the remaining elements; judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not; if so, taking the residual elements as text area elements; if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one. The principle that the overall average density of the text elements and the non-text elements is small is fully utilized due to the fact that the distance between the text elements is small, the average density is large, and the distance between the non-text elements and the text elements is large, so that the text elements are obtained by discarding the non-text elements, and the text element identification accuracy is high.
In addition, in the embodiment of the invention, in the process of trying to discard one by one and determining the actual discarded elements, a plurality of iteration processes can be carried out to obtain the data sequence of the actual discarded elements and the average density before discarding the elements each time.
Assuming that the page to be identified includes n elements, n iterations are required, and each iteration will obtain an actual discarded element, and the specific process is as follows:
traversing n elements in the page to be identified, performing abandoning attempt one by one, calculating the density gain of each abandoned element, and if the average density is reduced due to abandoning the element, recording the density gain as negative; if the average density increases as a result of discarding the element, the density gain is recorded as positive.
During the 1 st iteration, n elements are used, the abandon attempt is carried out one by one to obtain n density gains, and the element with the maximum density gain is selected as the actual abandon element R1
In the 2 nd iteration, n-1 elements are provided, the abandon attempt is carried out one by one to obtain n-1 density gains, and the element with the maximum density gain is selected as the actual abandon element R2
...
During the (n-1) th iteration, 2 elements are respectively discarded to obtain two density gains, and the element with the larger density gain is selected as the actual discarded element Rn-1
At the nth iteration, there are 1 element, which is the actual discarded element that needs to be discarded finally.
From the above, the following rule is obtained, and in the k-th iteration, there are n-k +1 elements, and n-k +1 traversals are required. Discarding the n-k +1 elements one by one to obtain a density gain, denoted as Δ Dk,j. Where k is the number of iterations and j is the number of discarded elements. From n-k +1 Δ Dk,jSelecting the element m with the maximum value as the actual discarded element and recording as RkAnd calculating the average density D of all elements before discardingk
After all iterations are completed n times, two data sequences are obtained:
R1,R2,...,Rnthat is, the actual discarded element for each iteration;
D1,D2,...,Dni.e. the average density before elements are discarded for each iteration.
Taking 18 elements shown in fig. 8 as an example, according to the average density data sequence obtained after 18 iterations; plotting the average density versus iteration number after each iteration, as shown in fig. 11, with the abscissa as the iteration number; mean density value on ordinate. The analysis graph can obtain that the average density of the elements jumps between the third iteration and the fourth iteration, so that the elements discarded in the previous three iterations are non-text elements. That is, by analyzing the jump points of the average density value in the graph, the boundary between the text element and the non-text element can be found, and the small distance between the text elements and the large average density are fully utilized; and the relative positions of the text elements and the non-text elements are found by the characteristics of larger distance and smaller average density between the text elements and the non-text elements.
In this embodiment, a density gain value data sequence may also be obtained by calculation according to the average density value, and a graph of the density gain with respect to the number of iterations is drawn, as shown in fig. 12, where an abscissa is the number of iterations and an ordinate is the density gain value Δ D. Wherein the density gain Δ Dk=Dk-Dk-1(ii) a Wherein, Δ D 10; the smaller the value of Δ D in fig. 12, indicating that the element corresponding to the iterative deletion is located adjacent to the element previously iteratively deleted, it is clear from an analysis of fig. 12 that the jump point of the density gain occurs at Δ D4(ii) a Indicating that the elements deleted by the fourth iteration are farther apart from the elements deleted by the third iteration.
It should be noted that, for the text area, as the iteration proceeds, the number of discarded elements increases, the number of remaining elements decreases, the discarding of a single element has an increased influence on the average density, and the Δ D change curve of the text area is in a shape of vibrating up and down and having an increasingly larger amplitude. As shown in fig. 12 for iterations 4-17.
It can also be obtained by combining the two judgment methods, and the conclusion obtained by analyzing fig. 12 is the same as the conclusion obtained by analyzing fig. 11, that is, the element deleted by the previous three iterations does not belong to the text region, and the remaining element after the previous three iterations is discarded is the text region.
The following describes the document text recognition apparatus provided in the embodiment of the present invention, and the document text recognition apparatus described below and the document text recognition method described above may be referred to in correspondence with each other.
The embodiment of the invention also provides a document text recognition device, and referring to fig. 13, fig. 13 is a schematic structural diagram of the document text recognition device provided by the embodiment of the invention.
As shown in fig. 13, the document body recognition apparatus includes:
an element determination unit 11 configured to determine all elements in the page to be recognized;
an actual discarded element determining unit 12, configured to traverse all elements in the page to be identified, and try to discard one by one to determine an actual discarded element;
an actual discarded element discarding unit 13 configured to discard the actual discarded element;
an average density calculation unit 14 for calculating an average density of the remaining elements;
and the judging unit 15 is configured to judge whether the average density of the remaining elements is greater than or equal to a preset average density threshold, if so, the remaining elements are text area elements, and if not, the step of traversing all the elements in the page to be identified is returned, and the steps of trying to discard the elements one by one are performed.
Alternatively, as shown in fig. 14, the actual discarded element determining unit 12 includes:
a single element discarding unit 121, configured to traverse all elements in the page to be identified, and attempt to discard one by one;
a density gain calculation unit 122 for calculating a density gain after each element is discarded;
a comparing unit 123 for comparing the magnitude of all the density gains;
the actual discarded element determining subunit 124 determines the element corresponding to the maximum value of the density gain as the actual discarded element.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A document text recognition method is characterized by comprising the following steps:
determining all elements in the page to be identified;
traversing all elements in the page to be identified, and trying to discard the elements one by one; the discarding process comprises: randomly discarding one element and calculating the average density of the remaining elements;
determining actual discarded elements after all elements are discarded once;
discarding the actual discarded element;
calculating the average density of the remaining elements;
judging whether the average density of the residual elements is greater than or equal to a preset average density threshold value or not;
if so, taking the residual elements as text area elements;
if not, returning to the step of traversing all elements in the page to be identified and trying to discard one by one;
the determining all elements in the page to be identified specifically includes:
determining the areas occupied by all elements in the page to be identified;
and carrying out black blocking processing on the areas occupied by all the elements, and determining all the elements in the page to be identified.
2. The document text recognition method according to claim 1, wherein the determining actual discarded elements specifically comprises:
calculating the density gain of each element after being discarded;
comparing the magnitude of all of the density gains;
and taking the element corresponding to the maximum value of the density gain as an actual discarding element.
3. The document text recognition method according to claim 2, wherein the calculating the density gain after each element is discarded specifically comprises:
when one element is discarded, calculating the average density of all the remaining elements;
and subtracting the average density before discarding the element from the average density of all the remaining elements to obtain the density gain after discarding the element.
4. The document text recognition method according to claim 1, further comprising, before said discarding the actual discarded element: and calculating the average density of all elements in the page to be identified as the initial average density.
5. A document text recognition apparatus, comprising:
the element determining unit is used for determining all elements in the page to be identified;
the actual discarded element determining unit is used for traversing all elements in the page to be identified, trying to discard the elements one by one and determining the actual discarded elements; the discarding process comprises: randomly discarding one element and calculating the average density of the remaining elements;
a real discarded element discarding unit, configured to discard the real discarded element after all elements have been discarded once;
an average density calculation unit for calculating an average density of the remaining elements;
a judging unit, configured to judge whether the average density of the remaining elements is greater than or equal to a preset average density threshold, if so, the remaining elements are text area elements, and if not, the step of traversing all the elements in the page to be identified and trying to discard the elements one by one is returned;
the determining all elements in the page to be identified specifically includes:
determining the areas occupied by all elements in the page to be identified;
and carrying out black blocking processing on the areas occupied by all the elements, and determining all the elements in the page to be identified.
6. The document text recognition apparatus according to claim 5, wherein the actual discarded element determination unit includes:
the single element discarding unit is used for traversing all elements in the page to be identified and trying to discard the elements one by one;
a density gain calculation unit for calculating a density gain after each element is discarded;
the comparison unit is used for comparing the magnitude of all the density gains;
and the actual discarded element determining subunit determines the element corresponding to the maximum density gain as the actual discarded element.
7. The document text recognition apparatus according to claim 5, wherein the average density calculation unit is further configured to calculate an initial average density of all elements in the page to be recognized.
CN201710150271.9A 2017-03-14 2017-03-14 Document text recognition method and device Active CN106951401B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710150271.9A CN106951401B (en) 2017-03-14 2017-03-14 Document text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710150271.9A CN106951401B (en) 2017-03-14 2017-03-14 Document text recognition method and device

Publications (2)

Publication Number Publication Date
CN106951401A CN106951401A (en) 2017-07-14
CN106951401B true CN106951401B (en) 2020-03-20

Family

ID=59467482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710150271.9A Active CN106951401B (en) 2017-03-14 2017-03-14 Document text recognition method and device

Country Status (1)

Country Link
CN (1) CN106951401B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714176A (en) * 2014-01-08 2014-04-09 同济大学 Webpage text extraction method based on maximum text density
CN105740355B (en) * 2016-01-26 2019-03-26 中国人民解放军国防科学技术大学 Webpage context extraction method and device based on aggregation text density
CN105808644A (en) * 2016-02-25 2016-07-27 浪潮软件集团有限公司 Method and device for determining text node

Also Published As

Publication number Publication date
CN106951401A (en) 2017-07-14

Similar Documents

Publication Publication Date Title
US10885325B2 (en) Information processing apparatus, control method, and storage medium
KR102208683B1 (en) Character recognition method and apparatus thereof
JP5508359B2 (en) Character recognition device, character recognition method and program
JP5675194B2 (en) Image processing apparatus, image processing method, and program
CN107038441B (en) Clipboard detection and correction
US11568623B2 (en) Image processing apparatus, image processing method, and storage medium
CN107085699B (en) Information processing apparatus, control method of information processing apparatus, and storage medium
US9171218B2 (en) Image processing apparatus, image processing method, and computer readable medium that recognize overlapping elements in handwritten input
US8559718B1 (en) Defining a layout of text lines of CJK and non-CJK characters
CN109697414A (en) A kind of text positioning method and device
JP2019102061A5 (en)
US8787702B1 (en) Methods and apparatus for determining and/or modifying image orientation
JP4393411B2 (en) Image processing apparatus, image processing program, and image processing method
JP6116531B2 (en) Image processing device
JP5271956B2 (en) Document orientation detection method and apparatus
US9110926B1 (en) Skew detection for vertical text
JP5171421B2 (en) Image processing apparatus, image processing method, and computer program
US10911636B2 (en) Image inclination angle detection apparatus that detects inclination angle of image with respect to document, image forming apparatus, and computer-readable non-transitory recording medium storing image inclination angle detection program
US10134138B2 (en) Information processing apparatus, computer-readable storage medium, information processing method
CN106951401B (en) Document text recognition method and device
CN113177885B (en) Method and device for correcting image, storage medium and electronic equipment
JP5794154B2 (en) Image processing program, image processing method, and image processing apparatus
JP2011022938A (en) Character recognition device, character recognition program and character recognition method
JP5821994B2 (en) Image processing apparatus, image forming apparatus, and program
JP2012022413A (en) Image processing apparatus, image processing method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant