CN113850258A

CN113850258A - Method, system, equipment and storage medium for extracting text line in document

Info

Publication number: CN113850258A
Application number: CN202111124778.XA
Authority: CN
Inventors: 杨恒; 阮仕海; 龙涛
Original assignee: Shenzhen Aimo Technology Co ltd
Current assignee: Shenzhen Aimo Technology Co ltd
Priority date: 2021-09-25
Filing date: 2021-09-25
Publication date: 2021-12-28

Abstract

The application discloses a method, a system, equipment and a storage medium for extracting a text line in a document, wherein the method comprises the following steps: acquiring a document image; determining a line mark text block of a first line of the document based on the document image; extracting characters of the text line of the head line based on the line mark text block of the head line; after the words of the text line of the first line are extracted, acquiring line mark text blocks of adjacent lines; extracting characters of the text lines of the adjacent lines based on the line mark text blocks of the adjacent lines; after the extraction of the characters of the text line of the previous line is completed, the acquisition of the line mark text block of the next adjacent line and the extraction of the text line are performed again until all the text lines are extracted.

Description

Method, system, equipment and storage medium for extracting text line in document

Technical Field

The present application relates to the field of image recognition, and in particular, to a method, system, device, and storage medium for extracting text lines in a document.

Background

With the rapid growth of information and data with the development of computer and communication technologies, there is an increasing demand from users for automated information processing, particularly document processing. In document processing, a text line needs to be extracted first, and then structures and features in the text line need to be analyzed to be segmented and combined, so that useful information can be extracted. OCR (Optical Character Recognition) is the first step of document processing, but the current state-of-the-art OCR technology can generally only provide the positions and word contents of all text blocks (a combination of multiple characters with compact positions and on the same text line) in a document, and the text blocks are out of order, so a post-processing method is needed to sort and associate the text blocks to extract the text line.

A general text line extraction method is to extract a text line by projecting in the horizontal direction and the vertical direction, using features that are arranged in lines in a line and there is a large space between lines. However, such methods are not well suited due to the complexity of document layout and the possible presence of noise interference.

In order to solve the problems, methods based on graph theory, methods based on clustering, methods based on morphology and the like are proposed in the industry, but some of the methods have high error rate and some of the methods are complicated to operate, and bring much inconvenience to practical application.

Therefore, the prior art has yet to be improved.

Disclosure of Invention

The method and the device aim to solve the problems that in the prior art, the text line extraction process of the document is complex and low in efficiency.

The technical purpose of the application is realized by the following technical scheme:

a method for extracting text lines in a document comprises the following steps:

acquiring a document image;

determining a line mark text block of a first line of the document based on the document image;

extracting characters of the text line of the head line based on the line mark text block of the head line;

after the words of the text line of the first line are extracted, acquiring line mark text blocks of adjacent lines;

extracting characters of the text lines of the adjacent lines based on the line mark text blocks of the adjacent lines;

and after the extraction of the characters of the text line of the previous line is finished, acquiring the line mark text block of the next adjacent line and extracting the text line again until all the text lines are extracted.

According to the scheme, the line marking file block is determined for the document image, the text line of the same line is extracted based on the line marking file block, after extraction is completed, the line marking text blocks of adjacent lines are obtained, and the text line extraction of the adjacent lines is performed until all the text lines are extracted, so that the operation is simple and convenient, and the identification efficiency is high.

Optionally, the method for extracting text lines in a document, where the step of extracting text lines of the top line based on the text blocks marked by the lines of the top line includes:

based on the line mark text block, acquiring adjacent text blocks in the same line, and setting the adjacent text blocks as current active text blocks;

acquiring text blocks of the same line adjacent to a current active text block in the same direction as the previous direction, and sequentially setting the text blocks as the current active text block until the line text block in the same direction has no adjacent text block;

and based on the line mark text block, obtaining the text block of the same line in the opposite direction in the previous time, setting the text block as a current active text block, and sequentially setting adjacent text blocks as the current active text block until no adjacent text block exists in the line text block in the same direction.

According to the scheme, when the text line of the first line is extracted, the adjacent text blocks are obtained through the determined line mark text blocks, the adjacent text blocks are extended, the adjacent text blocks are set to be the current active text blocks, the text blocks adjacent to the current active text blocks are obtained again, the text blocks are extended in sequence, finally, the text blocks in the same advancing direction are extracted, the adjacent text blocks opposite to the previous direction are obtained based on the line mark text blocks, the text blocks extend in sequence until the text blocks start from the line mark text blocks, the text line extraction in the same line is achieved, the operation is simple, and the accuracy is high.

Optionally, the method for extracting text lines in a document, wherein the step of obtaining text blocks in the same adjacent line includes:

leading out a ray in the vertical direction of the line mark text block or the current active text block;

rotating the ray clockwise or counterclockwise;

determining the first text block passing through in the current line direction in the ray rotation process, and marking the text block as a line mark or the text block adjacent to the current active text block in the same line.

According to the scheme, the text block passing through the line direction first is determined to be the adjacent text block through the rotation of the ray in the vertical direction of the line mark text block or the current active text block, the recognition precision is high, and the recognition error is reduced.

Optionally, the method for extracting text lines in a document, wherein after the text extraction of the text line in the top line is completed, the step of obtaining the line marker text blocks of the adjacent lines includes:

taking the previous line as a translation line, and translating the previous line in the vertical direction of the text line direction of the previous line;

and if the translation line has text blocks in the translation process, determining that the first text block of the translation line is a line mark text block of an adjacent line.

According to the scheme, after the first line of text line extraction is completed, adjacent lines of text line extraction are carried out, the line marking text blocks of the adjacent lines need to be determined, translation is carried out according to the vertical direction of the text line direction of the previous line, the adjacent lines are determined, in the translation process, text blocks exist in the translation lines, the first text block of the translation line is determined to be the line marking text block of the adjacent lines, through the scheme, the adjacent line of text block can be accurately and quickly obtained, then the adjacent lines of text block are subjected to line extraction through the method, and therefore all text line extraction is finally achieved.

Optionally, the method for extracting text lines in a document, where a previous line is used as a translation line, and the step of performing translation according to a direction perpendicular to a text line direction of the previous line includes:

acquiring a first text block and a last text block of a previous line;

and setting the connecting line of the first text block and the last text block as the text line direction of the previous line.

According to the scheme, the text line direction of the previous line is determined by acquiring the head text block and the tail text block of the previous line, so that the line direction of the adjacent line is convenient to determine.

Optionally, the method for extracting text lines in a document, where the step of extracting text lines of the top line based on the text blocks marked by the lines of the top line further includes:

and acquiring adjacent text blocks, if the height difference or the position difference in the vertical direction of the two text blocks exceeds a preset range, judging that the two text blocks are not in the same row, and discarding the text block acquired later.

According to the scheme, the acquired text blocks are compared, the error exceeds the preset range, the text blocks acquired later are abandoned, the identification precision is improved, and the identification error is reduced.

Optionally, the method for extracting text lines in a document, where the step of extracting text lines of adjacent lines based on the line marker text blocks of the adjacent lines further includes:

and judging whether the text lines of the adjacent lines are extracted or not, if so, acquiring the line mark text blocks of the adjacent text lines again, judging again, and extracting the text lines when the text lines of the adjacent lines are not extracted.

According to the scheme, the adjacent lines are judged, and when the adjacent lines are not extracted, the text lines are extracted, so that repeated extraction is avoided, and the text extraction precision is improved.

On the other hand, the application also discloses a system for extracting text lines in a document, wherein the system comprises:

the image acquisition module is used for acquiring a document image;

the first line mark text block determining module is used for determining a line mark text block of a first line of the document based on the document image;

the first line text extraction module is used for extracting characters of the text line of the first line based on the line mark text block of the first line;

the adjacent line mark text block acquisition module is used for acquiring the line mark text blocks of the adjacent lines after the characters of the text line of the first line are extracted;

the adjacent line text line extraction module is used for extracting characters of the text lines of the adjacent lines based on the line mark text blocks of the adjacent lines;

after the extraction of the characters of the text line of the previous line is finished, the adjacent line mark text block acquisition module acquires the line mark text block of the next adjacent line again, and the adjacent line text block extraction module extracts the text line until all the text lines are extracted.

In another aspect, the present application further discloses an apparatus for extracting text lines in a document, which includes a memory and a processor, wherein the memory stores a computer program that can be loaded by the processor and execute the method for extracting text lines in a document.

In another aspect, the present application further discloses a storage medium, in which a computer program capable of being loaded by a processor and executing the method for extracting text lines in a document as described above is stored.

In summary, the present application discloses a method, a system, a device and a storage medium for extracting text lines in a document, wherein the method includes: acquiring a document image; determining a line mark text block of a first line of the document based on the document image; extracting characters of the text line of the head line based on the line mark text block of the head line; after the words of the text line of the first line are extracted, acquiring line mark text blocks of adjacent lines; extracting characters of the text lines of the adjacent lines based on the line mark text blocks of the adjacent lines; after the extraction of the characters of the text line of the previous line is completed, the acquisition of the line mark text block of the next adjacent line and the extraction of the text line are performed again until all the text lines are extracted.

Drawings

FIG. 1 is a flow chart of the steps of a method for extracting lines of text in a document according to the present application.

Detailed Description

The present application is described in further detail below with reference to the attached drawings.

The embodiment of the application discloses a method for extracting a text line in a document, and refers to fig. 1, which is a flow chart of steps of the method for extracting the text line in the document. Wherein, include:

acquiring a document image;

In the embodiment of the application, in order to extract the text of the document, the document image is acquired, which may be by taking a picture or scanning, the document image is entered, and the line mark text block of the first line is determined based on the document image.

After the line mark text block of the top line is determined, text line extraction of the same line can be performed based on the line mark text block, in this embodiment of the present application, specifically, the method for extracting a text line in a document, where the step of performing text word extraction on the text line of the top line based on the line mark text block of the top line includes:

In the preferred embodiment of the present application, when extracting the text line of the first line, the line-marked text block is the leftmost text block, but in specific implementation, the line-marked text block is not necessarily the leftmost text block, after determining the line-marked text block, the adjacent text blocks in the same line are obtained based on the line-marked text block, and are set as the current active text block, the same line is defined as being in the same determined horizontal direction as the line-marked text block and being adjacent, in specific implementation, the adjacent text blocks include left adjacent and right adjacent, if the left adjacent text block is determined to be the current active text block, the text block in the same line adjacent to the left of the current active text block is obtained again, the current active text block is set again, the text block extraction of the same line on the left is sequentially performed, until there is no text block on the left, and the left text block extraction is completed, and at the moment, returning the line mark text block, extracting the text block on the same line on the right side, obtaining the text block adjacent to the right side in the same extraction mode as the text block on the left side, setting the text block as the current active text block, obtaining the text block adjacent to the right side, and sequentially extracting until the extraction of the text block on the right side is finished, thereby finishing the extraction of the text line on the whole line. In the preferred embodiment of the present application, the line-marking text block is the leftmost text block, so that when the left-side adjacent text block is obtained, the left-side adjacent text block does not exist, the right-side adjacent text block is obtained until the extraction of the right-side text block is completed, and if the line-marking text block is on the rightmost side, all the left-side text blocks are extracted, according to the position of the specific line-marking text block, according to the adjacent text block, the extraction of the text block on the same side is performed first, and then the extraction of the text block on the same side in the reverse direction is performed until the extraction of the whole line is completed. In the embodiment of the present application, the text line extraction of other lines is consistent with the text line extraction of the first line in steps, but the difference is that the text line extraction of other lines requires determining a line marker text block, and after the line marker text block is determined, the steps of extracting the same line are consistent, and therefore are not described herein again.

In the foregoing solution of the present application, it is mentioned that extracting the text blocks of all the text lines is performed by extracting adjacent text blocks, and therefore, in the embodiment of the present application, specifically, the method for extracting the text lines in the document, where the step of obtaining the text blocks in the same adjacent line includes:

rotating the ray clockwise or counterclockwise;

In the embodiment of the present application, in order to obtain text blocks in the same adjacent line, a line-marked text block or a current active text block needs to be obtained, so that a ray is drawn in a vertical direction of the line-marked text block or the current active text block, the ray is rotated if the text block adjacent to the left side needs to be determined, the ray is rotated counterclockwise if the text block adjacent to the right side needs to be determined, the ray is rotated clockwise if the text block adjacent to the right side needs to be determined, the first text block passing in the current line direction of the ray is an adjacent text block, the current line direction is substantially consistent with the text line direction of the previous line, and the control is within an error range.

In the embodiment of the present application, specifically, the method for extracting a text line in a document includes the following steps that after extraction of a text line in a first line is completed, a text block marked by a line in an adjacent line is obtained:

In the embodiment of the present application, when text lines of adjacent lines are extracted, the extracting step is the same as the first line extracting step, and the difference is that a text block with a line mark needs to be determined, when the text blocks are specifically implemented, after the previous line is extracted, because the direction of the line is substantially the same as the direction before the line, the text line extracted before the previous line is taken as a translation line, and the translation is performed according to the vertical direction of the text line direction of the previous line, for example, the line direction is the standard horizontal direction, and then the translation is performed according to the vertical direction, during the translation process, whether a text block exists on the translation line is judged, when a text block exists on the translation line, it is indicated that the current translation line is adjacent, then the first text block of the translation line is determined to be the text block with a line mark of the adjacent line, and preferably the left-most text block or the right-most text block of the translation line, and after the text block with a line mark is determined, and the step of extracting the text line is consistent with the step of extracting the first line, and when next adjacent line is extracted, the line mark text block of the next line is obtained and repeated in sequence until all the text lines are extracted.

The foregoing solution mentions that the previous line is used as a translation line, and the translation is performed according to a direction perpendicular to a direction of a text line of the previous line, so in order to obtain a text direction of the previous line, in an embodiment of the present application, the method for extracting a text line in a document, where the previous line is used as a translation line, and the translation is performed according to a direction perpendicular to the direction of the text line of the previous line, and includes:

acquiring a first text block and a last text block of a previous line;

In the embodiment of the application, the text line direction of the previous line is determined by acquiring the head text block and the tail text block of the previous line, so that the line direction of the adjacent line is convenient to determine.

In the specific implementation of the present application, because there may be wrong words in a document or in the recognition process, there are interference factors, such as stains, or shadows due to photographing and scanning, which may cause recognition interference, in this application, an embodiment of the present application summarizes the method for extracting text lines in a document, where, based on the line mark text block of the first line, the step of extracting text words from the text line of the first line further includes:

In the embodiment of the application, in order to reduce recognition errors, when text lines of a document are extracted, differences between text blocks are small, the text blocks are determined within an error range, and the differences between the text blocks are outside the error range, recognized text blocks need to be discarded, and recognition accuracy is ensured.

In the foregoing solution, it is mentioned that, after the extraction of the text line in the same line is completed, the extraction of the text line in the adjacent line needs to be performed, and in order to avoid repeated extraction, in an embodiment of the present application, the method for extracting the text line in the document, where the step of performing text extraction on the text line in the adjacent line based on the line mark text block in the adjacent line further includes:

In the embodiment of the application, in order to avoid repeated extraction, after one line is extracted, the line mark text blocks of the adjacent line are obtained, when the next line of text line is extracted, whether the next line of text line is extracted needs to be judged, when the next line of text line is extracted, the next line is skipped, and then the next line of text line is extracted, so that repeated extraction is avoided, and the text line extraction precision is improved.

In the embodiments of the present application, the overall operation flow is illustrated as follows:

the setting is such that the current row direction is perfectly horizontal. The first line has no other lines as references, so the text line orientation is assumed to be perfectly horizontal.

And projecting all unprocessed text blocks along the direction of the current text line, and selecting the first text block encountered from top to bottom as the current line mark text block.

The projection operation amounts to rotating the document to the current text line direction horizontal, expanding the connected lines left and right in a recursive manner starting from the current line marker text block. The method of expanding to the left and right is the same, and only the left expansion is taken as an example for explanation. Firstly, setting a current line mark text block as a current active text block, selecting a text block which is most possibly in the same line with the current active text block and is adjacent on the left side from the rest text blocks, connecting the text block to the leftmost side of the current line if the same line and the left side are adjacent, then setting the text block as the current active text block, and repeating the operation until the condition can not be met and the current line is expanded to the left. After the leftward and rightward expansion is completed, the in-line text blocks are arranged according to the connection sequence, namely the sequence of the in-line text blocks.

Selecting the text block that is most likely in the same line as the currently active text block and is left-adjacent: and (4) leading out a vertical upward ray from the current active text block, wherein the ray rotates in a clockwise direction or a counterclockwise direction, and the first encountered text block is the text block to be selected.

In the embodiment of the present application, the conditions that the same row is satisfied and the left side is adjacent or the right side is adjacent are as follows: the difference in height between the two text blocks is not more than 2.0 times, and the difference in position in the vertical direction of the text line is not more than 0.5 times the average height of the two text blocks.

The current text line direction is updated. Due to the deformation of the document and the perspective effect generated by shooting, the directions of all the text lines are inconsistent, but the directions of two adjacent lines are basically close, so that the direction of the next line can be approximated by the direction of the previous line, specifically, the direction of the connecting line of the leftmost text block and the rightmost text block of the current line is calculated.

Repeating the line extraction step and the step of updating the current text line direction until all text blocks are extracted into a line.

A second embodiment of the present application, based on the foregoing method, discloses a system for extracting a text line in a document, including:

the image acquisition module is used for acquiring a document image;

In the embodiment of the present application, the specific implementation of each module of the system and the method steps are described correspondingly, and the functions and the specific operation flows correspond to those of the method, and therefore are not described herein again.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The embodiments of the present invention are preferred embodiments of the present application, and the scope of protection of the present application is not limited by the embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims

1. A method for extracting text lines in a document is characterized by comprising the following steps:

acquiring a document image;

2. The method of claim 1, wherein the step of extracting text lines from the top line based on the line-marking text block of the top line comprises:

3. The method of claim 2, wherein the step of obtaining the text blocks of the same adjacent line comprises:

rotating the ray clockwise or counterclockwise;

and determining the first text block passed by the current line direction in the ray rotation process, and marking the text block as a line mark text block or a text block adjacent to the current active text block in the same line.

4. The method of claim 2, wherein the step of obtaining the line-marked text blocks of the adjacent lines after the text extraction of the text line of the top line is completed comprises:

5. The method of claim 4, wherein the previous line is used as a translation line, and the step of translating the previous line in a direction perpendicular to the direction of the text line of the previous line comprises:

acquiring a first text block and a last text block of a previous line;

6. The method of claim 1, wherein the step of extracting text lines from the top line based on the line-marking text block of the top line further comprises:

7. The method of extracting lines of text in a document according to claim 1, wherein the step of performing text extraction on adjacent lines of text lines based on adjacent lines of line-marking text blocks further comprises:

8. A system for extracting lines of text in a document, comprising:

the image acquisition module is used for acquiring a document image;

9. An apparatus for extracting lines of text in a document, comprising a memory and a processor, the memory storing a computer program that can be loaded by the processor and that executes the method for extracting lines of text in a document according to claims 1-7.

10. A storage medium storing a computer program that can be loaded by a processor and that executes the method of extracting lines of text in a document according to claims 1-7.