CN112115111A

CN112115111A - OCR-based document version management method and system

Info

Publication number: CN112115111A
Application number: CN201910536932.0A
Authority: CN
Inventors: 宋嘉琪; 张怀朋; 于航; 张智俊; 郭庆河
Original assignee: Shanghai Huairuo Intelligent Technology Co ltd
Current assignee: Shenzhen Yingshisheng Information Technology Co ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2020-12-22

Abstract

The invention provides a document version management method and a system based on OCR, which relate to the field of optical text recognition and natural language processing, and the method comprises the following steps of 1: performing OCR character recognition on the picture type document to obtain a plain text document; step 2: carrying out text structure reduction on the plain text document; and step 3: comparing the plain text documents with the text structures restored to obtain document comparison results; and 4, step 4: and carrying out result post-processing on the document comparison result, and displaying the document comparison result. The invention solves the problems that the existing document version management system can only compare based on plain text documents and can not compare picture documents; the problems that when the existing document version management system compares text documents with complex structures, the text documents are uniformly regarded as pure text documents to be compared, the text comparison precision is reduced to a certain extent, and the display is difficult are solved.

Description

OCR-based document version management method and system

Technical Field

The invention relates to the field of optical text recognition and natural language processing, in particular to a document version management method and system based on OCR.

Background

Document version management is based on the application of Computer Vision (CV) in Optical Recognition (OCR) and Natural Language Processing (NLP), and the main application scenes include contract management, Computer program code management, engineering plan or project requirement change and the like.

The core function of document version management is to compare document contents of different versions, and the comparison process of documents is different according to different application scenarios. Distinguishing according to document contrast granularity: the three levels have different applications and implementations in document contrast based on characters, words, lines. For the application of computer program code management, the contents of the computer codes only need to be compared line by line; in the application of contract management, the contents of the contracts are compared only by words one by one; for documents containing tables and the like with self-structure, comparison is needed on the basis of structured documents. The method comprises the following steps of dividing documents into plain text documents such as word and txt and picture documents such as PDF scanning pieces and photos according to carriers of the documents; for the photo documents, the text comparison can be performed after character recognition is performed through OCR and the document structure is restored.

However, the existing document version management system has the following defects:

(1) the existing document version management system is based on plain text (editable) document for comparison and cannot process picture documents.

(2) The existing document version management system has the defects that the contrast precision of the structural text document is reduced to a certain extent, the display is difficult and the like because the processing method of the existing document version management system does not consider the self structure of the document and uniformly considers the document as a plain text document to perform contrast processing for the text document with a complex structure (for example, a form, a diagram and the like with the self structure).

Disclosure of Invention

In view of the above disadvantages of the prior art, the present invention aims to provide a document version management method and system based on OCR, which solves the problem that the existing document version management system can only compare based on plain text documents, and cannot compare with photo documents; the problems that when the existing document version management system compares the text documents with complex structures, the documents are uniformly regarded as pure text documents to be compared, so that the document comparison precision is reduced to a certain extent, and the presentation is difficult are solved.

The invention provides a document version management method based on OCR, which comprises the following steps:

step 1: performing OCR character recognition on the picture type document to obtain a plain text document;

step 2: carrying out text structure reduction on the plain text document;

and step 3: comparing the plain text documents with the text structures restored to obtain document comparison results;

and 4, step 4: and carrying out result post-processing on the document comparison result, and displaying the document comparison result.

Further, the OCR character recognition step is as follows:

step 1.1: carrying out image angle correction and image noise reduction processing on the picture documents, and adjusting the picture documents into single-channel image data;

step 1.2: loading an OCR character recognition model, inputting image data of a single channel into the OCR character recognition model for target detection, and dividing the picture document into a plurality of small pictures according to table coordinates after obtaining the table coordinates;

step 1.3: loading an OCR character recognition model, and inputting the small picture into the OCR character recognition model for character recognition to obtain character recognition data;

step 1.4: and filtering, sequencing and combining the character recognition data to obtain a plain text document.

Further, the text structure reduction comprises free text structure reduction and table detection.

Further, the step of restoring the free text structure comprises the following steps:

step 2.1: judging the starting position and the ending position of paragraphs of the plain text document according to the characteristics of line spacing, line head and line tail of the plain text document, and inserting line feed character marks among the paragraphs;

step 2.2: detecting and judging whether a catalogue exists in the plain text document, if so, turning to the step 2.3, and if not, turning to the step 2.4;

step 2.3: identifying the content of the catalog, positioning chapter positions according to the catalog, and restoring a picture document chapter structure according to the chapter positions;

step 2.4: and positioning chapter positions according to the characteristics of the title and the line spacing of the plain text document, and restoring the chapter structure of the picture document according to the chapter positions.

Further, the table detection step is as follows:

step 3.1: detecting and positioning the intersection points of horizontal lines and vertical lines in the plain text document, and carrying out priority sequencing on the intersection points according to an x axis and a y axis of a rectangular coordinate system;

step 3.2: traversing the intersection points of all the transverse lines and the vertical lines, and taking the current intersection point as the left intersection point of the candidate cell;

step 3.3: judging whether the intersection point exists on the right side of the transverse line according to the transverse line where the intersection point on the left side is located, and if not, turning to the step 3.2;

step 3.4: judging whether an intersection point exists below the vertical line according to the vertical line where the right intersection point is located, and if not, turning to the step 3.2;

step 3.5: judging whether the intersection point exists on the left side of the transverse line according to the transverse line where the lower intersection point is located, and if not, turning to the step 3.2;

step 3.6: and (3) judging whether the upper left intersection point and the lower left intersection point are on a vertical line, if so, establishing the candidate cell, otherwise, not establishing the candidate cell, and going to the step 3.2.

Further, the step of comparing the plain text document after the text structure reduction is as follows:

step 4.1: judging whether the picture type document and the plain text document are stored as empty documents or not, if so, prompting an exception and finishing comparison;

step 4.2: judging whether the picture type document and the plain text document are equal, if so, defining the state of the plain text document and finishing comparison;

step 4.3: searching the longest identical prefix and the longest identical suffix of the picture type document and the plain text document, and defining the state of the longest identical prefix and the longest identical suffix;

step 4.4: removing the longest identical prefix and the longest identical suffix of the picture type document and the plain text document, and searching the longest identical subset;

step 4.5: dividing the picture type document and the plain text document into prefixes, subsets and suffixes of the picture type document and the plain text document by taking the maximum identical subset as a boundary, taking the prefixes of the picture type document and the plain text document as input, repeating the steps 4.1-4.5, taking the suffixes of the picture type document and the plain text document as input, and repeating the steps 4.1-4.5;

step 4.6: if any input length is less than or equal to 1, the comparison is finished.

Further, the result post-processing is to perform four-corner coding correction, mapping table verification and special symbol verification processing on the document comparison result.

A document version management system based on OCR comprises an OCR text recognition module, a text structure reduction module, a comparison processing module, a result post-processing module and a result display module;

an OCR text recognition module: performing OCR character recognition on the picture type document to obtain a plain text document;

the text structure restoration module: carrying out structure reduction on the plain text document, reducing chapter and paragraph structures of the picture document, and carrying out table detection;

a comparison module: comparing the plain text documents with the text structures restored, and formatting all differences of the picture type documents and the plain text documents in an array form;

a result post-processing module: carrying out four-corner coding correction, mapping table verification and special symbol verification processing on the picture type document and the plain text document;

and a result display module: and page display is carried out on the document comparison result of the picture document and the plain text document.

As described above, the document version management method and system based on OCR of the invention have the following advantages:

1. aiming at the problem that the current document version management system cannot process the photo documents, the invention can compare the photo documents based on the OCR character recognition model, and can carry out OCR character recognition with high precision by combining the large-data OCR character recognition model training, thereby providing the pure text documents to be compared.

2. The invention can restore paragraphs and directory structures of the pure text documents identified by OCR characters through free text structure restoration and table detection, and performs table detection on the text documents with complex structures, thereby improving the document comparison precision.

3. The invention coordinates the functional modules to realize the mutual dependence of the functional modules, and has a system mechanism with independent operation capability, thereby realizing low coupling and improving the working efficiency of the system.

Drawings

FIG. 1 is a flowchart showing a document version management method disclosed in an embodiment of the present invention.

FIG. 2 is a flow chart of OCR text recognition disclosed in the embodiments of the present invention.

FIG. 3 is a flow chart illustrating the recovery of the free-text structure disclosed in the embodiment of the present invention.

FIG. 4 is a flow chart of table detection disclosed in an embodiment of the present invention.

FIG. 5 is a flow chart showing document comparison disclosed in an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

As shown in FIG. 1, the present invention provides an OCR-based document version management method, comprising the steps of:

as shown in fig. 2, the OCR character recognition steps are:

The OCR character recognition is a key to process a document with any format and any structure, and particularly for a photo document, a scanned document and other photo documents, a pure text document must be obtained through the OCR character recognition, and then a subsequent comparison operation can be performed.

Step 2: performing text structure reduction on the plain text document, wherein the text structure reduction comprises free text structure reduction and table detection;

as shown in fig. 3, the steps of the free text structure restoration are:

step 2.1: judging the starting position and the ending position of paragraphs of the plain text document according to the characteristics of line spacing, line head and line tail of the plain text document, and inserting line feed character marks among the paragraphs to realize paragraph restoration;

In documents having their own structures such as tables and diagrams, since plain text documents obtained after OCR character recognition have no chapter or paragraph structures, it is necessary to restore a free text structure in order to obtain a more accurate comparison result.

As shown in fig. 4, the table detection steps are:

The table detection can be used as a judgment basis for the inclination and distortion of the picture document, and on the other hand, the table detection is also very important for OCR image recognition of information extraction, so that the recognition efficiency and the extraction rate can be greatly improved.

as shown in fig. 5, the step of comparing the plain text document after the text structure is restored includes:

step 4.2: judging whether the picture type document and the plain text document are equal, if so, defining the state of the plain text document as 'no change', and finishing comparison;

step 4.3: searching the longest identical prefix and the longest identical suffix of the picture class document and the plain text document, and defining the state of the longest identical prefix and the longest identical suffix as 'no change';

step 4.6: if any input length is less than or equal to 1, the comparison is finished;

step 4.7: if the comparison is overtime, outputting all differences up to now;

step 4.8: integrating all differences according to the sequence, wherein the difference result takes 'character' as a unit;

step 4.9: if the comparison needs to be carried out by taking the line as a unit, the comparison results of the characters are spliced, the line is taken as a demarcation point, all differences between two line changing characters are combined into a text, and the state of the text is defined as 'modification'.

Comparing the plain text documents with the text structures restored by using a comparison algorithm, and formatting all differences between the picture type documents and the plain text documents in an array form, wherein the formatting difference result comprises the maximum identical subsets of all the documents and the state of each maximum identical subset, including deletion, addition, modification and no change; wherein the largest identical subset is in units of characters.

Specifically, the result post-processing is to perform four-corner coding correction, mapping table verification and special symbol verification processing on the document comparison result.

The OCR character recognition is realized based on an OCR character recognition model of deep learning, the OCR character recognition model is a probability model, errors exist, and the result comparison result of a pure text document has wrong contents, so that the result post-processing is required, and the wrong contents in the document comparison result caused by the OCR character recognition errors are avoided.

The four-corner coding correction is a special coding mode for Chinese characters, the four-corner coding correction is utilized to compare a picture document and a plain text document in wrong contents, if the coding similarity is higher than a certain threshold value, the document comparison result is corrected, and the wrong contents in the document comparison result caused by OCR character recognition errors are removed.

The OCR character recognition model comprises the fitting capability to a language model, so that an OCR recognized pure text document comprises an invisible and ambiguous word or word, the pure text document is corrected in a mapping table verification mode, and the error content in a document comparison result caused by an OCR character recognition error is removed.

And correcting the plain text document in a special symbol verification mode for special characters of non-key information types such as punctuation marks, spaces, line feed marks and the like, and removing wrong contents in a document comparison result caused by OCR character recognition errors.

In order to enable the document comparison result to have better visual experience, the difference of the comparison file can be directly and clearly displayed through the document comparison display page.

The invention also provides a document version management system based on OCR, which is realized based on the method and comprises an OCR text recognition module, a text structure reduction module, a comparison processing module, a result post-processing module and a result display module;

The OCR character recognition module and the text structure reduction module are the basis for endowing the system with the version management capability of processing documents with any format and any structure, and respectively provide the processing capability of the picture documents and the complex structure text documents. The output of the text structure restoration module is a data source for subsequent document comparison.

The comparison module is a core processing module of the system, can meet the comparison requirement according to lines and characters according to the comparison requirements of different versions of documents, and meets the real-time performance in terms of efficiency.

The function of the comparison result post-processing module is based on the introduction of an OCR character recognition engine, and the judgment and the correction of the output result of the OCR character recognition model are carried out in order to ensure the fluency of the system.

And displaying the result, and feeding back the final comparison result output by the result post-processing to the user through a direct and clear page.

In conclusion, the invention solves the problems that the existing document version management system can only compare based on the plain text document and can not compare the picture documents; the problems that when the existing document version management system compares text documents with complex structures, the text documents are uniformly regarded as pure text documents to be compared, the text comparison precision is reduced to a certain extent, and the display is difficult are solved. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. An OCR-based document version management method, comprising the steps of:

step 2: carrying out text structure reduction on the plain text document;

2. An OCR based document version management method according to claim 1, wherein the OCR character recognition step is:

3. An OCR-based document version management method according to claim 1, characterized in that: the text structure reduction comprises free text structure reduction and table detection.

4. An OCR-based document version management method according to claim 3, wherein the step of free-text structure restoration is:

5. An OCR-based document version management method according to claim 3, wherein the table detecting step is:

6. An OCR-based document version management method according to claim 1, wherein the step of comparing the plain text document after the text structure is restored is:

7. An OCR-based document version management method according to claim 1, characterized in that: and the result post-processing is to carry out four-corner coding correction, mapping table verification and special symbol verification processing on the document comparison result.

8. An OCR-based document version management system, characterized by: the system comprises an OCR text recognition module, a text structure restoration module, a comparison processing module, a result post-processing module and a result display module;