CN112115111A - OCR-based document version management method and system - Google Patents

OCR-based document version management method and system Download PDF

Info

Publication number
CN112115111A
CN112115111A CN201910536932.0A CN201910536932A CN112115111A CN 112115111 A CN112115111 A CN 112115111A CN 201910536932 A CN201910536932 A CN 201910536932A CN 112115111 A CN112115111 A CN 112115111A
Authority
CN
China
Prior art keywords
document
plain text
ocr
text
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910536932.0A
Other languages
Chinese (zh)
Inventor
宋嘉琪
张怀朋
于航
张智俊
郭庆河
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yingshisheng Information Technology Co ltd
Original Assignee
Shanghai Huairuo Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Huairuo Intelligent Technology Co ltd filed Critical Shanghai Huairuo Intelligent Technology Co ltd
Priority to CN201910536932.0A priority Critical patent/CN112115111A/en
Publication of CN112115111A publication Critical patent/CN112115111A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1873Versioning file systems, temporal file systems, e.g. file system supporting different historic versions of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Character Input (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a document version management method and a system based on OCR, which relate to the field of optical text recognition and natural language processing, and the method comprises the following steps of 1: performing OCR character recognition on the picture type document to obtain a plain text document; step 2: carrying out text structure reduction on the plain text document; and step 3: comparing the plain text documents with the text structures restored to obtain document comparison results; and 4, step 4: and carrying out result post-processing on the document comparison result, and displaying the document comparison result. The invention solves the problems that the existing document version management system can only compare based on plain text documents and can not compare picture documents; the problems that when the existing document version management system compares text documents with complex structures, the text documents are uniformly regarded as pure text documents to be compared, the text comparison precision is reduced to a certain extent, and the display is difficult are solved.

Description

OCR-based document version management method and system
Technical Field
The invention relates to the field of optical text recognition and natural language processing, in particular to a document version management method and system based on OCR.
Background
Document version management is based on the application of Computer Vision (CV) in Optical Recognition (OCR) and Natural Language Processing (NLP), and the main application scenes include contract management, Computer program code management, engineering plan or project requirement change and the like.
The core function of document version management is to compare document contents of different versions, and the comparison process of documents is different according to different application scenarios. Distinguishing according to document contrast granularity: the three levels have different applications and implementations in document contrast based on characters, words, lines. For the application of computer program code management, the contents of the computer codes only need to be compared line by line; in the application of contract management, the contents of the contracts are compared only by words one by one; for documents containing tables and the like with self-structure, comparison is needed on the basis of structured documents. The method comprises the following steps of dividing documents into plain text documents such as word and txt and picture documents such as PDF scanning pieces and photos according to carriers of the documents; for the photo documents, the text comparison can be performed after character recognition is performed through OCR and the document structure is restored.
However, the existing document version management system has the following defects:
(1) the existing document version management system is based on plain text (editable) document for comparison and cannot process picture documents.
(2) The existing document version management system has the defects that the contrast precision of the structural text document is reduced to a certain extent, the display is difficult and the like because the processing method of the existing document version management system does not consider the self structure of the document and uniformly considers the document as a plain text document to perform contrast processing for the text document with a complex structure (for example, a form, a diagram and the like with the self structure).
Disclosure of Invention
In view of the above disadvantages of the prior art, the present invention aims to provide a document version management method and system based on OCR, which solves the problem that the existing document version management system can only compare based on plain text documents, and cannot compare with photo documents; the problems that when the existing document version management system compares the text documents with complex structures, the documents are uniformly regarded as pure text documents to be compared, so that the document comparison precision is reduced to a certain extent, and the presentation is difficult are solved.
The invention provides a document version management method based on OCR, which comprises the following steps:
step 1: performing OCR character recognition on the picture type document to obtain a plain text document;
step 2: carrying out text structure reduction on the plain text document;
and step 3: comparing the plain text documents with the text structures restored to obtain document comparison results;
and 4, step 4: and carrying out result post-processing on the document comparison result, and displaying the document comparison result.
Further, the OCR character recognition step is as follows:
step 1.1: carrying out image angle correction and image noise reduction processing on the picture documents, and adjusting the picture documents into single-channel image data;
step 1.2: loading an OCR character recognition model, inputting image data of a single channel into the OCR character recognition model for target detection, and dividing the picture document into a plurality of small pictures according to table coordinates after obtaining the table coordinates;
step 1.3: loading an OCR character recognition model, and inputting the small picture into the OCR character recognition model for character recognition to obtain character recognition data;
step 1.4: and filtering, sequencing and combining the character recognition data to obtain a plain text document.
Further, the text structure reduction comprises free text structure reduction and table detection.
Further, the step of restoring the free text structure comprises the following steps:
step 2.1: judging the starting position and the ending position of paragraphs of the plain text document according to the characteristics of line spacing, line head and line tail of the plain text document, and inserting line feed character marks among the paragraphs;
step 2.2: detecting and judging whether a catalogue exists in the plain text document, if so, turning to the step 2.3, and if not, turning to the step 2.4;
step 2.3: identifying the content of the catalog, positioning chapter positions according to the catalog, and restoring a picture document chapter structure according to the chapter positions;
step 2.4: and positioning chapter positions according to the characteristics of the title and the line spacing of the plain text document, and restoring the chapter structure of the picture document according to the chapter positions.
Further, the table detection step is as follows:
step 3.1: detecting and positioning the intersection points of horizontal lines and vertical lines in the plain text document, and carrying out priority sequencing on the intersection points according to an x axis and a y axis of a rectangular coordinate system;
step 3.2: traversing the intersection points of all the transverse lines and the vertical lines, and taking the current intersection point as the left intersection point of the candidate cell;
step 3.3: judging whether the intersection point exists on the right side of the transverse line according to the transverse line where the intersection point on the left side is located, and if not, turning to the step 3.2;
step 3.4: judging whether an intersection point exists below the vertical line according to the vertical line where the right intersection point is located, and if not, turning to the step 3.2;
step 3.5: judging whether the intersection point exists on the left side of the transverse line according to the transverse line where the lower intersection point is located, and if not, turning to the step 3.2;
step 3.6: and (3) judging whether the upper left intersection point and the lower left intersection point are on a vertical line, if so, establishing the candidate cell, otherwise, not establishing the candidate cell, and going to the step 3.2.
Further, the step of comparing the plain text document after the text structure reduction is as follows:
step 4.1: judging whether the picture type document and the plain text document are stored as empty documents or not, if so, prompting an exception and finishing comparison;
step 4.2: judging whether the picture type document and the plain text document are equal, if so, defining the state of the plain text document and finishing comparison;
step 4.3: searching the longest identical prefix and the longest identical suffix of the picture type document and the plain text document, and defining the state of the longest identical prefix and the longest identical suffix;
step 4.4: removing the longest identical prefix and the longest identical suffix of the picture type document and the plain text document, and searching the longest identical subset;
step 4.5: dividing the picture type document and the plain text document into prefixes, subsets and suffixes of the picture type document and the plain text document by taking the maximum identical subset as a boundary, taking the prefixes of the picture type document and the plain text document as input, repeating the steps 4.1-4.5, taking the suffixes of the picture type document and the plain text document as input, and repeating the steps 4.1-4.5;
step 4.6: if any input length is less than or equal to 1, the comparison is finished.
Further, the result post-processing is to perform four-corner coding correction, mapping table verification and special symbol verification processing on the document comparison result.
A document version management system based on OCR comprises an OCR text recognition module, a text structure reduction module, a comparison processing module, a result post-processing module and a result display module;
an OCR text recognition module: performing OCR character recognition on the picture type document to obtain a plain text document;
the text structure restoration module: carrying out structure reduction on the plain text document, reducing chapter and paragraph structures of the picture document, and carrying out table detection;
a comparison module: comparing the plain text documents with the text structures restored, and formatting all differences of the picture type documents and the plain text documents in an array form;
a result post-processing module: carrying out four-corner coding correction, mapping table verification and special symbol verification processing on the picture type document and the plain text document;
and a result display module: and page display is carried out on the document comparison result of the picture document and the plain text document.
As described above, the document version management method and system based on OCR of the invention have the following advantages:
1. aiming at the problem that the current document version management system cannot process the photo documents, the invention can compare the photo documents based on the OCR character recognition model, and can carry out OCR character recognition with high precision by combining the large-data OCR character recognition model training, thereby providing the pure text documents to be compared.
2. The invention can restore paragraphs and directory structures of the pure text documents identified by OCR characters through free text structure restoration and table detection, and performs table detection on the text documents with complex structures, thereby improving the document comparison precision.
3. The invention coordinates the functional modules to realize the mutual dependence of the functional modules, and has a system mechanism with independent operation capability, thereby realizing low coupling and improving the working efficiency of the system.
Drawings
FIG. 1 is a flowchart showing a document version management method disclosed in an embodiment of the present invention.
FIG. 2 is a flow chart of OCR text recognition disclosed in the embodiments of the present invention.
FIG. 3 is a flow chart illustrating the recovery of the free-text structure disclosed in the embodiment of the present invention.
FIG. 4 is a flow chart of table detection disclosed in an embodiment of the present invention.
FIG. 5 is a flow chart showing document comparison disclosed in an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
As shown in FIG. 1, the present invention provides an OCR-based document version management method, comprising the steps of:
step 1: performing OCR character recognition on the picture type document to obtain a plain text document;
as shown in fig. 2, the OCR character recognition steps are:
step 1.1: carrying out image angle correction and image noise reduction processing on the picture documents, and adjusting the picture documents into single-channel image data;
step 1.2: loading an OCR character recognition model, inputting image data of a single channel into the OCR character recognition model for target detection, and dividing the picture document into a plurality of small pictures according to table coordinates after obtaining the table coordinates;
step 1.3: loading an OCR character recognition model, and inputting the small picture into the OCR character recognition model for character recognition to obtain character recognition data;
step 1.4: and filtering, sequencing and combining the character recognition data to obtain a plain text document.
The OCR character recognition is a key to process a document with any format and any structure, and particularly for a photo document, a scanned document and other photo documents, a pure text document must be obtained through the OCR character recognition, and then a subsequent comparison operation can be performed.
Step 2: performing text structure reduction on the plain text document, wherein the text structure reduction comprises free text structure reduction and table detection;
as shown in fig. 3, the steps of the free text structure restoration are:
step 2.1: judging the starting position and the ending position of paragraphs of the plain text document according to the characteristics of line spacing, line head and line tail of the plain text document, and inserting line feed character marks among the paragraphs to realize paragraph restoration;
step 2.2: detecting and judging whether a catalogue exists in the plain text document, if so, turning to the step 2.3, and if not, turning to the step 2.4;
step 2.3: identifying the content of the catalog, positioning chapter positions according to the catalog, and restoring a picture document chapter structure according to the chapter positions;
step 2.4: and positioning chapter positions according to the characteristics of the title and the line spacing of the plain text document, and restoring the chapter structure of the picture document according to the chapter positions.
In documents having their own structures such as tables and diagrams, since plain text documents obtained after OCR character recognition have no chapter or paragraph structures, it is necessary to restore a free text structure in order to obtain a more accurate comparison result.
As shown in fig. 4, the table detection steps are:
step 3.1: detecting and positioning the intersection points of horizontal lines and vertical lines in the plain text document, and carrying out priority sequencing on the intersection points according to an x axis and a y axis of a rectangular coordinate system;
step 3.2: traversing the intersection points of all the transverse lines and the vertical lines, and taking the current intersection point as the left intersection point of the candidate cell;
step 3.3: judging whether the intersection point exists on the right side of the transverse line according to the transverse line where the intersection point on the left side is located, and if not, turning to the step 3.2;
step 3.4: judging whether an intersection point exists below the vertical line according to the vertical line where the right intersection point is located, and if not, turning to the step 3.2;
step 3.5: judging whether the intersection point exists on the left side of the transverse line according to the transverse line where the lower intersection point is located, and if not, turning to the step 3.2;
step 3.6: and (3) judging whether the upper left intersection point and the lower left intersection point are on a vertical line, if so, establishing the candidate cell, otherwise, not establishing the candidate cell, and going to the step 3.2.
The table detection can be used as a judgment basis for the inclination and distortion of the picture document, and on the other hand, the table detection is also very important for OCR image recognition of information extraction, so that the recognition efficiency and the extraction rate can be greatly improved.
And step 3: comparing the plain text documents with the text structures restored to obtain document comparison results;
as shown in fig. 5, the step of comparing the plain text document after the text structure is restored includes:
step 4.1: judging whether the picture type document and the plain text document are stored as empty documents or not, if so, prompting an exception and finishing comparison;
step 4.2: judging whether the picture type document and the plain text document are equal, if so, defining the state of the plain text document as 'no change', and finishing comparison;
step 4.3: searching the longest identical prefix and the longest identical suffix of the picture class document and the plain text document, and defining the state of the longest identical prefix and the longest identical suffix as 'no change';
step 4.4: removing the longest identical prefix and the longest identical suffix of the picture type document and the plain text document, and searching the longest identical subset;
step 4.5: dividing the picture type document and the plain text document into prefixes, subsets and suffixes of the picture type document and the plain text document by taking the maximum identical subset as a boundary, taking the prefixes of the picture type document and the plain text document as input, repeating the steps 4.1-4.5, taking the suffixes of the picture type document and the plain text document as input, and repeating the steps 4.1-4.5;
step 4.6: if any input length is less than or equal to 1, the comparison is finished;
step 4.7: if the comparison is overtime, outputting all differences up to now;
step 4.8: integrating all differences according to the sequence, wherein the difference result takes 'character' as a unit;
step 4.9: if the comparison needs to be carried out by taking the line as a unit, the comparison results of the characters are spliced, the line is taken as a demarcation point, all differences between two line changing characters are combined into a text, and the state of the text is defined as 'modification'.
Comparing the plain text documents with the text structures restored by using a comparison algorithm, and formatting all differences between the picture type documents and the plain text documents in an array form, wherein the formatting difference result comprises the maximum identical subsets of all the documents and the state of each maximum identical subset, including deletion, addition, modification and no change; wherein the largest identical subset is in units of characters.
And 4, step 4: and carrying out result post-processing on the document comparison result, and displaying the document comparison result.
Specifically, the result post-processing is to perform four-corner coding correction, mapping table verification and special symbol verification processing on the document comparison result.
The OCR character recognition is realized based on an OCR character recognition model of deep learning, the OCR character recognition model is a probability model, errors exist, and the result comparison result of a pure text document has wrong contents, so that the result post-processing is required, and the wrong contents in the document comparison result caused by the OCR character recognition errors are avoided.
The four-corner coding correction is a special coding mode for Chinese characters, the four-corner coding correction is utilized to compare a picture document and a plain text document in wrong contents, if the coding similarity is higher than a certain threshold value, the document comparison result is corrected, and the wrong contents in the document comparison result caused by OCR character recognition errors are removed.
The OCR character recognition model comprises the fitting capability to a language model, so that an OCR recognized pure text document comprises an invisible and ambiguous word or word, the pure text document is corrected in a mapping table verification mode, and the error content in a document comparison result caused by an OCR character recognition error is removed.
And correcting the plain text document in a special symbol verification mode for special characters of non-key information types such as punctuation marks, spaces, line feed marks and the like, and removing wrong contents in a document comparison result caused by OCR character recognition errors.
In order to enable the document comparison result to have better visual experience, the difference of the comparison file can be directly and clearly displayed through the document comparison display page.
The invention also provides a document version management system based on OCR, which is realized based on the method and comprises an OCR text recognition module, a text structure reduction module, a comparison processing module, a result post-processing module and a result display module;
an OCR text recognition module: performing OCR character recognition on the picture type document to obtain a plain text document;
the text structure restoration module: carrying out structure reduction on the plain text document, reducing chapter and paragraph structures of the picture document, and carrying out table detection;
a comparison module: comparing the plain text documents with the text structures restored, and formatting all differences of the picture type documents and the plain text documents in an array form;
a result post-processing module: carrying out four-corner coding correction, mapping table verification and special symbol verification processing on the picture type document and the plain text document;
and a result display module: and page display is carried out on the document comparison result of the picture document and the plain text document.
The OCR character recognition module and the text structure reduction module are the basis for endowing the system with the version management capability of processing documents with any format and any structure, and respectively provide the processing capability of the picture documents and the complex structure text documents. The output of the text structure restoration module is a data source for subsequent document comparison.
The comparison module is a core processing module of the system, can meet the comparison requirement according to lines and characters according to the comparison requirements of different versions of documents, and meets the real-time performance in terms of efficiency.
The function of the comparison result post-processing module is based on the introduction of an OCR character recognition engine, and the judgment and the correction of the output result of the OCR character recognition model are carried out in order to ensure the fluency of the system.
And displaying the result, and feeding back the final comparison result output by the result post-processing to the user through a direct and clear page.
In conclusion, the invention solves the problems that the existing document version management system can only compare based on the plain text document and can not compare the picture documents; the problems that when the existing document version management system compares text documents with complex structures, the text documents are uniformly regarded as pure text documents to be compared, the text comparison precision is reduced to a certain extent, and the display is difficult are solved. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (8)

1. An OCR-based document version management method, comprising the steps of:
step 1: performing OCR character recognition on the picture type document to obtain a plain text document;
step 2: carrying out text structure reduction on the plain text document;
and step 3: comparing the plain text documents with the text structures restored to obtain document comparison results;
and 4, step 4: and carrying out result post-processing on the document comparison result, and displaying the document comparison result.
2. An OCR based document version management method according to claim 1, wherein the OCR character recognition step is:
step 1.1: carrying out image angle correction and image noise reduction processing on the picture documents, and adjusting the picture documents into single-channel image data;
step 1.2: loading an OCR character recognition model, inputting image data of a single channel into the OCR character recognition model for target detection, and dividing the picture document into a plurality of small pictures according to table coordinates after obtaining the table coordinates;
step 1.3: loading an OCR character recognition model, and inputting the small picture into the OCR character recognition model for character recognition to obtain character recognition data;
step 1.4: and filtering, sequencing and combining the character recognition data to obtain a plain text document.
3. An OCR-based document version management method according to claim 1, characterized in that: the text structure reduction comprises free text structure reduction and table detection.
4. An OCR-based document version management method according to claim 3, wherein the step of free-text structure restoration is:
step 2.1: judging the starting position and the ending position of paragraphs of the plain text document according to the characteristics of line spacing, line head and line tail of the plain text document, and inserting line feed character marks among the paragraphs;
step 2.2: detecting and judging whether a catalogue exists in the plain text document, if so, turning to the step 2.3, and if not, turning to the step 2.4;
step 2.3: identifying the content of the catalog, positioning chapter positions according to the catalog, and restoring a picture document chapter structure according to the chapter positions;
step 2.4: and positioning chapter positions according to the characteristics of the title and the line spacing of the plain text document, and restoring the chapter structure of the picture document according to the chapter positions.
5. An OCR-based document version management method according to claim 3, wherein the table detecting step is:
step 3.1: detecting and positioning the intersection points of horizontal lines and vertical lines in the plain text document, and carrying out priority sequencing on the intersection points according to an x axis and a y axis of a rectangular coordinate system;
step 3.2: traversing the intersection points of all the transverse lines and the vertical lines, and taking the current intersection point as the left intersection point of the candidate cell;
step 3.3: judging whether the intersection point exists on the right side of the transverse line according to the transverse line where the intersection point on the left side is located, and if not, turning to the step 3.2;
step 3.4: judging whether an intersection point exists below the vertical line according to the vertical line where the right intersection point is located, and if not, turning to the step 3.2;
step 3.5: judging whether the intersection point exists on the left side of the transverse line according to the transverse line where the lower intersection point is located, and if not, turning to the step 3.2;
step 3.6: and (3) judging whether the upper left intersection point and the lower left intersection point are on a vertical line, if so, establishing the candidate cell, otherwise, not establishing the candidate cell, and going to the step 3.2.
6. An OCR-based document version management method according to claim 1, wherein the step of comparing the plain text document after the text structure is restored is:
step 4.1: judging whether the picture type document and the plain text document are stored as empty documents or not, if so, prompting an exception and finishing comparison;
step 4.2: judging whether the picture type document and the plain text document are equal, if so, defining the state of the plain text document and finishing comparison;
step 4.3: searching the longest identical prefix and the longest identical suffix of the picture type document and the plain text document, and defining the state of the longest identical prefix and the longest identical suffix;
step 4.4: removing the longest identical prefix and the longest identical suffix of the picture type document and the plain text document, and searching the longest identical subset;
step 4.5: dividing the picture type document and the plain text document into prefixes, subsets and suffixes of the picture type document and the plain text document by taking the maximum identical subset as a boundary, taking the prefixes of the picture type document and the plain text document as input, repeating the steps 4.1-4.5, taking the suffixes of the picture type document and the plain text document as input, and repeating the steps 4.1-4.5;
step 4.6: if any input length is less than or equal to 1, the comparison is finished.
7. An OCR-based document version management method according to claim 1, characterized in that: and the result post-processing is to carry out four-corner coding correction, mapping table verification and special symbol verification processing on the document comparison result.
8. An OCR-based document version management system, characterized by: the system comprises an OCR text recognition module, a text structure restoration module, a comparison processing module, a result post-processing module and a result display module;
an OCR text recognition module: performing OCR character recognition on the picture type document to obtain a plain text document;
the text structure restoration module: carrying out structure reduction on the plain text document, reducing chapter and paragraph structures of the picture document, and carrying out table detection;
a comparison module: comparing the plain text documents with the text structures restored, and formatting all differences of the picture type documents and the plain text documents in an array form;
a result post-processing module: carrying out four-corner coding correction, mapping table verification and special symbol verification processing on the picture type document and the plain text document;
and a result display module: and page display is carried out on the document comparison result of the picture document and the plain text document.
CN201910536932.0A 2019-06-20 2019-06-20 OCR-based document version management method and system Pending CN112115111A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910536932.0A CN112115111A (en) 2019-06-20 2019-06-20 OCR-based document version management method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910536932.0A CN112115111A (en) 2019-06-20 2019-06-20 OCR-based document version management method and system

Publications (1)

Publication Number Publication Date
CN112115111A true CN112115111A (en) 2020-12-22

Family

ID=73796748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910536932.0A Pending CN112115111A (en) 2019-06-20 2019-06-20 OCR-based document version management method and system

Country Status (1)

Country Link
CN (1) CN112115111A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800719A (en) * 2020-12-28 2021-05-14 北京思题科技有限公司 Electronic document structuring method
CN113111864A (en) * 2021-05-13 2021-07-13 上海巽联信息科技有限公司 Intelligent table extraction algorithm based on multiple modes
CN113704214A (en) * 2021-08-27 2021-11-26 北京市律典通科技有限公司 Electronic file type conversion method and device and computer equipment
CN114021543A (en) * 2022-01-05 2022-02-08 杭州实在智能科技有限公司 Document comparison analysis method and system based on table structure analysis
US11854287B2 (en) 2021-11-23 2023-12-26 International Business Machines Corporation Visual mode image comparison

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101676930A (en) * 2008-09-17 2010-03-24 北大方正集团有限公司 Method and device for recognizing table cells in scanned image
CN102567300A (en) * 2011-12-29 2012-07-11 方正国际软件有限公司 Picture document processing method and device
CN105718554A (en) * 2016-01-19 2016-06-29 深圳市天朗时代科技有限公司 Document collaboration conversion method and system
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
CN107025460A (en) * 2016-08-17 2017-08-08 广州市力融计算机技术有限公司 The system and method for improving contract management level and efficiency
CN107451582A (en) * 2017-07-13 2017-12-08 安徽声讯信息技术有限公司 A kind of graphics context identifying system and its recognition methods
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
CN109190092A (en) * 2018-08-15 2019-01-11 深圳平安综合金融服务有限公司上海分公司 The consistency checking method of separate sources file
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution
WO2019041526A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method of extracting chart in document, electronic device and computer-readable storage medium
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table
CN109543525A (en) * 2018-10-18 2019-03-29 成都中科信息技术有限公司 A kind of table extracting method of form of general use image

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101676930A (en) * 2008-09-17 2010-03-24 北大方正集团有限公司 Method and device for recognizing table cells in scanned image
CN102567300A (en) * 2011-12-29 2012-07-11 方正国际软件有限公司 Picture document processing method and device
CN105718554A (en) * 2016-01-19 2016-06-29 深圳市天朗时代科技有限公司 Document collaboration conversion method and system
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
CN107025460A (en) * 2016-08-17 2017-08-08 广州市力融计算机技术有限公司 The system and method for improving contract management level and efficiency
CN107451582A (en) * 2017-07-13 2017-12-08 安徽声讯信息技术有限公司 A kind of graphics context identifying system and its recognition methods
WO2019041526A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method of extracting chart in document, electronic device and computer-readable storage medium
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
CN109190092A (en) * 2018-08-15 2019-01-11 深圳平安综合金融服务有限公司上海分公司 The consistency checking method of separate sources file
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution
CN109543525A (en) * 2018-10-18 2019-03-29 成都中科信息技术有限公司 A kind of table extracting method of form of general use image
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卞静潇: "复杂版面文档图像表格与图的提取及分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800719A (en) * 2020-12-28 2021-05-14 北京思题科技有限公司 Electronic document structuring method
CN113111864A (en) * 2021-05-13 2021-07-13 上海巽联信息科技有限公司 Intelligent table extraction algorithm based on multiple modes
CN113704214A (en) * 2021-08-27 2021-11-26 北京市律典通科技有限公司 Electronic file type conversion method and device and computer equipment
US11854287B2 (en) 2021-11-23 2023-12-26 International Business Machines Corporation Visual mode image comparison
CN114021543A (en) * 2022-01-05 2022-02-08 杭州实在智能科技有限公司 Document comparison analysis method and system based on table structure analysis

Similar Documents

Publication Publication Date Title
CN112115111A (en) OCR-based document version management method and system
US7730050B2 (en) Information retrieval apparatus
US9384172B2 (en) Multi-level list detection engine
US6721451B1 (en) Apparatus and method for reading a document image
JP4071328B2 (en) Document image processing apparatus and method
JP5402099B2 (en) Information processing system, information processing apparatus, information processing method, and program
US8838657B1 (en) Document fingerprints using block encoding of text
US20090317003A1 (en) Correcting segmentation errors in ocr
JP6122800B2 (en) Electronic device, character string display method, and character string display program
US20220222292A1 (en) Method and system for ideogram character analysis
CN105302626B (en) Analytic method of XPS (XPS) structured data
CN110263792B (en) Image recognizing and reading and data processing method, intelligent pen, system and storage medium
KR20150099936A (en) Method and apparatus for applying an alternate font for maintaining document layout
JP5380040B2 (en) Document processing device
CN111310426A (en) Form format recovery method and device based on OCR and storage medium
US7406201B2 (en) Correcting segmentation errors in OCR
US9323726B1 (en) Optimizing a glyph-based file
US20130121583A1 (en) Handwritten character recognition based on frequency variations in characters
US8526744B2 (en) Document processing apparatus and computer readable medium
WO2021143058A1 (en) Image-based information comparison method, apparatus, electronic device, and computer-readable storage medium
CN105677718A (en) Character retrieval method and apparatus
JP4470913B2 (en) Character string search device and program
JPWO2016170691A1 (en) Input processing program, input processing apparatus, input processing method, character specifying program, character specifying apparatus, and character specifying method
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium
US20180032244A1 (en) Input control device, input control method, character correction device, and character correction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230427

Address after: Room 3701, Building T2, Shenye Shangcheng (South District), No. 5001 Huanggang Road, Lianhua Yicun Community, Huafu Street, Futian District, Shenzhen City, Guangdong Province, 518035

Applicant after: Shenzhen yingshisheng Information Technology Co.,Ltd.

Address before: Room 823, 2 / F, 148 Lane 999, XINER Road, Baoshan District, Shanghai

Applicant before: Shanghai Huairuo Intelligent Technology Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20201222

RJ01 Rejection of invention patent application after publication