CN115965988A - Text comparison method and device based on optical character recognition - Google Patents

Text comparison method and device based on optical character recognition Download PDF

Info

Publication number
CN115965988A
CN115965988A CN202211725806.8A CN202211725806A CN115965988A CN 115965988 A CN115965988 A CN 115965988A CN 202211725806 A CN202211725806 A CN 202211725806A CN 115965988 A CN115965988 A CN 115965988A
Authority
CN
China
Prior art keywords
text
comparison
object list
result
coordinate information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211725806.8A
Other languages
Chinese (zh)
Inventor
涂洪健
王巍
程海波
李捷
罗兴奕
徐柯文
王慧
厉超
蔡苹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pudong Development Bank Co Ltd
Original Assignee
Shanghai Pudong Development Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pudong Development Bank Co Ltd filed Critical Shanghai Pudong Development Bank Co Ltd
Priority to CN202211725806.8A priority Critical patent/CN115965988A/en
Publication of CN115965988A publication Critical patent/CN115965988A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a text comparison method and device based on optical character recognition. In one embodiment, the result of performing optical character recognition on the target document picture may be obtained, and the comparison between the text and the table is performed separately by classifying the recognition results. After classification is completed, combining the recognition results according to the coordinate information according to categories to obtain a text line object list containing text content and a table line object list containing table content, processing the text line object list and the table line object list through character formats to obtain comparison characters matched with a comparison algorithm format, and comparing the comparison characters through an algorithm to obtain a comparison result. Therefore, the text contents of different pages can be integrated into the text line object list, the table contents of different pages are integrated into the table line object list, character comparison is carried out in sequence by line units, single page comparison is not needed, and the file comparison efficiency and the comparison precision are improved.

Description

Text comparison method and device based on optical character recognition
Technical Field
The present disclosure relates to the field of text comparison, and in particular, to a text comparison method based on optical character recognition, a computer device, a computer-readable storage medium, and a computer program product.
Background
Nowadays, information updating is fast and violent, and enterprises, schools, scientific research units or individuals have more and more documents contacted and used in production and living activities. In practical situations, a document is often subjected to multiple modifications, and changes in file format may occur at different auditing nodes. Some slight differences in many important files may also cause serious influences, especially many files also involve conversion from text documents to different carrier forms such as electronic images, paper manuscripts, scanned files, and the like, but are limited by the accuracy problem of technologies such as OCR (Optical Character Recognition), and the like, and the problem of content missing or abnormality is difficult to avoid in the conversion process. Therefore, it is necessary to compare the text contents of different versions of files.
The existing text comparison mode comprises manual comparison, but the manual comparison is time-consuming and labor-consuming, and the accuracy cannot be guaranteed. In addition, the document in the form of the picture can be converted into the text in the form of characters through an OCR technology, and then the comparison is carried out by utilizing a comparison algorithm. However, this method is only suitable for single page comparison, and when the number of pages of the documents to be compared is different, the comparison precision is sharply reduced, and the comparison of the subsequent page numbers is seriously affected. In addition, when a wrong page occurs, the same problem occurs when the number of comparison pages is not consistent.
Disclosure of Invention
In view of the above, in view of the above technical problem, there are provided a text comparison method based on optical character recognition, a computer device, a computer readable storage medium, and a computer program product. The technical scheme of the disclosure is as follows:
according to an aspect of the embodiments of the present disclosure, there is provided a text comparison method based on optical character recognition, including:
acquiring an identification result obtained by carrying out optical character identification on a target file and coordinate information corresponding to the identification result;
classifying the recognition results to obtain text recognition results and form recognition results;
combining the text recognition results according to the coordinate information to generate a text line object list;
combining the table identification results according to the coordinate information to generate a table row object list;
respectively carrying out character format processing on the text line object list and the table line object list to obtain text pre-comparison characters and table pre-comparison characters;
respectively comparing the text pre-comparison characters with the table pre-comparison characters through a preset character comparison algorithm to obtain a text comparison result and a table comparison result; the text comparison result is used for representing text difference information in the target file; the table comparison result is used for representing table difference information in the target file.
In one embodiment, the combining the text recognition results according to the coordinate information to generate a text line object list includes:
arranging the text recognition results of the same line into a line in sequence according to the coordinate information to obtain a plurality of text line objects;
acquiring the line number of the text recognition result according to the coordinate information;
and combining the text line objects according to the line numbers to obtain a text line object list.
In one embodiment, after obtaining the text alignment result and the table alignment result, the method further includes:
inquiring the text difference information in the text comparison result;
acquiring an index line number of the text difference information in the text line object list according to the text comparison result, and determining a text line object corresponding to the index line number;
determining a text recognition result corresponding to the text difference information according to the text line object, and obtaining corresponding first coordinate information;
and recording the first coordinate information to the text comparison result.
In one embodiment, the combining the table identification results according to the coordinate information to generate a table row object list includes:
restoring the form identification result according to the coordinate information to obtain a restored form;
merging cells in the same row in the reduction table to obtain a plurality of table row objects;
and combining the plurality of table row objects according to the coordinate information to obtain a table row object list.
In one embodiment, after obtaining the text comparison result and the table comparison result, the method further includes:
inquiring the table difference information in the table comparison result;
according to the table comparison result, a table row number of the table difference information in the table row object list is obtained, and a table row object corresponding to the table row number is determined;
determining a table identification result corresponding to the table difference information according to the table row object, and obtaining corresponding second coordinate information;
and recording the second coordinate information to the table comparison result.
In one embodiment, after obtaining the text comparison result and the table comparison result, the method further includes:
merging the target files to enable texts and/or tables with corresponding relations in different files to be displayed in the same preset area;
determining the position information of the text difference information and/or the table difference information in the target file according to the coordinate information;
adding the text difference information and/or the table difference information to the preset area according to the position information;
and generating a visual difference report file according to the content displayed in the preset area.
According to another aspect of the embodiments of the present disclosure, there is provided a text comparison apparatus based on optical character recognition, including:
the identification result acquisition module is used for acquiring an identification result obtained by carrying out optical character identification on the target file and coordinate information corresponding to the identification result;
the classification module is used for classifying the recognition results to obtain text recognition results and form recognition results;
the text processing module is used for combining the text recognition results according to the coordinate information to generate a text line object list;
the table processing module is used for combining the table identification results according to the coordinate information to generate a table row object list;
the format processing module is used for respectively carrying out character format processing on the text line object list and the table line object list to obtain text pre-comparison characters and table pre-comparison characters;
the character comparison module is used for respectively comparing the text pre-comparison characters with the table pre-comparison characters through a preset character comparison algorithm to obtain a text comparison result and a table comparison result; the text comparison result is used for representing text difference information in the target file; the table comparison result is used for representing table difference information in the target file.
According to another aspect of the embodiments of the present disclosure, there is also provided a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.
According to another aspect of the embodiments of the present disclosure, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above method.
According to another aspect of the embodiments of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above method.
According to the technical scheme, the result of optical character recognition on the target file picture can be obtained, and the comparison between the text and the table is separately carried out through classification of the recognition result. After classification is completed, combining the recognition results according to the coordinate information according to categories to obtain a text line object list containing text content and a table line object list containing table content, processing the text line object list and the table line object list through character formats to obtain comparison characters matched with a comparison algorithm format, and comparing the comparison characters through an algorithm to obtain a comparison result. Therefore, the text contents of different pages can be integrated into the text line object list, the table contents of different pages are integrated into the table line object list, and the characters are compared in sequence by line units without single-page comparison, so that the file comparison efficiency and the comparison precision are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the specification, and other drawings can be obtained by those skilled in the art without inventive labor.
FIG. 1 is a flow diagram of a method for text alignment based on optical character recognition, according to an embodiment;
FIG. 2 is a schematic flow diagram that illustrates the generation of a text line object list based on text recognition results, under an embodiment;
FIG. 3 is a flow diagram that illustrates locating text difference information, in one embodiment;
FIG. 4 is a flow diagram that illustrates the generation of a table row object list based on table identification in one embodiment;
FIG. 5 is a diagram illustrating locating table difference information, in one embodiment;
FIG. 6 is a schematic diagram of generating a visual discrepancy report file in one embodiment;
FIG. 7 is a block diagram of an apparatus for comparing text based on optical character recognition according to an embodiment;
FIG. 8 is a diagram illustrating an internal architecture of a computer device, according to one embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in processes, methods, articles, or apparatus that include the recited elements is not excluded. For example, if the terms first, second, etc. are used to denote names, they do not denote any particular order.
As used herein, the terms "vertical," "horizontal," "left," "right," "upper," "lower," "front," "rear," "circumferential," "direction of travel," and the like are based on the orientations and positional relationships illustrated in the drawings and are intended to facilitate the description of the invention and to simplify the description, but do not indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the invention.
Unless defined otherwise, technical and scientific terms used herein may have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or", "at least one of …" includes any and all combinations of one or more of the associated listed items. It should be noted that the connections, and the like described in this disclosure may be direct connections through interfaces or pins between devices, may be connections through wires, and may be wireless connections (communication connections).
Nowadays, it is inevitable to convert documents into picture formats in many cases, for example, many documents often need to be printed, signed and stamped, then scanned page by page into picture types, and finally multiple pages of pictures are integrated into files in picture formats such as TIF (Tag Image File Format, which may also be referred to as TIFF) files or PDF (Portable Document Format) files. However, when files in two picture formats are compared, the existing method can only carry out single-page comparison, the method is suitable for files with the same number of pages in two pages, the two pages are compared one by one according to page numbers, the comparison efficiency and precision are low, once the pages are different or the pages are wrong, the comparison precision is reduced sharply, and the comparison result is seriously influenced. These problems are exacerbated by the fact that when more than two documents are compared, the probability of different pages or wrong pages is increased.
In view of the above problem, according to an aspect of the embodiments of the present disclosure, as shown in fig. 1, there is provided a text comparison method based on optical character recognition, including:
step S202, obtaining an identification result obtained by carrying out optical character identification on the target file and coordinate information corresponding to the identification result.
The target file can be a file to be compared, and the number of the target file is at least 2. For example, in some embodiments, the target file may be 2 files in TIF format. The recognition results are the textual contents recognized by the OCR technology for the target document, and each recognition result may include words and/or characters.
Specifically, the target file may be divided into pictures according to pages and then subjected to OCR (optical character recognition), each page of picture may obtain a plurality of recognition results, and a position of each recognition result in the picture is recorded and used as coordinate information of the recognition result. For example, the picture may be divided by using a planar coordinate system, a certain point may be used as an origin, and corresponding coordinates may be set according to a positional relationship between the recognition result and the origin. In some embodiments, the coordinate information may further include a page number of the picture corresponding to the recognition result. In some other embodiments, if the target file is a PDF file, the PDF file may be converted into a picture in another format by page, and then OCR recognition may be performed.
And step S204, classifying the recognition results to obtain a text recognition result and a form recognition result.
Specifically, the recognition results corresponding to the plain text content in the picture before recognition may be classified according to the type of the recognition result in the picture, and the recognition results corresponding to the table content in the picture before recognition may be classified as text recognition results, and table recognition results before recognition may be classified as table recognition results. It should be noted that the recognition result is one or more words and/or characters obtained by performing OCR recognition conversion on the picture, and the table recognition result is obtained by performing OCR recognition on words in the table content in the picture.
And step S206, combining the text recognition results according to the coordinate information to generate a text line object list.
Specifically, the text recognition results may be sorted and combined according to the coordinate information thereof, the text recognition results belonging to the same line are arranged into a line of characters according to the sequence, each line of characters is used as a text line object, and then the text line objects are arranged up and down according to the sequence in the picture to obtain a text line object list. In some other embodiments, corresponding line numbers and/or page numbers may also be added to the text line object list.
And step S208, combining the table identification results according to the coordinate information to generate a table row object list.
Specifically, the table identification results may be sorted and combined according to the coordinate information thereof, the table in the picture is taken as a unit, the table identification results identified from the same line in the table are arranged into a line of characters according to the sequence, and each line of characters is taken as a table line object. And after the table row objects are obtained, the table row objects are arranged up and down according to the row sequence in the table, and all the table row objects are combined according to the appearance sequence of the table in the full text to obtain a table row object list.
Step S210, respectively performing character format processing on the text line object list and the table line object list to obtain text pre-comparison characters and table pre-comparison characters.
Specifically, the two lists may be processed separately, and each line of text is spliced and converted to obtain a line text list arranged in lines and in a string format. The row character list obtained from the text row object list is a text pre-comparison character, and the row character list obtained from the table row object list is a table pre-comparison character. In some embodiments, punctuation marks and the like can be deleted and the like according to comparison requirements. It should be noted that the text line object list and the table line object list obtained in step S208 are not character strings, and cannot be used for inputting the comparison algorithm. After splicing and conversion, the character strings ordered by rows can be obtained.
Step S212, comparing the text pre-comparison characters with the table pre-comparison characters through a preset character comparison algorithm to obtain a text comparison result and a table comparison result; the text comparison result is used for representing text difference information in the target file; the table comparison result is used for representing table difference information in the target file.
Specifically, the text pre-comparison characters and the table pre-comparison characters may be used as input, and compared by a preset character comparison algorithm to obtain a text comparison result of the text characters and a table comparison result of the table characters. The preset character comparison algorithm may be a difflab (text comparison algorithm in Python language) algorithm, or an Edit Distance algorithm (Edit Distance) or other algorithms that may be used for comparing characters or texts. The text comparison result comprises difference information of text characters in the two files, and the table comparison result comprises difference information of table characters in the two files. The difference information may be distinguishing information from two files with different text contents.
According to the technical scheme, the result of optical character recognition on the target file picture can be obtained, and the comparison between the text and the table is separately carried out through classification of the recognition result. After classification is completed, combining the recognition results according to the coordinate information according to categories to obtain a text line object list containing text content and a table line object list containing table content, processing the text line object list and the table line object list through character formats to obtain comparison characters matched with a comparison algorithm format, and comparing the comparison characters through an algorithm to obtain a comparison result. Therefore, the text contents of different pages can be integrated into the text line object list, the table contents of different pages are integrated into the table line object list, character comparison is carried out in sequence by line units, single page comparison is not needed, and the file comparison efficiency and the comparison precision are improved.
In an embodiment, as shown in fig. 2, the combining the text recognition results according to the coordinate information to generate a text line object list includes:
step S2062, arranging the text recognition results of the same line into a line in sequence according to the coordinate information to obtain a plurality of text line objects.
Specifically, the position coordinates of the text recognition results in the picture may be determined according to the coordinate information corresponding to the text recognition results, and the text recognition results in the same line are sequentially arranged and combined according to the position coordinates, each line is combined to obtain one text line object, and different lines are respectively combined to obtain a plurality of text line objects with the same number as the number of lines. Each text line object contains text recognition results of the line sorted from left to right or from right to left.
Step S2064, obtaining the line number of the text recognition result according to the coordinate information.
Wherein the line number is used for indicating that the text recognition result is positioned in the number of lines in the picture.
Specifically, the line number of the text recognition result may be obtained according to the corresponding relationship between the coordinate information and the picture. In some embodiments, only the line number of any one of the text recognition results of the same line may be acquired.
Step S2066, combining the text line objects according to the line number to obtain a text line object list.
Specifically, a plurality of text line objects may be spliced together according to the sequence of line numbers to form a text line object list. The text line object list stores a text content in the target file.
In the above embodiment, the text recognition results may be combined into the text line object according to the coordinate information, and the text line object may be further combined according to the line number to obtain the text line object list. Therefore, the line characters of different pages can be compared through the text line object list, and the comparison efficiency is improved. In addition, compare according to the line number, can not take place wrong page or leaf, improved the precision of comparison.
In one embodiment, as shown in fig. 3, after obtaining the text comparison result and the table comparison result, the method further includes:
step S214, inquiring the text difference information in the text comparison result.
The text difference information may be difference information in which text contents in the two target files to be compared are inconsistent.
Specifically, after the comparison is completed, the difference information of the text content in the target file may be queried according to the comparison result.
Step S216, obtaining an index line number of the text difference information in the text line object list according to the text comparison result, and determining a text line object corresponding to the index line number.
And the index line number is the line number of the difference information obtained by comparison in the text line object list.
Specifically, after the text difference information is found, a line number of the content corresponding to the difference information in the text line object list may be obtained, and the text line object corresponding to the difference information may be determined according to the line number. In some other implementations, the text difference information may correspond to a plurality of text line objects.
Step S218, determining a text recognition result corresponding to the text difference information according to the text line object, and obtaining corresponding first coordinate information.
The text recognition result corresponding to the text difference information may be multiple.
Specifically, a text recognition result corresponding to the difference information may be determined in the text line object, and coordinate information of the text recognition result may be acquired.
Step S220, recording the first coordinate information to the text comparison result.
Specifically, the coordinate information of the text recognition result corresponding to the difference information may be recorded together with the corresponding difference information in the text comparison result. In some other embodiments, the coordinate information may be converted into a page number and a line number of the difference information in the target file, and the page number and the line number may be recorded together with the corresponding difference information.
In the above embodiment, after the comparison is completed, the difference information of the text comparison may be queried, and then the corresponding text line object is determined according to the difference information, and the coordinate information of the difference information is determined according to the text recognition result in the text line object. In this way, the coordinates of the difference information can be recorded together with the difference information, so that the position of the difference information in the target file can be quickly determined, and the difference information can be labeled in the target file.
In an embodiment, as shown in fig. 4, the combining the table identification results according to the coordinate information to generate a table row object list includes:
and S2082, restoring the table identification result according to the coordinate information to obtain a restored table.
Specifically, according to the coordinate information of the table identification result, the original table in the target file picture can be reconstructed and restored by using the table identification result, so that a restored table which is the same as the table in the picture in terms of the text content and the arrangement sequence is obtained.
Step S2084, merging the cells in the same row in the reduction table to obtain a plurality of table row objects.
Specifically, the cells in the same row in the recovery table may be merged, the cell obtained in each merged row is used as a table row object, and a plurality of rows of cells form a plurality of table row objects.
Step S2086, combining the plurality of table row objects according to the coordinate information, and obtaining a table row object list.
Specifically, the table row objects in the same page may be spliced together to obtain a page list according to the sequence included in the coordinate information, and then the page lists of different pages may be spliced and combined into the table row object list according to the page number sequence. And the table row object list comprises table text contents in the target file.
In the above embodiment, the table recognition results may be combined into the table row object according to the coordinate information, and the table row object may be further combined according to the coordinate and the page number to obtain the table row object list. Therefore, the table characters of different pages can be compared through the table row object list, the comparison efficiency is improved, page errors can be avoided, and the comparison accuracy is improved.
In one embodiment, as shown in fig. 5, after obtaining the text alignment result and the table alignment result, the method further includes:
step S222, querying the table difference information in the table comparison result.
The table difference information may be difference information of inconsistent table contents in the two target files to be compared.
Specifically, after the comparison is completed, the difference information of the table content in the target file may be queried according to the comparison result.
Step S224, obtaining a table row number of the table difference information in the table row object list according to the table comparison result, and determining a table row object corresponding to the table row number.
And the table row number is the row number of the table difference information obtained by comparison in the table row object list.
Specifically, after the table difference information is found, a row number of the content corresponding to the difference information in the table row object list may be obtained, and the table row object corresponding to the difference information is determined according to the row number. In some other implementations, the table difference information may correspond to multiple text line objects.
Step S226, determining a table identification result corresponding to the table difference information according to the table row object, and obtaining corresponding second coordinate information.
The table identification result corresponding to the table difference information may be multiple.
Specifically, a table identification result corresponding to the difference information may be determined in the table row object, and coordinate information of the table identification result may be acquired.
Step S228, recording the second coordinate information to the table comparison result.
Specifically, the coordinate information of the difference information correspondence table identification result and the corresponding difference information in the table alignment result may be recorded together. In some other embodiments, the coordinate information may be converted into a page number and a line number of the difference information in the target file, and the page number and the line number may be recorded together with the corresponding difference information.
In the above embodiment, after the comparison is completed, the difference information of the table comparison may be queried, the corresponding table row object may be determined according to the difference information, and the coordinate information of the difference information may be determined according to the table identification result in the table row object. In this way, the coordinates of the difference information and the difference information can be recorded together, so that the position of the table difference information in the target file can be conveniently and rapidly determined, and the difference information can be labeled in the target file.
In one embodiment, after obtaining the text alignment result and the table alignment result, as shown in fig. 6, the method further includes:
step S302, merging the target files, and displaying texts and/or tables with corresponding relations in different files in the same preset area.
Specifically, two target files may be displayed in the same area by corresponding to each other according to page numbers from a first page.
Step S304, determining the position information of the text difference information and/or the table difference information in the target file according to the coordinate information.
Specifically, the coordinate information may be obtained through the identification result corresponding to the difference information, and the position information such as the page number and the line number of the difference information in the target file may be determined according to the coordinate information.
Step S306, adding the text difference information and/or the table difference information to the preset area according to the position information.
Specifically, the position of the difference information in the target file may be found from the position information, and the difference information may be added to a specified position in the preset area. For example, the discrepancy information may be added in the form of an annotation at a corresponding location in the target file; the difference information may also be noted on the left or right side of the target file in a preset area. Optionally, the difference information may also be represented using highlighting and/or font bolding, etc. In some other embodiments, the difference information may be framed and the modified type of the difference text may be displayed.
And S308, generating a visual difference report file according to the content displayed in the preset area.
Specifically, a Python language program or script can be used to generate a visual difference report file from the content in the preset area. The visual difference report file can be an offline file, and a corresponding download address is set.
In the above embodiment, the target files for comparison may be integrated in the same area, the difference information may be added to the designated position of the area according to the coordinate information, and a visual difference report available for downloading may be generated by using a tool such as a script. Therefore, the difference information in different target files can be more intuitively reflected through the visual report, off-line downloading is provided, and the comparison target files can be checked at any time.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
According to another aspect of the embodiments of the present disclosure, as shown in fig. 7, there is also provided a text alignment apparatus based on optical character recognition, including:
a recognition result obtaining module 502, configured to obtain a recognition result obtained by performing optical character recognition on a target file and coordinate information corresponding to the recognition result;
a classification module 504, configured to classify the recognition result to obtain a text recognition result and a table recognition result;
the text processing module 506 is configured to combine the text recognition results according to the coordinate information to generate a text line object list;
a table processing module 508, configured to combine the table identification results according to the coordinate information to generate a table row object list;
a format processing module 510, configured to perform word format processing on the text row object list and the table row object list respectively to obtain text pre-comparison words and table pre-comparison words;
the character comparison module 512 is configured to compare the text pre-comparison characters with the table pre-comparison characters respectively through a preset character comparison algorithm to obtain a text comparison result and a table comparison result; the text comparison result is used for representing text difference information in the target file; the table comparison result is used for representing table difference information in the target file.
For the specific limitations of the above alignment apparatus, reference may be made to the limitations of the above alignment method, which are not repeated herein. According to the comparison method, the comparison device can be added with a first module, a second module and the like to realize steps in corresponding method embodiments. All or part of each module in the comparison device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, there is also provided a text alignment system, the system comprising:
the comparison task receiving module can be used for receiving the comparison task and putting the comparison task into a comparison task queue, and can also be used for returning to a task receiving state;
the comparison task processing module comprises a process pool, and the process pool can be used for reading comparison task information from the queue and executing the comparison task;
the file processing module can be used for splitting the TIF file into pictures; the method can also be used for converting the PDF file into a picture according to pages;
the OCR module can be used for performing OCR recognition on the picture to obtain a recognition result;
the text comparison device comprises all the component modules of the comparison device, and the steps in the corresponding method embodiment can be realized.
The system can be developed by Python, java and other programming languages, and for example, the comparison system can be established by a flash framework in Python.
According to another aspect of the embodiments of the present disclosure, there is provided a computer device, which may be a terminal, and an internal structure diagram of the terminal may be as shown in fig. 8. The computer device may include a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement the above-mentioned comparison method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
According to another aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, performs the steps in the above-described method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It is noted that other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described and illustrated in the drawings, and that various modifications and changes may be made without departing from the scope thereof.

Claims (10)

1. A text comparison method based on optical character recognition is characterized by comprising the following steps:
acquiring an identification result obtained by carrying out optical character identification on a target file and coordinate information corresponding to the identification result;
classifying the recognition results to obtain text recognition results and form recognition results;
combining the text recognition results according to the coordinate information to generate a text line object list;
combining the table identification results according to the coordinate information to generate a table row object list;
respectively carrying out character format processing on the text line object list and the table line object list to obtain text pre-comparison characters and table pre-comparison characters;
respectively comparing the text pre-comparison characters with the table pre-comparison characters through a preset character comparison algorithm to obtain a text comparison result and a table comparison result; the text comparison result is used for representing text difference information in the target file; the table comparison result is used for representing table difference information in the target file.
2. The method of claim 1, wherein the combining the text recognition results according to the coordinate information to generate a text line object list comprises:
arranging the text recognition results of the same line into a line in sequence according to the coordinate information to obtain a plurality of text line objects;
acquiring the line number of the text recognition result according to the coordinate information;
and combining the text line objects according to the line numbers to obtain a text line object list.
3. The method of claim 2, further comprising, after obtaining the text alignment result and the table alignment result:
inquiring the text difference information in the text comparison result;
acquiring an index line number of the text difference information in the text line object list according to the text comparison result, and determining a text line object corresponding to the index line number;
determining a text recognition result corresponding to the text difference information according to the text line object, and obtaining corresponding first coordinate information;
and recording the first coordinate information to the text comparison result.
4. The method of claim 1, wherein combining the table identification results according to the coordinate information to generate a table row object list comprises:
restoring the table identification result according to the coordinate information to obtain a restored table;
merging cells in the same row in the reduction table to obtain a plurality of table row objects;
and combining the plurality of table row objects according to the coordinate information to obtain a table row object list.
5. The method of claim 4, further comprising, after obtaining the text alignment result and the table alignment result:
inquiring the table difference information in the table comparison result;
according to the table comparison result, a table row number of the table difference information in the table row object list is obtained, and a table row object corresponding to the table row number is determined;
determining a table identification result corresponding to the table difference information according to the table row object, and obtaining corresponding second coordinate information;
and recording the second coordinate information to the table comparison result.
6. The method of claim 1, further comprising, after obtaining the text alignment and the table alignment:
merging the target files to enable texts and/or tables with corresponding relations in different files to be displayed in the same preset area;
determining the position information of the text difference information and/or the table difference information in the target file according to the coordinate information;
adding the text difference information and/or the table difference information to the preset area according to the position information;
and generating a visual difference report file according to the content displayed in the preset area.
7. A text comparison device based on optical character recognition is characterized by comprising:
the identification result acquisition module is used for acquiring an identification result obtained by carrying out optical character identification on the target file and coordinate information corresponding to the identification result;
the classification module is used for classifying the recognition results to obtain text recognition results and form recognition results;
the text processing module is used for combining the text recognition results according to the coordinate information to generate a text line object list;
the table processing module is used for combining the table identification results according to the coordinate information to generate a table row object list;
the format processing module is used for respectively carrying out character format processing on the text line object list and the table line object list to obtain text pre-comparison characters and table pre-comparison characters;
the character comparison module is used for respectively comparing the text pre-comparison characters with the table pre-comparison characters through a preset character comparison algorithm to obtain a text comparison result and a table comparison result; the text comparison result is used for representing text difference information in the target file; the table comparison result is used for representing table difference information in the target file.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202211725806.8A 2022-12-30 2022-12-30 Text comparison method and device based on optical character recognition Pending CN115965988A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211725806.8A CN115965988A (en) 2022-12-30 2022-12-30 Text comparison method and device based on optical character recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211725806.8A CN115965988A (en) 2022-12-30 2022-12-30 Text comparison method and device based on optical character recognition

Publications (1)

Publication Number Publication Date
CN115965988A true CN115965988A (en) 2023-04-14

Family

ID=87358492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211725806.8A Pending CN115965988A (en) 2022-12-30 2022-12-30 Text comparison method and device based on optical character recognition

Country Status (1)

Country Link
CN (1) CN115965988A (en)

Similar Documents

Publication Publication Date Title
US11868717B2 (en) Multi-page document recognition in document capture
US8347206B2 (en) Interactive image tagging
US11182604B1 (en) Computerized recognition and extraction of tables in digitized documents
US10963692B1 (en) Deep learning based document image embeddings for layout classification and retrieval
US11106906B2 (en) Systems and methods for information extraction from text documents with spatial context
US9720886B2 (en) System and method for dynamic linking between graphic documents and comment data bases
US10699112B1 (en) Identification of key segments in document images
CN113283355A (en) Form image recognition method and device, computer equipment and storage medium
US20210390299A1 (en) Techniques to determine document recognition errors
CN113837151A (en) Table image processing method and device, computer equipment and readable storage medium
US9672438B2 (en) Text parsing in complex graphical images
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium
CN115965988A (en) Text comparison method and device based on optical character recognition
CN115828856A (en) Test paper generation method, device, equipment and storage medium
Bartoli et al. Semisupervised wrapper choice and generation for print-oriented documents
CN114821623A (en) Document processing method and device, electronic equipment and storage medium
CN114529933A (en) Contract data difference comparison method, device, equipment and medium
CN113821555A (en) Unstructured data collection processing method of intelligent supervision black box
US11256760B1 (en) Region adjacent subgraph isomorphism for layout clustering in document images
CN112560849A (en) Neural network algorithm-based grammar segmentation method and system
CN111667214A (en) Goods information acquisition method and device based on two-dimensional code and electronic equipment
CN115116060B (en) Key value file processing method, device, equipment and medium
CN115017872B (en) Method and device for intelligently labeling table in PDF file and electronic equipment
CN117725196A (en) Method and system for recommending projects according to enterprise information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination