CN116384344A - Document conversion method, device and storage medium - Google Patents

Document conversion method, device and storage medium Download PDF

Info

Publication number
CN116384344A
CN116384344A CN202310347142.4A CN202310347142A CN116384344A CN 116384344 A CN116384344 A CN 116384344A CN 202310347142 A CN202310347142 A CN 202310347142A CN 116384344 A CN116384344 A CN 116384344A
Authority
CN
China
Prior art keywords
document
target
converted
ocr
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310347142.4A
Other languages
Chinese (zh)
Inventor
周小玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Lingyu Information Technology Co ltd
Original Assignee
Wuhan Lingyu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Lingyu Information Technology Co ltd filed Critical Wuhan Lingyu Information Technology Co ltd
Priority to CN202310347142.4A priority Critical patent/CN116384344A/en
Publication of CN116384344A publication Critical patent/CN116384344A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a document conversion method, a document conversion device and a storage medium, and relates to the technical field of computers. The document conversion method comprises the following steps: obtaining a format conversion document, a PDF document; marking the converted document, and generating a mark file of the converted document, wherein the marks in the table file are a text mark, a table mark and an OCR mark; the root mark processes the conversion file; obtaining a markup document corresponding to the conversion document; the target file with the specified format is obtained by converting the mark document, other format conversion operations are not required to be executed, compatibility of the generated document is improved, and document conversion efficiency is improved. The editability of the file class capacity is realized, and the information reduction degree and the convenience of information acquisition are improved; the method has the advantages of wide application equipment range, high conversion efficiency, good compatibility and practical application value.

Description

Document conversion method, device and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a document conversion method, a document conversion device and a storage medium.
Background
In daily life or work, PDF document application is widely used in the editing and manufacturing process of PDF documents, and the picture material content of the related PDF styles is often required to be referred to for modification. The user may need to perform a format conversion operation, such as converting a winter sheet of a PDF document layout into a PDF document. In the current conversion mode, the PDF document version cannot be extracted.
In the current conversion mode, the condition of a picture form in a PDF document layout picture cannot be extracted, and the reduction effect of a target file obtained by reducing the content of the PPT document layout picture form is poor.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a document conversion method, a device and a storage medium, which can effectively convert various documents and promote the convenience and the high efficiency of conversion among various file formats.
In order to achieve the above purpose, the invention is realized by the following technical scheme:
in a first aspect, a document conversion method is provided, the method including: acquiring a target document; and marking the target document, and determining the content mark, marking score as character mark, form mark and OCR mark of the target document.
It can be seen that, in the embodiment of the present application, the corresponding table document is obtained by marking the content of the document and analyzing it, that is, any type of document can complete the marking process, and then the obtained table document is converted to obtain the target document.
And scanning the content of the target document, and determining the mark classification of the content of the target document to obtain the scanned document.
Processing according to the scanned document to obtain a table document corresponding to the cache document, including: if the mark bit text marks, carrying out text analysis on the target document; if the marking bit table marks, carrying out table structural analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
The marking unit is also used for: if the scanning document corresponding to the target document is not obtained by adopting the form analysis tool to carry out structural analysis, adopting the OCR analysis tool to carry out OCR structural recognition, and obtaining the scanning document corresponding to the target document; and if the OCR analysis tool is adopted to carry out OCR structural recognition and the scanned document corresponding to the target document is not obtained, the text analysis tool is adopted to carry out text analysis and the scanned document corresponding to the target document is obtained.
The marking unit is further configured to, prior to structural recognition of the target document with the OCR tool using the OCR parsing tool: determining the target classification of the target document as an OCR type, and finishing reclassifying and grading the target document; the marking unit is further configured to, prior to text parsing of the target document using the third parsing tool: and determining the target classification of the target document as a text type, and finishing reclassifying and grading the target document.
Processing the target document according to the target classification to obtain a scanned document corresponding to the target document, including: performing table processing on the target document according to the target classification to obtain a scanned document corresponding to the target document; performing OCR (optical character recognition) processing on the target document to obtain a verification scanning document corresponding to the target document; the device further comprises a verification unit for: and carrying out correctness verification according to the character string message after the conversion of the verification scanning document and the character string message after the conversion of the Wen Duibiao grid document.
The correctness checking of the character string message after the conversion of the scanned document is carried out according to the character string message after the conversion of the check form document, which comprises the following steps: verifying the digital content of the character string message converted by the scanned document, and determining the character type correctness of the digital content; and verifying the key items of the character string message converted by the scanned document, and determining that the digital content of the package in the key items is matched with the text content included in the key items.
In another aspect, there is provided a document conversion apparatus including:
an acquisition unit configured to acquire a converted document;
the description unit is used for scanning the converted document to obtain a scanned document of the converted document;
the marking unit is used for marking the converted document and determining the marking type of the target document, wherein the marking number is a text mark, a form mark and an OCR mark;
and the conversion unit is used for converting the table document to obtain a target document corresponding to the converted document.
In one possible example, a content scan is performed on a target document, and a tag classification of the target document content is determined to obtain a scanned document.
In one possible example, processing according to the scanned document to obtain a form document corresponding to the cache document includes: if the mark bit text marks, carrying out text analysis on the target document; if the marking bit table marks, carrying out table structural analysis on the target document; if the marking bitmap piece is marked, carrying out picture analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
In a third aspect, an electronic device is provided, the device comprising: the processor, the memory and the communication interface are connected with each other and complete the communication work among each other;
the memory stores executable program codes, and the communication interface is used for wireless communication;
the processor is configured to retrieve executable program code stored on the memory and perform some or all of the steps as described in any of the methods of the first aspect of the embodiments of the present application.
In a fourth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a computer program operable to cause a computer to perform some or all of the steps described in any of the methods of the first aspect of the embodiments of the present application. The computer program product may be a software installation package.
In a fifth aspect, embodiments of the present application provide a document parsing system, including the electronic device described in the third aspect, and may further include other devices for interacting with the electronic device.
Advantageous effects
The invention provides a document conversion method, a document conversion device and a storage medium. Compared with the prior art, the method has the following beneficial effects:
obtaining a format conversion document, a PDF document; marking the converted document, and generating a mark file of the converted document, wherein the internal marks of the table file are text marks, table marks and OCR marks; the root mark processes the conversion file; obtaining a markup document corresponding to the conversion document; the target file with the specified format is obtained by converting the mark document, other format conversion operations are not required to be executed, compatibility of the generated document is improved, and document conversion efficiency is improved. The editability of the file class capacity is realized, and the information reduction degree and the convenience of information acquisition are improved; the method has the advantages of wide application equipment range, high conversion efficiency, good compatibility and practical application value.
Drawings
FIG. 1 is a schematic diagram of a document conversion system according to an embodiment of the present invention;
FIG. 2 is a flowchart of a document conversion method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for classifying objects of a PDF document into form types according to an embodiment of the invention;
FIG. 4 is a schematic diagram illustrating a process of classifying objects of another PDF document into form types according to an embodiment of the invention;
FIG. 5 is a flowchart of a specific process according to the present invention;
FIG. 6 is a schematic diagram of a document conversion device of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The application implementation provides a document format conversion method, a device, computer equipment and a storage medium. Specifically, the method for converting the document format in the embodiment of the application may be performed by a computer device, where the computer device may be a device such as a terminal or a server. The terminal can be a smart phone, a tablet personal computer, a notebook computer, a desktop computer, a smart television and the like.
The embodiment of the application is applied to various scenes such as artificial intelligence, computer vision, image recognition and the like.
First, terms or terminology appearing in describing embodiments of the present application are explained as follows:
artificial intelligence: artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Since birth, the theory and technology are mature, and the application field is expanding, and it is supposed that the technological product brought by artificial intelligence in the future will be a "container" of human intelligence. Artificial intelligence can simulate the information process of consciousness and thinking of people.
Computer vision: the computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify, track and measure targets, and the like, and further, graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can obtain "information" from images or multidimensional data. The information referred to herein refers to Shannon-defined information that may be used to assist in making a "decision". Because perception can be seen as the extraction of information from sensory signals, computer vision can also be seen as science of how to "perceive" an artificial system from images or multi-dimensional data.
Machine learning: machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance.
And (3) image identification: image recognition, which is a technique for processing, analyzing and understanding images by a computer to recognize targets and objects in various modes, is a practical application for applying a deep learning algorithm.
OCR: OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method; that is, the technology of converting the characters in the paper document into the image file of black-white lattice by optical mode and converting the characters in the image into the text format by the recognition software for further editing and processing by the word processing software is adopted.
Referring to fig. 1, fig. 1 is a schematic diagram of a document conversion system provided in an embodiment of the present application, and as shown in fig. 1, the system includes a document analysis server and a document conversion server, wherein a user inputs a plurality of documents to the document analysis server, and after the documents are analyzed by the document analysis server, a plurality of table documents are obtained, and after the plurality of table documents are input to the document conversion server, a target document is obtained by conversion. The document analysis server and the document conversion server may be independent servers or servers with combined functions.
Referring to fig. 2, fig. 2 is a flowchart of a document conversion method provided in an embodiment of the present application, which is applied to the above document conversion system, and specifically, the method includes the following steps:
101. a conversion document is acquired.
202. And (3) marking the contents of the converted document, and determining the content marks corresponding to the converted document, wherein the marks are text marks, form marks, picture marks, form marks, file attachment marks and OCR marks.
And scanning the content of the target document, and determining the mark classification of the content of the target document to obtain the scanned document.
Processing according to the scanned document to obtain a table document corresponding to the converted document, including: if the mark bit text marks, carrying out text analysis on the target document; if the marking bit table marks, carrying out table structural analysis on the target document; if the marking bitmap piece is marked, carrying out picture analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
According to the foregoing description, the document may support various contents including signatures, pictures, tables, fonts, etc., but the corresponding parsing manner may be different for different contents, for example, the document is a picture content, and then optical character recognition (Optical Character Recognition, OCR) is required for the PDF document so that the picture content can be converted into computer text. In the case where PDF is non-image content, that is, PDF document is generated by a machine, instead of scan, the content of PDF document may be obtained by table extraction. Alternatively, the document may not have structural features, and the process of extracting content from the document is implemented by text recognition. Thus, documents can be classified into: OCR types, form types, and text types.
The classification of different marks, the corresponding parsing logic is different, for example, the OCR type document, firstly, aiming at the print character, the characters in the paper document are converted into the image file of black-white lattice by adopting an optical mode, and the characters in the image are converted into the text format by the recognition software. Therefore, after the target classification corresponding to the target document is determined, the target document can be analyzed according to the target classification in a corresponding processing mode, and the form document is obtained. The reason why the target document is uniformly processed into the table document is that in the scene related to the embodiment of the application, the corresponding document data mostly includes the table data, and the table data is converted into the table document, for example, an Excel document, instead of being converted into a text document, for example, a word document, or a txt document, so that on one hand, the document analysis difficulty can be reduced, on the other hand, the document format after being uniformly converted is also used, and the efficiency of further converting the document into a character string message is improved.
Since the PDF document itself is unstructured data, the structured parsing process of the PDF document is a process of converting unstructured data into structured data. Taking the data content in the PDF document as a amortization report as an example:
the general financial statement includes an asset liability statement and a profit statement; for small and medium-sized enterprises, the financial accounting guidelines to be supported comprise the following categories, and for different accounting guidelines of reports, the accounting guidelines to be supported at present can be identified according to the contents of headers, form numbers, key subjects and the like, wherein the accounting guidelines to be supported at present are as follows:
a. general corporate rule liability statement;
b. small business criterion asset liability statement;
c. enterprise criteria asset liability statement;
d. general enterprise criteria profit tables;
e. small enterprise criterion profit sheet;
f. enterprise criterion profit tables;
g. annual newspaper of small enterprises;
the business report to be processed is firstly required to correspond the financial report and the accounting criteria in the PDF document processing link, and the business content is explicitly processed.
Structuring corresponds to a template which is a form document, specifically, for example, an Excel template; the different accounting criteria correspond to specific accounting objects contained in the Excel template.
In the case of determining that the target of the target document is classified as a form type, the specific processing procedure is as follows:
A. firstly, carrying out structural analysis, which is the key for processing service data, wherein the service data is displayed in a form of a table, and firstly, determining that a PDF document can be converted into the document in the form of the table;
B. the method comprises the steps of processing diversification of table forms, wherein the tables are divided into unit row and column tables; also contains various merge tables; during structuring, the tables in the merged form need to be treated differently.
The merge table is divided into:
b1, merging the table heads. Referring specifically to fig. 3, fig. 3 is a schematic diagram illustrating a process of classifying a PDF document object into a table type, where, as shown in (a) in fig. 3, the document includes a table whose header occupies two rows, and when the table is structured, a structured table as shown in (b) in fig. 3 is obtained, and the two rows occupied by the header are combined in the structuring process. Further, a specific header template is set for the structured table, and corresponding table frame lines can be added to the two combined rows as shown in table 1.
TABLE 1
Figure BDA0004160267800000091
And b2, merging the rows and columns. Referring specifically to fig. 4, a schematic diagram of a process of classifying objects of another PDF document provided in this embodiment into a table type is shown in fig. 4 (a), where the PDF document includes a table, and after structuring many rows and columns in the table, the PDF document needs to be removed and replaced after conversion because the general agency reports and the branch agency reports belong to redundant information; for the misplaced data, content splicing is required according to the serial numbers; the service logic can be processed by a special method.
C. If the data in the form of the table can be obtained through structural analysis, determining key row and column data of the table through modes of template matching, table boundary determination and the like, and further obtaining a table document;
D. if the data in the form of a table cannot be obtained through structural analysis, the targets of the possible target documents are classified into OCR types, an OCR recognition tool can be adopted for structural recognition, and if the recognition is successful, the process of b is carried out to obtain the table document;
E. if the data in the form of the table cannot be obtained through the OCR recognition tool, text analysis can be performed, the text is read in lines, and the table structure is restored according to the key subjects of the text. Such a way is error-prone and can be easily confused; typically for individual key subjects, and also to compare context, to prevent errors in the extracted content. It is therefore finally considered to obtain the form document in this way.
The steps a to E may be a process flow of obtaining a form document when the target of the target document is classified as a form type. Assuming that the target classification of the target document is determined to be of the OCR type, the processing may be started from step D, and assuming that the target classification of the target document is determined to be of the text type, the processing may be started from step E, and a form document corresponding to the target document may be obtained.
Therefore, in the embodiment of the application, for target documents with different classifications, different processing methods are adopted to obtain structured data in the target documents, or the unstructured data is structured to obtain corresponding table documents, so that the processing procedures of PDF documents with different classifications can be compatible, and meanwhile, the conversion efficiency of the PDF documents is improved.
103. And converting the table document to obtain the character string message corresponding to the target document.
An acquisition unit configured to acquire a converted document;
the description unit is used for scanning the converted document to obtain a scanned document of the converted document;
the marking unit is used for marking the converted document and determining the marking type of the target document, wherein the marking number is a text mark, a form mark, a picture mark, an accessory mark and an OCR mark;
and the conversion unit is used for converting the table document to obtain a target document corresponding to the converted document.
And the acquisition unit is used for acquiring the target document.
It can be seen that, the device described in the embodiments of the present application processes according to a scanned document to obtain a table document corresponding to a converted document, including: if the mark bit text marks, carrying out text analysis on the target document; if the marking bit table marks, carrying out table structural analysis on the target document; if the marking bitmap piece is marked, carrying out picture analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
Optionally, processing the target document according to the target classification to obtain a table document corresponding to the target document, including: if the mark bit text marks, carrying out text analysis on the target document; if the marking bit table marks, carrying out table structural analysis on the target document; if the marking bitmap piece is marked, carrying out picture analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
Specifically, the embodiments of the present application may divide functional units of the data acquisition device according to the above method examples, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 7: the device comprises a processor, a memory and a communication interface, wherein the processor, the memory and the communication interface are connected with each other and complete the communication work among each other;
the memory stores executable program codes, and the communication interface is used for wireless communication;
the processor is configured to retrieve executable program code stored on the memory, perform part or all of the steps of any of the data acquisition methods described in the method embodiments above, where the computer includes an electronic terminal device.
The memory may be volatile memory such as dynamic random access memory DRAM, or nonvolatile memory such as a mechanical hard disk. The memory is configured to store a set of executable program code, and the processor is configured to invoke the executable program code stored in the memory, and to execute instructions comprising:
acquiring a target document, wherein the target document is a portable document format PDF document; classifying and grading the target document, determining target classification corresponding to the target document, wherein the target classification is a text type, a form type or an optical character recognition OCR type; processing the target document according to the target classification to obtain a table document corresponding to the target document; and converting the table document to obtain the character string message corresponding to the target document.
It can be seen that, the device described in the embodiments of the present application processes according to a scanned document to obtain a table document corresponding to a converted document, including: if the mark bit text marks, carrying out text analysis on the target document; if the marking bit table marks, carrying out table structural analysis on the target document; if the marking bitmap piece is marked, carrying out picture analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
Optionally, in processing the target document according to the target classification to obtain a table document corresponding to the target document, the processor is configured to invoke executable program code stored in the memory to: if the marking bit table marks, carrying out table structural analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
Optionally, the executable program code stored in the memory for invoking the processor is further configured to: if the marking bit table marks, carrying out table structural analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
Optionally, the processor is configured to invoke executable program code stored in the memory prior to structural recognition of the target document by the OCR tool: if the marking bit table marks, carrying out table structural analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
Optionally, in processing the target document according to the target classification to obtain a table document corresponding to the target document, the processor is configured to invoke executable program code stored in the memory, and configured to: if the marking bit table marks, carrying out table structural analysis on the target document; if the mark is an OCR mark, performing OCR recognition on the target document; and generating the analyzed content into a target document to obtain a converted target document.
The object classification corresponding to the object document can be determined to be an OCR type if the object document is identified as a signature in the PDF document or the PDF document is composed of pictures, and the object classification is determined to be a form type if the form and text summarized in the PDF document can be directly extracted.
The same PDF document can meet different classification marks on different pages, for example, some pages are of a form type, and other pages are of an OCR type, and at the moment, page splitting can be carried out on the PDF document, so that the split sub-documents only meet one target classification, and the processing efficiency and the success rate are improved when PDF document processing is carried out according to the target classification subsequently.
Processing the converted document according to the target mark to obtain a scanned document corresponding to the converted document, including: if the targets of the target documents are classified into a table mode, carrying out structural analysis on the target documents by adopting a table analysis tool to obtain scanned documents corresponding to the target documents; if the target mark of the conversion document is of an OCR type, OCR structural recognition is carried out on the conversion document by adopting an OCR analysis tool, and a scanning document corresponding to the target document is obtained; if the target mark of the converted document is of a text type, text analysis is carried out on the target document by adopting a text analysis tool, and a table document corresponding to the target document is obtained.
The method further comprises the steps of: if the scanning document corresponding to the target document is not obtained by adopting the form analysis tool to carry out structural analysis, adopting the OCR analysis tool to carry out OCR structural recognition, and obtaining the scanning document corresponding to the target document; and if the OCR analysis tool is adopted to carry out OCR structural recognition and the scanned document corresponding to the target document is not obtained, the text analysis tool is adopted to carry out text analysis and the scanned document corresponding to the target document is obtained.
Embodiments of the present application provide a computer program product, wherein the computer program product comprises a computer program operable to cause a computer, such as part or all of the steps of any one of the loan risk assessment methods recited in the method embodiments described above, to be a software installation package.
It should be noted that, for simplicity of description, any embodiment of the document conversion method described above is shown as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may take place in other order or simultaneously, depending on the application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments and that the acts referred to are not necessarily required in the present application.
Although the present application has been described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the figures, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various method embodiments of any of the data collection methods described above may be performed by a program that instructs associated hardware, and that the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
It should be appreciated that the processor of an embodiment of the present application may be an integrated circuit chip having signal processing capabilities.
In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Proorammable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The methods, steps and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, human memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The foregoing has described in detail embodiments of the present application, and specific examples have been used herein to illustrate the principles and embodiments of a data acquisition method and apparatus, where the foregoing description of the embodiments is only for aiding in the understanding of the method and core ideas of the present application; meanwhile, as for those skilled in the art, according to the idea of a data acquisition method and apparatus of the present application, there are various changes in the specific embodiments and application scope, and in summary, the present disclosure should not be construed as limiting the present application.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, hardware products, and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Converting the target classification of the document into a tabular mode refers to the target document being a machine-generated PDF document.
Meanwhile, in the scene of the embodiment of the application, the service data in the PDF document has a certain structuring requirement, so that the target classification can be determined to be in a form mode, and then the scanned document is obtained in a structuring analysis mode.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks and/or flowchart block or blocks
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
And all that is not described in detail in this specification is well known to those skilled in the art.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A document conversion method, device and storage medium are characterized in that: the method specifically comprises the following steps:
s1, acquiring a format conversion document;
s2, marking the converted document, and determining the mark type of the converted document, wherein the mark type is divided into a text mark, a form mark and an OCR mark;
s3, processing the target document according to the target mark to obtain a target file corresponding to the target document;
s4, converting the target file to obtain the converted target document.
2. The document conversion method, apparatus and storage medium according to claim 1, wherein: according to user configuration, acquiring a document path to be converted in the configuration to acquire a document file to be converted, and acquiring a target file path, a target file format and a target file name;
processing the target document according to the mark of the table file to obtain a table document corresponding to the target document, including:
and scanning the content according to the converted document, and generating a scanned document of the converted document, wherein the scanned document at least comprises a document title, a document type, a scanning mark type and a scanning capacity.
And marking the converted document according to the scanned document to generate a table document.
3. The document conversion method, apparatus and storage medium according to claim 2, wherein: the scanning according to the content of the conversion file, obtaining the scanned document, including:
reading the converted document, scanning the converted document, and grouping according to different types of document contents. A plurality of groupings of document content are obtained.
And respectively carrying out functional scanning on the plurality of document content groups to obtain a scanned file.
4. The document conversion method, apparatus and storage medium according to claim 2, wherein: obtaining a table document corresponding to the target document according to the scanned file, wherein the table document comprises:
if the scanning file function group is text, text analysis is carried out on the target document by adopting a text analysis tool, so that a scanning document corresponding to the target document is obtained;
if the scanned file functional group is a table, carrying out structural analysis on the target document by adopting a table analysis tool to obtain a scanned document corresponding to the target document;
and if the scanning file function group is OCR, OCR structural recognition is carried out on the target document by adopting an OCR analysis tool, so that a scanning document corresponding to the target document is obtained.
5. The document conversion method, apparatus and storage medium according to claim 2, wherein: further comprises:
if the table document corresponding to the target document is not obtained by adopting the table tool to carry out structural analysis, adopting the OCR analysis tool to carry out OCR structural recognition, and obtaining the table document corresponding to the target document;
and if the second analysis tool is adopted for OCR structural recognition, the table document corresponding to the target document is not obtained, and the text analysis tool is adopted for text analysis, so that the scanned document corresponding to the target document is obtained.
6. The document conversion method, apparatus and storage medium according to claim 3, wherein: before structural recognition of the target document with an OCR tool using an OCR parsing tool, the method further includes: determining the target classification of the target document as the OCR type, and finishing reclassifying and grading the target document; before text parsing is performed on the target document by using a text parsing tool, the method further comprises: and determining the target classification of the target document as the text type, and finishing reclassifying and grading the target document.
7. The method, apparatus and storage medium for converting documents according to any one of claims 1 to 4, wherein: the processing the converted document according to the target classification to obtain a scanned document corresponding to the converted document comprises the following steps:
performing table processing on the target document according to the target classification to obtain a table document corresponding to the target document; performing OCR (optical character recognition) processing on the target document to obtain a check form document corresponding to the target document;
the method further comprises the steps of: and carrying out correctness verification on the character string message converted by the scanning document according to the character string message converted by the verification form document.
8. The method, apparatus and storage medium for converting documents according to claim 5, wherein: and carrying out correctness verification on the character string message converted by the table document according to the character string message converted by the verification scanning document, wherein the method comprises the following steps:
verifying the digital content of the converted character string message of the form document, and determining the character type correctness of the digital content;
and verifying the key items of the character string messages converted by the table documents, and determining that the digital content of the package in the key items is matched with the text content included in the key items.
9. A document conversion apparatus characterized by: the device comprises:
an acquisition unit configured to acquire a converted document;
the description unit is used for scanning the converted document to obtain a scanned document of the converted document;
and the marking unit is used for marking the converted document and determining the marking type of the target document, wherein the marking number is a text marking, a form marking, a picture marking, an accessory marking and an OCR marking.
10. A computer-readable storage medium, characterized by: which stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method according to any of claims 1-5.
CN202310347142.4A 2023-03-30 2023-03-30 Document conversion method, device and storage medium Pending CN116384344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310347142.4A CN116384344A (en) 2023-03-30 2023-03-30 Document conversion method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310347142.4A CN116384344A (en) 2023-03-30 2023-03-30 Document conversion method, device and storage medium

Publications (1)

Publication Number Publication Date
CN116384344A true CN116384344A (en) 2023-07-04

Family

ID=86980187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310347142.4A Pending CN116384344A (en) 2023-03-30 2023-03-30 Document conversion method, device and storage medium

Country Status (1)

Country Link
CN (1) CN116384344A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556078A (en) * 2024-01-11 2024-02-13 北京极致车网科技有限公司 Visual vehicle registration certificate file management method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556078A (en) * 2024-01-11 2024-02-13 北京极致车网科技有限公司 Visual vehicle registration certificate file management method and device and electronic equipment
CN117556078B (en) * 2024-01-11 2024-03-29 北京极致车网科技有限公司 Visual vehicle registration certificate file management method and device and electronic equipment

Similar Documents

Publication Publication Date Title
US10846553B2 (en) Recognizing typewritten and handwritten characters using end-to-end deep learning
US9626555B2 (en) Content-based document image classification
CN111680634B (en) Document file processing method, device, computer equipment and storage medium
CN112508011A (en) OCR (optical character recognition) method and device based on neural network
US12056171B2 (en) System and method for automated information extraction from scanned documents
CN109635805B (en) Image text positioning method and device and image text identification method and device
CN112800848A (en) Structured extraction method, device and equipment of information after bill identification
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
CN110287784B (en) Annual report text structure identification method
CN110059688B (en) Picture information identification method, device, computer equipment and storage medium
CN112801084A (en) Image processing method and device, electronic equipment and storage medium
JP2013073439A (en) Character recognition device and character recognition method
CN110889341A (en) Form image recognition method and device based on AI (Artificial Intelligence), computer equipment and storage medium
CN112801099B (en) Image processing method, device, terminal equipment and medium
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
CN114357174B (en) Code classification system and method based on OCR and machine learning
CN116384344A (en) Document conversion method, device and storage medium
Vafaie et al. Handwritten and printed text identification in historical archival documents
CN117709317A (en) Report file processing method and device and electronic equipment
CN114579796B (en) Machine reading understanding method and device
CN116030469A (en) Processing method, processing device, processing equipment and computer readable storage medium
CN115455143A (en) Document processing method and device
Anagha et al. An automatic histogram detection and information extraction from document images
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
CN113553852A (en) Contract information extraction method, system and storage medium based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination