CN116484833A

CN116484833A - Document analysis method and device

Info

Publication number: CN116484833A
Application number: CN202310301944.1A
Authority: CN
Inventors: 简仁贤
Original assignee: Emotibot Technologies Ltd
Current assignee: Emotibot Technologies Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-07-25

Abstract

The application relates to a document analysis method and a document analysis device, and relates to the technical field of computers, wherein the method comprises the following steps: the method comprises the steps of obtaining a document to be analyzed, carrying out classification processing according to the document to be analyzed to obtain a document analysis mode, carrying out information extraction processing on the document to be analyzed based on the document analysis mode to obtain document analysis information and document characteristic information corresponding to the document analysis information, determining a document analysis result based on the document analysis information and the document characteristic information, classifying the document to be analyzed to obtain the document analysis mode, carrying out information extraction by using the document analysis mode, realizing effective analysis on different types of documents to be analyzed, and solving the problems of the conventional document analysis technology in which a specific document analysis model is used for carrying out document analysis.

Description

Document analysis method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for analyzing a document.

Background

Currently, natural language processing (Natural Language Processing, NLP) technology is a big branch of artificial intelligence applications, mainly processing and interpreting text data by machine. However, most documents that we routinely process are not text that can be directly taken to a machine for processing, but rather documents, such as pictures, PDFs, PPTs, word documents, etc., that are all human-readable in nature, and that require a series of processing steps to be converted into a machine-readable data structure.

The existing document analysis technology generally utilizes a specific document analysis model to perform document identification analysis in combination with an OCR technology, so as to identify text contents of a document such as PDF and the like, and realize the conversion of the document into text data. However, the existing document analysis technology has a narrow application range, a specific document analysis model can only identify a specific document, analysis processing cannot be performed on different types of documents, and effective analysis of the different types of documents cannot be achieved.

Disclosure of Invention

The application provides a document analysis method and a document analysis device, which are used for effectively analyzing different types of documents to be analyzed, and solve the problems of the existing document analysis technology that a specific document analysis model is used for document analysis.

In a first aspect, the present application provides a document parsing method, including:

acquiring a document to be analyzed;

classifying according to the document to be analyzed to obtain a document analysis mode;

based on the document analysis mode, carrying out information extraction processing on the document to be analyzed to obtain document analysis information and document characteristic information corresponding to the document analysis information;

and determining a document analysis result based on the document analysis information and the document characteristic information.

Optionally, the classifying processing according to the document to be analyzed to obtain a document analysis mode corresponding to the document to be analyzed includes:

carrying out document identification on the document to be analyzed through a preset classifier to obtain document type information of the document to be analyzed;

acquiring a preset analysis configuration file;

and searching a document analysis flow matched with the document type information in the analysis configuration file to obtain a document analysis mode corresponding to the document to be analyzed.

Optionally, the performing information extraction processing on the document to be parsed based on the document parsing mode to obtain document parsing information and document feature information corresponding to the document parsing information includes:

determining the document state of the document to be analyzed;

if the document state is a document encryption state, decrypting the document to be analyzed to obtain document decryption information;

and based on the document analysis mode, carrying out information extraction processing on the document to be analyzed by combining the document decryption information to obtain the document analysis information and the document characteristic information.

Optionally, performing information extraction processing on the document to be parsed to obtain the document parsing information and the document feature information, including:

Extracting pages of the document to be analyzed to obtain a page list to be analyzed, wherein the page list to be analyzed comprises at least one page to be analyzed;

aiming at the page to be analyzed, analyzing by utilizing the document analysis mode to obtain chart analysis information and text analysis information of the page to be analyzed;

the chart analysis information and/or the text analysis information are used as the document analysis information;

and extracting features of the page to be analyzed according to the document analysis information to obtain document feature information, wherein the document feature information comprises chart feature information corresponding to the chart analysis information and character feature information corresponding to the character analysis information.

Optionally, the analyzing the page to be analyzed by using the document analysis mode to obtain chart analysis information and text analysis information of the page to be analyzed includes:

when the document analysis mode comprises a chart analysis mode, chart analysis is carried out on the page to be analyzed by utilizing the chart analysis mode, so that chart analysis information is obtained;

and when the document analysis mode comprises a text analysis mode, performing text analysis on the page to be analyzed by using the text analysis mode to obtain text analysis information.

Optionally, the performing chart analysis on the page to be analyzed by using the chart analysis mode to obtain chart analysis information includes:

carrying out chart identification on the page to be analyzed through a preset algorithm to obtain a chart identification result;

if the chart identification result is a chart identification success result, a preset chart analyzer is utilized to analyze the chart of the page to be analyzed, and chart information is obtained;

carrying out logic structure analysis on the chart information to obtain logic structure information;

and generating chart analysis information based on the chart information and the logic structure information.

Optionally, the determining a document parsing result based on the document parsing information and the document feature information includes:

aiming at the document analysis information, reading analysis is carried out by utilizing the document characteristic information to obtain reading sequence information;

carrying out structural clustering on the document analysis information according to a preset structural algorithm by utilizing the document characteristic information to obtain document structural information;

integrating and outputting the document analysis information according to the reading sequence information and combining the document structure information to obtain target document analysis information;

And generating the document analysis result based on the target document analysis information.

Optionally, the reading analysis is performed on the document analysis information by using the document feature information to obtain reading sequence information, including:

aiming at the document analysis information, carrying out paragraph processing by utilizing the coordinate information in the document characteristic information to obtain paragraph merging information;

and aiming at the paragraph merging information, sequencing the reading sequence of the document analysis information to obtain reading sequence information.

Optionally, the generating the document parsing result based on the target document parsing information includes:

determining the coding format of the target document analysis information;

if the coding format is a special coding format, coding correction is carried out on the target document analysis information according to a preset coding correction format, and a document analysis result is obtained;

if the coding format is not a special coding format, the target document analysis information is directly used as the document analysis result.

In a second aspect, the present application provides a document parsing apparatus, including:

the document to be analyzed obtaining module is used for obtaining the document to be analyzed;

The classification processing module is used for classifying the document to be analyzed through a preset classifier to obtain a document analysis mode corresponding to the document to be analyzed;

the feature extraction processing module is used for carrying out feature extraction processing on the document to be analyzed by utilizing the document analysis mode to obtain document analysis information and document feature information corresponding to the document analysis information;

and the document analysis result determining module is used for determining a document analysis result based on the document analysis information and the document characteristic information.

In summary, according to the embodiment of the application, the document to be analyzed is obtained, the classification processing is performed according to the document to be analyzed, the document analysis mode is obtained, the information extraction processing is performed on the document to be analyzed based on the document analysis mode, the document analysis information and the document characteristic information corresponding to the document analysis information are obtained, the document analysis result is determined based on the document analysis information and the document characteristic information, the document analysis mode is obtained by classifying the document to be analyzed, the information extraction is performed by using the document analysis mode, the effective analysis of different types of documents to be analyzed is realized, and the problem that the document analysis is performed by using a specific document analysis model in the conventional document analysis technology is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a document parsing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of steps of a document parsing method according to an alternative embodiment of the present application;

FIG. 3 is a YAML file configuration diagram provided in an alternative embodiment of the present application;

FIG. 4 is a diagram of a domainParser file configuration provided in an alternative embodiment of the present application;

FIG. 5 is a flow chart of document parsing provided in an alternative embodiment of the present application;

FIG. 6 is a diagram illustrating an analytical example of a chart provided in an alternative embodiment of the present application;

FIG. 7 is a block diagram of a document parsing apparatus according to an embodiment of the present disclosure;

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

In the related art, the existing document analysis technology, such as PDF analysis technology, can only configure a part of extraction capability, performs analysis on PDF documents one by one, has no generalization capability, cannot perform expansion and secondary development, is easy to lose document data in the document analysis process, such as color and font format of characters in the document may be lost, cannot perform reading sequence analysis on the text after the document analysis is performed, and has low text readability.

In order to solve the problems, the embodiments of the present application provide a method and an apparatus for resolving a document, by acquiring a document to be resolved, performing classification processing according to the document to be resolved, obtaining a document resolving mode, performing information extraction processing on the document to be resolved based on the document resolving mode, obtaining document resolving information and document feature information corresponding to the document resolving information, determining a document resolving result based on the document resolving information and the document feature information, classifying the document to be resolved in different types of formats, obtaining the document resolving mode corresponding to the document to be resolved, further performing information extraction on the document to be resolved by using the document resolving mode, so as to achieve effective resolving of the document to be resolved in different types, solve the problem existing in the conventional document resolving technology that a specific document resolving model is used for resolving the document, improve the generalization capability of the document resolving function, effectively avoid the situation of losing document data, and improve the document reading performance.

For the purpose of facilitating an understanding of the embodiments of the present application, reference will now be made to the drawings and specific examples, which are not intended to limit the embodiments of the present application.

Fig. 1 is a schematic flow chart of a document parsing method according to an embodiment of the present application. As shown in fig. 1, the document parsing method provided in the embodiment of the present application may specifically include the following steps:

step 110, a document to be parsed is obtained.

Specifically, the document to be parsed may be used as the document to be parsed in this embodiment, for example, the document format of the document to be parsed may include, but is not limited to, PDF and picture PDF, PPT, word (for example, word in doc format and Word in docx format).

And 120, performing classification processing according to the document to be analyzed to obtain a document analysis mode.

Specifically, the document parsing method may include a chart parsing method and a text parsing method, which is not limited in the embodiment of the present application. The documents to be parsed in different document formats can correspond to different document parsing modes. In this embodiment, after the document to be resolved is obtained, the document to be resolved may be classified, for example, a suffix name of the document to be resolved may be obtained through an algorithm or a document classifier, a document format of the document to be resolved may be determined according to the suffix name, and then a corresponding document resolution manner may be determined according to the document format.

In a specific implementation, the embodiment of the application can preset corresponding document analysis modes aiming at different document formats, for example, when a document to be analyzed is a PDF document, the document analysis mode can be a PDF analysis mode; for another example, when the document to be analyzed is a Word document (such as a DOC document or a DOCX document), the document analysis mode may be a Word document analysis mode; and if the document to be analyzed is a PPT document, the document analysis mode can be a PPT analysis mode, and the embodiment of the application does not limit the document analysis mode. By presetting corresponding document analysis modes for documents with different formats, determining document format analysis of the document to be analyzed according to suffix names of the document to be analyzed after the document to be analyzed is acquired, and further acquiring corresponding document analysis modes for the document format of the document to be analyzed, so that chart information, text information and characteristic information in the document to be analyzed can be obtained by analyzing the document to be analyzed according to the document analysis modes, when a new document format is added, document analysis can be realized only by setting the corresponding document analysis modes, the document analysis can be realized, the capabilities of expansion and secondary development are improved, and the problem that the existing document analysis technology cannot be expanded and secondarily developed is solved.

In actual processing, different documents may or may not contain charts, and for a document to be analyzed containing charts, information such as charts in the document to be analyzed can be analyzed in a chart analysis mode, and text information in the document to be analyzed can be analyzed in a text analysis mode. For example, PDF documents and PPT documents mostly contain charts, so in this embodiment, the chart-containing documents may be parsed by a chart parsing method to obtain chart information in the document to be parsed.

And 130, based on the document analysis mode, extracting information from the document to be analyzed to obtain document analysis information and document characteristic information corresponding to the document analysis information.

Specifically, the document parsing information may include text parsing information and/or chart parsing information, which is not limited in the embodiments of the present application. The text analysis information may include text information, for example, may include characters, font formats and font colors corresponding to the characters, and the embodiment of the present application does not limit the text analysis information; the chart parsing information may include table data of a table and/or picture information of a picture, which is not limited in the embodiment of the present application. The document feature information may include text feature information corresponding to the text analysis information and chart feature information corresponding to the chart analysis information, which is not limited in the embodiment of the present application. The character feature information may include features of each character in the character analysis information, such as serial number features of the character, coordinates of the character, and the like; the chart feature information may include a table feature corresponding to the table and a picture feature corresponding to the picture, for example, the picture feature may include a storage connection of the picture, a format of the picture, a hash value, a length-width feature, a picture coordinate, and information around the picture (such as a picture title, a picture footer, etc.); the form features may include form type, information about the form (e.g., form footer, unit, title, etc.), and form coordinates, to which embodiments of the present application are not limited.

In this embodiment, the information extraction process may be performed on the document to be analyzed based on the document analysis manner, for example, chart information and text information may be analyzed on the document to be analyzed to obtain text analysis information and chart analysis information in the document to be analyzed as document analysis information, and feature extraction may be performed on the document analysis information in the document to be analyzed, for example, text feature extraction may be performed on the document to be analyzed to obtain text feature information corresponding to the text analysis information; and aiming at the chart analysis information, extracting the chart characteristics of the document to be analyzed to obtain chart characteristic information corresponding to the icon analysis information. The document to be analyzed is subjected to information extraction processing in a document analysis mode, document analysis information and document characteristic information are extracted from the document to be analyzed, document analysis and characteristic extraction are realized, the condition that data such as text colors and text fonts in the document are lost is avoided, the integrity of document data is ensured, and the problem that the document data is lost in the document analysis process in the conventional document analysis technology is solved.

And 140, determining a document analysis result based on the document analysis information and the document characteristic information.

Specifically, the document analysis result may include target document analysis information, where the target document analysis information may be analyzed document data, and a data format corresponding to the target document analysis information may be a data exchange format (JavaScript Object Notation, JSON), that is, the target document analysis information may be analyzed document data of JSON format data, and of course, may also be other format data.

In this embodiment, the document analysis information may be processed based on the document feature information, for example, for text analysis information, paragraph merging may be performed on characters in the text analysis information based on the text feature information, paragraph merging may be performed on characters belonging to the same paragraph to obtain paragraph information, reading order analysis may be performed on the paragraph information and chart analysis information based on the text feature information and the chart feature information, reading order of characters and charts in the document may be determined, reading order information may be obtained, reading order information may be combined with the document analysis information to perform integration output, and an analysis file in JSON format may be obtained as target document analysis information to obtain document analysis results, so as to implement effective analysis on documents to be analyzed in different types, and solve the problem existing in the document analysis technology that a specific document analysis model is used for document analysis.

Therefore, according to the embodiment of the application, the document to be analyzed is obtained, the classification processing is carried out according to the document to be analyzed, the document analysis mode is obtained, the information extraction processing is carried out on the document to be analyzed based on the document analysis mode, the document analysis information and the document characteristic information corresponding to the document analysis information are obtained, the document analysis result is determined based on the document analysis information and the document characteristic information, the document analysis mode is obtained by classifying the document to be analyzed, the information extraction is carried out by utilizing the document analysis mode, the effective analysis of different types of documents to be analyzed is realized, and the problem that the document analysis is carried out by using a specific document analysis model in the conventional document analysis technology is solved.

Referring to fig. 2, a schematic step flow diagram of a document parsing method according to an alternative embodiment of the present application is shown. The document analysis method specifically comprises the following steps:

step 210, obtaining a document to be parsed.

As an example, a document parsing framework may be pre-constructed to perform document acquisition, document classification, document parsing, and the like on a document to be parsed, for example, a Pyparser framework may be constructed by Python programming language as the document parsing framework. The pypasser framework may provide an interface for inputting and obtaining a file, and after the pypasser framework operates, a configuration file and a document to be parsed may be transferred into the pypasser framework through the interface, where the configuration file may be a markup language (YAML Ain't Markup Language, YAML) file, and the YAML file may be used to describe which processes need to be executed for converting a document to be parsed into a target parsing document (e.g., JSON document), and how the document to be parsed is output into the target parsing document after a series of processes. For example, referring to fig. 3, fig. 3 is a view of a YAML file configuration provided in an embodiment of the present application, where a Pyparser framework may be configured with a DocParser, where the DocParser may include one or more YAML files for describing a flow of document parsing conversion by the YAML files, a typical yacparser yacml file may include multiple pipes (pipes), where the pipes may be used to implement certain operations on the document, such as extracting features of the document, parsing elements of the document, etc., all of the pipes may conform to the definition of the pipe interface, i.e., implement a process method, input the document, output JSON objects, the pipes may have various parameters, where the parameters may be defined in the yacml file, where the methods of the pipes may include, but are not limited to: decryption method (DecryptPDF), page extraction method (PDFPageSplit), picture conversion method (ExtractPDFImage), picture text parsing method (pdfminextraction), table parsing method (camelottablemaster), table logic result parsing method (tablepowerparamser), paragraph merging method (pdfparagraph blockProposer), reading order parsing method (BlockOrder), identification adding method (addlockid), document structure parsing method (DocTreeParser), encoding method (charencodeceFixer).

In actual processing, the pypasser framework may further be provided with a document downloading device, and when the input information received through the interface is a document downloading link of the document to be parsed, the pypasser framework may download the document to be parsed by using the document downloading link through the document downloading device.

And 220, carrying out document identification on the document to be analyzed through a preset classifier to obtain document type information of the document to be analyzed.

Specifically, the document type information may include a document format, such as PDF format, PPT format, doc format, docx format, image format, PPTX format, excel format, markdown format, html format, email format, xmind format, and other formats, which are not limited in the embodiments of the present application.

In this embodiment, a classifier for classifying the document may be preset, and the document format of the document to be parsed is identified by the classifier, for example, the classifier may identify the suffix name of the document to be parsed, and determine the document format of the document to be parsed according to the suffix name, so as to be used as the document type information of the document to be parsed.

As an example, referring to fig. 4, in order to implement classification recognition of a document on the basis of docparamser, a domainparamser may be constructed, which may also be in a YAML format, and a Trigger (Trigger) is added to the domainparamser as a classifier of document classification to determine which document parsing method is used to parse the document to be parsed, and the Trigger may provide a document input interface to obtain the document to be parsed, output a boolean value, that is True or False, according to the document to be parsed to implement recognition classification of the document to be parsed, and three types of Trigger implement respectively determining a type of the document to be parsed by using a file name, determining a type of the document to be parsed by using a suffix, automatically recognizing whether the document to be parsed is a PDF of a scanning type, and so on, which is not limited in this example.

Step 230, obtaining a preset parsing configuration file.

Specifically, the parsing configuration file may include a document parsing flow corresponding to each document format, for example, the parsing configuration file may be a YAML file, which is not limited in the embodiment of the present application.

In this embodiment, after the classifier identifies the document type of the document to be parsed, a preset parsing configuration file may be obtained, so that the document parsing mode of the document to be parsed may be determined by parsing the configuration file.

And step 240, searching a document analysis flow matched with the document type information in the analysis configuration file to obtain a document analysis mode corresponding to the document to be analyzed.

In this embodiment, after determining the document type corresponding to the document to be parsed, the classifier may obtain all the document types in the parsing configuration file and document parsing flows corresponding to the document types, then the classifier may perform document type query in the parsing configuration file to determine whether the document type and the document parsing flows corresponding to the document types exist in the parsing configuration file, and if it is determined that the document type and the document parsing flows corresponding to the document types exist in the parsing configuration file, the document parsing flows may be used as a document parsing mode of the document to be parsed.

As an example, referring to fig. 5, the pyparser framework may obtain an input Yaml configuration and a document, take the document as a document to be parsed, identify a document format of the document to be parsed by a document classifier, and determine a parsing flow of the document to be parsed by using the Yaml configuration file according to the document format.

Step 250, determining the document state of the document to be parsed.

Specifically, the document state may include a document encrypted state and a document unencrypted state, which is not limited by the embodiments of the present application. In this embodiment, the document to be parsed may be an encrypted document, so when the document to be parsed is parsed, the document state of the document to be parsed may be obtained first, and whether the document to be parsed is encrypted or not is determined according to the document state, if the document state is the document encryption state, it is determined that the document to be parsed is encrypted; and when the document state is the document unencrypted state, determining that the document to be analyzed is not encrypted.

And 260, if the document state is the document encryption state, performing decryption processing on the document to be analyzed to obtain document decryption information.

Specifically, the document decryption information may include a decryption ciphertext of the document to be processed, which is not limited in the embodiment of the present application.

Specifically, when the document state is the document encryption state, the embodiment of the application can determine that the document to be processed is encrypted, then the encrypted document to be processed can be decrypted through a preset decryption method, and a decryption ciphertext of the document to be processed is determined to be used as document decryption information, so that the document to be processed can be decrypted by using the document decryption information to obtain a plaintext document to be processed; when the document state is the document unencrypted state, the document to be processed can be determined to be unencrypted without decryption.

As an example, referring to fig. 5, taking a PDF document to be processed as an encrypted PDF document as an example, the PDF document to be processed may be identified by a DecryptPDF method in a Pipeline, whether the PDF document is encrypted may be determined, and if it is determined that the PDF document is encrypted, the DecryptPDF method may decrypt the PDF document. If part of the PDF document is provided with a password, decryption can be performed by a decryptPDF method. Some PDFs are in an encrypted state, but the password is empty, and can be decrypted by a Decrypt PDF method.

And step 270, based on the document analysis mode, carrying out information extraction processing on the document to be analyzed by combining the document decryption information to obtain the document analysis information and the document characteristic information.

In this embodiment, for a document to be processed, the document state of which is a document encryption state, the document to be parsed may be decrypted using the document decryption information to obtain a clear text document to be parsed, and then information extraction processing may be performed on the document to be parsed based on the document parsing manner to obtain document parsing information and document feature information; and for the document to be processed, the document state of which is the document unencrypted state, the information extraction processing can be directly carried out on the document to be analyzed based on the document analysis mode, so as to obtain the document analysis information and the document characteristic information.

Optionally, the information extraction processing is performed on the document to be parsed to obtain the document parsing information and the document feature information, which may include the following substeps:

in sub-step 2701, the document to be parsed is subjected to page extraction to obtain a list of pages to be parsed, where the list of pages to be parsed includes at least one page to be parsed.

In a specific implementation, the embodiment can page the document to be analyzed, divide the document to be analyzed into one or more pages, obtain the page to be analyzed, and form a page list to be analyzed by the multiple pages to be analyzed, so that analysis and feature extraction can be performed for each page.

As an example, referring to fig. 5, a PDF document may be decomposed by a PDF fpagesplit method in Pipeline, a PDF of a single page is extracted from the PDF document and then saved to an object storage server (Minio).

In the substep 2702, the page to be resolved is resolved by using the document resolution mode, so as to obtain chart resolution information and text resolution information of the page to be resolved.

In a specific implementation, the embodiment can analyze the chart of the page to be analyzed by using a document analysis mode for each page to be analyzed, for example, whether the page to be analyzed has a chart and/or a table can be identified by a preset chart analysis method, if the chart and/or the table of the page to be analyzed are determined, the chart analysis can be performed on the page to be analyzed to obtain chart analysis information, and the text analysis can be performed on the text in the page to be analyzed to obtain text analysis information.

In actual processing, in order to conveniently analyze a page to be analyzed, a data structure which is convenient for computer processing can be predefined, the data structure can be a Doc object, the embodiment of the application does not limit the problem, the Doc object can be converted into a JSON object, a JSON file is finally output, after a document waiting to be analyzed such as PDF is analyzed into the JSON file, a user can convert the JSON file into the Doc document when reading is needed, the reading is convenient for the user, and the Doc document can be converted into the JSON file according to the use requirement, and the storage is convenient. Wherein, JSON object outermost layer contains: parser (Parser) information and Doc objects. The Doc object may represent a parsing result of a document, and is divided into five levels from large to small, including Doc, page (page), block (block), character string (char group) and character (char), where the Doc object may include a page object list and shape and appearance (meta) information corresponding to the Doc object, such as a document name, a document download path, and the like; the page object may include a block object list, and meta information corresponding to the page object, such as picture information, size, width, etc. of a certain page of the document; the block object may include text block (TextBlock), picture block (ImageBlock), table block (TableBlock), chart block (ChartBlock) and the like, and represents a paragraph of a document or a text block, a picture block, each block may have some own specific attributes, such as the specific attributes of the table include table data, the specific attributes of the picture include picture information, and may also have common attributes, such as sequence number (id) and text information (text) of all blocks, and a meta object and charGroup object list, meta information of the block may have coordinates of the block, and document structure tree (doctree) attribute of the block; the character group object is a smaller continuous character set, such as a word in English, a line of characters in Chinese, or a character in a cell in a table, and has text attribute and meta attribute, and a character object list; the char object is the smallest character-level data, and has text attribute and meta attribute, and the meta attribute includes information such as character coordinates, character size, bold, and whether it is an upper and lower corner mark. The meta information of the above five levels is extensible, and feature information of the corresponding level is stored in the meta information. By defining each hierarchy, when the document is analyzed, the information corresponding to each hierarchy can be analyzed, the problem that document data is easy to lose in the document analysis process is effectively solved, and the method has generalization capability, strong expansion and secondary development capability.

In an optional embodiment, in the embodiment of the present application, for the page to be resolved, resolving by using the document resolving manner to obtain chart resolving information and text resolving information of the page to be resolved, the method may specifically include: when the document analysis mode comprises a chart analysis mode, chart analysis is carried out on the page to be analyzed by utilizing the chart analysis mode, so that chart analysis information is obtained; and when the document analysis mode comprises a text analysis mode, performing text analysis on the page to be analyzed by using the text analysis mode to obtain text analysis information.

In an optional embodiment, in the embodiment of the present application, chart analysis is performed on the page to be analyzed by using the chart analysis manner to obtain chart analysis information, which may specifically include: carrying out chart identification on the page to be analyzed through a preset algorithm to obtain a chart identification result; if the chart identification result is a chart identification success result, a preset chart analyzer is utilized to analyze the chart of the page to be analyzed, and chart information is obtained; carrying out logic structure analysis on the chart information to obtain logic structure information; and generating chart analysis information based on the chart information and the logic structure information.

In a specific implementation, after the chart is analyzed, a two-dimensional table containing merging information of table cells can be obtained to be used as chart information, then logic structure analysis can be performed on the chart information, which cells in the table are titles (Label) and which cells are data (Entry) are identified, logic structure information is obtained, and then the chart information and the logic structure information are used as chart analysis information.

Sub-step 2703 takes the chart resolution information and/or the text resolution information as the document resolution information.

In this embodiment, after chart analysis and text analysis are performed on the page to be analyzed, the analyzed chart analysis information and/or the analyzed text analysis information may be used as document analysis information.

In the substep 2704, feature extraction is performed on the page to be parsed with respect to the document parsing information, so as to obtain document feature information.

The document characteristic information comprises chart characteristic information corresponding to the chart analysis information and text characteristic information corresponding to the text analysis information.

In a specific implementation, the embodiment can extract characteristics of a page to be analyzed according to the document analysis information, such as character coordinates, character size, whether characters are bold, whether upper and lower corner marks and the like of characters can be extracted from the page to be analyzed according to the character analysis information in the document analysis information, so as to serve as document characteristic information; for chart analysis information, table data, picture information, coordinates and the like can be extracted from a page to be analyzed to serve as chart characteristic information.

In actual processing, the embodiment of the application may define features (features) to be extracted in advance, where features refer to features in five levels of doc, page and block, charGroup, char under page, and a developer may extend features in a custom manner according to a Feature protocol, and only needs to implement two methods, from_json and to_json, where from_json defines how to convert from a json character string to a Feature object, and to_json defines how to convert the Feature object to a json object. Features mainly include, but are not limited to: the Title (Title) of the document, one of the simplest features, may be the file name of the document; string number Feature (text), a relatively complex type of Feature, the attributes include prefix, suffix, number type, and value, such as a subtitled text is "3.1.2.feature interface" where the prefix of text Feature is "3.1" and the value is "2", suffix is "2", and number type is a hierarchical value (level digit). The number types also comprise Chinese numbers, roman numbers, english sequences, circled values and the like, the number types can be self-defined and expanded, meanwhile, the text also realizes a from_text method for automatically identifying the features from the text, and an automatic training and clustering interface is convenient for a user to self-define the text with a group of similar features; document structure tree features (DocTree), document structures typically include a hierarchy of primary, secondary, tertiary, etc. titles beginning with a main title. Of course, not all documents carry definitions of these formats, in plain text without format definitions, we need to identify DocTree from the text with algorithms, and specific algorithms for calculating DocTree can be implemented by Pipeline; coordinate features (Pos), which may appear in meta of Page, block, charGroup, char level, are four integer type values of x, y, w, h, representing a box in the coordinate system with the upper left corner as the origin, the upper left corner of the box being (x, y), the width of the box being w, the height being h, based on which the Pos object also implements a class method, including calculating the area (area) of the box, the relation (overlaps, contains etc.) between two boxes, a minimum bounding_rect of a Pos list, calculating the horizontal or vertical projection (projection) of a Pos list; picture features (ImageUrl), including image_url, image_type, image_ hash, width, height, etc. features; the character OcrProb has only one numerical attribute, which represents the confidence of the character recognition in Ocr algorithm; the character's superscript feature, which is only one enumeration type attribute, indicates that the character is a superscript, a subscript, or a normal position; form features (tablestructures) divide each cell of a form into either an Entry type or a Label type (or no type, such as top left blank cells). Where Entry represents the type of form data, label is a description of this data, label may appear either in the top row of the form or in the left column of the form, possibly with multiple rows of such Label attributes. The specific form feature calculation is realized by Pipeline; forms, some information (SideInfo) around the picture, including the information of the footnotes, units, titles of the picture, footnotes and the like of the form; the page crossing statistical feature (cross page feature), the meta of doc level currently includes the full text minimum number (min_font_size), the most common word number (most common_font_size), the line spacing (line_space), when a plurality of lines are used, for example, when PDF is combined in a page crossing manner, the line spacing is referred to by the upper margin of the next page, when the text content is identified, whether the word number is the full text most common word number can be judged, and when the upper and lower corner marks are judged, the words with small word numbers can be screened out.

As an example, referring to fig. 5, for a picture in a PDF document, the PDF may be subjected to picture conversion by a method of extrapdfimage in a Pipeline, and the converted picture is uploaded to Minio for storage, and a picture connection (ImageUrl) for acquiring the picture is obtained as picture analysis information in the chart analysis information; for text characters and the like in a PDF document, text recognition can be carried out on PDFs according to pages through a PDFMINerExacte method in a Pipeline to obtain text analysis information, for example, PDFMinier (PDF toolkit of a third party) is used for analyzing PDFs, PDFMinier is better in analysis of pictures and texts of the PDFs, char text and coordinates of each page in the PDF document can be obtained through the PDFMINerExacte method, char text is used as text analysis information, and char coordinates are used as text feature information; for the table in the PDF document, the table recognition may be performed by using the camelottablemaster method in Pipeline, as shown in fig. 6, where the PDF is parsed by a Camelot (PDF table parser of a third party), and only the page with a possible table is parsed by a Camelot again by a classification algorithm based on a frame line and a block for fast table preliminary screening. Because the effect is not ideal when the Camelot is directly called for the condition that a page of document has a plurality of tables, the Pipeline can additionally call a preset image-based table area recognition algorithm, after the table area is recognized, the Camelot is called to analyze the tables in the table area, a two-dimensional table containing the merging information of the table cells is obtained, the two-dimensional table is used as chart information, the cells with the same id in the two-dimensional table represent the merging of the cells, and a dictionary with the id corresponding to the content of the cells is used as chart characteristic information. The logical structure of the table can then be parsed using the tablearchitecture Parser method in Pipeline, which is not limited to PDF, but applies to all doc objects with table contents. The core function is to distinguish which cells in the table are descriptive titles (Label) and which cells are data (Entry). In the tablearchitecture Parser method, a parameter can also be configured: whether or not to execute in fast mode (fast_mode), when this value is true, an internal rule algorithm is invoked, simply dividing the table into cells, merged or not, and the columns or rows with the merging automatically become Label. When fast_mode is false, we call the table cell sorting algorithm developed by our team of algorithms to get more accurate results. The table is subjected to logic structure analysis by a tablearchitecture Parser method to obtain logic structure information, and then chart analysis information can be generated by using chart information and logic structure information, so that reasonable analysis of the chart is realized.

And 280, aiming at the document analysis information, carrying out reading analysis by utilizing the document characteristic information to obtain reading sequence information.

Specifically, the reading sequence information may include a text reading sequence and a chart reading sequence, which is not limited in the embodiment of the present application. Specifically, the embodiment can read and analyze the document analysis information by utilizing the document characteristic information, for example, for the text analysis information, the document characteristic information can be utilized to merge paragraphs of characters in the text analysis information, and then the document characteristic information is utilized to calculate the reading sequence of the merged paragraphs to obtain the text reading sequence; and for the chart analysis information, the chart characteristic information can be utilized to analyze the reading sequence of the chart to obtain the chart reading sequence. The reading order information may then be derived based on the text reading order and the chart reading order.

In an optional embodiment, in the embodiment of the present application, for the document analysis information, reading analysis is performed by using the document feature information to obtain reading order information, which may specifically include: aiming at the document analysis information, carrying out paragraph processing by utilizing the coordinate information in the document characteristic information to obtain paragraph merging information; and aiming at the paragraph merging information, sequencing the reading sequence of the document analysis information to obtain reading sequence information.

As an example, referring to fig. 5, paragraph merging may be performed through text block coordinates using the pdfpagraph block proposer method in Pipeline, which is effective for all documents with text coordinates, so that the picture of the scan class, even the PPT class document may also apply the pdfpagraph block proposer method. In the merging process of the PDFParagraph BlockProposer method, characters are merged into lines nearby, then areas are divided, for example, a picture or a table is arranged in the middle of a page, the page is divided into an upper area and a lower area, left and right boundary coordinates of the text lines are counted in the areas, and then paragraph merging operation is carried out according to the characteristics of first line indentation, last line alignment and the like.

In actual processing, an unrepeated id attribute, such as a chart, a text and the like, can be added to each block through an AddBlockId method in a Pipeline, and the AddBlockId method can be applied to all types of document analysis flows, so that the reusability of the Pipeline is embodied. After id attribute is added by an AddBlockId method, a Block order method in Pipeline can be utilized to calculate a reading sequence of blocks according to coordinates, a document with complex typesetting can be theoretically obtained, the reading sequence of the blocks can be complex, each Block can be returned according to a sequence from left to right and from top to bottom, and the blocks can be aligned within a certain coordinate error range to obtain a chart reading sequence and a character reading sequence to serve as reading sequence information. By reading and analyzing the document analysis information, the reading sequence of the document analysis information is reasonably processed, and the document reading performance is improved.

And step 290, carrying out structural clustering on the document analysis information according to a preset structural algorithm by utilizing the document characteristic information to obtain document structural information.

In a specific implementation, the embodiment can utilize document feature information to perform structural clustering on document analysis information according to a preset structural algorithm, for example, text in the same line in text analysis information is determined through the document feature information, feature extraction is performed on each line of text, then a clustering algorithm is used for grouping similar texts into a group, the same group of texts is scored by utilizing the preset structural algorithm, and hierarchical division is performed according to scores to obtain the document structural information.

As an example, referring to fig. 5, a document structure tree may be parsed by a DocTreeParser method in a Pipeline, feature extraction is performed on each line of text by text features, then similar types of text are grouped into a group by a clustering algorithm, then all text groups of text blocks that the DocTree algorithm (i.e., the structural algorithm) will take the full text are scored, the highest score becomes a first level, groups within text blocks between two first levels are scored, doctrees of a second level are found, and recursion is sequentially performed until legal text groups are not included between the two levels, and document structure information is obtained.

And 300, integrating and outputting the document analysis information according to the reading sequence information and combining the document structure information to obtain target document analysis information.

Specifically, the embodiment can sequentially sort the chart analysis information and the text analysis information in the document analysis information according to the reading sequence by using the reading sequence information, then integrate the text analysis information according to the hierarchy by using the document structure information, and output the sorted and integrated document according to the target format, for example, can output the document according to the JSON format, so as to obtain the target document analysis information.

And step 310, generating the document analysis result based on the target document analysis information.

In a specific implementation, after the target document analysis information is obtained, the target document analysis information can be used as a document analysis result. Therefore, the method and the device realize effective analysis of different types of documents to be analyzed, and solve the problems existing in the prior document analysis technology that a specific document analysis model is used for document analysis.

In actual processing, there may be a problem of special encoding in the document to be processed, at this time, there may also be special encoding in the obtained target document analysis information, so as to avoid that the special encoding affects the readability of the target document analysis information.

In an optional embodiment, the generating the document parsing result according to the target document parsing information may specifically include: determining the coding format of the target document analysis information; if the coding format is a special coding format, coding correction is carried out on the target document analysis information according to a preset coding correction format, and a document analysis result is obtained; if the coding format is not a special coding format, the target document analysis information is directly used as the document analysis result.

For example, referring to fig. 5, the document to be processed may be identified by encoding by the charencodifixer method in Pipeline, whether the document to be processed has a special encoding may be determined, and when it is determined that the document to be processed has a special encoding, a special encoding correction unicode may be performed, an encoding correction may be performed during a document parsing process, or an encoding correction may be performed after obtaining target document parsing information, which is not limited in this example.

In summary, the embodiment of the application identifies the acquired document to be analyzed through the preset classifier to obtain the document type information of the document to be analyzed, then searches the document analysis flow matched with the document type information in the acquired preset analysis configuration file to obtain the document analysis mode corresponding to the document to be analyzed, and when the document state of the document to be processed is the document encryption state, decrypts the document to be analyzed to obtain the document decryption information, further performs information extraction processing on the document to be analyzed by combining the document decryption information based on the document analysis mode to obtain the document analysis information and the document characteristic information, performs reading analysis on the document analysis information by utilizing the document characteristic information to obtain reading sequence information, performs structural clustering on the document analysis information according to the preset structural algorithm to obtain the document structural information, integrates and outputs the document analysis information according to the reading sequence information to obtain the target document analysis information, and generates the document analysis result based on the target document analysis information, thereby realizing the effective analysis on the document to be analyzed of different types, solving the problem that the document analysis model is used for analyzing the document in the specific document, improving the reading function of the document analysis information, and improving the document reading function, and preventing the document from losing the document data.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the embodiments.

As shown in fig. 7, the embodiment of the present application further provides a document parsing apparatus 700, including:

the to-be-parsed document acquisition module 710 is configured to acquire a to-be-parsed document;

the classification processing module 720 is configured to perform classification processing on the document to be parsed by using a preset classifier, so as to obtain a document parsing mode corresponding to the document to be parsed;

the feature extraction processing module 730 is configured to perform feature extraction processing on the document to be parsed by using the document parsing manner, so as to obtain document parsing information and document feature information corresponding to the document parsing information;

the document parsing result determining module 740 is configured to determine a document parsing result based on the document parsing information and the document feature information.

Optionally, the classification processing module 720 includes:

the document identification sub-module is used for carrying out document identification on the document to be analyzed through a preset classifier to obtain document type information of the document to be analyzed;

The analysis configuration file acquisition sub-module is used for acquiring a preset analysis configuration file;

and the searching sub-module is used for searching the document analysis flow matched with the document type information in the analysis configuration file to obtain a document analysis mode corresponding to the document to be analyzed.

Optionally, the feature extraction processing module 730 includes:

a document state determining sub-module, configured to determine a document state of the document to be parsed;

the decryption processing sub-module is used for decrypting the document to be analyzed when the document state is a document encryption state, so as to obtain document decryption information;

and the information extraction processing sub-module is used for carrying out information extraction processing on the document to be analyzed by combining the document decryption information based on the document analysis mode to obtain the document analysis information and the document characteristic information.

Optionally, the information extraction processing sub-module includes:

the page extraction unit is used for extracting the page of the document to be analyzed to obtain a page list to be analyzed, wherein the page list to be analyzed comprises at least one page to be analyzed;

the analysis unit is used for analyzing the page to be analyzed by utilizing the document analysis mode to obtain chart analysis information and text analysis information of the page to be analyzed;

A document analysis information determining unit, configured to use the chart analysis information and/or the text analysis information as the document analysis information;

and the feature extraction unit is used for extracting features of the page to be analyzed according to the document analysis information to obtain document feature information, wherein the document feature information comprises chart feature information corresponding to the chart analysis information and character feature information corresponding to the character analysis information.

Optionally, the parsing unit includes:

the chart analysis subunit is used for carrying out chart analysis on the page to be analyzed by utilizing the chart analysis mode to obtain chart analysis information when the document analysis mode comprises the chart analysis mode;

and the text analysis subunit is used for carrying out text analysis on the page to be analyzed by utilizing the text analysis mode to obtain text analysis information when the document analysis mode comprises the text analysis mode.

Optionally, the chart analysis subunit is specifically configured to perform chart identification on the page to be analyzed through a preset algorithm, so as to obtain a chart identification result; if the chart identification result is a chart identification success result, a preset chart analyzer is utilized to analyze the chart of the page to be analyzed, and chart information is obtained; carrying out logic structure analysis on the chart information to obtain logic structure information; and generating chart analysis information based on the chart information and the logic structure information.

Optionally, the feature extraction processing module 730 includes:

the reading analysis sub-module is used for carrying out reading analysis on the document analysis information by utilizing the document characteristic information to obtain reading sequence information;

the structure clustering sub-module is used for carrying out structure clustering on the document analysis information according to a preset structure algorithm by utilizing the document characteristic information to obtain document structure information;

the integration output sub-module is used for integrating and outputting the document analysis information according to the reading sequence information and the document structure information to obtain target document analysis information;

and the document analysis result generation sub-module is used for generating the document analysis result based on the target document analysis information.

Optionally, the reading analysis sub-module includes:

the paragraph processing unit is used for carrying out paragraph processing by utilizing the coordinate information in the document characteristic information aiming at the document analysis information to obtain paragraph merging information;

and the reading sequence ordering unit is used for ordering the reading sequence of the document analysis information aiming at the paragraph merging information to obtain reading sequence information.

Optionally, the document parsing result generating sub-module includes:

An encoding format determining unit for determining an encoding format of the target document parsing information;

the coding correction unit is used for carrying out coding correction on the target document analysis information according to a preset coding correction format when the coding format is a special coding format, so as to obtain a document analysis result;

and the document analysis result determining unit is used for directly taking the target document analysis information as the document analysis result when the coding format is not a special coding format.

It should be noted that, the document analysis device provided in the embodiment of the present application may perform the document analysis method provided in any embodiment of the present application, and has the corresponding functions and beneficial effects of performing the document analysis method.

In a specific implementation, the document analysis device can be integrated in equipment, so that the equipment can perform classification processing and information extraction processing according to the document to be analyzed to obtain document analysis information and document characteristic information, and determine a document analysis result based on the document analysis information and the document characteristic information, and the document analysis result is used as electronic equipment to realize effective analysis of different types of documents to be analyzed. The electronic device may be formed by two or more physical entities or may be formed by one physical entity, for example, the electronic device may be a personal computer (Personal Computer, PC), a computer, a server, or the like, which is not particularly limited in the embodiment of the present application.

As shown in fig. 8, an embodiment of the present application provides an electronic device, including a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 perform communication with each other through the communication bus 114; a memory 113 for storing a computer program; the processor 111 is configured to implement the steps of the document parsing method provided in any one of the foregoing method embodiments when executing the program stored in the memory 113. By way of example, the steps of the document parsing method may include the steps of: acquiring a document to be analyzed; classifying according to the document to be analyzed to obtain a document analysis mode; based on the document analysis mode, carrying out information extraction processing on the document to be analyzed to obtain document analysis information and document characteristic information corresponding to the document analysis information; and determining a document analysis result based on the document analysis information and the document characteristic information.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the document parsing method provided by any one of the method embodiments described above.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the application to enable one skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A document parsing method, comprising:

acquiring a document to be analyzed;

2. The method of claim 1, wherein the classifying according to the document to be parsed to obtain a document parsing method corresponding to the document to be parsed includes:

acquiring a preset analysis configuration file;

3. The method according to claim 1, wherein the performing information extraction processing on the document to be parsed based on the document parsing method to obtain document parsing information and document feature information corresponding to the document parsing information includes:

Determining the document state of the document to be analyzed;

4. A method according to any one of claims 1 to 3, wherein performing information extraction processing on the document to be parsed to obtain the document parsing information and the document feature information, comprises:

5. The method of claim 4, wherein the parsing the page to be parsed by using the document parsing method to obtain chart parsing information and text parsing information of the page to be parsed comprises:

6. The method of claim 5, wherein the performing the chart parsing on the page to be parsed by using the chart parsing method to obtain chart parsing information includes:

7. The method of claim 1, wherein the determining a document parsing result based on the document parsing information and the document feature information comprises:

8. The method of claim 5, wherein the reading analysis is performed on the document analysis information by using the document feature information to obtain reading order information, and the method comprises:

9. The method of claim 5, wherein generating the document parsing result based on the target document parsing information comprises:

determining the coding format of the target document analysis information;

10. A document parsing apparatus, comprising: