CN117859122A

CN117859122A - AI-enhanced audit platform including techniques for automated document processing

Info

Publication number: CN117859122A
Application number: CN202280057933.7A
Authority: CN
Inventors: 李中生; W·程; M·J·弗拉维尔; L·M·霍尔马克; N·A·利佐特; A·S·饶; K·M·梁; 朱迪; T·德利莱; M·J·P·拉米瑞兹; 万缘; R·R·辛格; V·班萨尔; S·荷达; A·辛格; S·S·赞吉
Original assignee: Pwc Product Sales Co ltd
Current assignee: Pwc Product Sales Co ltd
Priority date: 2021-06-30
Filing date: 2022-06-30
Publication date: 2024-04-09
Also published as: CN117751362A; CN117882081A; CN117882041A

Abstract

Systems and methods for automated document processing for AI-enhanced audit platforms are provided. A system for determining the composition of a document package extracts essential content information and metadata information from the document package and generates based on the extracted information about the composition of the document package. A system for verifying a signature in a document extracts data representing a spatial location of a respective signature and generates a confidence level for the respective signature, and determines whether a signature criterion is satisfied based on the location and the confidence level. A system for extracting information from documents applies a set of data transformation processing steps to a plurality of received documents to generate structured data, and then applies a set of knowledge-based modeling processing steps to the structured data to generate output data extracted from the plurality of electronic documents.

Description

AI-enhanced audit platform including techniques for automated document processing

Cross Reference to Related Applications

U.S. provisional application No.63/217,119, filed at 2021, 6, 30; U.S. provisional application No.63/217,123, filed at 30/6/2021; U.S. provisional application No.63/217,127, filed at 30, 6, 2021; U.S. provisional application No.63/217,131, filed at 30/6/2021; and U.S. provisional application No.63/217,134, filed on 6/30 of 2021, the entire contents of each of which are incorporated herein by reference.

Technical Field

The present invention relates generally to document processing, and more particularly to an AI-enhanced audit platform including techniques for automating document processing.

Background

AI-enhanced audit platforms benefit from automated document processing techniques including automated document classification and clustering, automated signature detection and verification, and automated information extraction from PDF documents and other document formats.

Disclosure of Invention

Known techniques for document classification do not adequately leverage context data to guide document classification, particularly in the context of an audit process. As described herein, the context data available in the audit process can be effectively and efficiently leveraged to improve the accuracy and efficiency of document classification and clustering used in AI-enhanced audit platforms.

In some embodiments, a system for automated document processing may be configured to perform automated document classification (e.g., classifying documents according to different document types) and/or document bundling (document bundling). As described herein, as part of the audit review process, the system may apply a set of AI methods to leverage context data in conjunction with a multi-page document classification ML model to accurately determine the composition of a document package (document bundle), such as a document package received by an AI-enhanced audit platform.

Document processing often requires, for example, verifying that a signature (or abbreviation) appears in a particular area or is associated with a particular topic in the document for assurance purposes. There may be more than one portion, more than one topic, and/or more than one signature in a single document or document package. Known techniques for signature detection require manual review and verification, which is inefficient and inaccurate and does not allow large-scale processing of documents.

In some embodiments, a system for automated document processing may be configured to perform automated signature detection, including learning, by application, an AI model of where a signature may appear on a given document type. During document ingestion and processing, the system may verify that the document being processed actually has a signature at an expected/desired location within the document. The systems and methods provided herein may be used to automatically process documents to determine whether the documents provide evidence with a required and sufficient signature to meet warranty criteria for cargo transportation, cargo reception, contractual agreements, and the like.

Documents stored in PDF format, image format, and other formats may contain a large amount of information, and extracting the information may be an important component of AI-driven assurance processes and other tasks performed by AI-enhanced audit platforms. For example, AI-driven assurance processes may rely on automated extraction of data stored in PDFs so that invoices and/or other information (e.g., evidence information) may be fully considered, properly understood, and applied as part of the audit process. Efficient processing of documents may enable the audit process to take into account all available evidence (e.g., document) data in detail, rather than simply a small sample thereof.

In some embodiments, the document processing and information extraction systems described herein leverage a unique combination of (a) natural language processing using semantic and morphological analysis with (b) fuzzy-matching based weak labeling and text-and computer vision-based deep learning. A combined model configured to extract information from the PDF may provide a set of NLP, text, and computer vision.

In some embodiments, a first system is provided for determining a composition of a document package, the first system comprising one or more processors configured to cause the first system to: receiving first input data comprising a document package; extracting, from the document package, first information including substantial content of one or more documents of the document package; extracting second information from the document package including metadata associated with one or more documents of the document package; output data representing the composition of the document package is generated based on the first information and the second information.

In some embodiments of the first system, the output data representing the composition of the document package represents one or more sketches between page boundaries in the document package.

In some embodiments of the first system, generating the output data is further based on information obtained from an ERP system of an entity associated with the document package.

In some embodiments of the first system, the metadata includes one or more of the following: file name, file extension, file creator, file date, and information about an automated process flow for acquiring data.

In some embodiments of the first system, extracting the first information includes applying embedded object type detection.

In some embodiments of the first system, generating the output data includes applying a page similarity assessment model to a plurality of pages of the document package.

In some embodiments, a first non-transitory computer-readable storage medium is provided that stores instructions for determining a composition of a document package, the instructions configured to be executed by one or more processors of a system to cause the system to: receiving first input data comprising a document package; extracting, from the document package, first information including substantial content of one or more documents of the document package; extracting second information from the document package including metadata associated with one or more documents of the document package; output data representing the composition of the document package is generated based on the first information and the second information.

In some embodiments, a first method is provided for determining a composition of a document package, wherein the first method is performed by a system comprising one or more processors, the first method comprising: receiving first input data comprising a document package; extracting, from the document package, first information including substantial content of one or more documents of the document package; extracting second information from the document package including metadata associated with one or more documents of the document package; output data representing the composition of the document package is generated based on the first information and the second information.

In some embodiments, a second system is provided for verifying a signature in a document, the second system comprising one or more processors configured to cause the second system to: receiving an electronic document comprising one or more signatures; applying one or more signature extraction models to the electronic document to generate, for each of the one or more signatures in the electronic document, data representing a spatial location of the respective signature and a confidence level of the respective signature; a determination is made whether the electronic document satisfies a set of signature criteria based on the data representing the spatial location and the confidence level.

In some embodiments of the second system, the one or more signature extraction models include a first signature extraction model configured to identify a signature regardless of spatial location.

In some embodiments of the second system, the one or more signature extraction models include a second signature extraction model configured to identify a signature based on a spatial location within the document.

In some embodiments of the second system, applying the second signature extraction model includes: determining a predicted spatial location within the electronic document based on one or more of a structure, a format, and a type of the electronic document; and extracting the signature from the predicted spatial location.

In some embodiments of the second system, determining whether the electronic document meets a set of signature criteria includes determining whether the signature appears at a desired spatial location in the electronic document.

In some embodiments of the second system, determining whether the electronic document satisfies the set of signature criteria includes determining that the confidence level exceeds a predefined threshold.

In some embodiments of the second system, determining whether the electronic document meets a set of signature criteria includes determining whether the signature appears within a spatial proximity in the electronic document to the context data extracted from the electronic document.

In some embodiments of the second system, determining whether the electronic document meets the set of signature criteria includes generating an association score that indicates a level of association between a signature extracted from the electronic document and context data extracted from the electronic document.

In some embodiments of the second system, the system is configured to determine a set of signature criteria based at least in part on context data extracted from the electronic document, wherein the context data is indicative of one or more of: document type, document structure, and document format.

In some embodiments, a second non-transitory computer-readable storage medium is provided that stores instructions for verifying a signature in a document, the instructions configured to be executed by one or more processors of a system to cause the system to: receiving an electronic document comprising one or more signatures; applying one or more signature extraction models to the electronic document to generate, for each of the one or more signatures in the electronic document, data representing a spatial location of the respective signature and a confidence level of the respective signature; a determination is made whether the electronic document satisfies a set of signature criteria based on the data representing the spatial location and the confidence level.

In some embodiments, a second method is provided for verifying a signature in a document, wherein the second method is performed by a system comprising one or more processors, the second method comprising: receiving an electronic document comprising one or more signatures; applying one or more signature extraction models to the electronic document to generate, for each of the one or more signatures in the electronic document, data representing a spatial location of the respective signature and a confidence level of the respective signature; a determination is made whether the electronic document satisfies a set of signature criteria based on the data representing the spatial location and the confidence level.

In some embodiments, a third system is provided for extracting information from a document, the third system comprising one or more processors configured to cause the third system to: receiving a dataset comprising a plurality of electronic documents; applying a set of data conversion processing steps to the plurality of electronic documents to generate a processed dataset comprising structured data generated based on the plurality of electronic documents, wherein applying the set of data conversion processing steps comprises applying one or more deep learning based Optical Character Recognition (OCR) models; and applying the set of knowledge-based modeling processing steps to the structured data, wherein applying the set of knowledge-based modeling processing steps comprises: receiving user input indicative of a plurality of data tags of structured data; and applying a knowledge-based deep learning model based on the structured data and the plurality of data tags; and generating output data extracted from the plurality of electronic documents.

In some embodiments of the third system, applying the set of data conversion processing steps includes applying an automated orientation correction processing step prior to applying the one or more deep learning based OCR models.

In some embodiments of the third system, applying the set of data conversion processing steps includes applying a denoising function prior to applying the one or more deep learning based OCR models.

In some embodiments of the third system, applying one or more deep learning based OCR models includes: applying a text detection model; a text recognition model is applied.

In some embodiments of the third system, applying the set of data conversion processing steps includes applying image-level feature engineering processing steps to generate structured data after applying the one or more deep learning based OCR models.

In some embodiments of the third system, applying the set of data transformation processing steps includes applying a post-processing method that uses lexical to resolve structural relationships between words.

In some embodiments of the third system, applying the set of knowledge-based modeling processing steps includes, prior to receiving user input indicative of the plurality of data tags, applying one or more feature engineering processing steps to the structured data to generate the data

In some embodiments of the third system, applying one or more feature engineering processing steps includes predicting a phrase based on a lexicon.

In some embodiments of the third system, applying the set of knowledge-based modeling processing steps includes receiving user input specifying a user-defined feature project.

In some embodiments of the third system, applying the set of knowledge-based modeling processing steps includes applying fuzzy matching, wherein the system is configured to consider partial matching sufficient for labeling purposes to automatically label documents on a word-by-word basis.

In some embodiments of the third system, applying the set of knowledge-based modeling processing steps includes automatically correcting one or more text recognition errors during the training process.

In some embodiments of the third system, the knowledge-based deep learning model includes a loss function configured to accelerate convergence of the knowledge-based deep learning model.

In some embodiments of the third system, the knowledge-based deep learning model includes one or more layers embedded using Natural Language Processing (NLP) such that the model learns both content information and related location information.

In some embodiments of the third system, the knowledge-based deep learning model is trained using an adaptive feed method.

In some embodiments of the third system, the knowledge-based deep learning model includes an input layer that applies merged embedding and feature engineering.

In some embodiments of the third system, the knowledge-based deep learning model includes an input layer configured for varying batch sizes.

In some embodiments of the third system, the knowledge-based deep learning model includes an input layer that applies a sliding window.

In some embodiments of the third system, the knowledge-based deep learning model includes one or more fully dense layers disposed between the input layer and the prediction layer.

In some embodiments of the third system, the knowledge-based deep learning model includes a predictive layer that generates one or more metrics for presentation to the user.

In some embodiments, a third non-transitory computer-readable storage medium is provided that stores instructions for extracting information from a document, the instructions configured to be executed by one or more processors of a system to cause the system to: receiving a dataset comprising a plurality of electronic documents; applying a set of data conversion processing steps to the plurality of electronic documents to generate a processed dataset comprising structured data generated based on the plurality of electronic documents, wherein applying the set of data conversion processing steps comprises applying one or more deep learning based Optical Character Recognition (OCR) models; and applying the set of knowledge-based modeling processing steps to the structured data, wherein applying the set of knowledge-based modeling processing steps comprises: receiving user input indicative of a plurality of data tags of structured data; and applying a knowledge-based deep learning model based on the structured data and the plurality of data tags; and generating output data extracted from the plurality of electronic documents.

In some embodiments, a third method is provided for extracting information from a document, wherein the third method is performed by a system comprising one or more processors, the third method comprising: receiving a dataset comprising a plurality of electronic documents; applying a set of data conversion processing steps to the plurality of electronic documents to generate a processed dataset comprising structured data generated based on the plurality of electronic documents, wherein applying the set of data conversion processing steps comprises applying one or more deep learning based Optical Character Recognition (OCR) models; and applying the set of knowledge-based modeling processing steps to the structured data, wherein applying the set of knowledge-based modeling processing steps comprises: receiving user input indicative of a plurality of data tags of structured data; and applying a knowledge-based deep learning model based on the structured data and the plurality of data tags; and generating output data extracted from the plurality of electronic documents.

In some embodiments, a fourth system is provided for determining the composition of a document package, the fourth system comprising one or more processors configured to cause the first system to: receiving data comprising a document package; extracting, from the document package, first information including substantial content of one or more documents of the document package; extracting second information from the document package including metadata associated with one or more documents of the document package; output data representing the composition of the document package is generated based on the first information and the second information.

In some embodiments, a fourth non-transitory computer-readable storage medium is provided that stores instructions for determining the composition of the document package, the instructions configured to be executed by one or more processors of the system to cause the system to: receiving data comprising a document package; extracting, from the document package, first information including substantial content of one or more documents of the document package; extracting second information from the document package including metadata associated with one or more documents of the document package; output data representing the composition of the document package is generated based on the first information and the second information.

In some embodiments, a fourth method is provided for determining the composition of a document package, wherein the fourth method is performed by a system comprising one or more processors, the fourth method comprising: receiving data comprising a document package; extracting, from the document package, first information including substantial content of one or more documents of the document package; extracting second information from the document package including metadata associated with one or more documents of the document package; output data representing the composition of the document package is generated based on the first information and the second information.

In some embodiments, a fifth system is provided for verifying a signature in a document, the fifth system comprising one or more processors configured to cause the fifth system to: receiving an electronic document comprising one or more signatures; applying one or more signature extraction models to the electronic document to generate, for each of the one or more signatures in the electronic document, data representing a spatial location of the respective signature and a confidence level of the respective signature; a determination is made whether the electronic document satisfies a set of signature criteria based on the data representing the spatial location and the confidence level.

In some embodiments, a fifth non-transitory computer-readable storage medium is provided that stores instructions for verifying a signature in a document, the instructions configured to be executed by one or more processors of a system to cause the system to: receiving an electronic document comprising one or more signatures; applying one or more signature extraction models to the electronic document to generate, for each of the one or more signatures in the electronic document, data representing a spatial location of the respective signature and a confidence level of the respective signature; a determination is made whether the electronic document satisfies a set of signature criteria based on the data representing the spatial location and the confidence level.

In some embodiments, a fifth method is provided for verifying a signature in a document, wherein the fifth method is performed by a system comprising one or more processors, the fifth method comprising: receiving an electronic document comprising one or more signatures; applying one or more signature extraction models to the electronic document to generate, for each of the one or more signatures in the electronic document, data representing a spatial location of the respective signature and a confidence level of the respective signature; a determination is made whether the electronic document satisfies a set of signature criteria based on the data representing the spatial location and the confidence level.

In some embodiments, a sixth system is provided for extracting information from a document, the system comprising one or more processors configured to cause the system to: receiving a dataset comprising a plurality of electronic documents; applying a set of data conversion processing steps to the plurality of electronic documents to generate a processed dataset comprising structured data generated based on the plurality of electronic documents, wherein applying the set of data conversion processing steps comprises applying one or more deep learning based Optical Character Recognition (OCR) models; and applying the set of knowledge-based modeling processing steps to the structured data, wherein applying the set of knowledge-based modeling processing steps comprises: receiving user input indicative of a plurality of data tags of structured data; and applying a knowledge-based deep learning model trained based on the structured data and a plurality of data tags indicated by one or more user inputs; and generating output data extracted from the plurality of electronic documents by the deep learning model.

In some embodiments, a sixth non-transitory computer readable storage medium is provided that stores instructions for extracting information from a document, the instructions configured to be executed by one or more processors of a system to cause the system to: receiving a dataset comprising a plurality of electronic documents; applying a set of data conversion processing steps to the plurality of electronic documents to generate a processed dataset comprising structured data generated based on the plurality of electronic documents, wherein applying the set of data conversion processing steps comprises applying one or more deep learning based Optical Character Recognition (OCR) models; and applying the set of knowledge-based modeling processing steps to the structured data, wherein applying the set of knowledge-based modeling processing steps comprises: applying a knowledge-based deep learning model trained based on the structured data and a plurality of data tags indicated by one or more user inputs; and generating output data extracted from the plurality of electronic documents by the deep learning model.

In some embodiments, a sixth method is provided for extracting information from a document, wherein the sixth method is performed by a system comprising one or more processors, the sixth method comprising: receiving a dataset comprising a plurality of electronic documents; applying a set of data conversion processing steps to the plurality of electronic documents to generate a processed dataset comprising structured data generated based on the plurality of electronic documents, wherein applying the set of data conversion processing steps comprises applying one or more deep learning based Optical Character Recognition (OCR) models; and applying the set of knowledge-based modeling processing steps to the structured data, wherein applying the set of knowledge-based modeling processing steps comprises: applying a knowledge-based deep learning model trained based on the structured data and a plurality of data tags indicated by one or more user inputs; and generating output data extracted from the plurality of electronic documents by the deep learning model.

In some embodiments, any one or more of the features, characteristics, or aspects of any one or more of the systems, methods, or non-transitory computer-readable storage media described above may be combined with each other, in whole or in part, and/or with any one or more (in whole or in part) of any other embodiment or feature, characteristic, or aspect of the disclosure herein.

Drawings

Various embodiments are described with reference to the accompanying drawings, in which:

FIG. 1 illustrates an exemplary architecture of a text deep learning model in accordance with some embodiments.

FIG. 2 illustrates an exemplary architecture of a visual deep learning model, according to some embodiments.

FIG. 3 illustrates a schematic diagram of a two-part pipeline for knowledge-based information extraction from a rich-format digital document set (document), in accordance with some embodiments.

FIG. 4 illustrates a sample of an ICDAR13 image according to some embodiments.

Fig. 5 illustrates a sample of ICDAR2015 images according to some embodiments.

FIG. 6 illustrates a comparison of text models according to some embodiments.

FIG. 7 illustrates a comparison between deep OCR and an OCR engine in accordance with some embodiments.

FIG. 8 illustrates a schematic diagram of a two-part pipeline for knowledge-based information extraction from a format-rich digital document set, in accordance with some embodiments.

9-18 illustrate images of PDF documents processed by the techniques disclosed herein, according to some embodiments.

Fig. 19 illustrates an output generated by the techniques disclosed herein, in accordance with some embodiments.

Fig. 20 illustrates the marking of a CSV file according to some embodiments.

FIG. 21 illustrates an example image that may be used as a basis for feature engineering in accordance with some embodiments.

FIG. 22 illustrates an architecture of a named entity recognition model, according to some embodiments.

FIG. 23 illustrates output data from a named entity recognition model, according to some embodiments.

Fig. 24 illustrates the results of processing PDFs using the NER model in accordance with some embodiments.

Fig. 25 illustrates an application of the NER model to a complete sentence according to some embodiments.

Fig. 26 depicts a computer according to some embodiments.

Detailed Description

Disclosed herein are systems and methods for providing an AI-enhanced audit platform, including techniques for automating document processing. As described below, automated document processing that may be performed by the AI-enhanced audit platform may include one or more of: automated classification (and clustering) of documents, automated signature detection within documents, and weak learning AI/ML processing techniques for extracting information from documents.

As described herein, a system for providing an AI-enhanced audit platform may be configured to receive one or more documents as input data and perform automated processing of the input documents. The document may be received as structured or unstructured electronic data received from one or more data sources, and the system may subject the received document to one or more document processing techniques to identify information content within the document, extract the information content from the document, and generate, store, and leverage data generated by the document processing techniques. As explained herein, in some embodiments, the document processing techniques may include the application of one or more machine learning models.

Document classification and clustering

In some embodiments, a system for automated document processing may be configured to perform automated document classification (e.g., classifying documents according to different document types) and/or document bundling. As described herein, as part of the audit review process, the system may apply a set of AI methods to leverage context data in conjunction with a multi-page document classification ML model to accurately determine the composition of a document package, such as a document package received by an AI-enhanced audit platform.

The system may be configured to receive data representing one or more documents and apply one or more AI methods to the received data to identify and extract information from the documents and classify and/or cluster the documents. AI methods may be configured to perform analysis based on substantial document content (e.g., characters, text, and/or images in a document), based on metadata stored as part of or in association with the document, and/or based on contextual data associated with the document.

In some embodiments, metadata stored as part of or in association with the document may include data such as document format data, document portion data, page number data, font data, document layout data, document creator data, document creation time data, document title data, and/or any other suitable metadata that may belong to all or part of a document package. In some embodiments, the metadata may include one or more of the following: information obtained from file names of one or more documents, information obtained from file extensions of one or more documents, information obtained from file metadata (e.g., creator, date, etc.) of one or more documents.

In some embodiments, the external context data may include one or more of the following: information about one or more automation processes used in obtaining document data (and/or context data) from one or more systems (e.g., from an Enterprise Resource Planning (ERP) system or database); information about one or more requests to which the document responds; information about the document and/or the party or parties to which the document belongs from which the document is requested; and information about the manner in which the document was provided (e.g., the communication medium).

In some embodiments, the context data may include information regarding one or more processes, protocols, and/or standards to which the document pertains. For example, the context data may indicate information about a series of steps in a predefined process (e.g., a business process) or a series of document types in a set of predefined document types. In determining boundaries between document boundaries, one or more data processing models applied by the system may be configured to identify document types in a predefined set of document types (e.g., identify boundaries between documents in a package) and/or identify documents related to steps in a predefined process. In some embodiments, the data processing operations may be configured to identify document types (e.g., identify boundaries between documents in a package) according to a predefined order of steps and/or a predefined order of document types indicated by the context data. (any of the data processing operations referenced herein may include the application of one or more models trained through machine learning.)

The context data may be received by the system from any one or more suitable data sources, may be indicated by one or more user inputs detected by the system, and/or may be inferred by one or more data processing models of the system. Fully exploiting the context data may provide a bridge for the system to introduce a priori knowledge and understand documents within the environment in which the document data (e.g., unstructured data) is provided.

The system may be configured to apply one or more data processing algorithms, models, and/or machine learning models (including, for example, a series of machine learning techniques) to identify document types for a document package, a single document, and/or a single page document. In some embodiments, a system (e.g., one or more machine learning models) may be configured to delineate document type boundaries within a document package in order to identify boundaries between individual documents within the document package. The identification of document type boundaries within the document package may be based on one or more of the following: a determination of a document type for a page within a document package, a determination of a similarity (e.g., a similarity score) between two or more pages within a document package, and/or a detection and evaluation of one or more embedded objects within a document (including determining a similarity (e.g., a similarity score) between two or more embedded objects within a document). The system may be configured to detect transitions within the document package-e.g., based on changes in one or more of document content, document type, document metadata, document format, and/or embedded object characteristics within the document package-and to classify different portions of the document (and identify document boundaries within the document package) based on the transitions.

The system may be configured for information integrity purposes during the auditing process.

In some embodiments, the system may receive data comprising a document package and may extract document content information and/or metadata information from the received data. In some embodiments, the system may extract context information from the received document data. In some embodiments, the system may receive context information from one or more additional data sources (e.g., separate from the data source from which the document data was received) and may associate the received context information with the document package data. In some embodiments, extracting document content information includes applying embedded object type detection.

The system may then use the document content information, metadata extracted from the document, and/or the context information to generate output data representing the composition of the document package, where the output information may indicate one or more document types of the document package, a plurality of document types within the document package, and/or information about boundaries (e.g., page breaks between) between different documents within the document package. In some embodiments, generating the output data includes applying a page similarity assessment model to a plurality of pages of the document package.

In some embodiments, generating output data includes applying one or more data processing operations to model the state of a document package being processed. In some embodiments, the document package may be modeled using a finite state model. In some embodiments, the model of the document package may be used to leverage the computed likelihood that subsequent pages in the document package are part of the same document (e.g., same category, same type) as the current page of the document. For example, a model may be used to make a determination by leveraging contextual data about the manner in which documents are typically arranged (e.g., about the manner in which pages from different documents are typically not randomly interleaved with each other, but are typically arranged as sequential portions of a document package).

In some embodiments, generating output data includes applying one or more data processing operations to analyze the presence or absence of one or more embedded objects within the document. For example, the system may apply one or more rules and/or models as to whether certain document types are associated with certain embedded object types. For example, an embedded signature object may be associated with certain document types and thus may be recognized by the system and used to identify the associated certain document types.

In some embodiments, the system may apply a page similarity model as part of the document understanding pipeline. In some embodiments, the page similarity model may be the first step applied in the document understanding pipeline. In some embodiments, a page similarity model (e.g., random forest) may determine whether two pages belong to the same document. This may be useful because multiple documents may be bundled into a single PDF file before being provided to the system. The page similarity model may include one or more of the following: random forest classification of image features (e.g., low-level image features such as directional FAST and rotational BRIEF (ORB)), structural Similarity (SSIM) index, and image histograms using different distance metrics such as correlation, chi-square, intersection, hellinger, etc.

In some embodiments, the system may apply text and a low-level feature model (tfidf+vgg16+svm). The text and low-level feature models may include two parts: the page similarity module and the page classification module. In some embodiments, the page similarity module of the text and low-level feature models may share any one or more features in common with the page similarity models described above. In some embodiments, the page classification module may be configured to classify one or more pages (e.g., a first page) of the document package, as image text, using a Support Vector Machine (SVM) classifier and feature determination, by visual features of the TFIDF and VGG16 models.

In some embodiments, the system may apply a text deep learning model (e.g., embedded +1D-CNN). In some embodiments, the text deep learning model may use text extracted from the image to classify the document using embedding. More specifically, word2Vec may be used to segment words (tokenized) and embed, which may then be passed through shallow CNNs for classification. According to some embodiments, the architecture is shown in fig. 1.

In some embodiments, the system may apply a visual deep learning model (e.g., VGG19 transfer learning). FIG. 2 illustrates an exemplary architecture of a visual deep learning model, according to some embodiments. The visual deep learning model may be configured to identify visual features using the VGG19 deep convolutional neural network architecture as shown below. The model may be loaded with weights trained using, for example, imagenet, and the last two layers of the model may be trained.

In some embodiments, the system may apply a Siamese model (e.g., embedddings &1D-CNN+VGG19 transfer learning). The Siamese model may combine text and visual features to perform Siamese deep learning classification. Features from the two models described above may be cascaded and passed through dense layers for classification.

In some embodiments, the system may apply a document clustering model. The document clustering model may select different sample data sets from the large data sets for model training purposes.

Table 1 below shows, by way of example, the performance of various models. The test data used to generate the result data in table 1 includes data from the same client whose data was used to train the model. The pilot data includes data from clients on which the model is not trained. Thus, the pilot data results may be a better indication of the performance of the model for data that is not visible.

TABLE 1

In some embodiments, the system may automatically leverage the output data generated as described herein or the more functionality provided by the AI-enhanced audit platform. For example, the system may automatically generate and store individual document files for each individual document identified within the document package. In some embodiments, the system may leverage individual documents identified within the document package alone as individual evidence in one or more audit evaluations, including AI-enhanced audit processes that use document data in order to perform one or more vouching processes, arbitration processes, recommendation generation processes, information integrity processes, and/or data integrity processes.

Signature detection

Document processing often requires, for example, for warranty purposes, verifying that a signature (or abbreviation) appears in a particular area or is associated with a particular topic in the document. There may be more than one portion, more than one topic, and/or more than one signature in a single document or document package. Known techniques for signature detection require manual review and verification, which is inefficient and inaccurate and does not allow large-scale processing of documents.

In some embodiments, a system for automated document processing may be configured to perform automated signature detection, including learning, by application, an AI model of where a signature may appear on a given document type. During document ingestion and processing, the system may then verify that the document being processed actually has a signature at the expected/desired location within the document. The systems and methods provided herein may be used to automatically process documents to determine whether the documents provide evidence with a required and sufficient signature to meet warranty criteria for cargo transportation, cargo reception, contractual agreements, and the like.

As explained herein, the system may receive one or more input documents to be processed for signature detection and/or automated vouching analysis. The system may apply one or more AI models to detect information about the document type, document structure, and/or document format of the received document. In some embodiments, the determination of the document type may be based at least in part on the identification of one or more signatures within the document. For example, a single signature, a corresponding pair of signatures, no signature, some sort of signature, and/or the presence of a signature in some pages and/or some portions may be associated with some document types by one or more rules or models, and the system may leverage the rules/models to identify the document types.

In some embodiments, once the system has generated information representing the document type, document structure, and/or document format for the document to be analyzed, the system may determine one or more signature requirement criteria for the document to be analyzed. Signature requirement criteria may be determined based on document type, document structure, and/or document format. In some embodiments, the system may use one or more machine learning models trained on signed documents of various types, structures, and/or formats to determine signature requirement criteria for the various document types, document structures, and/or document formats. In some embodiments, the system may determine the signature requirement criteria based on one or more predefined signature criteria rules.

In some embodiments, the determined signature criteria may include one or more of the following: the location of the signature, the portion of the document to which the signature corresponds, the content of the document to which the signature corresponds, the type of signature (e.g., handwriting, electronic signature, abbreviation, etc.), the order of the signatures, and/or the date of the signatures.

Once the system determines the signature criteria for a document, the system may evaluate the document to determine whether those one or more signature criteria are met. For example, the system may apply one or more signature detection models to extract signature information from the document, where the extracted information may indicate signature presence, signature identity, signature location, association of the signature with a document portion and/or with document content, and/or signature type. ( In some embodiments, the signature detection model may be applied before and/or after the document type detection is performed and before and/or after signature criteria for the document are required. For example, in the case where signature detection is used to determine the document type, a signature detection model may have been applied prior to determining the signature criteria for the document. )

In some embodiments, the one or more signature detection models may include one or more contextually free signature detection models that have been trained for signed and unsigned, regardless of location within the document. In some embodiments, the one or more signature detection models may include one or more contextually relevant signature detection models that take context into account in determining whether and where signatures are detected.

In some embodiments, the system may be configured such that for each signature detected within the document, the system generates (a) a spatial location within the document where the signature was detected and (b) a confidence level of the detected signature. In some embodiments, the generated confidence level may indicate a degree of confidence that the signature was detected and/or may indicate a degree of confidence regarding a location at which the signature was detected. In some embodiments, the system may be configured such that for each signature detected within the document, the system generates (c) a signal indicative of one or more characteristics of the signature (e.g., signature quality, signature type, signature identity, signature date, signature order, etc.) and optionally a respective confidence value associated with one or more of the characteristics.

The system may compare the extracted signature information to the determined signature criteria and may generate one or more outputs indicating whether one or more signature criteria for the document are satisfied. In some embodiments, the system may indicate that the signature criterion is met, that the signature criterion is not met, or that a determination as to whether the signature criterion is met cannot be made. In some embodiments, the output indicating whether the signature criterion is met may include one or more confidence scores indicating confidence lengths of one or more conclusions.

In some embodiments, evaluating whether the signature meets the signature criteria for the document may be based at least in part on associating signature-context data (where the context data may be associated with and/or extracted from the document) with one or more signatures within the document. For example, the system may associate signature-context data (such as information about a portion of the document, an identity of one or more parties associated with the document, a spatial location within the document, etc.) with one or more detected signatures. In some embodiments, the detected signature may be associated with the signature-context data from the document based on the spatial proximity of the signature location to the location from which the context data was extracted. In some embodiments, the association between the signature and the signature-context data may be quantified by an association score (e.g., indicating a confidence level of the association). In some embodiments, the system may then evaluate compliance of the document with one or more signature criteria based on the determined associations and/or the determined association scores.

In some embodiments, the selection of one or more signatures for evaluating compliance with signature criteria may be based on one or both of: (a) A confidence score for identifying the signature and/or the signature information itself, and (b) an association score for the association of the identified signature with the document context (e.g., based on spatial proximity in the document). In some embodiments, the evaluation of compliance with the signature criterion may be based on one or both of a confidence score and an association score. In some embodiments, the overall relevance ranking may be based on both the confidence score and the relevance score.

The association between the signature established by the system and the signature context data may be one-to-one, one-to-many, many-to-one, or many-to-many. In some embodiments, the system may rank the associations between the signature and the various signature-context data (or between the signature-context data and the various signatures), and an association score may be assigned to each association. In some embodiments, the system may select the highest ranked association and may evaluate compliance with the signature criterion based on the signature-context association indicated by the highest ranked association (and/or an association score based on the highest ranked association).

In some embodiments, the signatures may be ranked by a signature confidence score for detecting/identifying the signature, an association score, and/or an overall (e.g., combined) confidence score based on the foregoing scores (and optionally other factors). In some embodiments, the selection of the signature for evaluation and/or the evaluation of the signature for compliance with the signature criterion may itself be based on any one or more of the following: signature confidence, association score, and/or overall (e.g., combined) confidence score.

Signature detection example

Custom pipelines were developed with the YOLO model that fully utilized transfer learning. The pipeline is configured to receive the PDF document, detect pages within the PDF document that contain signatures, and generate output data indicative of page numbers and confidence scores for each detected signature.

The connection component analysis method was developed as follows:

step 1-detection of the designated Box (bottom half of the Page 30% minimum height & width of the outline) Using outline detection with parameters

Step 2-recognition of the box types by Tesseact OCR (keywords "SHIPPER" and "CARRIER")

Step 3-performing CCL analysis on each box to extract larger connected components (e.g., signature and handwritten text)

Step 4-generating an output by superimposing only the output of the bounding box over the blank

Step 5-Abby ground truth for obtaining a box by parsing an xml file to obtain bounding box details

Step 6-checking accuracy by performing a basic facts bounding box IoU on the input image and the output image

Weak learning for AI enhanced assurance

For a format rich PDF, existing solutions include: creating a "knowledge base construct". However, this solution relies on the underlying structure of the PDF and cannot be used for scanned PDF documents where the underlying structure is unknown. Existing solutions include, for scanned PDF documents, optical Character Recognition (OCR) and NLP. However, these solutions rely on templating of PDF and surrounding text, and they cannot handle too many different formats and/or visual relationships. According to known techniques, automatically extracting information from electronic formats such as PDF and image formats is inefficient, inaccurate, and time consuming. The known alternatives-manual inspection-are costly and inefficient. Known automated information extraction solutions use pre-trained models to extract text from data, but they require annotated PDFs to train computer vision models that can extract such information. Creating these annotations to train the model itself is an expensive activity.

Known solutions are Fonder and OCR assisted methods. The pipeline of fondour is strongly dependent on parsing PDF into HTML. Perfect conversion can preserve as much information as possible, which makes fondue advanced. However, fondeer's application is limited because few software can fully support this process. As for OCR-assisted methods, an OCR engine such as Abbyy processes well-scanned document sets. Abbyy can extract information from the document, but the user still needs to make additional effort to extract the actual needed entity. NLP and other AI methods that use semantic information among all extracted words to improve the extraction of the target entity are typically used to achieve this goal. Since these solutions do not take structural information into account, they are not robust enough for noisy documents with complex underlying structures.

The systems and methods described herein may address one or more of the above-described drawbacks of existing solutions.

Disclosed herein are systems and methods for automated information extraction that may address one or more of the above-described drawbacks. In some embodiments, the document processing and information extraction systems described herein leverage a unique combination of (a) natural language processing using semantic and lexical analysis with (b) weak labeling based on fuzzy matching and deep learning based on text and computer vision. A combined model configured to extract information from the PDF may provide a set of NLP, text, and computer vision. The systems and methods described herein may provide accurate and efficient information extraction from PDF documents and evidence data provided in other formats, may overcome one or more of the above-described drawbacks of known solutions, and may overcome the problem of cold start of the document (annotation data is not present and the cost of creating annotations is expensive). Information that may be accurately and efficiently extracted by the techniques disclosed herein includes, for example, invoice amounts, numbers, institution names, committees, and the like.

With respect to the task of creating annotations, in some embodiments, the ground truth data from which annotations may be created may exist in one or more data sources, such as in an ERP database or system. However, the ground truth data may exist in a format (e.g., normalized format) that does not perfectly (e.g., word-by-word) match the content in the document to be processed by the system word. This further complicates the task of creating annotations. The systems and methods disclosed herein overcome this challenge by applying weak labels (fuzzy matches), where entities in a ground truth data source (e.g., ERP system) need only partially match entities in a processed document (e.g., PDF) so that the system generates labels based on the partial matches so that the model can learn from those labels.

Described below are some embodiments of systems and methods for knowledge-based information extraction from a rich-format digital document collection. Although the following description is made primarily with reference to PDF documents, the techniques described herein may also be applied to web pages, business reports, product specifications, scientific literature, and any other suitable document type. As described below, the system may process the input document/data into an image, and thus any input data formatted (or may be) into an image may be suitable.

The systems and methods described herein may provide a pipeline for knowledge-based information extraction from a rich-format digital document collection, where the pipeline includes two parts: first, a document conversion section, and second, a knowledge modeling section. FIG. 3 depicts a schematic diagram of a training process for a two-part pipeline 300 for knowledge-based information extraction from a format-rich digital document set, in accordance with some embodiments. The model may include a set of manual feature-based NLP models and computer vision models that improve accuracy over time through a self-learning and verification mechanism. Described herein are features of such a pipeline for knowledge-based information extraction from a rich-format digital document set, according to some embodiments.

As shown in fig. 3, pipeline 300 may include a data conversion portion 310 and a knowledge-based modeling portion 330.

In the first portion 310 of the two-part pipeline 300, the system may convert the PDF into a database. For this process, one or more deep learning models (e.g., deep ocr) may be applied; the models may include a text detection model and a text recognition model. The model may be used to extract words (e.g., each word) rather than using an OCR engine. This may enable greater capability and more robust performance in extracting information from both clean and noisy documents. Due to the capacity constraints of OCR, it is not guaranteed that all information in a document can be detected. Thus, the systems and methods described herein may combine computer vision with deep OCR, referred to as a "canvas," which may automatically supplement the information missing from deep OCR without human interaction. After the conversion, a specific post-processing method can be applied, wherein the post-processing method introduces lexical to better resolve structural relationships between words. For example, expansion and erosion using custom kernels may be used to determine if nearby words are from the same phrase or paragraph.

In some embodiments, steps 312-322, described below, may be applied in the first portion 310 of the two-part pipeline 300.

At block 312, in some embodiments, the system may receive input data including one or more documents to be processed. In some embodiments, the input data may include PDF data, image data, and/or document data in any other suitable format. The input data may be received from any suitable data source, such as one or more databases, data stores, network sources, and the like. The input data may be received according to a predefined schedule, as part of an inbound network transmission, as part of a crawling operation, in response to a user request, and/or as part of a manual data upload. The received data may be stored locally and/or remotely after receipt.

At block 314, in some embodiments, the system may apply one or more automated orientation correction data processing operations to the received data in order to correct/normalize the orientation of pages in the document.

At block 316, in some embodiments, the system may apply one or more denoising data processing operations to the orientation corrected data. In some embodiments, the one or more denoising operations may include a data normalization operation. In some embodiments, one or more denoising operations may be selected based on user input, system settings, an identity of one or more parties associated with the document being processed, an industry of one or more parties associated with the document being processed, and/or a document type of one or more of the documents being processed (e.g., as determined automatically by the system).

At block 318, in some embodiments, the system may apply one or more deep learning based text detection and recognition operations. In some embodiments, the operations may include flexible OCR operations. In some embodiments, the text detected and identified at block 318 may include all character data that may be identified within the processed data. In some embodiments, the identified character data may be stored in association with metadata indicating the spatial location of each identified character within the document in which it is identified.

At block 320, in some embodiments, one or more image level feature engineering processes may be applied to the data generated at block 318 to select features to be used to generate feature data. During the training process, block 320 may be applied to determine which features to use to train the model. During subsequent application of the model, after training is completed, block 320 may simply entail extracting features previously identified by the feature engineering process during training, and using those extracted features to generate feature data to be processed and analyzed by the trained model. The feature data generated at block 320 may include text data, such as character data, word data, sentence data, paragraph data, partial data. The feature data generated at block 320 may include location data (e.g., indicating a spatial location within the page) associated with any text data. The feature data generated at block 320 may include document structure data indicating portions (e.g., pages, portions, chapters, etc.) within the document associated with any of the text data. The feature data generated at block 320 may include text trait data, such as indicating a font, style, size, and/or orientation associated with any text data.

At block 322, in some embodiments, the system may store the data generated at block 320 (e.g., word-level tags with location information and other features) in any suitable format (e.g., in a CSV format). The data may be stored locally and/or remotely in any suitable computer storage system.

In the second section 330 of the two-part line 300, the following steps may be applied. Semantic, document, structural, and/or lexical information may be used as input alone and/or together. The method may include weakly supervised learning, where the labels for the documents do not need to be perfectly correct. This approach may be robust in handling incorrect tag information. Users may only need to provide their domain knowledge and the system may use fuzzy matching to automatically tag documents word by word. Based on this weak marking method, the system may correct some errors in text recognition during the training process. By efficient design of the model, the system described herein allows for a powerful ability to extract information from documents that are not seen in the same field.

In some embodiments, steps 332-338, described below, may be applied in the second portion 330 of the two-part pipeline 300.

At block 332, in some embodiments, the system may access stored data generated by the first portion 310 of the pipeline 300. In some embodiments, the accessed data may be the same data (or a portion thereof, and/or data based thereon) as stored at block 322.

At block 334, in some embodiments, the system may apply one or more feature engineering processes to the data generated at block 332 in order to select features to be used to generate the feature data. The feature engineering process may select features such as characters, words (e.g., having more than one character), length of words, surrounding environment (e.g., near a boundary (which may be from a table)), and so forth. During the training process, block 334 may be applied to determine which features to use to train the model. During subsequent application of the model, after training is completed, block 334 may simply entail extracting features previously identified by the feature engineering process during training, and using these extracted features to generate feature data to be processed and analyzed by the trained model.

At block 336, in some embodiments, the system may apply the tags and perform user-defined feature engineering to select features to be used to generate the feature data. During the training process, block 336 may be applied to determine which tags to apply to train the model and which features to use to train the model. During subsequent application of the model, after training is completed, block 336 may simply entail extracting features previously identified by the feature engineering process during training, and using those extracted features to generate feature data to be processed and analyzed by the trained model.

In applying the tags, the system may utilize domain knowledge, e.g., relying on one or more domain knowledge sources, such as a dictionary or a third party data source. Domain knowledge may include known patterns that associate certain content types (e.g., page numbers) with certain spatial locations (e.g., top of page or bottom of page). During training, the system can mark all labels (e.g., characters, words, etc.) even though the confidence level of the accuracy of all labels is less than 100%. In performing the marking during training, the system may seek to achieve high recall (e.g., to cover as much of the target entity as possible) and high precision (e.g., by mismarking the marking as little as possible).

In performing user-defined feature engineering during training, the system may apply one or more feature engineering processes that leverage user input to select features for generating feature data based on user domain knowledge. Fully exploiting user domain knowledge to select features for generating feature data for training may improve model quality and may improve model performance during implementation. The system may receive one or more user inputs indicating one or more of: part title, customer name, customer address, date, billing address, shipping address, etc.

At block 338, in some embodiments, the system may generate, configure, and/or apply a knowledge-based deep learning model using the feature data generated at blocks 320, 334, and/or 336. During training, the system may generate and configure a model based on the features selected for training. During application, the system may apply the trained model to generate output data indicative of information extracted from the input data (e.g., input documents) being analyzed, classification of the input data, and/or confidence levels associated with the model output. The knowledge-based deep learning model may be a deep learning model trained using feature data generated based on features selected at blocks 320, 334, and/or 336 during training. The deep learning model(s) may generate output data indicative of one or more identified pieces of content of the input document, optionally together with an associated confidence score. The deep learning model(s) may generate output data classifying the input document into one or more classifications, optionally together with an associated confidence score. The output data may, for example, indicate original marks (e.g., locations, words), basic features, and/or user-defined features.

By applying deep learning based text detection and text recognition instead of (or in addition to) an OCR engine, the systems and methods disclosed herein may be more flexibly applicable in different scenarios, and they may provide more control and customizable output for text recognition and detection.

In some embodiments, labels generated from the trained model may be used for further training of the model.

In some embodiments, the systems and methods disclosed herein may apply one or more of classical syntactic, semantic, and/or lexical analysis of documents to extract templates and weak labels.

In some embodiments, the systems and methods disclosed herein may include custom loss functions that may accelerate model convergence.

In some embodiments, the systems and methods disclosed herein may include one or more custom layers that leverage NLP embedding to allow models to learn both content information and related location information.

In some embodiments, the systems and methods disclosed herein may leverage lexical as part of feature engineering to improve the performance of predicted phrases.

In some embodiments, the systems and methods disclosed herein may include one or more adaptive feed methods for model training (e.g., feeding 10 PDFs with different formats for a model for one step).

Regarding deep learning based OCR (deep OCR), three methods can be used: text detection, text recognition, and end-to-end combination of both.

If the goal is to find information in the image, text detection may be used to determine which portions of the image are likely text, and then recognition models may be used to determine content information in those portions of the image. Using two deep learning models slows down the pipeline but customizes the intermediate output more permanently. Alternatively, an end-to-end solution may be used to directly identify what the text is and where it is. This is just a deep learning model, so the inference speed is faster than a pipeline using two deep learning models.

In some embodiments, the steps applied by the pipeline may be as follows. As a first step, OCR supplementation and line marking may be applied as part of OCR feature engineering. This may include performing initial text detection, performing missing value replenishment, and detecting lines.

As a second step, phrase segmentation, cluster segmentation, and structure segmentation may be applied as part of OCR feature engineering.

As a third step, as part of the OCR feature engineering, OCR feature engineering (structure) may be performed. Word-level features may include word coordinates, word height (font size), word size, count up/down features, and/or line labels. Phrase-level features may include phrase coordinates, word counts, string/digit counts, total blanks, and/or word cluster labels. The word cluster level features may include word cluster coordinates, counts of words, counts of phrases, total blanks, and/or line counts. Word structure level features may include word structure coordinates, counts of words/phrases, counts of word clusters, counts of total blanks, and/or rows. The relevant coordinates and other structural information may be used to generate an output such as a CSV output. The use of the vocabulary as part of the feature engineering may improve the performance of the predicted phrase.

As a fourth step, weak markers for knowledge model training may be applied as part of entity extraction.

In some embodiments, the model architecture may utilize semantic information, structural information, and/or lexical information. The model may include a custom network including an input layer, a body portion, and a prediction portion. The input layer may include (a) merged embedding and feature engineering and/or (b) variable batch sizes and sliding windows. The body portion may include (a) a fully dense layer and/or (b) a custom penalty. The predictions may include custom metrics to be monitored. Having a custom layer embedded with NLP may allow the model to learn both content information and related location information. The model may apply a sliding window from left to right. The model may take full advantage of training of the enabling structure. With respect to deep learning based computer vision, deep OCR models a more inconsistent and noisy target scene rather than a common OCR engine for specific situations such as a well-scanned or printed document set. The three main data sets used for training and testing are: ICDAR13, ICDAR15, and ICDAR17. The scenes in these images are mostly scene text. Some samples of ICDAR13 images are shown in fig. 4.

Some samples of ICDAR2015 images are shown in fig. 5.

The combined solution (text detection + text recognition) is slower to handle than the two solutions described above, but can be flexibly tailored according to a separate architecture. The second end-to-end solution is faster but has relatively low performance. Details are shown in table 2 below, showing a comparison of the highest scores.

	ICDAR2013	ICDAR2015
			End-to-end	0.8477(F1)	0.6533(F1)
Text detection	0.952(F1)	0.869(F1)
			Text recognition	0.95(Acc)	0.933(ACC)

TABLE 2

Among the models of the first solution, the model from Clova was chosen as the base model. As shown in fig. 6, the performance of the model from Clova is competitive and prediction is more flexible than the text detection model.

Using most of the scanned image, the OCR engine (ABBYY) outputs a phrase. An example is shown in fig. 7, which shows a comparison between deep OCR and OCR engines.

FIG. 8 depicts a schematic diagram of a two-part pipeline for knowledge-based information extraction from a rich-format digital document set, in accordance with some embodiments. In some embodiments, the pipeline shown in fig. 8 may share any one or more features in common with the pipeline 300 shown in fig. 3 above and/or with any other embodiment described herein. Described herein are features of such a pipeline for knowledge-based information extraction from a rich-format digital document set, according to some embodiments.

FIG. 9 shows a first page of a PDF document that may be used as an input to the pipeline of FIG. 8.

Fig. 10 shows the result of an input PDF page subjected to a denoising operation and then binarized into an image.

After denoising and binarization into images, a text detection model may be applied. FIG. 11 illustrates a bounding box applied by a text detection model, which may define word levels. In applying the text detection model, the function may be: detection_net. The following customizations may be available when applied:

a trained_model: pre-trained model for text detection

Text_threshold: confidence threshold for detecting text

Low_text: text lower limit score

Link_threshold: link confidence threshold

Cuda: inference using cuda (default: true)

Canvas_size: maximum image size for inference

Mag_ratio: image magnification ratio

Poly: enabling polygon type results

Show_time: showing the processing time

Test_folder: folder path for input image

And refine: use of link refiners for sentence-level datasets

A rethrer_model: pre-trained refiner model

After the text detection model is applied, missing information may be supplemented, for example using "Canvas". Fig. 12 shows how the detected text may be covered by a white box.

Fig. 13 shows how lines, broken lines, and other noise are removed to preserve only missing information (shown as white tiles in fig. 13).

Fig. 14 shows how the identified white tiles indicating missing information are supplemented into the text detection result, with additional bounding boxes (as compared to the bounding boxes in fig. 11) showing the supplemental information.

The system may then analyze the different orientations and sizes of the detected "blobs" (e.g., phrases, paragraphs, portions, etc.) based on the lexical(s). As shown by the additional bounding box in fig. 15 (as compared to the bounding boxes in fig. 11 and 14), horizontal "blobs" may be identified as phrases.

As shown by the additional bounding boxes in FIG. 16 (as compared to the bounding boxes in FIGS. 11 and 14), larger "blobs" (e.g., as compared to other bounding boxes on the same page, other bounding boxes in the same document package, and/or based on prior training of the system) may be identified as paragraphs and/or portions having significant distances.

As shown by the additional bounding boxes in FIG. 17 (as compared to the bounding boxes in FIGS. 11 and 14), the largest "blob" (e.g., as compared to other bounding boxes on the same page, other bounding boxes in the same document package, and/or based on prior training of the system) may be identified as indicative of a structural segmentation of the document. In FIG. 17, in some embodiments, the additional bounding boxes correspond to blobs that may indicate structural segmentation of the document. The identification of paragraphs as shown in fig. 16 and the identification of structure segmentation as shown in fig. 17 may be performed using different "blob" size thresholds. In some embodiments, information about structural segmentation of documents may be used for a feed network.

Then, in some embodiments, the system may rank the words ordered from left to right, and may determine the row tags, for example as shown in fig. 18. Regarding how structural information can affect performance, different scale "blobs" can be used not only for feature engineering, but also for inferred corrections. Thus, the model may use, for example, phrase-level "blobs" and line information to locate which entity the predicted word is located at.

The system may then apply one or more text recognition algorithms on each bounding box and may generate output data therefrom, such as by generating a CSV file with line and phrase information for each tag, such as shown in FIG. 19.

The data included in the output (as shown in fig. 19) may include one or more of the following:

slug: name of document

Page: marking a page to which a tag belongs

X0, y0: coordinates of the upper left corner of the bounding box

X1, y1: coordinates of the lower right corner of the bounding box

Rel_x0, rel_y0, rel_x1, rel_y1: a relative coordinate; coordinates adjusted according to the size of the document

Token: words recognized by text detection algorithm

Line_label: marking which line in the document is located

Word_group_label: generated by horizontal spots (for identifying phrases)

The system may then tag the data set with domain knowledge information received, for example, via one or more user inputs from a user of the system. This may be referred to as a "weak tag function". The system may create annotations that may be verified by one or more users and may be used to train a computer vision model that may extract information, allowing guidance with little or no marked data.

FOR example, a user may want to extract data about a committee name from a document, and the user's domain knowledge may include that "JONES FOR SENATE" is a committee name. After the user enters this information into the solution, the system may scan all training documents and mark the words recognized by deep OCR. For example, the deep OCR output For a document may be "For instance, the 'JONES FOR SENTE' is a committee name". The CSV document may then be marked as shown in FIG. 20.

In this example, the solution correctly labels the committee name (in the line that includes the words "JONES", "FOR" and "SENATE") and also incorrectly labels the first instance of the word "FOR" as part of the committee name. This demonstrates that some false weak tag functions can be generated in the tag column. In weak tags, the real tag may be spread as noise in the incorrect tag; if the recall is high enough, this noise does not significantly impair model performance. Thus, it may be used in the systems described herein, with appropriate configuration as described herein, to tag data.

The system may be configured to perform feature engineering, for example as follows for the example image shown in fig. 21.

Length_words: how many characters are in this label; 16 characters

Word_size: the area of the bounding box; (x 1-x 0) × (y 1-y 0)

Related_location: the order of the words; starting from the upper left corner to the lower right corner

Num_upper_char: capital character number; 14 capital letters

Title_word: whether or not to tile words

The Relative word size: word size exceeding maximum word size in the same page

Max word size: maximum word size in a page

Max_word_size_page: maximum word size in pages that exceed the maximum word size in a document

Num_words_in_page: number of words in page

X0_min: minimum value of x0 in page

X1_max: maximum value of x1 in page

Y0_min: minimum value of y0 in page

Y1_max: maximum value of y1 in page

Line_label_max: number of lines in a page

The system may apply one or more knowledge models. The system may apply an activation function, such as a self-adjusting non-monotonic neural activation function. The derivative of the activation function used by the system may be smoother than Relu, which may increase the rate of convergence.

To ensure that the distribution of labels is stable, the model may be trained using varying batch sizes, which may ensure that the model is trained using the same number of documents in each batch. This may reduce the risk of gradient explosions.

In some embodiments, the embedded layer of the network may be constructed with fast text word embedding. The method can improve the convergence rate and accuracy of the model.

In some embodiments, the systems and methods described herein may provide superior performance compared to Named Entity Recognition (NER) models.

Named entity recognition is a subtask of information extraction, aimed at locating named entities mentioned in unstructured text and classifying them into predefined categories such as person name, organization, location, medical code, time expression, quantity, monetary value, percentage, etc. FIG. 22 illustrates an architecture of a named entity recognition model, according to some embodiments.

Tools for NER, including Spacy, stanfordNLP and Bert, are trained with a large number of documents. However, the main part of the document used for the training is a paragraph, not a phrase. This means that the NER pre-trained model may not be suitable for handling all document types, such as documents in rich format.

The NER model is applied to the same test data as described above and the named entities are annotated with different bounding boxes. The results are shown in fig. 24, where different bounding box types (e.g., may be displayed in different colors by the display system and may correspond to different stored metadata associated with the bounding box) may correspond to the following meanings:

·“CARDINAL”

·“ORG”

·“DATE”

·“'LANGUAGE'”

·“GPE”

·“PRODUCT”

·“PERSON”

"target entity" (basic fact)

It was observed that no ground truth was detected by the NER model, "Smart Media Group" and "SPENC-Spence for Governor" being institution names and committees, respectively.

However, using the NER model on a complete sentence like "SMART MEDIA GROUP advertises in KSHB-TV", the NER correctly identifies "SMART MEDIA GROUP" as an organization, as shown in fig. 25 by applying the NER model to the complete sentence.

Thus, for documents with paragraph structures, the NER model may be a good solution. However, for documents that are rich in format and in which paragraphs are not a major part of the document, the NER model may have only limited applicability, and other systems and methods described herein may provide improvements and advantages.

Computer with a memory for storing data

Fig. 26 illustrates an example of a computer according to some embodiments. Computer 2600 can be a component of a system for providing an AI-enhanced audit platform, including techniques for providing AI interpretability for processing data through multiple layers. In some embodiments, computer 2600 may perform any one or more of the methods described herein.

Computer 2600 may be a host computer connected to a network. The computer 2600 may be a client computer or a server. As shown in fig. 26, computer 2600 may be any suitable type of microprocessor-based device, such as a personal computer, workstation, server, or handheld computing device (such as a telephone or tablet). The computer may include, for example, one or more of a processor 2610, an input device 2620, an output device 2630, a storage 2640, and a communication device 2660. Input devices 2620 and output devices 2630 may correspond to those described above and may be connected to or integrated with a computer.

Input device 2620 may be any suitable device that provides input, such as a touch screen or monitor, keyboard, mouse, or voice recognition device. The output device 530 may be any suitable device that provides output, such as a touch screen, monitor, printer, disk drive, or speaker.

Storage 2640 may be any suitable device that provides storage, such as electrical, magnetic, or optical memory, including Random Access Memory (RAM), cache, hard disk drive, CD-ROM drive, tape drive, or removable storage disk. Communication device 2660 may include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or card. The components of the computer may be connected in any suitable manner, such as via a physical bus or wirelessly. Storage 2640 may be a non-transitory computer-readable storage medium including one or more programs that, when executed by one or more processors (such as processor 2610), cause the one or more processors to perform the methods described herein.

The software 2650, which may be stored in the storage 2640 and executed by the processor 2610, may include, for example, programming (e.g., as described above, as embodied in a system, computer, server, and/or device) that implements the functionality of the present disclosure. In some embodiments, software 2650 may include a combination of servers such as an application server and a database server.

The software 2650 may also be stored and/or transmitted within any computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch and execute the instructions associated with the software from the instruction execution system, apparatus, or device. In the context of this disclosure, a computer-readable storage medium may be any medium, such as storage 2640, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.

Software 2650 may also be propagated within any transmission medium used by or in connection with an instruction execution system, apparatus, or device, such as those described above, from which instructions associated with the software are retrieved and executed. In the context of this disclosure, a transmission medium may be any medium that can communicate, propagate, or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transmission readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared wired or wireless propagation medium.

Computer 2600 may be connected to a network, which may be any suitable type of interconnected communication system. The network may implement any suitable communication protocol and may be secured by any suitable security protocol. The network may include any suitably arranged network link, such as a wireless network connection, T1 or T3 line, wired network, DSL, or telephone line, that enables transmission and reception of network signals.

Computer 2600 may implement any operating system suitable for operating on a network. Software 2650 may be written in any suitable programming language, such as C, C ++, java, or Python. In various embodiments, for example, application software implementing the functionality of the present disclosure may be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service.

The following is a list of examples:

embodiment 1, a system for determining composition of a document package, the system comprising one or more processors configured to cause the system to:

receiving data comprising a document package;

extracting, from the document package, first information including substantial content of one or more documents of the document package;

extracting second information from the document package including metadata associated with one or more documents of the document package; and

output data representing the composition of the document package is generated based on the first information and the second information.

Embodiment 2 the system of embodiment 1, wherein the output data representing a composition of the document package represents one or more outlines between page boundaries in the document package.

Embodiment 3, the system of embodiments 1-2, wherein generating the output data is further based on contextual information received from a data source separate from the document package.

Embodiment 4, the system of embodiment 3, wherein the contextual information comprises ERP data received from an ERP system of the entity associated with the document bundle.

Embodiment 5, the system of embodiments 3-4, wherein the context information includes data specifying a predefined set of events associated with a process associated with the document package.

Embodiment 6, the system of embodiments 3-5, wherein the context information includes data characterizing the request, wherein the data comprising the document package is received by the system in response to the request.

Embodiment 7, the system of embodiments 3-6, wherein the context information includes data characterizing an automated process for acquiring the data.

Embodiment 8, the system of embodiments 1-7, wherein the metadata comprises one or more of: file name, file extension, file creator, and file date.

Embodiment 9, the system of embodiments 1-8, wherein extracting the first information comprises applying embedded object type detection.

Embodiment 10, the system of embodiments 1-9, wherein generating the output data comprises applying a page similarity assessment model to a plurality of pages of the document package.

The system of embodiment 11, as in embodiments 1-10, wherein generating the output data comprises applying finite state modeling data processing operations to the document package.

Embodiment 12, a non-transitory computer-readable storage medium storing instructions for determining a composition of a document bundle, the instructions configured to be executed by one or more processors of a system to cause the system to:

Receiving data comprising a document package;

Embodiment 13, a method for determining a composition of a document package, wherein the method is performed by a system comprising one or more processors, the method comprising:

receiving data comprising a document package;

Embodiment 14, a system for verifying a signature in a document, the system comprising one or more processors configured to cause the system to:

receiving an electronic document comprising one or more signatures;

applying one or more signature extraction models to the electronic document to generate, for each of the one or more signatures in the electronic document, data representing a spatial location of the respective signature and a confidence level of the respective signature; and

A determination is made whether the electronic document satisfies a set of signature criteria based on the data representing the spatial location and the confidence level.

The system of embodiment 15, embodiment 14, wherein the one or more signature extraction models comprise a first signature extraction model configured to identify a signature regardless of spatial location.

Embodiment 16, the system of embodiments 14-15, wherein the one or more signature extraction models includes a second signature extraction model configured to identify a signature based on a spatial location within the document.

Embodiment 17, the system of embodiment 16, wherein applying the second signature extraction model comprises:

determining a predicted spatial location within the electronic document based on one or more of a structure, a format, and a type of the electronic document; and

a signature is extracted from the predicted spatial location.

Embodiment 18, the system of embodiments 14-17, wherein determining whether the electronic document meets a set of signature criteria comprises determining whether the signature appears at a desired spatial location in the electronic document.

Embodiment 19, the system of embodiments 14-18, wherein determining whether the electronic document meets the set of signature criteria includes determining whether a confidence level exceeds a predefined threshold.

Embodiment 20, the system of embodiments 14-19, wherein determining whether the electronic document meets the set of signature criteria comprises determining whether the signature appears within a spatial proximity in the electronic document to context data extracted from the electronic document.

The system of embodiment 21, embodiments 14-20, wherein determining whether the electronic document meets the set of signature criteria includes generating an association score that indicates a level of association between a signature extracted from the electronic document and signature-context data generated based on the electronic document.

The system of embodiment 22, as in embodiments 14-21, wherein the system is configured to determine a set of signature criteria based at least in part on context data, wherein the context data indicates one or more of: document type, document structure, and document format.

The system of embodiment 23, as in embodiments 14-22, wherein the system is configured to determine the set of signature criteria based at least in part on the one or more signatures detected in the document.

Embodiment 24, a non-transitory computer-readable storage medium storing instructions for verifying a signature in a document, the instructions configured to be executed by one or more processors of a system to cause the system to:

Receiving an electronic document comprising one or more signatures;

Embodiment 25, a method for verifying a signature in a document, wherein the method is performed by a system comprising one or more processors, the method comprising:

receiving an electronic document comprising one or more signatures;

Embodiment 26, a system for extracting information from a document, the system comprising one or more processors configured to cause the system to:

Receiving a dataset comprising a plurality of electronic documents;

applying a set of data conversion processing steps to a plurality of electronic documents to generate a processed dataset comprising structured data generated based on the plurality of electronic documents, wherein applying the set of data conversion processing steps comprises applying one or more deep learning based Optical Character Recognition (OCR) models; and

applying the set of knowledge-based modeling processing steps to the structured data, wherein applying the set of knowledge-based modeling processing steps comprises:

applying a knowledge-based deep learning model trained based on the structured data and a plurality of data tags indicated by one or more user inputs; and

output data extracted from the plurality of electronic documents by the deep learning model is generated.

Embodiment 27, the system of embodiment 26, wherein applying the set of data conversion processing steps includes applying an automated orientation correction processing step prior to applying the one or more deep learning based OCR models.

Embodiment 28, the system of embodiments 26-27, wherein applying the set of data conversion processing steps includes applying a denoising function prior to applying the one or more deep learning-based OCR models.

Embodiment 29, the system of embodiments 26-28, wherein applying the one or more deep learning based OCR models comprises:

applying a text detection model; and

a text recognition model is applied.

Embodiment 30, the system of embodiments 26-29, wherein applying the set of data conversion processing steps includes, after applying the one or more deep learning based OCR models, generating structured data based on image level feature engineering steps.

The system of embodiment 31, as in embodiments 26-30, wherein applying the set of data transformation processing steps includes applying a post-processing method that uses lexical to resolve structural relationships between words.

Embodiment 32, the system of embodiments 26-31, wherein applying the set of knowledge-based modeling processing steps includes generating structured data based on one or more feature engineering processing steps prior to receiving user input indicative of the plurality of data tags.

Embodiment 33, the system of embodiment 32, wherein the one or more feature engineering processing steps include predicting a phrase based on lexical.

Embodiment 34, the system of embodiments 26-33, wherein applying the set of knowledge-based modeling processing steps includes applying a model based on user training for user-defined feature engineering.

The system of embodiment 35, embodiments 26-34, wherein applying the set of knowledge-based modeling processing steps includes applying fuzzy matching, wherein the system is configured to consider partial matching sufficient for labeling purposes to automatically label documents on a word-by-word basis.

Embodiment 36, the system of embodiments 26-35, wherein applying the set of knowledge-based modeling processing steps includes automatically correcting one or more text recognition errors during a training process.

Embodiment 37, the system of embodiments 26-36, wherein the knowledge-based deep learning model includes a penalty function configured to accelerate convergence of the knowledge-based deep learning model.

Embodiment 38, the system of embodiments 26-37, wherein the knowledge-based deep learning model includes one or more layers embedded using Natural Language Processing (NLP) such that the model learns both content information and related location information.

Embodiment 39, the system of embodiments 26-38, wherein the knowledge-based deep learning model is trained using an adaptive feed method.

The system of embodiment 40, embodiments 26-39, wherein the knowledge-based deep learning model includes application of a merge embedded input layer.

Embodiment 41, the system of embodiments 26-40, wherein the knowledge-based deep learning model includes an input layer configured to vary a batch size.

Embodiment 42, the system of embodiments 26-41, wherein the knowledge-based deep learning model includes an input layer that applies a sliding window.

Embodiment 43, the system of embodiments 26-42, wherein the knowledge-based deep learning model includes one or more full dense layers disposed between the input layer and the prediction layer.

Embodiment 44, the system of embodiments 26-43, wherein the knowledge-based deep learning model includes a predictive layer that generates one or more metrics for presentation to the user.

Embodiment 45, a non-transitory computer-readable storage medium storing instructions for extracting information from a document, the instructions configured to be executed by one or more processors of a system to cause the system to:

receiving a dataset comprising a plurality of electronic documents;

Embodiment 46, a method for extracting information from a document, wherein the method is performed by a system comprising one or more processors, the method comprising:

receiving a dataset comprising a plurality of electronic documents;

The present application was filed at month 6 and 30 of 2022, attorney docket number: 13574-20068.00, U.S. patent application entitled "AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR AUTOMATED ASSESSMENT OF VOUCHING EVIDENCE".

The present application was filed at month 6 and 30 of 2022, attorney docket number: 13574-20069.00, U.S. patent application entitled "AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR AUTOMATED ADJUDICATION OF COMMERCIAL SUBSTANCE, RELATED PARTIES, AND COLLECTINABITY".

The present application was filed at month 6 and 30 of 2022, attorney docket number: 13574-20070.00, U.S. patent application entitled "AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR APPLYING A COMPOSABLE ASSURANCE INTEGRITY FRAMEWORK".

The present application was filed at month 6 and 30 of 2022, attorney docket number: 13574-20072.00, U.S. patent application entitled "AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR PROVIDING AI-EXPLAINABILITY FOR PROCESSING DATA THROUGH MULTIPLE LAYERS".

Claims

1. A system for determining composition of a document package, the system comprising one or more processors configured to cause the system to:

receiving data comprising a document package;

extracting first information including substantial content of one or more documents of the document package from the document package;

output data representing a composition of the document package is generated based on the first information and the second information.

2. The system of claim 1, wherein the output data representing the composition of the document package represents one or more sketches between page boundaries in the document package.

3. The system of any of claims 1-2, wherein generating output data is further based on contextual information received from a data source separate from the document package.

4. The system of claim 3, wherein the contextual information includes ERP data received from an ERP system of an entity associated with the document package.

5. The system of any of claims 3-4, wherein the contextual information includes data specifying a predefined set of events associated with a process associated with the document package.

6. The system of any of claims 3-5, wherein the context information includes data characterizing a request, wherein the data comprising a document package is received by the system in response to the request.

7. The system of any of claims 3-6, wherein the contextual information includes data characterizing an automated process flow for acquiring the data.

8. The system of any of claims 1-7, wherein the metadata comprises one or more of: file name, file extension, file creator, and file date.

9. The system of any of claims 1-8, wherein extracting the first information includes applying embedded object type detection.

10. The system of any of claims 1-9, wherein generating output data includes applying a page similarity assessment model to a plurality of pages of the document package.

11. The system of any of claims 1-10, wherein generating output data includes applying finite state modeling data processing operations to the document package.

12. A non-transitory computer-readable storage medium storing instructions for determining composition of a document package, the instructions configured to be executed by one or more processors of a system to cause the system to:

Receiving data comprising a document package;

13. A method for determining a composition of a document package, wherein the method is performed by a system comprising one or more processors, the method comprising:

receiving data comprising a document package;