CN114616572A

CN114616572A - Cross-document intelligent writing and processing assistant

Info

Publication number: CN114616572A
Application number: CN202080064610.1A
Authority: CN
Inventors: A·贝根; S·德罗斯; T·贾弗里; L·马蒂; M·帕尔默; J·保利; C·帕夫洛普卢; E·普里科尤; S·萨兰吉; M·萨威基; M·什哈德赫; M·塔隆; B·托普拉尼; Z·R·瓦迪亚; D·沃特森; E·怀特; J·Y·范; K·古普塔; A·M·黄; 刘占麟
Original assignee: Docugami
Current assignee: Docugami
Priority date: 2019-09-16
Filing date: 2020-07-24
Publication date: 2022-06-10
Also published as: KR20220059526A; US20210081411A1; CA3150535A1; US11392763B2; EP4028961A4; US11960832B2; US11816428B2; US20210081608A1; JP2022547750A; US11507740B2; US20220245335A1; US11822880B2; EP4028961A1; US20210081613A1; US20210081601A1; US11514238B2; US20210081602A1

Abstract

Machine learning, artificial intelligence, and other computer-implemented methods are used to identify various semantically important blocks in a document, automatically mark them with the appropriate data types and semantic roles, and use this enhanced information to assist authors and support downstream processes. Block location, data type, and semantic role can often be automatically determined from: the "contexts" referred to herein, i.e., the combination of their formatting, structure, and content; those of adjacent or nearby content; global appearance patterns in the document; and the similarity of all these things across documents (primarily but not exclusively among documents in the same document set). Similarity is not limited to exact or fuzzy character strings or attribute comparisons, but may include similarity of natural language grammar structures, ML (machine learning) techniques such as measured similarity of words, blocks, and other embeddings, and similarity of data types and semantic roles of previously identified blocks.

Description

Cross-document intelligent writing and processing assistant

Cross Reference to Related Applications

This application claims priority to "Cross-Document Intelligent Automation and Processing Assistant" in U.S. provisional patent application Serial No. 62/900,793, 35U.S. C. § 119(e), filed on 2019, 9, 16. The subject matter of all of the above is incorporated by reference herein in its entirety.

Technical Field

The present disclosure relates generally to methods and apparatus for AI self-supervised creation of layered semantic markup documents and/or for assisting in authoring and processing such documents.

Background

Many businesses create multiple documents that are very similar, even though they are customized each time. For example, an insurance office may generate many recommendations for a particular kind of insurance, but each recommendation must be tailored to the needs of a particular customer. These documents may be considered to be of the same "type" because they have similar textual (and possibly image) content (reflecting similar purposes and topics), similar selection and arrangement of large units (such as segments), and often even similar geometric layout and formatting characteristics.

Some types of documents are widely known and used, but many are not. Many are specific to a particular business, market, or application, and new documents are created for new situations. Users, which may be referred to as "writers" or "edits," typically create a new document of a particular type (sometimes referred to as a "target document") by copying an earlier document of the same type and then modifying it as needed (e.g., by manually editing or replacing certain pieces of content).

In current practice, word processing typically identifies blocks only if formatting needs to be implemented: for example, titles, footnotes and numbers may be explicitly marked for special formatting; but the name, address or date is rarely explicitly indicated. Even when identified, a block is typically associated only with formatting effects (such as margins, fonts, etc.) that are useful information, but do not directly provide any indication of its data type or semantic role. Similarly, word processors often only visually represent a hierarchy containing: often there is no explicit identification of the nested segments themselves, but only an explicit representation of the title that is formatted differently.

When creating a new document of the same general kind as the previous document, in many cases most of the work is text editing, replacing, removing or inserting certain blocks, taking care not to confuse blocks with different semantic roles (such as exchanging buyer and seller addresses). This often requires manual intervention because authoring systems are often unaware of these blocks, particularly their data types or semantic roles, and thus do not provide assistance very efficiently.

In some simple cases, explicit locations may be provided to fill in the contents of a particular block using "forms" and "templates". However, forms typically only address the simple case where substantially all required blocks can be enumerated in advance, and where there are few large, repeatable, or highly structured blocks. Creating forms also requires skilled skill, is difficult to adjust to changing conditions, and does not actively assist the author.

Drawings

This patent or application document contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the office upon request and payment of the necessary overhead.

Embodiments of the present disclosure have other advantages and features which will become more readily apparent from the following detailed description and appended claims, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a block diagram of one implementation of a system and process for creating documents of hierarchical semantic tags using machine learning and artificial intelligence.

FIG. 2 is a screenshot of a dashboard illustrating a process of tracking different sets of documents by the system of FIG. 1.

FIG. 3 is a screenshot of a user interface for receiving feedback from a user.

FIG. 4 is a screenshot of an integration with other software applications.

FIG. 5 is a block diagram of one embodiment of a computer system that may be used with the present invention.

Detailed Description

SUMMARY

A group of documents determined to be of the same type constitutes a "document set" or "document cluster". For example, a suggestion of a certain insurance by an insurance company for a certain class of customers may be considered to be of the same type and form a document set. Suggestions by the same company for different kinds of insurance, or suggestions by customers who they consider to be different, may be considered different types belonging to different sets of documents. Rental agreements, clinical notes for a certain patient, sales recommendations, calendars, meeting summaries, etc. are other potential document types, as well as subtypes of unique patterns of sharing content, structure, and/or layout.

Creating and editing a new target document within a document set often involves editing or replacing "semantically significant" blocks ": such blocks are specific parts of a document, usually but not necessarily continuous spans of text, that have specific data types and semantic roles, and are meaningful and significant to business or other processes.

These blocks have various data types, which are more refined here than the atomic data types in many computer systems. For example, a given block may represent not only a string of characters, but also: a person or organization name; a date; duration (not exactly the same as date); the monetary amount. Larger blocks may include lists of drugs or other substances, itineraries, procedures to be followed, information packets (such as medical prescriptions); and myriad other information.

In addition, blocks may have semantic roles related to the document in which they appear. For example, the person name may be "tenant" in a rental agreement, or "seller" in a sales proposal, or "agent" of another person. The date may indicate the beginning or end of some responsibility or activity. The dollar amount may be a periodic payment amount, or a fine or bonus associated with certain conditions, or the like. Such semantic roles are important for the correct use of information in the information block. The name of a semantic role is referred to as a "semantic role tag," or simply "tag.

A block is typically represented as a packet, including its location, data type, semantic role, and/or other data/metadata. Locations are typically represented as start and end points, which may be represented in several ways, such as inserted flags or bytes, characters or lemma offsets (either global to the document or relative to a given ID, flag or other object). Semantic roles are represented by tags or other identifiers. The blocks may be of any size, and some blocks may contain other blocks as "sub-blocks". Blocks may contain not only text, but also non-text data, such as images or other media, and "structures" (such as tables, lists, segments, etc.).

The techniques disclosed herein use machine learning, artificial intelligence, and other computer-implemented methods to identify various semantically important blocks in documents, automatically provide them with appropriate data types and semantic roles, and use this enhanced information to assist authors and support downstream processes. The location, data type, and semantic role of a block can often be automatically determined from: the "contexts" referred to herein, i.e., the combination of their formatting, structure, and content; those of adjacent or nearby content; global appearance patterns in the document; and the similarity of all these things across documents (primarily but not limited to between documents in the same document set). "nearby content" includes content that is close in the horizontal direction, such as front-to-back in the text reading order; also included are content that is proximate in the vertical direction, such as within the same container structure, such as lists and sections, as well as their respective logos, titles, levels, etc. Similarity is not limited to exact or fuzzy character strings or attribute comparisons, but may include similarities of natural language grammar structures, ML (machine learning) techniques, such as measured similarities of words, blocks, and other embeddings, as well as similarities of data types and semantic roles of previously identified blocks.

For example, a person or organization name may often be identified as having a semantic role (such as "vendor") because documents are spoken in this way, often using sentence(s) in some human language, but also often involving a larger context. As another example, one or more words may be generally easily identified as representing a data type, such as "drug name"; but context is needed to determine that it has the semantic role of allergy, not prescription. Important evidence of semantic roles is often not in the same sentence, but is expressed in various other ways, such as by the block appearing in a larger block (such as the "known allergy" segment). The flexibility and variety of syntax and document structure organization (let alone wrongly written words, transcription errors, etc.) makes identifying data types less difficult, but identifying semantic roles, particularly those with a range greater than a single sentence, is very difficult.

A given semantic role may relate a block to an entire document or other block. For example, the departure time of a flight is tied to a particular "leg" in the itinerary and is only indirectly related to the other "legs". Typically, the hierarchy of blocks appropriately groups such items together, such as being co-located within a segment, table portion, or the like.

In more detail, examples of semantically significant blocks include: the name, address, and other characteristics of a particular party to a contract; prescription drugs and prescribed procedures in medical records; requirements in real estate advice (or exceptions); date and flight number on the trip; and so on. These can all be considered as semantic roles for the block. There are also larger blocks with various types and roles, such as whole segments and sub-segments. These are often inserted or removed as a whole, perhaps with minor variations in blocks. Blocks may be hierarchical; that is, a larger "containing" block may contain other "sub-blocks" up to any number of levels.

A block is typically a series of consecutive words in a document, such as "John Doe". However, a block may include partial text. "the unknown house (John Doe's house)" includes a name, but the name ends before the apostrophe (in the middle of the text). The chunks may even be non-contiguous, such as the same name in "John (also known as 'Bill') Doe". Typesetting may also make a block discontinuous, e.g., pagination may occur in the middle of a block (perhaps with headers, footers, or footnotes, which may be ignored for some purposes); middle charts, tables, charts, sidebars, or other displays; and so on.

The actual location and context of the block may also be important — the block is not merely an isolated string, which may appear multiple times, with different (or no) semantic roles for different instances. More modern systems typically support inline or stand-off markers, sometimes referred to as "annotations," that can persistently associate various tags and other information with chunks of speech. For example, HTML provides tags ("div", "ol", etc.) for manually marking the boundaries of common structural blocks, and several broad types or roles ("acronym", "kbd", "dfn", "cite", etc.) for (usually) smaller blocks. Other XML schemas (schemas) provide many other markup, and word processors allow for somewhat similar markup via "style".

Some blocks may represent what is commonly referred to as a "field". These are often small blocks and often appear in many or all documents in a given set with similar context and layout, but often have different textual content in each document. They may also appear multiple times in a single document, with the same or very similar content. Such a block may be referred to as a "field block". In template-based systems, they are often manually discovered and treated as "fields," but here they are discovered through the pattern of their context and appearance within and across documents, and are assigned data types and semantic roles in much the same way as other blocks. They may or may not represent named entities such as personal names, addresses, dates, and the like.

Another general type of block may be referred to as a "structural" or "structural" block. Such blocks are typically large and often contain many other blocks (some of which may also be structural blocks). They frequently have a "title" or "heading" that provides a name, number, description, and/or other information about the fabric block. Examples of structure blocks include chapters, segments, tables, graphs, sidebars, and many other examples. The type and semantic role of the structural blocks is often important to determine the type and semantic role of nearby or contained blocks.

Not only the data type, but also the specific semantic role, is important for correctly authoring and utilizing the document: whether a name represents buyer and seller, or patient and doctor; whether the given date is the desired start or end, or departure and arrival times of the flight; whether the number specifies principal, interest, dose, temperature, fine, or some other thing. For larger blocks, roles include things such as "limits of responsibility" statements and "jurisdictional laws" regulations and "definitions," as well as countless other things. The semantic role of a block is often specific to a particular domain or transaction and can be considered among the most important features of a document. In many kinds of documents, chunks with specific data types and semantic roles are necessary, or at least very common, and when chunks correspond across different documents, they are called "counterparts". Counterpart blocks may appear in a similar order and pattern, particularly for documents of the same author or organization, and typically belong to the same document set. Counterpart blocks have the same or very similar roles and typically have similar contexts and/or formatting. Thus, the distribution of block data types and semantic roles provides valuable information to distinguish the type of document and help identify corresponding chunks in other documents.

Many counterpart pieces have similar content, but others do not. For example, the same party (semantic role) in different documents is often different individuals, albeit in a very similar context and usage pattern. This may be particularly common, but not exclusive, to "field blocks".

Once discovered, the hierarchical semantic blocks in the business document with their data types and semantic roles can be used for downstream business processes. For example, if the back-office database is given a particular party name, a particular date, and a number such as terms and interest rates, it can correctly record new mortgages. In particular for such uses, semantic roles are of utmost importance: placing the correct data type in the wrong database field (such as exchanging the names or addresses of sellers and buyers) is a big problem, especially when moving information to downstream databases, processes, or reports.

Several features and benefits

The techniques described herein may have various features and benefits, including any of the following.

Some implementations may provide an easier, more efficient, and more accurate way to generate documents with hierarchically organized blocks that have semantic tags useful to business processes. This may be accomplished using various techniques to identify such blocks of different sizes, discover the data types and semantic roles they play in the document, and learn their usage patterns, characteristic context, and so forth. The learning may result from analysis of the content, structure and formatting of current and previous documents; feedback from authors and edits; and comparison of multiple documents, particularly documents in the same document set. With this knowledge, the system can provide valuable assistance to the user, for example, making it easier to create higher quality new documents, and extract the desired information for downstream use, such as with other software applications, in background databases, export reports, compliance checks, and the like. Such learning can be accomplished with unsupervised and self-supervised learning techniques that do not require a large amount of pre-labeled or pre-analyzed data, but rather infer patterns from unlabeled or minimally labeled data.

Some implementations may enable a computer to help writers avoid many of these errors by discovering and using patterns within and across business documents, and thus reducing the time required to achieve a given level of quality, assisting in the authoring process.

Today, typical document systems do not identify blocks, or in particular their data types or semantic roles. This increases the time and expense of the author and edit, and the time and expense in importing data from the document into a back-end database, dashboard, or other downstream business process. For example, it is common to manually find data from contracts (block-by-block) and copy the data into spreadsheets or data entry forms.

Some implementations may help mark such hierarchical semantic blocks during the authoring process and explicitly represent them, thus making them easy to extract by people and/or computers and saving time and overhead in connecting to other various business processes.

Current techniques often do not take full advantage of similarities between multiple documents created by the same author or group, and/or documents of the same type (represented here by members of a particular document set), to more reliably identify blocks in a new document, or to flag possible significant differences for attention. Explicit rules, such as requiring a segment titled "partitionability", only cover similarities that are easily noticed and described by analysts; are static and often restrictive (e.g., lack of rephrasing or reorganization, or failure to respond to a counteracting condition); and soon out of date. Small companies often lack the necessary resources to develop more responsive technologies and often have few documents to justify their overhead. On the other hand, smaller companies often have less diverse documents that are more suitable for automated analysis, as described herein.

Some implementations may use extracted information about the blocks and their patterns of content, context, layout, and usage across documents to assist authors in creating new documents. Examples include at least suggesting the following: specific content that is changed, reformatted or moved; terms that are missing in the new document, although they are typically present in similar documents (referred to as "missing" or "possibly ignoring" blocks or content); terms that exist, even though the terms are not generally present in similar documents (referred to as "unusual" blocks or content); changes, such as exchanging names or roles of different parties at a particular location; and so on.

Some implementations may accept and retain user feedback, such as when the user indicates that the block: when marked with an incorrect scope, data type, or semantic role; when they are not of interest; or not marked at all. Some implementations may use specific user modifications to improve machine learning and neural models, as well as to remember not to repeat earlier suggestions if the user has rejected them (even if additional learning fails to prevent specific false instances). In particular, some implementations may avoid requiring a large number of review steps or corrections, which is advantageous for sample-less learning techniques and carefully selecting the required feedback in order to minimize the amount of user action required. Some current techniques learn something that is very specific, for example, when a user tells the spell checker to add words to their dictionary. However, this involves only a mechanical memory list, not iterative training or fine tuning of the model used to determine complex later behavior, and therefore does not fully exploit capabilities such as those described herein.

Some implementations may use a small number of user modifications to learn and improve their behavior while avoiding annoying the user with repeated suggestions when reapplying an improved, but still imperfect, model.

Many businesses record specific information obtained from documents in various databases that support their processes. For example, companies that own many rental properties often use backend systems to help manage not only the tenants' payments, but also specific information derived from their rental agreements, such as approved pets, previous damage to which the tenants were not responsible, or other information. Car or tool renters, mortgage lenders, healthcare providers, municipalities and other organizations use other information. Many goods and services have many choices of mix and collocations and supervisors review statistics regarding their acceptance, composition, pricing and other factors. Business information systems typically provide analysis, check for consistency or compliance, report, and/or support for other business processes, all of which can be facilitated through the use of block information as described herein.

Typically, the blocks and the information they provide are spread throughout the prose text, manually extracted, and manually entered into a spreadsheet, database, or other system. Manual work was previously required because important blocks can be expressed in a myriad of different ways because of the flexibility of the natural human language in which protocols, e-mails, etc. are written, and similarly variable typesetting and presentation conventions. The underlying conversational content of such documents is also often dispersed across a variety of documents, including e-mail, conversation records, slide presentations, and the like. This information may also be useful, but is typically handled manually. Some systems may treat such information sources as documents, thereby achieving the same benefits already described.

Some implementations may provide a means for a computer to begin executing a particular document once it is transformed into a hierarchical semantic tag as described herein. By combining the structure of hierarchical tags of a document with a tool that provides a vector-semantic representation of text, certain blocks can be identified as requiring certain actions. For example, the contract may specify a funds transfer, notification, or other action, as well as conditions that enable or trigger them. These can be identified and used to begin executing the contract.

Some implementations may provide a simple way to review and summarize information from a document set in an interface, such as a "dashboard," and move the identified information into a client's back-end database or similar system, making the business data flow more efficient and less costly, and enhancing quality assurance, consistency, and reporting. Once a block is semantically labeled, it becomes easier to generate a summary report on a document set containing the corresponding object block. Some implementations may provide a very simple way for a user to create such a report by simply clicking on one or more instances of the blocks to be included, and then locating and extracting the blocks across all documents in the collection by role or context. Some implementations may also assist the user in finding documents that lack the expected counterpart blocks, and amend them to include or identify such blocks, or confirm that they do not include such blocks properly.

In another aspect, performance for a given group (such as a company or department) may be improved by incorporating information such as block semantic roles, occurrence patterns, and other characteristics of their documents, as well as their user feedback, into the learning process of the system, and using the resulting improved model to enhance and/or examine future documents. However, many customers do not wish to share such information with other customers, and many customers have restrictive privacy requirements. On the other hand, general information and learning derived from public, non-confidential sources can be freely used and shared.

Some implementations may provide the benefits of feedback and learning while maintaining each customer's data and any model information derived therefrom, both separate and private to each customer, while still sharing general learning based on non-confidential public data. Keeping those data processes separate ensures that information is not "leaked" from one client to another, even statistically.

Introduction to example implementations

The following is a description of an example system. See fig. 1. The system generally relates to methods and apparatus for AI-unsupervised creation of layered semantic tagged documents and/or for assisting in the authoring and processing of such documents. This includes processes such as authoring, structuring, annotating, modifying, reviewing, extracting data from documents, and/or using such data in downstream business processes. More specifically, it focuses on documents that are similar to previous documents, discovering a detailed hierarchical structure of documents, made up of many semantically meaningful blocks, associated with their roles, by using mainly unsupervised and self-supervised machine learning techniques across a set of documents (including a relatively small set of documents); and focus on using such highly enhanced documents in business processes.

The operation of this example system uses the following processes, which are described in more detail in the following sections. This is merely an example. Other implementations may use different combinations of steps, including omitting steps, adding other steps, and changing the order of some steps. They may also use different implementations of the steps listed below, including different combinations of the techniques described below for each step. In FIG. 1, the step is preceded by an "S", so step 1 below is labeled "S01", and so on.

1)Introduction into: the user document set is brought into the data store 110.

2)Tissue of: documents are classified by type into document sets, such as rental and sales agreements, or medical records and current clinical records.

3)Visual extraction: linear text stream(s) are extracted from each document based at least on its content and visual layout, including limited information about the distinct text and other regions, their starting and ending positions, formats, and content. The extracted data may be organized as "visual lines" or "visual boxes" (also referred to as "superlines" or "visual" blocks), such as segments distinguished by geometric layoutAnd (6) dropping.

4)Structure of the product: the title, list items, and other broad categories of structure blocks in the document are identified.

5)Re-nesting: the nesting relationship of the segments and lists, and the text range of each segment and list are determined.

6)Topic chunking: the subject matter content of each document is analyzed and blocks of regions containing similar subjects are generated (subject-level blocks).

7)Subject mark：

i) Candidate data types and semantic role labels are generated for each topic in the corpus using embedding and clustering.

ii) generating candidate data types and semantic role labels for the block using key phrase extraction techniques.

8)Block marker: data type(s) and semantic role candidates are identified and assigned to other blocks throughout the document using a variety of methods, such as neural networks, word and character embedding, parsing and pattern matching, regular expressions, similarity measures, and/or other methods. Of particular interest for certain embodiments are:

i) and carrying out grammar analysis and pattern matching on the obtained structure.

ii) using question answering techniques to associate tiles with the particular semantic role they play in the document.

iii) combining XPath tree matching with word embedding techniques to match patterns in structural and syntactic trees, although there may be wide differences in wording and vote.

9)Named entity recognition(NER): the data type is identified and assigned to the blocks detected as named entities throughout the document.

10)Character marking and extracting type marking: semantic role labels are assigned to blocks, such as representing "seller" parties whose names constitute a contract, or representing drugs mentioned as allergies, rather than as prescriptions.

11)Abnormality (S): identifying documents that normally exist in the documents of the document set under consideration but are in the current documentSemantic roles that do not exist, or that do not exist in the documents of the document set under consideration but exist in the current document (and vice versa).

12)Arbitration: adjustments and/or selections are made between alternative scopes for blocks, data types, and semantic role labels, resulting in a well-formed structure that is easy to express in a format such as an XML format.

13)DGML: an enhanced version of the document is created that contains explicit identifications of block locations, data types, and semantic role labels, and may also contain additional information such as the confidence level of each identified block, the data types expected in similar blocks (such as dates, date ranges, personal names, etc.), and the like. The enhanced version is created using an XML-based markup language called DGML.

14)Feedback: the enhanced version is displayed to the user(s) and the selection blocks (and potential locations of segments that may be ignored) are shown to the user, gathering the user's selections to confirm, reject, or make other changes. Users are also free to choose their own reading and reviewing order. The feedback may also be applied to any other interpretation made by the system, such as organizing documents into document sets as described in step (2).

i) In the case of blocks that may be ignored, priority examples from other documents are provided, which may be checked and/or copied into the current document as desired, and automatically customized by applying the target document value for smaller nested blocks.

15)Feedback response: the user's responses to these interactions are tracked and this information is used to fine tune the model 120 and prevent later repetition of the same or similar errors.

16)Downstream communication, transmission: the blocks are selected by type and/or role and used to generate reports on the document set and/or exported to a downstream system that adds functionality (such as a backend contract database, an administrative compliance checker, an administrative report generator, etc.).

FIG. 2 is a screen shot showing a dashboard tracking the processing of different sets of documents One through Seven (One through Seven) through the process described above. In the dashboard, the process is divided into the following phases:

upload to

Pretreatment

Review chunk

Review tile

Ready for use

The color coding shows the degree of completion. The green phase is complete, the red phase is in progress, and the black phase is not yet started.

Each of the above listed steps is described in more detail below.

Further description of example implementations

The numbering here reflects the general order of analysis for this particular example. However, not every step depends on every previous step, and thus, many elements may be reordered or parallelized in other implementations. Elements may also be shifted or even repeated in order to exchange additional information with other elements, or elements may operate independently, such as in a separate process or machine.

1)Introduction into

The system accepts a typical Word processor document (such as MS Word) and a typeset document (such as PDF or. png file). In each case, visually contiguous regions, such as titles, paragraphs, table cells, tables, images, and the like, are identified and represented as blocks using a combination of their relative positions, surrounding blanks, fonts, and typographical properties, and the like. These features are selected in part by the designer and learned in part through image and pattern analysis of a large number of documents. OCR is also applied for incoming documents that do not yet have machine-readable textual content.

Those blocks are submitted to subsequent modules in the system along with the selected layout information.

2)Tissue of

Users do not have to organize the documents they examine into the system. The system uses a clustering method that operates on textual content, layout information, and structural information that has been detected (such as the identification of some headlines) to group documents into a "set" of documents of a particular type, such as rental agreements and rentals and sales. The particular set of documents found may be examined with the user and named automatically or by the user. Once established, these document sets facilitate later machine learning and reasoning about format, content, semantic roles, and differences therein. For example, the system may find that almost all documents in a given set have a particular section with three particular sub-blocks of data types for a particular role and person name, one of which repeats in five different sections. Such patterns are used to help identify similar (and dissimilar) portions of other documents, suggest reviews or changes to the user, and provide example text for reuse in other documents in the same (or possibly different) set.

Clustering documents into document sets may use features from document structure (order and containment relationships between different sized blocks, data types and roles) and layout, as well as textual content. Once some blocks and/or roles have been identified in at least some documents, this information can also be used to improve clustering, either by full re-clustering or by minor adjustments. For example, if certain content of blocks with the same role is ignored, such as the names, addresses, etc. of sellers and buyers, similar documents may become nearly or even identical; or to check whether the appearance pattern of the different blocks is the same, e.g. one name (e.g. of the seller) appears in some location and another (e.g. of the buyer) appears in some other location.

The system retains both the original organization of uploaded files into directories (if any) and its own organization of their document collections. Thus, the user can view both organizations, and the learning algorithm can use both organizations as information. For example, some users name documents according to various conventions, and/or organize documents by client, document category, or other characteristic, which is almost always useful for understanding the relationship between patterns of similarity (such as having common block locations and roles) and documents.

3)Visual extraction

i)Area finding

The system uses heuristics and machine learning to identify regions in the document based on geometric patterns. For example, in many documents, meaningful blocks have special layouts, such as signature boxes, digests, definition lists, tables, and the like. Such patterns may be automatically learned by considering geometric and/or typographical features, uniqueness or rarity, and/or correspondence within or across documents, particularly within the same set of documents.

The choice of method depends on the format of the incoming document. For example, word processor documents typically provide explicit information about paragraph boundaries, but PDFs or scanned pages require the system to assemble them from visual lines, or even to analyze blank sizes to assign characters into visual lines (such as in multi-column documents).

ii)Signature finding

The system creates signatures (also called "digests") for document parts and uses these signatures to identify and classify "interesting" additional blocks and find their boundaries. Signatures are not only based on textual content, but also on various aspects of context, and may ignore the content of smaller contained blocks (e.g., blocks of fields whose content varies among counterpart blocks).

The signature may use a pixel representation of even the block. The text-laid out bitmap image is divided into tiles, preferably of a size of about 24 pixels square (adjusted according to the scan resolution), and the tiles are clustered. The processing of these, including their neighbor relations, by the auto-encoder and neural networks reveals similar visual events such as boundaries between text and rules, edges and corners of text blocks, even indentation changes and substantial font/style changes. The further neural network then uses the clusters to collectively identify similar typeset objects that frequently indicate or characterize important blocks.

The methods herein may use unsupervised methods to generate document chunk embedding based on pixels and characters in the document chunk, the size of the chunk, its location in the document, and the like. (As previously mentioned, an image may also be a block). Clustering and comparison techniques can then be used on these embeddings for many downstream tasks.

iii)Extraction of

This aspect takes a laid out document (e.g., a PDF or scanned printed page) and transforms the identified character images ("glyphs") in the document into a stream of text representing the correct document order of the glyphs (the stream may also contain drawing or image objects, as appropriate, and may have multiple streams, such as footnotes or headers, that do not have typical positions in reading order). In some documents, the reading order is not a completely explicit notation. One well-known example is that at any given point there is typically no indication that multi-column typesetting is valid, and thus the first "row" extends only half (or less), not all. However, there are many additional examples in which the order of text may be complex or non-obvious. For example, some typesetting programs draw each character separately, making the text boundaries less visible. Table elements, sidebars, figures, footnotes, and other display content may not have a distinct location in the text order. Some text, such as text in headers and footers (and end-of-line hyphens) may not require a position in text order at all. Many formats do not provide an explicit indication that something belongs to such a special category.

The system solves this task by combining visual information (location, style, etc.) of glyphs with deep neural networks that understand the characteristics of the written language used in the document to create a text stream. In addition, it detects many basic text boundaries such as line, block, column, image, inline font changes, and text boundaries for header/footer objects.

iv)To represent

After extracting the text sequence and some hypothesized structural blocks, the system creates a representation of the document (called "DGML" in one example) that includes those and information about the visual characteristics (font, color, size, etc.). The representation of the blocks, including information such as their location, type and role, is called "annotations". Natural Language Processing (NLP) and Deep Neural Networks (DNN) may then use this combined data. The deep neural network incorporates this visual information to assist in structuring the document into a hierarchy to represent the document structure, including blocks such as title/body, list/list items, and the like.

Enough information may be included so that later aspects may build an editable word processor document that closely resembles the original source. This may be included in DGML or similar representations along with other structure, content and block information. In many cases, portions of a document having distinct formatting and layout are also useful blocks. However, formatting characteristics that are inconsistent with otherwise required blocks (and vice versa) may still be represented via special block types, via isolated annotations, or via other methods.

4)Structure of the device

The structure pipeline converts flat text files into a hierarchy in which sections, subsections, and other portions of the document form an ordered hierarchy of content-based objects, a structure known to those skilled in the art. The conversion is accomplished using unsupervised machine learning techniques. The method has several stages:

i)super transformation (hyperlinning)

This involves segmenting the text into "superlines," which are larger groups than the visual lines and include more meaningful (relative to the visual) logical units, such as paragraphs, titles, and so forth. This is preferably done using a pre-trained neural network that takes into account features such as the "word shape" of the lemma (especially leading and trailing lemmas), typesetting information such as font and pitch characteristics, and similar features. Some of the hyperlinks may also have been provided by earlier steps (depending on the format of the input document).

ii)Document language model

This preferably uses a document language model that also includes information about the text content, formatting, and any structures that have been found so far, rather than a text-based language model alone. This enables better detection of blocks and their hierarchy (such as title/body, list/list items, etc.) because meaningful blocks and their patterns of occurrence are identified from formatted page learning.

This creates a representation of the document that includes both textual content and visual characteristics (geometry, font, color, size, etc.). Deep neural networks and NLP processes then utilize such information in the task of structuring a document into a hierarchy of blocks with data types and semantic role labels by finding ranges and/or boundaries of various sized blocks representing the document structure. At this stage, the blocks found are mainly titles, sections, lists and items, tables, graphs and other relatively large units.

iii)Super-row clustering

This uses an auto-encoder to cluster superrows across a document set based on word shape structure, assigning each superrow to a superrow cluster that is similar in layout, beginning and ending content, and other characteristics, where each cluster is identified by a "cluster ID" (which should not be confused with the creation or identification of a document set).

iv)Inline title

A special case of particular interest is the "inline header", i.e. the header of the block (which sometimes provides the semantic role of the block) is not itself on a separate visual line(s), but on the same line as the start of the subsequent text. Typically, inline headers differ in format, such as bold, underline, different font, following colon, or other effect. Separate heuristics and neuro-algorithms identify these blocks.

v)Sample-less structure learning

Despite the advanced structuring methods described above, it is expected that the resulting structure has some imperfections, or does not conform to a priori expectations of the user. As described in steps (14) - (15), the sample-less structure learning is responsible for creating a machine learning model that relies on user-provided feedback. The model is then used to generate a structure that combines the user's feedback on the structure with the structure that has been produced by the system (and perhaps iteratively enhanced by previous feedback).

The main principle applied in this case derives from the Machine Translation (MT) method, in which a sequence is converted into another sequence. In this case, one sequence describing the superpass is converted into another sequence that also contains start/end markers that encode the hierarchy.

The process is carried out in different stages or steps:

(a) first, the machine translation model is pre-trained using a publicly available data set.

(b) "scheduler" (see description) "Feedback response"segment") filters user feedback.

(c) A new structure file is generated from the user feedback and a trimmed machine translation data set is generated.

(d) The pre-trained models are further trained using a few sample learning principle.

5)Re-nesting

This aspect uses a "corpus re-nesting" algorithm that iteratively creates a nested structure using a push-down automaton, given a flattened list of cluster IDs that is preferred from the super-clustering step. By comparing signatures of adjacent superlines, the system may determine whether a given title or list item belongs to more, equal, or fewer nesting levels. This allows for the reconstruction of multiple nested hierarchies of many documents (such as chapters, sections, sub-sections, terms, lists, etc.).

Features considered in re-nesting include: the "shape" of the lemmas in the superrun (as known in NLP technology), especially considering the first and last; punctuation marks (if any) that end a particular category of the previous line; capitalization; formatting information such as preceding blanks, indents, bold and underlines; the presence and form of an enumerated string at the beginning of a line (e.g., a pattern like "iv (a) (1)" or "iv)") or a particular bold dot or other typographical character; the value of the enumerator; presence, level and value of previous same kind enumerators; and so on.

6)Topic chunking

This aspect uses lexical statistics and other learning techniques on successive document chunks to find the location of topic shifts. This enhances the identification of large chunks of boundaries, such as for an entire section of a given topic, because a section (whatever level) typically has more uniform topics, vocabularies, and styles within it than it does with adjacent sections.

7)Subject mark

i)Theme marker

This step is shown in FIG. 1 for each title in the corpus

Create a digital representation called "embedding" for each title.

Clustering the titles based at least on those embeddings.

Filtering out "bad" clusters based at least on weighing criteria such as density, number of elements, and similarity level.

Propagate the most common semantic role labels in each remaining cluster to all titles in the cluster.

ii)Key phrase marker

For each block, this step uses a set of key phrase extraction techniques (such as rule-based linguistic techniques, ML, statistics, Bayes, etc.) to generate candidate semantic role labels for the text.

8)Block marker

i)Grammar for grammar

This aspect of the system begins with linguistic analysis of text, such as natural language processing tasks, including part-of-speech tagging, dependency parsing, component parsing, and the like. The system then applies a tree matching mechanism from another domain to locate grammars and other structures within the tree or class tree structure found via NLP. These include document structuring methods such as tree syntax and tree pattern matching, as demonstrated by tools such as XPath, GATE.

Using such patterns to identify grammatical phenomena in a sentence enables the system to extract semantic role labels from the text itself, which are then used to annotate nearby blocks. For example, a search pattern may be constructed that matches "The terms of our agreement (The following is The terms of The outer element) based on The constituent structure of The sentence (as well as other sentences with similar grammatical structures); and then extract the noun phrase (in this example, "terms") and use it as a semantic role tag for one or more blocks in the content that follows the sentence and contains such "terms".

ii)Question-answer

Question answering techniques, including BERT for question answering, are specifically tailored to identify semantic role labels for candidate blocks (e.g., date, name, dollar amount). In contrast, most conventional question answering models aim to answer, for example, "what is the effective date? "is used in the above-mentioned patent publication. The system instead trains the model to answer questions such as "what is 7, 8, 2018? "and is intended to predict" effective date "or" effective date for X ", where X represents another block in the text (not just" date ", which is a data type and not a semantic role).

The system also discovers synthetic questions that, when answered, may point to relevant information in the text. This provides the ability to automatically ask questions to be used by the question-answers.

iii)XPath-like rules integrated with embedding

Here, tools in the field discussed under "grammar" are integrated with tools that provide vector-semantic representations of text, such as word2vec, char2vec, and many related methods. The system enables analysts to express and query patterns that include both: structural information (which may include block data expressed in XML or DOM compatible form) well handled by XPath and similar tools; and fuzzy or "semantic" similarity information that is well handled by the vector model.

9)NER (unlabelled nubs)

The technique may identify some blocks by data type, such as personal or company name, address, etc. (this is referred to as "named entity recognition" or "NER"). However, NER is far from identifying the semantic roles of those entities in a document. Current techniques also fail to identify larger blocks, such as entire clauses or segments, or groups of blocks that include meaningful or useful larger blocks.

This aspect of the system detects patches of interest, but does not necessarily assign roles to them as well. There are many methods and tools for identifying NERs in text. The system uses a variety of methods, examples of which are listed below. These innovations are largely unsupervised:

i) established NER method

ii) intended text

The "expected words in the normal english context" model is built by training a language model of the n-gram using extensive general text (such as wikipedia). When viewing a particular document, the system provides a means to identify n-grams that do not conform to the general model, and therefore tend to be specific to the document being processed.

iv)TF-IDF

This is a TF-IDF based approach ("word frequency versus inverse document frequency") and is used in conjunction with tag propagation and contextual semantic labeling.

v)Sequence clustering

Small text or character sequences, such as n-grams, are extracted and clustered using context embedding (e.g., embedding by BERT). The expected result is that n-grams sharing semantic meaning will begin to cluster together. The cost of combinatorial explosion is addressed by using heuristics (including on syntax trees) to filter out some of the n-grams before clustering. Various clustering algorithms may be applied. In this example, the hdbscan algorithm achieves efficient clustering while assigning random noise to "none" clusters.

vi)Few sample NER

The system uses a sample-less learning technique to generalize from a small number of labeled instances (e.g., selective user feedback) to more widely applicable rules or adjustments of learned parameters. This greatly reduces the number of times user feedback must be required and improves the performance of the system more quickly.

10)Extraction type mark

This aspect of the system finds semantic role labels for tiles that appear directly in the sentence(s) surrounding the tile. Meaningful blocks often have their roles dictated in some form by context. For example:

a innominate woman (Jane Doe) ("seller"), living.

A rent of $ 999 must be paid before the end of each month.

i)Contextual semantic tags (CSL)

The process learns what parts of the text may be semantic role labels for various blocks using a neural network operating on previously established structures, including sentence parsing. Many blocks may already have tags with different sources and confidence levels, but this provides additional evidence of support or disadvantaged for those tags as well as new tags. Here, some of the patterns relate to syntax. For example, in "innominate should pay a rent of $ 1000 before the last business day of each month", the title verb shows the role of monetary amount: i.e. it is the amount of the rental to be paid. Other patterns are automatically learned by supervised and/or unsupervised methods using features of structure, segmentation, labeling and content available in context. Formatting (such as brackets, table typesets, key phrases and words), and other features also provide features for the neural network.

Useful information often resides in contained blocks, such as segments or sub-segments, or headers thereof. For example, whether a given drug is relevant as a prescription drug or as an allergen may only be detectable by looking at the headings that contain the segments (this is another example of why detecting correct hierarchical nesting of segments is important). There are many other clues that exist that can be learned through machine learning techniques and applied to discover applicable roles for various blocks. Similarity across documents may also be used, particularly with documents in the same document set, to associate semantic roles that are discovered for similar contexts but may not be discoverable for isolated documents.

ii)Label propagation

This process normalizes the labels across similar text blocks in a document corpus. It applies to both tags extracted from the context and available from previous steps. The algorithm clusters the blocks based on their embedding using a merged clustering method, ranks the candidate tags for each cluster of the blocks using a weighted PageRank algorithm (which uses the frequency/confidence of the tags as initial node weights), and determines how similar the tags are to each other using co-occurrence and embedding similarities. It then assigns tags to the blocks based on their cluster-level scores and how similar (in terms of content, embedding, structure, data type, semantic role, and/or context) the blocks we are tagging are to the blocks from which the tags are derived. The merged clustering and PageRank algorithms are applied to propagate tags across similar contexts and make tags more consistent across a set of documents.

11)Abnormality (S)

This aspect of the system examines a plurality of documents within a document set, such as the document produced in step (2), and identifies blocks that appear in the current document but do not typically have corresponding chunks in other documents of the same document set, or vice versa. The counterpart blocks need not have exactly the same content, structure, formatting, context, data type, and semantic role, but may vary from one document to another. However, they may be identified as being substantially similar to other identified blocks in those ways.

When a new document includes blocks that are not normally present in other documents of the same document set, the user may be queried for some or all of them to confirm that they are actually intended. In this example system, such a query would be more general when the problematic block is common in the new document and the document on which it is based (if any), but rare in other documents.

Examples of some or all such blocks are suggested to the user when the new document lacks counterpart blocks that typically exist in other documents of the same document set, or even in certain related external sources (e.g., house style manuals, compliance requirements, etc.), where the content is taken from other documents. Such suggestions may be ranked for the user depending on factors such as frequency of use, being most typical (centroid) of available alternatives, or having a high probability of co-occurring with other blocks present in the new document. The block suggestions may be automatically updated, for example, by replacing the name, date, and other sub-blocks specific to the example taken document with values taken from the new document.

Further, the selection of blocks suggested for addition or deletion may usefully depend on the practices of different authors, editors, or other workers. For example, if the current author's document frequently differs from another author's document in a particular way, this may indicate that the difference is a considered choice, rather than an error. On the other hand, if all authors working under the same supervisor did in one way, but the current author was different from it, this may indicate a more demanding review, at least when first noted.

Modeling of the anomaly takes into account the structure and block data type and semantic roles, as well as context, content, and format. For example, a pattern of what data types and semantic roles for a block appear inside, adjacent to, or otherwise near other blocks is modeled. Violations of an established pattern may be classified as anomalies and presented to the user as any other anomaly.

12)Arbitration

Many of the previous steps create and/or manipulate blocks of the document that are defined as (typically but not necessarily continuous) extents of characters, lemmas, and/or non-textual objects within the linear sequence(s) generated in step (3).

The blocks considered at any point may be represented by "in-line" meta-information, such as a marker, or may be represented by a "stand-alone" representation, which refers to a location in the text by various pointers. In this example, the isolated representation is used for most of the processing, but the in-line representation is used for some purpose, such as communication with an external tool, which often prefers the in-line representation. These representations and others are functionally interchangeable and the choice between them may be determined by performance, convenience, and other considerations.

The representation of the blocks includes information about what steps or implementations created them, how their certainty ("confidence level") was, and their specific data types and/or semantic role labels. Redundant, uncertain, conflicting, or partially overlapping blocks may occur frequently, and we refer to herein as "non-optimal". For example, two or more different processes may append semantic role labels to text of the same span (or nearly the same span, e.g., one including "doctor" before the name and one not). Blocks may be nested, sometimes deeply nested, but may also be arbitrarily overlapped (that is, where each of the overlapping blocks contains some content that is also in the other overlapping block, and contains some content that is not). Throughout the above steps, the system may maintain a representation that may represent a large number of annotations, including overlapping or co-located annotations.

Such non-optimal blocks are generally undesirable, at least when the document is presented to the user. In addition, many prior art NLP tools prefer non-overlapping structures, as are many document tools and methods familiar to those skilled in the art, such as XML, JSON, SQL, and other representation systems. The more constrained structures that are generally preferred are often referred to as "layered" or "well-structured" and avoid partially overlapping blocks.

This aspect of the system modifies the set of blocks to be strictly hierarchical and avoids non-optimal blocks. This can be achieved in a number of ways. First, the blocks may be deleted entirely (i.e., the blocks themselves; the document contents they identify are not deleted). Second, the block range may be modified (e.g., by including or excluding one or more characters or tokens from either end) to prevent overlap with another block(s). Third, the blocks may be determined to be redundant and merged. Fourth, the blocks may be found to be contradictory (e.g., if one tool considers "Essex" to be a region and one tool considers a person), and a selection is made.

The process comprises the following means: quickly finding partial and/or complete overlap; comparing blocks by type, role, and confidence; and to address non-optimal situations by modifying the blocks and their associated data. Selecting what blocks to modify, merge, or delete takes into account a number of factors, such as: a confidence level; giving prior probability of the block data type, semantic role and content; the upper and lower relations among semantic role labels; conditional probability of occurrence in a given context; the number, role, and distribution of other blocks in current and other similar documents; the priority of the current process; customer feedback on similar situations; and/or other methods.

The modification may also change the block confidence level. For example, aspects of the system may apply similar or identical semantic role labels to the same or nearly the same portions of the document. In this case, the labels are typically merged and the resulting block is assigned a higher confidence than the individual blocks it contains. In other cases, a selection is made between contradictory block assignments, but the selected block may eventually have a reduced confidence level to reflect that there is some level of evidence.

This process improves the quality and consistency of the block identification and labeling, enables information to interoperate with a wide variety of tools, and enables results to be more easily and reliably analyzed. The operations just described may be applied at any time(s), not just last. For example, if the previous step used an external tool for a certain subtask, it might request a reduction to constructivity. The removed or modified blocks may instead be "suspended", meaning that they no longer affect processing, but may be reintroduced as required; this enables such use of tools that do not support overlap to be achieved without having to recreate the previous work from scratch later on, and increases processing flexibility and speed.

In one approach, all overlapping and/or all non-optimal blocks are resolved before generating the document that is shown to the user, so that the results can be easily encoded into a hierarchical format, such as the XML format used by many modern word processing programs and other tools. However, it is also possible (even in XML) to maintain multiple, possibly overlapping, alternatives at a particular location for potential later resolution, such as through user feedback or improved algorithm learning.

13)DGML (DocuGami markup language)

The enhanced version of the document represents the document structure, format, content, and identified blocks, and may identify which steps of the process identify which blocks, and with what level of confidence. Some embodiments use XML as the syntax for this representation, although a wide variety of representations may contain substantially the same information, such as other XML summaries, JSON, various databases, custom text or binary formats, and so forth.

In this step, the document and information about the blocks it finds are converted (or "serialized") into an XML form that can be easily passed on to other processes, most notably a front-end user interface for feedback, editing, and review; and formats that can be used for "dashboard" applications that provide summary, statistics, and compliance information to other users (such as group managers, quality control personnel, etc.).

DGML, the Docugami markup language, is a special XML architecture for this purpose that holds all the described information in one package. Most previous architectures can handle structure, content, and sometimes also typeset, but do not annotate "blocks" in the abstract as described herein. Many previous architectures also do not provide a general mechanism in which blocks can be automatically detected and represented on the fly, particularly with confidence levels and provenance information.

For some word processing programs and other tools' file formats, it is also possible to "tunnel" the information by representing the same information in a form that is transparent to the format. For example, if the tool supports embedded annotations or metadata, "invisible" text, negligible properties, or other similar features, the information described herein may be hidden within them, allowing the resulting document to be used and possibly modified in the tool; and the tunneling information is still available when returned to the system.

14)Feedback mode front end

Extensive annotation and analysis, attached to the document and its found blocks by the methods already described, makes it feasible to guide the user by editing the sample, template or previous document to produce a new document that is similar but tailored to the current needs. For example, the system will typically identify: parties and properties to contracts; drugs or conditions listed under the history of clinical records, current findings, and other specific segments; a date of relevance; and so on. By also examining other documents of the same document set, the system learns which things are uncommon, common, or necessary, and can therefore make more useful recommendations to the user explaining what to review and/or update. For example, there may be an effective date for nearly every contract in a document collection, but its value may be different in every contract. Similarly, the participants change, but the categories of participants are much more consistent.

i)Unguided feedback

In interaction with the user, the system first requests feedback on chunks found (or possibly not found) in several documents. The first few documents presented for feedback will be the "cluster centroid" of the document set. The last few documents will be "outliers" in the document set.

ii)Guided feedback

After that, the system guides the user in providing feedback by showing the user selected portions of the document and asking for current or potential tags for them, their scope, etc.

"tags of interest" are determined by a PageRank-based algorithm with a syntactic and structural model. Among those tags, a set of low confidence instances is selected for review.

b. The same process may be repeated for additional documents when there are no more low confidence tags in the current document. In some embodiments, the model is continuously updated according to feedback provided by the user. However, feedback may be accumulated and applied later, batch, and/or offline. In turn, adjustments to the model may affect the selection of blocks and tags that are thereafter presented for feedback, and may trigger re-analysis of some documents.

c. The system solicits feedback on fields and structure blocks using essentially the same mechanism. In one approach, all block detectors provide confidence estimates, which may be used along with other information to select candidates for feedback.

Feedback may be requested in different passes for smaller blocks versus larger blocks, fields versus structural blocks, or in other orders. Referring to FIG. 3, this is an example user interface for user feedback. Which displays some or all of the tiles and allows the user to select a particular tile to examine, see the assigned type and/or role, and optionally alternatives. The user may move block boundaries, select or edit tags, and so forth. Preferably, the user may also request that a particular change (such as to a tag) be applied to all corresponding or same type of blocks.

15)Feedback response

i) Queuing queries is a method that allows a system to query both private data and public data based on user feedback, typically from multiple users. The selected example is similar to the previous failure case both semantically and syntactically, which increases the value of the feedback.

ii) a scheduler. The scheduler is a method for linking user feedback of the combined output of multiple ML models and non-ML algorithms by the user back to a particular learning model 120 that can be learned from the feedback.

The system allows the model to be refined from user feedback of its output by the user and from user feedback of its output by other learned and non-learned models. This is accomplished by using feedback as incremental (also referred to as "fine-tuning") training data for the various numerical and neural models described. After the feedback is used to improve the model, not only are certain documents re-evaluated, but all documents in the document set, or even all documents of the user, are re-evaluated. Thus, feedback on each document may improve block identification, role assignment, structure discovery for all documents, and thus improve user assistance. This retraining is represented in fig. 1 by the dashed connecting line from step (15) to step (3).

The documents and all associated information facilitate learning and analyzing the set of documents (particularly but not exclusively within a particular set of documents) and thus improve performance on future documents. For example, once a new block is added to one or more documents in a document set, the block may be used for future documents (or revisions of older documents) and may be suggested for future documents. At some point, it may become abnormal that there is no recently introduced block role or there is a block role that has been used less recently. This point may be selected by the user actively or in response to a feedback question, or automatically based on the usage profile of the corresponding mass over time. For example, if few or no documents in a document set authored before a certain time have a block of a given role and/or context (e.g., "exclude" segments), but most or all later authored documents have that block, the lack of a corresponding block in the new document may be anomalous and may be so usefully suggested to the user.

16)Downstream communication

After annotating the document(s) with chunk information such as described above, the selected information is converted to the particular format required by external business information systems (such as databases, analysis tools, etc.) and passed to those systems either directly or through automated and/or manual review steps. For example, the name and address of a particular party may be copied to the correct fields in the database, which may not be done automatically if they are only identified as the "name" and "address" themselves. Referring to FIG. 4, this is an example of integration with a downstream software application. In this example, chunks representing terms that the parties are expected to agree to have been extracted and passed to a downstream application similar to Docusign to be filled out and signed.

FIG. 5 is a block diagram of one embodiment of a computer system 510 that may be used with the present invention. The above steps may be implemented by software executing on such a computer system. Computer system 510 typically includes at least one computer or processor 514 that communicates with peripheral devices via bus subsystem 512. In general, a computer may include, or a processor may be, any of: a microprocessor, a graphics processing unit, or a digital signal processor, and their electronic processing equivalents, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA). These peripheral devices may include a storage subsystem 524 (which includes a memory subsystem 526 and a file storage subsystem 528), a user interface input device 522, a user interface output device 520, and a network interface subsystem 516. The input and output devices allow a user to interact with computer system 510.

The computer system may be a server computer, a client computer, a workstation, a mainframe, a Personal Computer (PC), a tablet, a rack-mounted "blade," or any data processing machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine.

The computer system typically includes an operating system such as Microsoft Windows, Solaris for Sun microsystems, MacOs for apple computers, Linux, or Unix. The computer system may also typically include a basic input/output system (BIOS) and processor firmware. The operating system, BIOS, and firmware are used by the processor to control subsystems and interfaces connected to the processor. Typical processors compatible with these operating systems include Pentium and Itanium from intel, Opteron and Athlon from supermicro, and ARM processors from advocate.

The innovations, embodiments and/or examples of the claimed invention are not limited to traditional computer applications nor to programmable devices running them. For example, the claimed innovations, embodiments and/or examples may include optical computers, quantum computers, analog computers and the like. The computer system may be a multi-processor or multi-core system and may be implemented using or in a distributed or remote system. The term "processor" is used herein in its broadest sense to include single processors and multi-core or multi-processor arrays, including graphics processing units, digital signal processors, digital processors, and combinations of these devices. Moreover, although only a single computer system or a single machine may be illustrated, the use of such terms in the singular shall also refer to any collection of computer systems or machines that individually or jointly execute instructions to perform any one or more of the operations discussed herein. Due to the ever-changing nature of computers and networks, the description of computer system 510 depicted in FIG. 5 is intended only as an example for purposes of illustrating the preferred embodiments. Many other configurations of computer system 510 are possible having more or fewer components than the computer system depicted in fig. 5.

Network interface subsystem 516 provides an interface to an external network (which includes an interface to communication network 518) and is coupled to corresponding interface devices in other computer systems or machines via communication network 518. Communication network 518 may include a number of interconnected computer systems, machines, and communication links. These communication links may be wired links, optical links, wireless links, or any other device for communicating information. The communication network 518 may be any suitable computer network, such as a wide area network (such as the internet), and/or a local area network (such as an ethernet). The communication network may be wired and/or wireless, and may use encryption and decryption methods, such as may be used with a virtual private network. A communication network uses one or more communication interfaces that can receive data from and transmit data to other systems. Examples of communication interfaces typically include an ethernet card, a modem (e.g., telephone, satellite, cable, or ISDN), (asynchronous) Digital Subscriber Line (DSL) unit, firewire interface, USB interface, etc. One or more communication protocols may be used, such as HTTP, TCP/IP, RTP/RTSP, IPX, and/or UDP.

The user interface input devices 522 may include alphanumeric keyboards, keypads, pointing devices (such as mice, trackballs, touch pads, styluses, or graphic pads), scanners, touch screens incorporated into displays, audio input devices (such as voice recognition systems or microphones), eye gaze recognition, brain wave pattern recognition, and other types of input devices. Such devices may be connected to a computer system by wired or wireless means. In general, use of the term "input device" is intended to include all possible types and manners of entering information into computer system 510 or onto communication network 518. User interface input devices typically allow a user to select objects, icons, text, etc. that appear on some type of user interface output device (e.g., a display subsystem).

User interface output device 520 may include a display subsystem, a printer, or a non-visual display (such as an audio output device). The display subsystem may include a flat panel device, such as a Liquid Crystal Display (LCD), a projection device, or some other device for creating a visible image, such as a virtual reality system. The display subsystem may also provide non-visual displays, such as via an audio output or tactile output (e.g., vibration) device. In general, use of the term "output device" is intended to include all possible types of devices and ways to output information from computer system 510 to a user or to another machine or computer system.

Memory subsystem 526 typically includes a number of memories including a main Random Access Memory (RAM)530 (or other volatile storage device) for storing instructions and data during program execution, and a Read Only Memory (ROM)532 in which fixed instructions are stored. File storage subsystem 528 provides persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, flash memory, or removable media cartridges. Databases and modules implementing the functionality of certain embodiments may be stored by file storage subsystem 528.

Bus subsystem 512 provides a means for the various components and subsystems of computer system 510 to communicate with one another as intended. Although bus subsystem 512 is shown schematically as a single bus, alternative embodiments of the bus subsystem may use multiple buses. For example, a RAM-based main memory may communicate directly with a file storage system using a Direct Memory Access (DMA) system.

Although the specific embodiments contain many specific details, these should not be construed as limiting the scope of the invention, but as merely illustrating different examples. It should be understood that the scope of the present disclosure includes other embodiments not discussed in detail above. It will be apparent to those skilled in the art that various other modifications, changes, and variations can be made in the arrangement, operation, and details of the methods and apparatus disclosed herein without departing from the spirit and scope as defined in the appended claims. Accordingly, the scope of the invention should be determined by the appended claims and their legal equivalents.

Claims

1. A method implemented on a computer system executing instructions for analyzing and annotating a document, the method comprising:

accessing a document set comprising a plurality of documents;

automatically identifying blocks within individual documents of the set of documents by: (a) based on the content, layout, and context in the individual documents; and (b) a schema based on the content, layout, and context of the documents across the set of documents; and

annotating documents in the document set based on an analysis of the blocks identified from documents within the document set.

2. The computer-implemented method of claim 1, further comprising:

the set of documents is assembled by clustering the documents into the set of documents based on similarity of content and/or composition.

3. The computer-implemented method of claim 1, wherein automatically identifying blocks within individual documents in the set of documents is further (c) based on identifying semantic roles within the individual documents; and (d) based on identifying counterpart blocks in different documents in the set of documents, wherein the counterpart blocks play the same semantic role in the different documents.

4. The computer-implemented method of claim 3, wherein identifying corresponding chunks in different documents comprises:

content that is different in different documents but that appears within a substantially similar context within the different documents is identified.

5. The computer-implemented method of claim 3, wherein identifying corresponding chunks in different documents comprises:

substantially identical content in different documents is identified.

6. The computer-implemented method of claim 1, further comprising:

annotating some of the identified blocks with metadata describing the blocks, wherein identifying corresponding blocks in different documents is based on similarity of the metadata.

7. The computer-implemented method of claim 1, wherein identifying blocks based on patterns across the documents in the set of documents comprises:

blocks are identified in an individual document that typically appear in the documents of the document set but do not appear to appear in the individual document.

8. The computer-implemented method of claim 1, wherein the identified block comprises:

a field block containing content within the document suitable for use as a field in a document template; and

a structure block containing content including a structure within the layout of the document.

9. The computer-implemented method of claim 8, wherein some of the field blocks are hierarchical and contain other blocks as sub-blocks.

10. The computer-implemented method of claim 1, wherein some of the identified blocks contain content that describes semantic roles played by other blocks.

11. The computer-implemented method of claim 1, further comprising:

annotating some of the identified blocks with data types of the blocks and semantic roles of the blocks.

12. The computer-implemented method of claim 1, wherein identifying blocks based on composition comprises:

grouping line-oriented text into structural blocks, wherein the grouping is based on word shape, first and last lemma, formatting properties, and/or punctuation.

13. The computer-implemented method of claim 1, wherein identifying blocks based on composition comprises:

machine learning reasoning trained on patches of page images is used to identify spatial boundaries of structural patches.

14. The computer-implemented method of claim 1, wherein identifying blocks based on composition comprises:

identifying spatial boundaries of structural blocks using artificial intelligence-based visual recognition of the composed geometric patterns.

15. The computer-implemented method of claim 1, wherein identifying blocks based on composition comprises:

identifying structural blocks based on typesetting of non-textual structural features, wherein the non-textual structural features include at least one of: a drawing, a table, a sidebar, a footer, a header, or a footer.

16. The computer-implemented method of claim 1, wherein identifying blocks based on content comprises:

the blocks are identified using AI techniques for topic estimation.

17. The computer-implemented method of claim 1, wherein identifying blocks based on content comprises:

a small number of named entity recognition techniques are used to identify blocks within the document set.

18. The computer-implemented method of claim 1, further comprising:

receiving user corrections for erroneously identified blocks; and

the step of automatically identifying blocks is improved in response to the user correction.

19. A non-transitory computer-readable storage medium storing executable computer program instructions for analyzing and improving documents, the instructions being executable by a computer system and causing the computer system to perform a method, the method comprising:

accessing a document set comprising a plurality of documents;

20. A computer system for analyzing and refining documents, the computer system comprising:

a storage medium for receiving and storing a document set comprising a plurality of documents; and

a processor system having access to the storage medium and executing an application for analyzing and annotating documents, wherein the processor system executes the application:

21. A method implemented on a computer system executing instructions for analyzing and improving a document, the method comprising:

accessing a document set comprising a plurality of documents, wherein the document set further identifies blocks within individual documents of the document set;

automatically assigning semantic role labels to a plurality of the blocks, wherein the semantic role labels describe the semantic roles played by the blocks; and automatically assigning semantic role labels to the blocks (a) comprises: determining semantic roles for the blocks using machine learning and/or natural language processing methods; and automatically assigning semantic role labels to the blocks (b) also based on blocks in different documents identified as playing the same semantic role within their respective documents; and

using the block and its semantic role tag in further processing of documents in the document set.

22. The computer-implemented method of claim 21, wherein the plurality of documents in the set of documents are all of the same document type.

23. The computer-implemented method of claim 21, wherein the block in the document set comprises:

field blocks containing content within the document suitable for use as fields in a document template, wherein some of the field blocks are hierarchical and contain other blocks as sub-blocks; and

a structure block containing content that includes a structure within the layout of the document.

24. The computer-implemented method of claim 21, wherein the set of documents contains legal documents; and the semantic roles include (a) roles played by participants in the legal document, and (b) roles played by date, time period, or other temporal expression.

25. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

automatically extracting some of the semantic role labels from the block; and

assigning the extracted semantic role labels to blocks.

26. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

automatically extracting semantic role labels from blocks using machine learning by: (a) based on content, layout, and context in the individual documents; (b) a mode based on content, layout, and context across the documents in the document set; and (c) a block-based data type; and

assigning the extracted semantic role labels to blocks.

27. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

automatically extracting some of the semantic role labels using an automatic encoder machine learning technique; and

assigning the extracted semantic role labels to blocks.

28. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

automatically extracting candidate semantic role labels from the block;

refining the candidate semantic role labels using machine learning; and

assigning the extracted semantic role labels to blocks.

29. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

automatically extracting some of the semantic role labels from blocks based on similarity in content, layout, and/or context of blocks from different documents in the document set; and

assigning the extracted semantic role labels to blocks.

30. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

assigning the candidate semantic role labels to the blocks;

grouping blocks into clusters based on similarities in the semantic roles played by the blocks;

normalizing the candidate semantic role labels among the blocks in a cluster; and

assigning the normalized semantic role labels to blocks.

31. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

assigning the candidate semantic role labels to the blocks;

grouping the blocks into block clusters based on the block sizes and similarity of text embedding;

grouping the candidate semantic role labels into label clusters based on similarity of text embedding of the candidate semantic role labels;

normalizing the candidate semantic role labels based on the block clusters and the label clusters; and

assigning the normalized semantic role labels to blocks.

32. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises:

assigning a candidate semantic role tag to a block that includes a segment of a document, wherein the candidate semantic role tag is based on a title of the segment;

grouping the blocks into clusters based on similarity of content in the segments;

standardizing the candidate semantic role labels by selecting the most common candidate semantic role labels as the semantic role labels of all blocks in the cluster; and

assigning the normalized semantic role labels to blocks.

33. The computer-implemented method of claim 21, wherein the semantic role labels are selected from a set of predetermined semantic role labels.

34. The computer-implemented method of claim 21, wherein the semantic role labels comprise: tags identified by a software application for further processing documents of the document set.

35. The computer-implemented method of claim 21, wherein automatically assigning semantic role labels to blocks comprises at least one of:

(a) determining semantic roles of blocks based on other blocks in the vicinity or based on containing blocks that contain the blocks using machine learning, or (b) determining semantic roles of blocks based on syntactic structures of nearby blocks using natural language processing methods.

36. The computer-implemented method of claim 21, wherein some of the blocks are named entity references, such blocks are tagged with semantic role tags of the semantic role played by those of the blocks in the document, and such blocks are also tagged with a data type of the block.

37. The computer-implemented method of claim 21, wherein some of the blocks are multi-paragraph structures in the document, and such blocks are tagged with semantic role tags of the semantic role played by those blocks in the document.

38. The computer-implemented method of claim 21, further comprising:

estimating a confidence level of the semantic role labels that are automatically assigned;

based on the estimated confidence level, presenting some assignments to a user for confirmation;

receiving user feedback for the automatically assigned semantic role labels; and

improving the machine learning and/or the natural language processing method in response to the user feedback.

39. A non-transitory computer-readable storage medium storing executable computer program instructions for analyzing and improving documents, the instructions being executable by a computer system and causing the computer system to perform a method, the method comprising:

automatically assigning semantic role labels to a plurality of the blocks, wherein the semantic role labels describe semantic roles played by the blocks; and automatically assigning semantic role labels to the blocks (a) comprises: determining semantic roles for the blocks using machine learning and/or natural language processing methods; and automatically assigning semantic role labels to the blocks (b) also based on blocks in different documents identified as playing the same semantic role within their respective documents; and

making the block and its semantic role label available for further processing of documents in the document set.

40. A computer system for analyzing and refining documents, the computer system comprising:

a storage medium to receive and store a document set comprising a plurality of documents, wherein the document set further identifies blocks within individual documents of the document set; and

a processor system having access to the storage medium and executing an application for analyzing and improving a document, wherein the processor system executes the application:

41. A method implemented on a computer system executing instructions for processing a document, the method comprising:

processing a document set containing a plurality of documents to identify blocks in the documents and generate corresponding annotations, comprising the stages of:

processing an image of the document to identify a visual block comprising a visually distinct region of the image of the document; and generating a first annotation specifying a spacing and formatting of the visual block;

processing the visual block and the first annotation to identify a structural block containing content from a structure within the visual block; and generating a second annotation specifying the layout of the structure block;

processing the structure block and the second annotation to identify a topic-level block based on a grouping by topic of content in the structure block; and generating a third annotation specifying a subject of the subject level block; and

processing the subject-level block and the third annotation to identify a field block containing content suitable for use as a field in a document template; and generating a fourth annotation specifying the field of the field block;

generating a representation of the processed document in a format that includes the field blocks and at least some of the other blocks identified from the document and corresponding annotations for the blocks; and

the representation in the format is made available to any of a plurality of software applications in a downstream process.

42. The computer-implemented method of claim 41, wherein the representation of the document that is processed includes all of the blocks identified when processing the document and all of the corresponding annotations generated when processing the document.

43. The computer-implemented method of claim 41, wherein each of the stages of processing the document uses machine learning, artificial intelligence, and/or natural language processing.

44. The computer-implemented method of claim 41, wherein each of the stages of processing the document identifies blocks with less than 100% confidence.

45. The computer-implemented method of claim 44, wherein the representation of the document that is processed further comprises: assigning annotations of confidence levels to the identification of blocks.

46. The computer-implemented method of claim 44, further comprising:

receiving user fixes for erroneously identified blocks; and

a stage of improving the automatic identification of the blocks in response to said user correction.

47. The computer-implemented method of claim 41, wherein the stages of processing visual blocks, processing structural blocks, and processing subject-level blocks are performed recursively for visual blocks contained within other visual blocks.

48. The computer-implemented method of claim 41, wherein the processed representation of the document further comprises annotations for the data types and the semantic role labels for a plurality of the blocks, wherein the semantic role labels describe semantic roles played by the blocks.

49. The computer-implemented method of claim 41, wherein some higher-level blocks contain other lower-level blocks as sub-blocks, and the representation of the document being processed further includes annotations specifying that lower-level blocks are contained in higher-level blocks.

50. The computer-implemented method of claim 41, wherein some blocks have hierarchical relationships and the representation of the document that is processed further includes annotations that specify hierarchical relationships between blocks.

51. The computer-implemented method of claim 41, wherein the block in the representation of the document that is processed comprises: a plurality of segments, titles, lists, items, logos, and/or named entities at a plurality of different levels.

52. The computer-implemented method of claim 41, wherein the plurality of documents in the set of documents are all of the same document type.

53. The computer-implemented method of claim 41, further comprising:

the set of documents is assembled by clustering the documents into the set of documents based on similarity of content and/or layout.

54. The computer-implemented method of claim 41, wherein the processed representation of the document is in XML format.

55. The computer-implemented method of claim 41, wherein the processed representation of the document further comprises: annotation of the location of the block using digital signatures.

56. The computer-implemented method of claim 41, wherein the document has an original layout and the representation of the document that is processed contains sufficient information to reconstruct the document having the original layout.

57. The computer-implemented method of claim 41, wherein the plurality of software applications includes a software application having a user interface for a user to create, edit, and/or review the representation of the document that is processed.

58. The computer-implemented method of claim 41, wherein the format is a standardized, published format.

59. A non-transitory computer-readable storage medium storing executable computer program instructions for processing a document, the instructions being executable by a computer system and causing the computer system to perform a method, the method comprising:

processing the subject-level block and the third annotation to identify a field block that contains content suitable for use as a field in a document template; and generating a fourth annotation specifying the field of the field block;

generating a representation of the document processed in a format that includes the field blocks and at least some of the other blocks identified from the document and corresponding annotations for the blocks; and

60. A computer system for processing a document, the computer system comprising:

a processor system having access to the storage medium and executing an application for processing a document, wherein the processor system executes the application:

processing the plurality of documents to identify blocks in the documents and generate corresponding annotations, comprising the stages of:

processing the structure block and the second annotation to identify a topic-level block based on a grouping by topic of content in the structure block; and generating a third annotation specifying a subject of the subject-level block; and

61. A method implemented on a computer system executing instructions for assisting a user in developing a target document belonging to a set of documents, the method comprising:

accessing a document set containing a plurality of documents, wherein the document set further identifies blocks within individual documents of the document set, and further includes data types and semantic role labels for some of the blocks, wherein the semantic role labels describe semantic roles that the blocks play within their respective documents;

deriving a pattern of occurrence of (a) a semantic role played across blocks of the document in the set of documents; and (b) counterpart blocks in different documents across the set of documents, wherein counterpart blocks play the same semantic role in different documents;

providing a user interface for developing a target document belonging to the set of documents; and

generating a suggestion to develop the target document based on the derived appearance pattern across the set of documents, and displaying the suggestion within the user interface.

62. The computer-implemented method of claim 61, wherein deriving the occurrence patterns comprises: machine learning and/or artificial intelligence is used to derive the appearance patterns.

63. The computer-implemented method of claim 61, wherein the plurality of documents in the set of documents are all of the same document type.

64. The computer-implemented method of claim 61, further comprising:

deriving a pattern of occurrence of blocks within individual documents of the set of documents, wherein automatically generating a suggestion is further based on such derived pattern.

65. The computer-implemented method of claim 61, wherein the block in the set of documents comprises:

field blocks containing content within the document suitable for use as fields in a document template, wherein some of the field blocks are hierarchical and contain other blocks as sub-blocks;

structure blocks containing content from structures within the layout of the document, and the semantic role labels comprise semantic role labels for some of the structure blocks; and

containing blocks of images or video.

66. The computer-implemented method of claim 61, further comprising:

comparing blocks in the target document to the resulting semantic roles across the set of documents and/or the occurrence patterns of counterpart blocks, wherein some suggestions are automatically generated based on the comparison.

67. The computer-implemented method of claim 61, further comprising:

identifying anomalies in the occurrence of semantic roles in the target document as compared to the derived occurrence patterns of semantic roles and/or counterpart blocks across the set of documents; wherein at least one suggestion is automatically generated based on the identified anomaly.

68. The computer-implemented method of claim 67, wherein:

the identified anomalies include: identifying semantic roles that are missing in the target document but that commonly appear in the document set; and

the automatically generated suggestions include: adding content for the missing semantic role.

69. The computer-implemented method of claim 67, wherein:

the identified anomalies include: identifying additional semantic roles that appear in the target document but that do not normally appear in the set of documents; and

the automatically generated suggestions include: removing or modifying corresponding blocks for the additional semantic role.

70. The computer-implemented method of claim 67, wherein:

the identified anomalies include: identifying semantic roles that occur in the target document and that also commonly occur in the set of documents, but whose content of corresponding blocks in the target document is inconsistent with that of corresponding blocks in the set of documents; and

the automatically generated suggestions include: removing or modifying the inconsistent content in the target document.

71. The computer-implemented method of claim 61, wherein:

the identified blocks include field blocks containing content within the document suitable for use as fields in a document template;

the patterns derived include the following: for one of the field blocks, the counterpart blocks all contain substantially the same content; and

the at least one automatically generated suggestion includes: populating the target document with the same content as for the field blocks.

72. The computer-implemented method of claim 61, wherein displaying a suggestion to the user comprises: displaying some suggestions with options for the user to accept or reject the suggestions.

73. The computer-implemented method of claim 61, further comprising at least one of:

(a) in response to the user accepting individual suggestions for the target document, repeating the same suggestions for a second target document that exhibits the same pattern in the target document that caused the accepted suggestions to be generated in the target document; and

(b) in response to the user rejecting individual suggestions for the target document, not repeating the same suggestions for a third target document that exhibits the same pattern in the target document that caused the rejected suggestions to be generated in the target document.

74. The computer-implemented method of claim 61, further comprising:

automatically applying some suggestions, wherein displaying suggestions to the user comprises: displaying the suggested options for the user to confirm being automatically applied.

75. The computer-implemented method of claim 61, wherein displaying the suggestion includes: displaying the suggestions within the user interface in an order that ranks the confidence in the suggestions.

76. The computer-implemented method of claim 61, further comprising:

automatically generating additional suggestions for the target document based on patterns within the target document itself and/or based on patterns in documents outside of the set of documents; wherein the suggestions based on patterns within the target document, based on patterns within the set of documents, and based on patterns in documents outside the set of documents are displayed within the user interface with different priorities.

77. The computer-implemented method of claim 61, wherein the suggestion that is automatically generated is additionally dependent on the user and/or an affiliation of the user.

78. The computer-implemented method of claim 61, wherein at least one of: (a) the target document is an existing document edited by the user, and the automatically generated suggestions include suggestions for editing the existing document; and (b) the target document is a new document created by the user, and the automatically generated suggestions include suggestions for creating the new document.

79. A non-transitory computer-readable storage medium storing executable computer program instructions for assisting a user in developing a target document belonging to a document set, the instructions being executable by a computer system and causing the computer system to perform a method comprising:

providing a user with a user interface to develop target documents belonging to the set of documents; and

automatically generating a suggestion to develop the target document based on the derived pattern of blocks across the set of documents, and displaying the suggestion within the user interface.

80. A computer system for assisting a user in developing a target document belonging to a set of documents, the computer system comprising:

a storage medium to receive and store a document set containing a plurality of documents, wherein the document set further identifies blocks within individual documents of the document set, and further includes data types and semantic role labels for some of the blocks, wherein the semantic role labels describe the semantic role that the block plays within its respective document; and

a processor system having access to the storage medium and executing an application for developing the target document, wherein the processor system executes the application:

81. A method implemented on a computer system executing instructions for assisting a user in reviewing a document set, the method comprising:

automatically developing information about the content in one or more documents in the set of documents based on the derived appearance patterns across the set of documents; and making the information available to downstream processes.

82. The computer-implemented method of claim 81, wherein deriving the occurrence patterns comprises: machine learning and/or artificial intelligence is used to derive the appearance patterns.

83. The computer-implemented method of claim 81, wherein the block in the set of documents comprises:

containing blocks of images or video.

84. The computer-implemented method of claim 81, wherein the information includes content extracted from individual documents from the set of documents.

85. The computer-implemented method of claim 84, wherein the extracted content includes one or more snippets from chunks of the individual document, the snippets being counterparts to chunks in other documents.

86. The computer-implemented method of claim 81, wherein the information includes an indication of the presence or absence of a particular chunk in an individual document, and the particular chunk is a counterpart to a chunk present in other documents from the set of documents.

87. The computer-implemented method of claim 81, wherein the information comprises a summary of the individual documents.

88. The computer-implemented method of claim 81, wherein the information includes content extracted from a plurality of documents in the set of documents.

89. The computer-implemented method of claim 88, wherein the information includes blocks extracted from a plurality of documents in the set of documents, and the information is organized according to which blocks are counterparts.

90. The computer-implemented method of claim 88, wherein the information comprises an indication of an occurrence of an anomaly across corresponding chunks of the documents in the set of documents.

91. The computer-implemented method of claim 90, wherein the anomaly comprises an absence of a counterpart block in an individual document, and the information is made available in a format that facilitates navigation to the individual document in which the counterpart block is absent.

92. The computer-implemented method of claim 90, wherein the anomalies include an absence of counterpart blocks in individual documents, and the information is made available in a format that summarizes the absence of counterpart blocks in the individual documents.

93. The computer-implemented method of claim 88, wherein the downstream process is implemented by a software application and the information is made available in a format suitable for use by the software application.

94. The computer-implemented method of claim 93, wherein the information further comprises:

a description of one or more processes executable by the software application for retrieving the content extracted from the plurality of documents in the set of documents.

95. The computer-implemented method of claim 88, wherein the downstream process includes verifying compliance of the content in blocks that play a semantic role for which the document is subject to predefined requirements or policies.

96. The computer-implemented method of claim 81, wherein the downstream process comprises generating a report in a human understandable format.

97. The computer-implemented method of claim 96, further comprising:

receiving a user selection of a block in one of the documents in the set of documents;

wherein in response to the user selection, the report includes the presence or absence of a counterpart block of the user-selected block.

98. The computer-implemented method of claim 97, further comprising:

in response to the report of missing some counterpart blocks, receiving a user selection of one of the missing counterpart blocks, and in response to the user selection, updating the report to add the missing counterpart block.

99. A non-transitory computer-readable storage medium storing executable computer program instructions for assisting a user in reviewing a document set, the instructions being executable by a computer system and causing the computer system to perform a method comprising:

100. A computer system for assisting a user in reviewing a set of documents, the computer system comprising:

a processor system having access to the storage medium and executing an application for assisting a user in reviewing a document set, wherein the processor system executes the application: