WO2022211982A1 - Systems and methods for generating dialog trees - Google Patents

Systems and methods for generating dialog trees Download PDF

Info

Publication number
WO2022211982A1
WO2022211982A1 PCT/US2022/019215 US2022019215W WO2022211982A1 WO 2022211982 A1 WO2022211982 A1 WO 2022211982A1 US 2022019215 W US2022019215 W US 2022019215W WO 2022211982 A1 WO2022211982 A1 WO 2022211982A1
Authority
WO
WIPO (PCT)
Prior art keywords
dialog
raw
blocks
decision tree
tree generation
Prior art date
Application number
PCT/US2022/019215
Other languages
French (fr)
Inventor
Achraf Abdelmoneim Tawfik Mahmoud CHALABI
Michael Zaki Adel Zaki FARAG
Sameh Hany Ahmed ABDULRAHIM
Mohamed Hussein Mohamed HUSSEIN
Ahmed Salaheldin Mohamed Ahmed MOHAMED
Eslam Kamal ABDELREHEEM
Omar Mohamed Hamed ABOUELKHIR
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/334,543 external-priority patent/US20220318497A1/en
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2022211982A1 publication Critical patent/WO2022211982A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems

Definitions

  • Dialog systems are interactive question answering systems that access information from structured databases to answer questions from customers.
  • Customers may interact with virtual agents or hots when interfacing with these dialog systems, in which a decision tree is traversed as the customer is asked a series of focused questions toward a final answer to the customer's question.
  • human operators are employed to develop such dialog systems manually, which requires significant time, cost, and effort.
  • a dialog tree generation system including a processor, and a memory storing instructions that, when executed by the processor, cause the system to receive documents, parse the documents into raw blocks, extract visual design elements from the raw blocks, generate a content structure from the extracted visual design elements, generate at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy, and output the dialog decision tree.
  • Fig. 1A is a general schematic diagram illustrating a dialog tree generation system according to an example embodiment of the subject disclosure.
  • Fig. IB is a general schematic diagram illustrating the tools that can be used by the document parser, page layout extractor, content structure extractor, and augmentor of the dialog tree generation system of Fig. 1 A.
  • FIG. 2A is an illustration of an exemplary document parsing output of the document parser of the dialog tree generation system of Figs. 1 A and IB.
  • FIG. 2B is an illustration of an exemplary page layout extraction output of the page layout extractor of the dialog tree generation system of Figs. 1A and IB.
  • FIG. 2C is an illustration of an exemplary content structure extraction output of the content structure extractor of the dialog tree generation system of Figs. 1 A and IB.
  • Fig. 2D is an illustration of an exemplary augmentation output of the augmentor of the dialog tree generation system of Figs. 1 A and IB.
  • FIG. 2E is an illustration of an exemplary dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
  • FIG. 3 is an illustration of an exemplary second document structure model and an exemplary second dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
  • Fig. 4 is an illustration of an exemplary third document structure model and an exemplary third dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
  • FIG. 5 is an illustration of an exemplary fourth document structure model and an exemplary fourth dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
  • Fig. 6 is a flowchart illustrating a method for generating a dialog decision tree according to a first example embodiment of the subject disclosure.
  • Fig. 7 is a flowchart illustrating another method for generating a dialog decision tree according to a second example embodiment of the subject disclosure.
  • FIG. 8 is a flowchart illustrating yet another method for generating a dialog decision tree according to a third example embodiment of the subject disclosure.
  • Fig. 9 is a schematic diagram illustrating an exemplary computing system that can be used to implement the dialog tree generation system of Figs. 1 A and IB. DETAILED DESCRIPTION
  • a dialog tree generation system 10 for use in extracting dialog information from documents.
  • the dialog tree generation system 10 comprises a dialog extraction computing device 12 including a processor 14, volatile memory 16, an input/output module 18, and non-volatile memory 24 storing an application 26 including a document parser 30, a page layout extractor 34, a content structure extractor 38, an augmentor 42, and a dialog extractor 48.
  • a bus 20 can operatively couple the processor 14, the input/output module 18, and the volatile memory 16 to the non-volatile memory 24.
  • the document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48 are depicted as hosted (i.e., executed) at one computing device 12, it will be appreciated that the document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48 can alternatively be hosted across a plurality of computing devices to which the computing device 12 is communicatively coupled via a network 22.
  • a client computing device 52 can be provided, which is operatively coupled to the computing device 12.
  • the network 22 can take the form of a local area network (LAN), wide area network (WAN), wired network, wireless network, personal area network, or a combination thereof, and can include the Internet.
  • the computing device 12 comprises a processor 14 and a non-volatile memory 24 configured to store document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48 in non-volatile memory 16.
  • Non-volatile memory 24 is memory that retains instructions stored data even in the absence of externally applied power, such as FLASH memory, a hard disk, read only memory (ROM), electrically erasable programmable memory (EEPROM), etc.
  • the instructions include one or more programs, including document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48, and data used by such programs sufficient to perform the operations described herein.
  • the instructions cause the processor 14 to execute document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48.
  • the processor 14 is a microprocessor that includes one or more of a central processing unit (CPU), a graphical processing unit (GPU), an application specific integrated circuit (ASIC), a system on chip (SOC), a field-programmable gate array (FPGA), a logic circuit, or other suitable type of microprocessor configured to perform the functions recited herein.
  • the system 10 further includes volatile memory 16 such as random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), etc., which temporarily stores data only for so long as power is applied during execution of programs.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • the client computing device 52 can execute an application client 26A to send input documents 28 to the computing device 12 as user input 56, and subsequently receive a dialog decision tree 50 from the computing device 12 as output.
  • the dialog decision tree 50 can be associated with a virtual assistant 51 which is a program that is installed on the computing device 12.
  • the virtual assistant 51 can also be referred to as a virtual agent, a chatter bot, a chatbot, a digital personal assistant, or an automated online assistant, for example.
  • the computing device 12 executes the virtual assistant 51.
  • the dialog decision tree 50 is executed by the virtual assistant 51 to thereby present an interactive chat dialog to a user that proceeds according to the dialog decision tree 50.
  • the input documents can include PDF files, HTML files, PowerPoint files, Word documents, and OCR (optical character recognition) documents, for example.
  • the application client 26A can be coupled to a graphical user interface 54 of the client computing device 52 to display graphical output 58 based on the dialog decision tree 50 outputted from the dialog extractor 48.
  • the document parser 30 receives input documents 28 as input, parses the input documents into raw blocks 32, and outputs the raw blocks 32, which can include raw text blocks, raw image blocks, and/or raw shape blocks.
  • a raw text block is a body of text that is grouped together on a page of the input documents 28
  • a raw image block is an area on a page of the input documents 28 where a raw image is located
  • a raw shape block is an area on a page of the input documents 28 where a raw shape is located.
  • These raw blocks can be shaped like a square or a rectangle.
  • the page layout extractor 34 receives the raw blocks 32 as input, processes the raw blocks 32 to extract visual design elements, and outputs a document layout model 36 comprising the visual design elements.
  • the content structure extractor 38 receives the document layout model 36 as input, processes the document layout model 36 by analyzing the visual design elements in the document layout model 36, generates a document structure model 40 comprising a content structure from the extracted visual design elements, and outputs the document structure model 40.
  • the augmentor 42 receives the document structure model 40 as input, annotates the document structure model 40, and outputs the document structure model 40 annotated with entities 44 and synonyms 46.
  • the dialog extractor 48 receives the document structure model 40 annotated with entities 44 and synonyms 46 as input, processes the document structure model 40 annotated with entities 44 and synonyms 46, generates a dialog decision tree 50 based on the extracted content structure, and outputs the dialog decision tree 50 to the application client 26A to a location accessible by a virtual assistant 51.
  • the dialog decision tree may include a plurality of nodes organized into a hierarchy.
  • FIG. IB the processing tools that the document parser 30, the page layout extractor 34, the content structure extractor 38, and the augmentor 42 can use are depicted.
  • the document parser 30 can use a clustering algorithm 30a, and/or a rules- based algorithm 30b to cluster recognized text lines into text blocks, recognized vectors into raw shape blocks, and recognize raw images.
  • the page layout extractor 34 can use a clustering algorithm 34a, a rules- based algorithm 34b, and/or recursive x-y cut 34c to extract page layout elements.
  • the page layout extractor 34 can use recursive x-y cut 34c to extract and bound columns.
  • Recursive x-y cut 34c can also segment a page into separate zones projecting filled pixels on the x-y axis of the page in a recursive way and splitting a given zone into multiple ones based on the energy distribution on a histogram.
  • the page layout extractor 34 can use a clustering algorithm 34a to extract tables, clustering lines into separate bins (i.e., clusters of lines) to identify the number of different tables in a page and process the bins to identify the boundaries of each of the tables, or by segmenting table columns/rows based on alignment for borderless tables.
  • a clustering algorithm 34a to extract tables, clustering lines into separate bins (i.e., clusters of lines) to identify the number of different tables in a page and process the bins to identify the boundaries of each of the tables, or by segmenting table columns/rows based on alignment for borderless tables.
  • the content structure extractor 38 can use a clustering algorithm 38a, conditional random field (CRF) 38b, TF-IDF (term frequency-inverse document frequency) 38c, an ensemble classifier 38d, and/or a rules-based algorithm 38e to determine a document structure model 40.
  • a clustering algorithm 38a agglomerative clustering can be used to identify the headings and their hierarchy.
  • the clustering algorithm 38a can classify generated clusters as headings or content based on average member length and noise ratio, for example.
  • Raw text blocks can be clustered and arranged in a hierarchy based on visual design elements which can comprise style and geometry properties, including font and typographical properties such as the typeface, letterforms, font size, font family, style, color, orientation, length, contrast, position, spacing, dimension, and others.
  • visual design elements can comprise style and geometry properties, including font and typographical properties such as the typeface, letterforms, font size, font family, style, color, orientation, length, contrast, position, spacing, dimension, and others.
  • Other visual design elements can be used to cluster and arrange the document into a hierarchy of visual design elements: paragraph alignment, line lengths, column widths, and graphic shapes. For example, decisive contrast between disparate visual design elements can be used to identify hierarchical relationships within the document.
  • an ensemble classifier 38d including a conditional random field 38da, decision tree 38db, and support vector machine 38dc can also be used by the content structure extractor 38 to extract headings.
  • the content structure extractor 38 can also use TF-IDF 38c to extract headings.
  • TF-IDF 38c on the boldness style can be used to detect inline headings.
  • the content structure extractor 38 can also use a conditional random field 38b to extract titles.
  • the augmentor 42 can use a conditional random field 42a and/or a rules- based algorithm 42b to augment each node of the headings tree with metadata including entities and synonyms.
  • Entities can be one of 3 classes: named entities, generic entities (key phrases), and action entities.
  • FIG. 2A an example of raw blocks 32 outputted by the document parser 30 is depicted.
  • pages of a printer manual are inputted into the document parser 30, which processes the pages to extract raw text blocks, raw shape blocks, and raw image blocks using algorithms which can include a clustering algorithm 30a and/or a rules-based algorithm 30b, for example.
  • the document parser 30 has extracted raw text blocks, raw shape blocks (lines), and a raw image which will be processed further by the page layout extractor 34, the content structure extractor 38, and augmentor 42.
  • Raw text blocks can be identified based on style and geometry properties, which can include font family, font size, and font color.
  • Raw shape blocks can be identified based on style and geometry properties that include vector properties, which in turn can include lines, rectangles, arcs, paths, fills, strokes, colors, and positions.
  • Raw image blocks can be identified based on style and geometry properties, which can include positions, dimensions, encodings, and color spaces.
  • a document layout model 36 outputted by the page layout extractor 34 is depicted.
  • the raw blocks 32 outputted by the document parser 30 are inputted into the page layout extractor 34, which processes the raw blocks 32 to extract page layout elements and output a document layout model 36, which can include paragraphs, lists, columns, watermarks, charts, tables, images, captions, table-of-contents, and indices.
  • the page layout extractor 34 has extracted a table, two lists, four paragraphs, a header, a column, and an image caption, which will be further processed by the content structure extractor 38.
  • the algorithms used to process the raw blocks 32 can include a clustering algorithm 34a, a rules-based algorithm 34b, and/or recursive x-y cut 34c, for example.
  • the document structure model 40 is a hierarchical graph data structure in which nodes are organized in a traversable tree, each node representing a dialog.
  • Each heading may include one or more subheadings, and each subheading in turn may include one or more sub-subheadings.
  • Such headings, subheadings, and sub-subheadings generally correspond to main topics, subtopics, and sub-subtopics, respectively, where the sub-subtopic is a topic within a subtopic, and a subtopic is a topic within a main topic.
  • the document layout model 36 outputted by the page layout extractor 34 is inputted into the content structure extractor 38, which processes the document layout model 36 to determine and output a document structure model 40, which can include titles, headers, footers, headings, heading trees, and content blocks.
  • the content structure extractor 38 has determined a document structure model 40 including three levels, three headings, and four content blocks.
  • the algorithms used to process the document structure model 40 can include a clustering algorithm 38a, a conditional random field 38b, a TF-IDF 38c, an ensemble classifier 38d, and/or a rules-based algorithm 38e, for example.
  • Entities 44 can be one of 3 classes: named entities, generic entities (key phrases), and action entities.
  • the document structure model 40 outputted by the content structure extractor 38 is inputted into the augmentor 42, which annotates the document structure model 40 with entities 44 and synonyms 46.
  • the augmentor 42 has identified five action entities defining actions ("press”, “plug”, “turn on”, “open”, and “open”), three generic entities defining key phrases ("access point”, “print cartridge”, “print head”), and three sets of synonyms (WAP and AP for "access point”, ink cartridge for "print cartridge”, printing head, inkjet head, printer head for "print head”).
  • the algorithms used to annotate the document structure model 40 can include a conditional random field 42a and/or a rules- based algorithm 42b, for example.
  • the rules-based algorithm 42b can identify verbs in the imperative tense as an action verb that can be classified as an action entity (press, plug, turn on, open).
  • Fig. 2E an example of dialog decision tree 50 outputted by the dialog extractor 48 is depicted.
  • the dialog extractor 48 infers a dialog decision tree 50 based on the annotated document structure model 40 with entities 44 and synonyms 46.
  • the decision tree 50 includes a hierarchy with three levels, where the "setting up printer” element in the first level branches into the "installing the print cartridge” and the “default settings of access point” elements in the second level, and the "installing the print cartridge” element branches into the “installing print head”, “installing color ink tanks", and “installing b/w ink tank” elements in the third level.
  • Dialogs can come in three types: entity based, binary type, or a mixture of entity based and binary type.
  • Figs. 3-5 show examples of these dialogs corresponding to various exemplary document structures.
  • FIG. 3 an example of a second document structure model 140 and a second dialog decision tree 150 with three levels is depicted.
  • a printer manual is the input document which is processed by the document parser to generate raw blocks, which are processed by the page layout extractor to generate a document layout model, which is processed by the content structure extractor to generate a document structure model, which is annotated with entities and synonyms by the augmentor.
  • this document structure model 140 there is one heading ("Printing photos"), two subheadings ("Printing photos on normal paper”, “Printing photos on photo paper") branching from the heading, and four sub subheadings (two "Windows instructions", two “iOS instructions”) branching from the subheadings.
  • a decision tree 150 is inferred based on the structure of this document structure model 140, so that the heading and each of the sub-headings and sub-subheadings are translated into nodes in the decision tree 150, each node representing a dialog.
  • the main topic inferred from the heading is "printing photos”.
  • the subtopics inferred from the subheadings are "normal paper” and “photo paper”, as the augmentor identifies "printing” as the common intent between the two subheadings and "normal paper” and "photo paper” as different entities.
  • the inferred sub-subtopics are "Windows” and "iOS", as the augmentor identifies the paper type as the common intent between the sub-subheadings for each subheading and "Windows" and "iOS" as different entities.
  • the dialogs are binary type, as there are two topics that branch from each node in levels 1 and 2.
  • FIG. 4 an example of a third document structure model 240 and a third dialog decision tree 250 with two levels is depicted.
  • a sound device manual is the input document which is processed by the document parser to generate raw blocks, which are processed by the page layout extractor to generate a document layout model, which is processed by the content structure extractor to generate a document structure model, which is annotated with entities and synonyms by the augmentor.
  • this document structure model 240 there is one heading ("Sound troubleshooting") and two subheadings ("no sound", "poor sound”).
  • the main topic inferred from the heading is "sound troubleshooting".
  • the subtopics inferred from the subheadings are "no sound” and “poor sound”, as the augmentor identifies "sound troubleshooting" as the common intent between the two subheadings.
  • the dialogs are binary type, as there are two topics that branch from each node in levels 1 and 2.
  • Webpages from a password-protected website are the input documents which are processed by the document parser to generate raw blocks, which are processed by the page layout extractor to generate a document layout model, which is processed by the content structure extractor to generate a document structure model, which is annotated with entities and synonyms by the augmentor.
  • this document structure model 340 there is one heading ("can't remember your password"), two subheadings ("view password”, “reset password”), and four sub subheadings ("Safari”, “Chrome”, “Thunderbird”, “Internet Explorer”).
  • the main topic inferred from the heading is "can't remember your password”.
  • the subtopics inferred from the subheadings are "view password” and “reset password”, as the augmentor identifies "can't remember password” as the common intent between the two subheadings.
  • the sub-subtopics inferred from the subheading "view password” are "Safari”, “Chrome”, “Thunderbird”, and “Internet Explorer”, as the augmentor identifies "view password” as the common intent among the four different sub-subheadings.
  • the dialog in level 1 is considered binary type, as there are two topics that branch from the node in level 1. There are two factors impacting the given situation: the first factor is related with the action the user is willing to take, and this factor can be either "view the forgotten password” or "resetting the password”.
  • the dialog in level 2 is considered entity based, as there are more than two topics that branch from the node in level 2 - the steps that the user takes depending on the browser type, for which there are five possible values in this example: "Safari”, “Chrome”, “Thunderbird”, and "Internet Explorer”.
  • a flowchart is illustrated of a first method 400 for extracting dialog information from documents.
  • the following description of method 400 is provided with reference to the software and hardware components described above, and can be implemented on such hardware and software. It will be appreciated that method 400 also can be performed using other suitable hardware and software components.
  • step 402 input documents are received.
  • step 404 the input documents are parsed into raw blocks.
  • step 406 visual design elements are extracted from the raw blocks.
  • step 408 a content structure is generated from the extracted visual design elements.
  • the content structure is annotated with entities.
  • step 412 a dialog decision tree is generated based on the annotated content structure.
  • step 414 the dialog decision tree is outputted.
  • a flowchart is illustrated of a second method 500 for extracting dialog information from documents.
  • the following description of method 500 is provided with reference to the software and hardware components described above, and can be implemented on such hardware and software. It will be appreciated that method 500 also can be performed using other suitable hardware and software components.
  • a document is inputted into the processor.
  • the input document is parsed.
  • raw blocks and shapes are generated.
  • vector images are extracted.
  • preprocessing is performed.
  • noisy blocks are tagged.
  • tables are detected.
  • charts are detected.
  • noisy blocks are tagged.
  • lines are constructed.
  • page zoning is performed.
  • diagrams are detected.
  • indices are detected.
  • table-of- contents are detected.
  • the document is classified as a FAQ (frequency-asked questions).
  • headers and footers are detected.
  • step 534 bullet characters are detected.
  • step 536 tables are detected.
  • step 538 global list patterns are identified.
  • paragraphs are constructed.
  • step 540 paragraphs are constructed.
  • step 542 captions are detected.
  • step 544 lines are extracted.
  • step 546 explicit table-of-contents headings are mapped.
  • step 548 titles are extracted.
  • step 550 a document tree is constructed.
  • step 552 questions and answers are extracted.
  • step 554 a knowledge tree is built.
  • the knowledge tree is augmented with entities and synonyms.
  • the augmented knowledge tree is outputted.
  • a flowchart is illustrated of a third method 600 for extracting dialog information from documents.
  • the following description of method 600 is provided with reference to the software and hardware components described above, and can be implemented on such hardware and software. It will be appreciated that method 600 also can be performed using other suitable hardware and software components.
  • a document is inputted into the processor.
  • the input document is parsed.
  • raw blocks are generated.
  • vector images are detected.
  • tables are detected.
  • charts are detected.
  • lines are constructed.
  • page zoning is performed.
  • diagrams are detected.
  • table-of-contents and indices are detected.
  • tables are detected.
  • paragraphs are constructed.
  • captions are detected.
  • lists are extracted.
  • titles are extracted.
  • a document tree is constructed.
  • questions and answers are extracted.
  • a knowledge tree is built.
  • the knowledge tree is augmented with entities and synonyms.
  • the augmented knowledge tree is outputted.
  • FIG. 9 schematically shows a non-limiting embodiment of a computing system 700 that can enact one or more of the processes described above.
  • Computing system 700 is shown in simplified form.
  • Computing system 700 can embody the computing device 12 or client computing device 52 described above.
  • Computing system 700 can take the form of one or more personal computers, server computers, tablet computers, home-entertainment computers, network computing devices, gaming devices, mobile computing devices, mobile communication devices (e.g., smart phone), and/or other computing devices, and wearable computing devices such as smart wristwatches and head mounted augmented reality devices.
  • Computing system 700 includes a logic processor 702 volatile memory 704, and a non-volatile storage device 706.
  • Computing system 700 can optionally include a display subsystem 708, input subsystem 710, communication subsystem 712, and/or other components not shown in earlier Figures.
  • Logic processor 702 includes one or more physical devices configured to execute instructions.
  • the logic processor can be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions can be implemented to perform a task, implement a data type, transform the state of one or more components, achieve a technical effect, or otherwise arrive at a desired result.
  • the logic processor can include one or more physical processors (hardware) configured to execute software instructions. Additionally or alternatively, the logic processor can include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. Processors of the logic processor 702 can be single-core or multi-core, and the instructions executed thereon can be configured for sequential, parallel, and/or distributed processing. Individual components of the logic processor optionally can be distributed among two or more separate devices, which can be remotely located and/or configured for coordinated processing. Aspects of the logic processor can be virtualized and executed by remotely accessible, networked computing devices configured in a cloud-computing configuration. In such a case, these virtualized aspects are run on different physical logic processors of various different machines, it will be understood.
  • Non-volatile storage device 706 includes one or more physical devices configured to hold instructions executable by the logic processors to implement the methods and processes described herein. When such methods and processes are implemented, the state of non-volatile storage device 706 can be transformed — e.g., to hold different data.
  • Non-volatile storage device 706 can include physical devices that are removable and/or built in.
  • Non-volatile storage device 706 can include optical memory (e.g., CD, DVD, HD-DVD, Blu-Ray Disc, etc ), semiconductor memory (e.g., ROM, EPROM, EEPROM, FLASH memory, etc.), and/or magnetic memory (e.g., hard-disk drive, floppy- disk drive, tape drive, MRAM, etc.), or other mass storage device technology.
  • Non-volatile storage device 706 can include nonvolatile, dynamic, static, read/write, read-only, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. It will be appreciated that non-volatile storage device 706 is configured to hold instructions even when power is cut to the non-volatile storage device 706.
  • Volatile memory 704 can include physical devices that include random access memory. Volatile memory 704 is typically utilized by logic processor 702 to temporarily store information during processing of software instructions. It will be appreciated that volatile memory 704 typically does not continue to store instructions when power is cut to the volatile memory 704.
  • logic processor 702, volatile memory 704, and non-volatile storage device 706 can be integrated together into one or more hardware-logic components.
  • hardware-logic components can include field-programmable gate arrays (FPGAs), program- and application-specific integrated circuits (PASIC / ASICs), program- and application-specific standard products (PSSP / ASSPs), SOC, and complex programmable logic devices (CPLDs), for example.
  • module can be used to describe an aspect of computing system 700 typically implemented in software by a processor to perform a particular function using portions of volatile memory, which function involves transformative processing that specially configures the processor to perform the function.
  • a module, program, or engine can be instantiated via logic processor 702 executing instructions held by non-volatile storage device 706, using portions of volatile memory 704. It will be understood that different modules, programs, and/or engines can be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine can be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc.
  • module can encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
  • display subsystem 708 can be used to present a visual representation of data held by non-volatile storage device 706.
  • the visual representation can take the form of a graphical user interface (GUI).
  • GUI graphical user interface
  • the state of display subsystem 708 can likewise be transformed to visually represent changes in the underlying data.
  • Display subsystem 708 can include one or more display devices utilizing virtually any type of technology. Such display devices can be combined with logic processor 702, volatile memory 704, and/or non-volatile storage device 706 in a shared enclosure, or such display devices can be peripheral display devices.
  • input subsystem 710 can comprise or interface with one or more user-input devices such as a keyboard, mouse, touch screen, or game controller.
  • the input subsystem can comprise or interface with selected natural user input (NUI) componentry.
  • NUI natural user input
  • Such componentry can be integrated or peripheral, and the transduction and/or processing of input actions can be handled on- or off-board.
  • Example NUI componentry can include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection and/or intent recognition; as well as electric-field sensing componentry for assessing brain activity; and/or any other suitable sensor.
  • communication subsystem 712 can be configured to communicatively couple various computing devices described herein with each other, and with other devices.
  • Communication subsystem 712 can include wired and/or wireless communication devices compatible with one or more different communication protocols.
  • the communication subsystem can be configured for communication via a wireless telephone network, or a wired or wireless local- or wide-area network, such as aHDMI over Wi-Fi connection.
  • the communication subsystem can allow computing system 700 to send and/or receive messages to and/or from other devices via a network such as the Internet.
  • One aspect provides a dialog tree generation system comprising a processor, and a memory storing instructions that, when executed by the processor, cause the system to receive documents; parse the documents into raw blocks; extract visual design elements from the raw blocks; generate a content structure from the extracted visual design elements; generate at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and output the dialog decision tree.
  • the extracted visual design elements may include style and geometry properties of the raw blocks.
  • the raw blocks may comprise raw text blocks, raw image blocks, and raw shape blocks; the style and geometry properties of the raw text blocks may include color spaces, font properties, and text dimensions; the style and geometry properties of the raw shape blocks may include vector properties; and the style and geometry properties of the raw image blocks may include positions, dimensions, encodings, and color spaces.
  • the extracted visual design elements may comprise a page layout including at least one of paragraphs, lists, columns, charts, tables, captions, table of contents, or indices.
  • the tables may be determined by clustering lines of the raw blocks into separate bins and detecting boundaries of the bins or segmenting table columns/rows based on alignment for borderless tables.
  • the outputted dialog decision tree may be executed by a virtual assistant program to thereby present an interactive chat dialog to a user that proceeds according to the dialog decision tree.
  • the content structure may include titles, headers, footers, headings, and heading trees.
  • the content structure may be a hierarchical graph data structure in which nodes are organized in a traversable tree.
  • the content structure may be annotated with action entities defining actions and generic entities defining key phrases.
  • Another aspect provides a dialog tree generation method comprising receiving documents; parsing the documents into raw blocks; extracting visual design elements from the raw blocks; generating a content structure from the extracted visual design elements; generating at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and outputting the dialog decision tree.
  • the extracted visual design elements may include style and geometry properties of the raw blocks.
  • the raw blocks may comprise raw text blocks, raw image blocks, and raw shape blocks;
  • the extracted visual design elements may include style and geometry properties of the raw blocks;
  • the style and geometry properties of the raw text blocks may include color spaces, font properties, and text dimensions;
  • the style and geometry properties of the raw shape blocks may include vector properties;
  • the style and geometry properties of the raw image blocks may include positions, dimensions, encodings, and color spaces.
  • the extracted visual design elements may include a page layout.
  • the page layout may include at least one of paragraphs, lists, columns, charts, tables, captions, table of contents, or indices.
  • the tables may be determined by clustering lines of the raw blocks into separate bins and detecting boundaries of the bins or segmenting table columns/rows based on alignment for borderless tables.
  • the content structure may include titles, headers, footers, headings, and heading trees.
  • the content structure may be a hierarchical graph data structure in which nodes are organized in a traversable tree.
  • the content structure may be annotated with action entities defining actions and generic entities defining key phrases.
  • Another aspect provides a computing system comprising a processor, and a memory storing instructions that, when executed by the processor, cause the system to implement a virtual assistant configured to execute a dialog decision tree to thereby present an interactive chat dialog to a user that proceeds according to the dialog decision tree, the dialog decision tree being generated by a dialog tree generation software system, configured to receive documents; parse the documents into raw text blocks, raw image blocks, and raw shape blocks; process the raw text blocks, raw image blocks, and raw shape blocks to generate a document layout model; process the document layout model to generate a document structure model; annotate the document structure with entities and synonyms; generate at least the dialog decision tree based on the annotated document structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and output the dialog decision tree to a location accessible by the virtual assistant.
  • a dialog tree generation software system configured to receive documents; parse the documents into raw text blocks, raw image blocks, and raw shape blocks; process the raw text blocks, raw image blocks, and raw shape blocks to generate
  • visual design elements of the document layout model may be analyzed to generate the document structure model comprising a hierarchical graph data structure in which nodes are organized in a traversable tree, each node representing a dialog.
  • the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible.
  • the specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.

Abstract

A dialog tree generation system is provided, including a processor, and a memory storing instructions that, when executed by the processor, cause the system to receive documents, parse the documents into raw blocks, extract visual design elements from the raw blocks, generate a content structure from the extracted visual design elements, generate at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy, and output the dialog decision tree.

Description

SYSTEMS AND METHODS FOR GENERATING DIALOG TREES
BACKGROUND
[0001] Interactive question answering systems are widely used to provide customer support for products and services. Dialog systems are interactive question answering systems that access information from structured databases to answer questions from customers. Customers may interact with virtual agents or hots when interfacing with these dialog systems, in which a decision tree is traversed as the customer is asked a series of focused questions toward a final answer to the customer's question. Conventionally, human operators are employed to develop such dialog systems manually, which requires significant time, cost, and effort.
SUMMARY
[0002] In view of the above, a dialog tree generation system is provided, including a processor, and a memory storing instructions that, when executed by the processor, cause the system to receive documents, parse the documents into raw blocks, extract visual design elements from the raw blocks, generate a content structure from the extracted visual design elements, generate at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy, and output the dialog decision tree.
[0003] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] Fig. 1A is a general schematic diagram illustrating a dialog tree generation system according to an example embodiment of the subject disclosure.
[0005] Fig. IB is a general schematic diagram illustrating the tools that can be used by the document parser, page layout extractor, content structure extractor, and augmentor of the dialog tree generation system of Fig. 1 A.
[0006] Fig. 2A is an illustration of an exemplary document parsing output of the document parser of the dialog tree generation system of Figs. 1 A and IB.
[0007] Fig. 2B is an illustration of an exemplary page layout extraction output of the page layout extractor of the dialog tree generation system of Figs. 1A and IB.
[0008] Fig. 2C is an illustration of an exemplary content structure extraction output of the content structure extractor of the dialog tree generation system of Figs. 1 A and IB. [0009] Fig. 2D is an illustration of an exemplary augmentation output of the augmentor of the dialog tree generation system of Figs. 1 A and IB.
[0010] Fig. 2E is an illustration of an exemplary dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
[0011] Fig. 3 is an illustration of an exemplary second document structure model and an exemplary second dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
[0012] Fig. 4 is an illustration of an exemplary third document structure model and an exemplary third dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
[0013] Fig. 5 is an illustration of an exemplary fourth document structure model and an exemplary fourth dialog decision tree outputted by the dialog extractor of the dialog tree generation system of Figs. 1A and IB.
[0014] Fig. 6 is a flowchart illustrating a method for generating a dialog decision tree according to a first example embodiment of the subject disclosure.
[0015] Fig. 7 is a flowchart illustrating another method for generating a dialog decision tree according to a second example embodiment of the subject disclosure.
[0016] Fig. 8 is a flowchart illustrating yet another method for generating a dialog decision tree according to a third example embodiment of the subject disclosure.
[0017] Fig. 9 is a schematic diagram illustrating an exemplary computing system that can be used to implement the dialog tree generation system of Figs. 1 A and IB. DETAILED DESCRIPTION
[0018] In view of the above issues, automated systems and methods are provided to generate highly precise dialog systems that require minimal editing by human operators. Referring to Fig. 1A, a dialog tree generation system 10 is provided for use in extracting dialog information from documents. The dialog tree generation system 10 comprises a dialog extraction computing device 12 including a processor 14, volatile memory 16, an input/output module 18, and non-volatile memory 24 storing an application 26 including a document parser 30, a page layout extractor 34, a content structure extractor 38, an augmentor 42, and a dialog extractor 48.
[0019] A bus 20 can operatively couple the processor 14, the input/output module 18, and the volatile memory 16 to the non-volatile memory 24. Although the document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48 are depicted as hosted (i.e., executed) at one computing device 12, it will be appreciated that the document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48 can alternatively be hosted across a plurality of computing devices to which the computing device 12 is communicatively coupled via a network 22.
[0020] As one example of one such other computing device, a client computing device 52 can be provided, which is operatively coupled to the computing device 12. In some examples, the network 22 can take the form of a local area network (LAN), wide area network (WAN), wired network, wireless network, personal area network, or a combination thereof, and can include the Internet.
[0021] The computing device 12 comprises a processor 14 and a non-volatile memory 24 configured to store document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48 in non-volatile memory 16. Non-volatile memory 24 is memory that retains instructions stored data even in the absence of externally applied power, such as FLASH memory, a hard disk, read only memory (ROM), electrically erasable programmable memory (EEPROM), etc. The instructions include one or more programs, including document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48, and data used by such programs sufficient to perform the operations described herein. In response to execution by the processor 14, the instructions cause the processor 14 to execute document parser 30, the page layout extractor 34, a content structure extractor 38, the augmentor 42, and the dialog extractor 48.
[0022] The processor 14 is a microprocessor that includes one or more of a central processing unit (CPU), a graphical processing unit (GPU), an application specific integrated circuit (ASIC), a system on chip (SOC), a field-programmable gate array (FPGA), a logic circuit, or other suitable type of microprocessor configured to perform the functions recited herein. The system 10 further includes volatile memory 16 such as random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), etc., which temporarily stores data only for so long as power is applied during execution of programs.
[0023] In one example, the client computing device 52 can execute an application client 26A to send input documents 28 to the computing device 12 as user input 56, and subsequently receive a dialog decision tree 50 from the computing device 12 as output. The dialog decision tree 50 can be associated with a virtual assistant 51 which is a program that is installed on the computing device 12. The virtual assistant 51 can also be referred to as a virtual agent, a chatter bot, a chatbot, a digital personal assistant, or an automated online assistant, for example. The computing device 12 executes the virtual assistant 51. The dialog decision tree 50 is executed by the virtual assistant 51 to thereby present an interactive chat dialog to a user that proceeds according to the dialog decision tree 50. The input documents can include PDF files, HTML files, PowerPoint files, Word documents, and OCR (optical character recognition) documents, for example. The application client 26A can be coupled to a graphical user interface 54 of the client computing device 52 to display graphical output 58 based on the dialog decision tree 50 outputted from the dialog extractor 48.
[0024] In this example, the document parser 30 receives input documents 28 as input, parses the input documents into raw blocks 32, and outputs the raw blocks 32, which can include raw text blocks, raw image blocks, and/or raw shape blocks. A raw text block is a body of text that is grouped together on a page of the input documents 28, a raw image block is an area on a page of the input documents 28 where a raw image is located, and a raw shape block is an area on a page of the input documents 28 where a raw shape is located. These raw blocks can be shaped like a square or a rectangle. The page layout extractor 34 receives the raw blocks 32 as input, processes the raw blocks 32 to extract visual design elements, and outputs a document layout model 36 comprising the visual design elements. The content structure extractor 38 receives the document layout model 36 as input, processes the document layout model 36 by analyzing the visual design elements in the document layout model 36, generates a document structure model 40 comprising a content structure from the extracted visual design elements, and outputs the document structure model 40. The augmentor 42 receives the document structure model 40 as input, annotates the document structure model 40, and outputs the document structure model 40 annotated with entities 44 and synonyms 46. The dialog extractor 48 receives the document structure model 40 annotated with entities 44 and synonyms 46 as input, processes the document structure model 40 annotated with entities 44 and synonyms 46, generates a dialog decision tree 50 based on the extracted content structure, and outputs the dialog decision tree 50 to the application client 26A to a location accessible by a virtual assistant 51. As described below, the dialog decision tree may include a plurality of nodes organized into a hierarchy.
[0025] Referring to Fig. IB, the processing tools that the document parser 30, the page layout extractor 34, the content structure extractor 38, and the augmentor 42 can use are depicted.
[0026] The document parser 30 can use a clustering algorithm 30a, and/or a rules- based algorithm 30b to cluster recognized text lines into text blocks, recognized vectors into raw shape blocks, and recognize raw images.
[0027] The page layout extractor 34 can use a clustering algorithm 34a, a rules- based algorithm 34b, and/or recursive x-y cut 34c to extract page layout elements. For example, the page layout extractor 34 can use recursive x-y cut 34c to extract and bound columns. Recursive x-y cut 34c can also segment a page into separate zones projecting filled pixels on the x-y axis of the page in a recursive way and splitting a given zone into multiple ones based on the energy distribution on a histogram. The page layout extractor 34 can use a clustering algorithm 34a to extract tables, clustering lines into separate bins (i.e., clusters of lines) to identify the number of different tables in a page and process the bins to identify the boundaries of each of the tables, or by segmenting table columns/rows based on alignment for borderless tables.
[0028] The content structure extractor 38 can use a clustering algorithm 38a, conditional random field (CRF) 38b, TF-IDF (term frequency-inverse document frequency) 38c, an ensemble classifier 38d, and/or a rules-based algorithm 38e to determine a document structure model 40. As a clustering algorithm 38a, agglomerative clustering can be used to identify the headings and their hierarchy. The clustering algorithm 38a can classify generated clusters as headings or content based on average member length and noise ratio, for example. Raw text blocks can be clustered and arranged in a hierarchy based on visual design elements which can comprise style and geometry properties, including font and typographical properties such as the typeface, letterforms, font size, font family, style, color, orientation, length, contrast, position, spacing, dimension, and others. Other visual design elements can be used to cluster and arrange the document into a hierarchy of visual design elements: paragraph alignment, line lengths, column widths, and graphic shapes. For example, decisive contrast between disparate visual design elements can be used to identify hierarchical relationships within the document. In some embodiments, an ensemble classifier 38d including a conditional random field 38da, decision tree 38db, and support vector machine 38dc can also be used by the content structure extractor 38 to extract headings.
[0029] The content structure extractor 38 can also use TF-IDF 38c to extract headings. For example, TF-IDF 38c on the boldness style can be used to detect inline headings. The content structure extractor 38 can also use a conditional random field 38b to extract titles.
[0030] The augmentor 42 can use a conditional random field 42a and/or a rules- based algorithm 42b to augment each node of the headings tree with metadata including entities and synonyms. Entities can be one of 3 classes: named entities, generic entities (key phrases), and action entities.
[0031] Referring to Figs. 2A-E, an example use case illustrating aspects of the present disclosure will now be presented. Referring to Fig. 2A an example of raw blocks 32 outputted by the document parser 30 is depicted. In this example, pages of a printer manual are inputted into the document parser 30, which processes the pages to extract raw text blocks, raw shape blocks, and raw image blocks using algorithms which can include a clustering algorithm 30a and/or a rules-based algorithm 30b, for example. In this example, the document parser 30 has extracted raw text blocks, raw shape blocks (lines), and a raw image which will be processed further by the page layout extractor 34, the content structure extractor 38, and augmentor 42. Raw text blocks can be identified based on style and geometry properties, which can include font family, font size, and font color. Raw shape blocks can be identified based on style and geometry properties that include vector properties, which in turn can include lines, rectangles, arcs, paths, fills, strokes, colors, and positions. Raw image blocks can be identified based on style and geometry properties, which can include positions, dimensions, encodings, and color spaces.
[0032] Referring to Fig. 2B, an example of a document layout model 36 outputted by the page layout extractor 34 is depicted. In this example, the raw blocks 32 outputted by the document parser 30 are inputted into the page layout extractor 34, which processes the raw blocks 32 to extract page layout elements and output a document layout model 36, which can include paragraphs, lists, columns, watermarks, charts, tables, images, captions, table-of-contents, and indices. In this example, the page layout extractor 34 has extracted a table, two lists, four paragraphs, a header, a column, and an image caption, which will be further processed by the content structure extractor 38. The algorithms used to process the raw blocks 32 can include a clustering algorithm 34a, a rules-based algorithm 34b, and/or recursive x-y cut 34c, for example.
[0033] Referring to Fig. 2C, an example of a document structure model 40 outputted by the content structure extractor 38 is depicted. The document structure model 40 is a hierarchical graph data structure in which nodes are organized in a traversable tree, each node representing a dialog. Each heading may include one or more subheadings, and each subheading in turn may include one or more sub-subheadings. Such headings, subheadings, and sub-subheadings generally correspond to main topics, subtopics, and sub-subtopics, respectively, where the sub-subtopic is a topic within a subtopic, and a subtopic is a topic within a main topic.
[0034] In this example, the document layout model 36 outputted by the page layout extractor 34 is inputted into the content structure extractor 38, which processes the document layout model 36 to determine and output a document structure model 40, which can include titles, headers, footers, headings, heading trees, and content blocks. The content structure extractor 38 has determined a document structure model 40 including three levels, three headings, and four content blocks. The algorithms used to process the document structure model 40 can include a clustering algorithm 38a, a conditional random field 38b, a TF-IDF 38c, an ensemble classifier 38d, and/or a rules-based algorithm 38e, for example.
[0035] Referring to Fig. 2D, an example of a document structure model 40 annotated with entities 44 and synonyms 46 and outputted by the augmentor 42 is depicted. Entities 44 can be one of 3 classes: named entities, generic entities (key phrases), and action entities. In this example, the document structure model 40 outputted by the content structure extractor 38 is inputted into the augmentor 42, which annotates the document structure model 40 with entities 44 and synonyms 46. The augmentor 42 has identified five action entities defining actions ("press", "plug", "turn on", "open", and "open"), three generic entities defining key phrases ("access point", "print cartridge", "print head"), and three sets of synonyms (WAP and AP for "access point", ink cartridge for "print cartridge", printing head, inkjet head, printer head for "print head"). The algorithms used to annotate the document structure model 40 can include a conditional random field 42a and/or a rules- based algorithm 42b, for example. For example, the rules-based algorithm 42b can identify verbs in the imperative tense as an action verb that can be classified as an action entity (press, plug, turn on, open).
[0036] Referring to Fig. 2E, an example of dialog decision tree 50 outputted by the dialog extractor 48 is depicted. The dialog extractor 48 infers a dialog decision tree 50 based on the annotated document structure model 40 with entities 44 and synonyms 46. In this example, the decision tree 50 includes a hierarchy with three levels, where the "setting up printer" element in the first level branches into the "installing the print cartridge" and the "default settings of access point" elements in the second level, and the "installing the print cartridge" element branches into the "installing print head", "installing color ink tanks", and "installing b/w ink tank" elements in the third level.
[0037] Dialogs can come in three types: entity based, binary type, or a mixture of entity based and binary type. Figs. 3-5 show examples of these dialogs corresponding to various exemplary document structures.
[0038] Referring to Fig. 3, an example of a second document structure model 140 and a second dialog decision tree 150 with three levels is depicted. A printer manual is the input document which is processed by the document parser to generate raw blocks, which are processed by the page layout extractor to generate a document layout model, which is processed by the content structure extractor to generate a document structure model, which is annotated with entities and synonyms by the augmentor. In this document structure model 140, there is one heading ("Printing photos"), two subheadings ("Printing photos on normal paper", "Printing photos on photo paper") branching from the heading, and four sub subheadings (two "Windows instructions", two "iOS instructions") branching from the subheadings. A decision tree 150 is inferred based on the structure of this document structure model 140, so that the heading and each of the sub-headings and sub-subheadings are translated into nodes in the decision tree 150, each node representing a dialog. In this example, the main topic inferred from the heading is "printing photos". The subtopics inferred from the subheadings are "normal paper" and "photo paper", as the augmentor identifies "printing" as the common intent between the two subheadings and "normal paper" and "photo paper" as different entities. For each subheading, the inferred sub-subtopics are "Windows" and "iOS", as the augmentor identifies the paper type as the common intent between the sub-subheadings for each subheading and "Windows" and "iOS" as different entities. For levels 1 and 2, the dialogs are binary type, as there are two topics that branch from each node in levels 1 and 2.
[0039] Referring to Fig. 4, an example of a third document structure model 240 and a third dialog decision tree 250 with two levels is depicted. A sound device manual is the input document which is processed by the document parser to generate raw blocks, which are processed by the page layout extractor to generate a document layout model, which is processed by the content structure extractor to generate a document structure model, which is annotated with entities and synonyms by the augmentor. In this document structure model 240, there is one heading ("Sound troubleshooting") and two subheadings ("no sound", "poor sound"). In this example, the main topic inferred from the heading is "sound troubleshooting". The subtopics inferred from the subheadings are "no sound" and "poor sound", as the augmentor identifies "sound troubleshooting" as the common intent between the two subheadings. For levels 1 and 2, the dialogs are binary type, as there are two topics that branch from each node in levels 1 and 2. [0040] Referring to Fig. 5, an example of a fourth document structure model 340 and a fourth dialog decision tree 350 with three levels is depicted. Webpages from a password-protected website are the input documents which are processed by the document parser to generate raw blocks, which are processed by the page layout extractor to generate a document layout model, which is processed by the content structure extractor to generate a document structure model, which is annotated with entities and synonyms by the augmentor. In this document structure model 340, there is one heading ("can't remember your password"), two subheadings ("view password", "reset password"), and four sub subheadings ("Safari", "Chrome", "Thunderbird", "Internet Explorer"). In this example, the main topic inferred from the heading is "can't remember your password". The subtopics inferred from the subheadings are "view password" and "reset password", as the augmentor identifies "can't remember password" as the common intent between the two subheadings. The sub-subtopics inferred from the subheading "view password" are "Safari", "Chrome", "Thunderbird", and "Internet Explorer", as the augmentor identifies "view password" as the common intent among the four different sub-subheadings. The dialog in level 1 is considered binary type, as there are two topics that branch from the node in level 1. There are two factors impacting the given situation: the first factor is related with the action the user is willing to take, and this factor can be either "view the forgotten password" or "resetting the password". The dialog in level 2 is considered entity based, as there are more than two topics that branch from the node in level 2 - the steps that the user takes depending on the browser type, for which there are five possible values in this example: "Safari", "Chrome", "Thunderbird", and "Internet Explorer".
[0041] Referring to Fig. 6, a flowchart is illustrated of a first method 400 for extracting dialog information from documents. The following description of method 400 is provided with reference to the software and hardware components described above, and can be implemented on such hardware and software. It will be appreciated that method 400 also can be performed using other suitable hardware and software components.
[0042] At step 402, input documents are received. At step 404, the input documents are parsed into raw blocks. At step 406, visual design elements are extracted from the raw blocks. At step 408, a content structure is generated from the extracted visual design elements. At step 410, the content structure is annotated with entities. At step 412, a dialog decision tree is generated based on the annotated content structure. At step 414, the dialog decision tree is outputted.
[0043] Referring to Fig. 7, a flowchart is illustrated of a second method 500 for extracting dialog information from documents. The following description of method 500 is provided with reference to the software and hardware components described above, and can be implemented on such hardware and software. It will be appreciated that method 500 also can be performed using other suitable hardware and software components.
[0044] At step 502, a document is inputted into the processor. At step 504, the input document is parsed. At step 506, raw blocks and shapes are generated. At step 508, vector images are extracted. At step 510, preprocessing is performed. At step 512, noisy blocks are tagged. At step 514, tables are detected. At step 516, charts are detected. At step 518, noisy blocks are tagged. At step 520, lines are constructed. At step 522, page zoning is performed. At step 524, diagrams are detected. At step 526, indices are detected. At step 528, table-of- contents are detected. At step 530, the document is classified as a FAQ (frequency-asked questions). At step 532, headers and footers are detected. At step 534, bullet characters are detected. At step 536, tables are detected. At step 538, global list patterns are identified. At step 540, paragraphs are constructed. At step 542, captions are detected. At step 544, lines are extracted. At step 546, explicit table-of-contents headings are mapped. At step 548, titles are extracted. At step 550, a document tree is constructed. At step 552, questions and answers are extracted. At step 554, a knowledge tree is built. At step 556, the knowledge tree is augmented with entities and synonyms. At step 558, the augmented knowledge tree is outputted.
[0045] Referring to Fig. 8, a flowchart is illustrated of a third method 600 for extracting dialog information from documents. The following description of method 600 is provided with reference to the software and hardware components described above, and can be implemented on such hardware and software. It will be appreciated that method 600 also can be performed using other suitable hardware and software components.
[0046] At step 602, a document is inputted into the processor. At step 604, the input document is parsed. At step 606, raw blocks are generated. At step 608, vector images are detected. At step 610, tables are detected. At step 612, charts are detected. At step 614, lines are constructed. At step 616, page zoning is performed. At step 618, diagrams are detected. At step 620, table-of-contents and indices are detected. At step 622, tables are detected. At step 624, paragraphs are constructed. At step 626, captions are detected. At step 628, lists are extracted. At step 630, titles are extracted. At step 632, a document tree is constructed. At step 634, questions and answers are extracted. At step 636, a knowledge tree is built. At step 638, the knowledge tree is augmented with entities and synonyms. At step 640, the augmented knowledge tree is outputted. [0047] The above-described systems and methods can be used to implement a dialog extractor that can extract dialog information to output a highly precise dialog decision tree which requires minimal editing by a human operator. Accordingly, the process for generating dialog decision trees for interactive question answering systems can become automated to increase coverage and efficiency compared to tedious and time consuming conventional methods. Coverage and efficiency can be increased since the above-described systems and methods can cover more documents in less time and at less expense than using use human generated dialogs.
[0048] Fig. 9 schematically shows a non-limiting embodiment of a computing system 700 that can enact one or more of the processes described above. Computing system 700 is shown in simplified form. Computing system 700 can embody the computing device 12 or client computing device 52 described above. Computing system 700 can take the form of one or more personal computers, server computers, tablet computers, home-entertainment computers, network computing devices, gaming devices, mobile computing devices, mobile communication devices (e.g., smart phone), and/or other computing devices, and wearable computing devices such as smart wristwatches and head mounted augmented reality devices.
[0049] Computing system 700 includes a logic processor 702 volatile memory 704, and a non-volatile storage device 706. Computing system 700 can optionally include a display subsystem 708, input subsystem 710, communication subsystem 712, and/or other components not shown in earlier Figures.
[0050] Logic processor 702 includes one or more physical devices configured to execute instructions. For example, the logic processor can be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions can be implemented to perform a task, implement a data type, transform the state of one or more components, achieve a technical effect, or otherwise arrive at a desired result.
[0051] The logic processor can include one or more physical processors (hardware) configured to execute software instructions. Additionally or alternatively, the logic processor can include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. Processors of the logic processor 702 can be single-core or multi-core, and the instructions executed thereon can be configured for sequential, parallel, and/or distributed processing. Individual components of the logic processor optionally can be distributed among two or more separate devices, which can be remotely located and/or configured for coordinated processing. Aspects of the logic processor can be virtualized and executed by remotely accessible, networked computing devices configured in a cloud-computing configuration. In such a case, these virtualized aspects are run on different physical logic processors of various different machines, it will be understood.
[0052] Non-volatile storage device 706 includes one or more physical devices configured to hold instructions executable by the logic processors to implement the methods and processes described herein. When such methods and processes are implemented, the state of non-volatile storage device 706 can be transformed — e.g., to hold different data. [0053] Non-volatile storage device 706 can include physical devices that are removable and/or built in. Non-volatile storage device 706 can include optical memory (e.g., CD, DVD, HD-DVD, Blu-Ray Disc, etc ), semiconductor memory (e.g., ROM, EPROM, EEPROM, FLASH memory, etc.), and/or magnetic memory (e.g., hard-disk drive, floppy- disk drive, tape drive, MRAM, etc.), or other mass storage device technology. Non-volatile storage device 706 can include nonvolatile, dynamic, static, read/write, read-only, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. It will be appreciated that non-volatile storage device 706 is configured to hold instructions even when power is cut to the non-volatile storage device 706.
[0054] Volatile memory 704 can include physical devices that include random access memory. Volatile memory 704 is typically utilized by logic processor 702 to temporarily store information during processing of software instructions. It will be appreciated that volatile memory 704 typically does not continue to store instructions when power is cut to the volatile memory 704.
[0055] Aspects of logic processor 702, volatile memory 704, and non-volatile storage device 706 can be integrated together into one or more hardware-logic components. Such hardware-logic components can include field-programmable gate arrays (FPGAs), program- and application-specific integrated circuits (PASIC / ASICs), program- and application-specific standard products (PSSP / ASSPs), SOC, and complex programmable logic devices (CPLDs), for example.
[0056] The terms “module,” “program,” and “engine” can be used to describe an aspect of computing system 700 typically implemented in software by a processor to perform a particular function using portions of volatile memory, which function involves transformative processing that specially configures the processor to perform the function. Thus, a module, program, or engine can be instantiated via logic processor 702 executing instructions held by non-volatile storage device 706, using portions of volatile memory 704. It will be understood that different modules, programs, and/or engines can be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine can be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” can encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
[0057] When included display subsystem 708 can be used to present a visual representation of data held by non-volatile storage device 706. The visual representation can take the form of a graphical user interface (GUI). As the herein described methods and processes change the data held by the non-volatile storage device, and thus transform the state of the non-volatile storage device, the state of display subsystem 708 can likewise be transformed to visually represent changes in the underlying data. Display subsystem 708 can include one or more display devices utilizing virtually any type of technology. Such display devices can be combined with logic processor 702, volatile memory 704, and/or non-volatile storage device 706 in a shared enclosure, or such display devices can be peripheral display devices.
[0058] When included, input subsystem 710 can comprise or interface with one or more user-input devices such as a keyboard, mouse, touch screen, or game controller. In some embodiments, the input subsystem can comprise or interface with selected natural user input (NUI) componentry. Such componentry can be integrated or peripheral, and the transduction and/or processing of input actions can be handled on- or off-board. Example NUI componentry can include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection and/or intent recognition; as well as electric-field sensing componentry for assessing brain activity; and/or any other suitable sensor.
[0059] When included, communication subsystem 712 can be configured to communicatively couple various computing devices described herein with each other, and with other devices. Communication subsystem 712 can include wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem can be configured for communication via a wireless telephone network, or a wired or wireless local- or wide-area network, such as aHDMI over Wi-Fi connection. In some embodiments, the communication subsystem can allow computing system 700 to send and/or receive messages to and/or from other devices via a network such as the Internet.
[0060] It will be appreciated that “and/or” as used herein refers to the logical disjunction operation, and thus A and/or B has the following truth table.
Figure imgf000016_0001
[0061] Further, it will be appreciated that the terms "includes," "including," "has,"
"contains," variants thereof, and other similar words used in either the detailed description or the claims are intended to be inclusive in a manner similar to the term "comprising" as an open transition word without precluding any additional or other elements.
[0062] The following paragraphs provide additional support for the claims of the subject application. One aspect provides a dialog tree generation system comprising a processor, and a memory storing instructions that, when executed by the processor, cause the system to receive documents; parse the documents into raw blocks; extract visual design elements from the raw blocks; generate a content structure from the extracted visual design elements; generate at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and output the dialog decision tree. In this aspect, additionally or alternatively, the extracted visual design elements may include style and geometry properties of the raw blocks. In this aspect, additionally or alternatively, the raw blocks may comprise raw text blocks, raw image blocks, and raw shape blocks; the style and geometry properties of the raw text blocks may include color spaces, font properties, and text dimensions; the style and geometry properties of the raw shape blocks may include vector properties; and the style and geometry properties of the raw image blocks may include positions, dimensions, encodings, and color spaces. In this aspect, additionally or alternatively, the extracted visual design elements may comprise a page layout including at least one of paragraphs, lists, columns, charts, tables, captions, table of contents, or indices. In this aspect, additionally or alternatively, the tables may be determined by clustering lines of the raw blocks into separate bins and detecting boundaries of the bins or segmenting table columns/rows based on alignment for borderless tables. In this aspect, additionally or alternatively, the outputted dialog decision tree may be executed by a virtual assistant program to thereby present an interactive chat dialog to a user that proceeds according to the dialog decision tree. In this aspect, additionally or alternatively, the content structure may include titles, headers, footers, headings, and heading trees. In this aspect, additionally or alternatively, the content structure may be a hierarchical graph data structure in which nodes are organized in a traversable tree. In this aspect, additionally or alternatively, the content structure may be annotated with action entities defining actions and generic entities defining key phrases.
[0063] Another aspect provides a dialog tree generation method comprising receiving documents; parsing the documents into raw blocks; extracting visual design elements from the raw blocks; generating a content structure from the extracted visual design elements; generating at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and outputting the dialog decision tree. In this aspect, additionally or alternatively, the extracted visual design elements may include style and geometry properties of the raw blocks. In this aspect, additionally or alternatively, the raw blocks may comprise raw text blocks, raw image blocks, and raw shape blocks; the extracted visual design elements may include style and geometry properties of the raw blocks; the style and geometry properties of the raw text blocks may include color spaces, font properties, and text dimensions; the style and geometry properties of the raw shape blocks may include vector properties; and the style and geometry properties of the raw image blocks may include positions, dimensions, encodings, and color spaces. In this aspect, additionally or alternatively, the extracted visual design elements may include a page layout. In this aspect, additionally or alternatively, the page layout may include at least one of paragraphs, lists, columns, charts, tables, captions, table of contents, or indices. In this aspect, additionally or alternatively, the tables may be determined by clustering lines of the raw blocks into separate bins and detecting boundaries of the bins or segmenting table columns/rows based on alignment for borderless tables. In this aspect, additionally or alternatively, the content structure may include titles, headers, footers, headings, and heading trees. In this aspect, additionally or alternatively, the content structure may be a hierarchical graph data structure in which nodes are organized in a traversable tree. In this aspect, additionally or alternatively, the content structure may be annotated with action entities defining actions and generic entities defining key phrases.
[0064] Another aspect provides a computing system comprising a processor, and a memory storing instructions that, when executed by the processor, cause the system to implement a virtual assistant configured to execute a dialog decision tree to thereby present an interactive chat dialog to a user that proceeds according to the dialog decision tree, the dialog decision tree being generated by a dialog tree generation software system, configured to receive documents; parse the documents into raw text blocks, raw image blocks, and raw shape blocks; process the raw text blocks, raw image blocks, and raw shape blocks to generate a document layout model; process the document layout model to generate a document structure model; annotate the document structure with entities and synonyms; generate at least the dialog decision tree based on the annotated document structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and output the dialog decision tree to a location accessible by the virtual assistant. In this aspect, additionally or alternatively, visual design elements of the document layout model may be analyzed to generate the document structure model comprising a hierarchical graph data structure in which nodes are organized in a traversable tree, each node representing a dialog. [0065] It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.
[0066] The subject matter of the present disclosure includes all novel and non- obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.

Claims

1. A dialog tree generation system comprising: a processor, and a memory storing instructions that, when executed by the processor, cause the system to: receive documents; parse the documents into raw blocks; extract visual design elements from the raw blocks; generate a content structure from the extracted visual design elements; generate at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and output the dialog decision tree.
2. The dialog tree generation system of claim 1, wherein the extracted visual design elements include style and geometry properties of the raw blocks.
3. The dialog tree generation system of claim 2, wherein the raw blocks comprise raw text blocks, raw image blocks, and raw shape blocks; the style and geometry properties of the raw text blocks include color spaces, font properties, and text dimensions; the style and geometry properties of the raw shape blocks include vector properties; and the style and geometry properties of the raw image blocks include positions, dimensions, encodings, and color spaces.
4. The dialog tree generation system of claim 1, wherein the extracted visual design elements comprise a page layout including at least one of paragraphs, lists, columns, charts, tables, captions, table of contents, or indices.
5. The dialog tree generation system of claim 4, wherein the tables are determined by clustering lines of the raw blocks into separate bins and detecting boundaries of the bins or segmenting table columns/rows based on alignment for borderless tables.
6. The dialog tree generation system of claim 1, wherein the outputted dialog decision tree is executed by a virtual assistant program to thereby present an interactive chat dialog to a user that proceeds according to the dialog decision tree.
7. The dialog tree generation system of claim 1, wherein the content structure includes titles, headers, footers, headings, and heading trees.
8. The dialog tree generation system of claim 1, wherein the content structure is a hierarchical graph data structure in which nodes are organized in a traversable tree.
9. The dialog tree generation system of claim 1, wherein the content structure is annotated with action entities defining actions and generic entities defining key phrases.
10. A dialog tree generation method comprising: receiving documents; parsing the documents into raw blocks; extracting visual design elements from the raw blocks; generating a content structure from the extracted visual design elements; generating at least a dialog decision tree based on the extracted content structure, the dialog decision tree comprising a plurality of nodes organized into a hierarchy; and outputting the dialog decision tree.
11. The dialog tree generation method of claim 10, wherein the extracted visual design elements include style and geometry properties of the raw blocks.
12. The dialog tree generation method of claim 10, wherein the raw blocks comprise raw text blocks, raw image blocks, and raw shape blocks; the extracted visual design elements include style and geometry properties of the raw blocks; the style and geometry properties of the raw text blocks include color spaces, font properties, and text dimensions; the style and geometry properties of the raw shape blocks include vector properties; and the style and geometry properties of the raw image blocks include positions, dimensions, encodings, and color spaces.
13. The dialog tree generation method of claim 10, wherein the extracted visual design elements include a page layout.
14. The dialog tree generation method of claim 13, wherein the page layout includes at least one of paragraphs, lists, columns, charts, tables, captions, table of contents, or indices.
15. The dialog tree generation method of claim 14, wherein the tables are determined by clustering lines of the raw blocks into separate bins and detecting boundaries of the bins or segmenting table columns/rows based on alignment for borderless tables.
PCT/US2022/019215 2021-03-30 2022-03-08 Systems and methods for generating dialog trees WO2022211982A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163168117P 2021-03-30 2021-03-30
US63/168,117 2021-03-30
US17/334,543 2021-05-28
US17/334,543 US20220318497A1 (en) 2021-03-30 2021-05-28 Systems and methods for generating dialog trees

Publications (1)

Publication Number Publication Date
WO2022211982A1 true WO2022211982A1 (en) 2022-10-06

Family

ID=80937121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/019215 WO2022211982A1 (en) 2021-03-30 2022-03-08 Systems and methods for generating dialog trees

Country Status (1)

Country Link
WO (1) WO2022211982A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190132264A1 (en) * 2017-10-30 2019-05-02 International Business Machines Corporation Generation of a chatbot interface for an application programming interface

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190132264A1 (en) * 2017-10-30 2019-05-02 International Business Machines Corporation Generation of a chatbot interface for an application programming interface

Similar Documents

Publication Publication Date Title
CN110914824B (en) Apparatus and method for removing sensitive content from a document
KR102275413B1 (en) Detecting and extracting image document components to create flow document
KR101943137B1 (en) Providing topic based search guidance
JP7100747B2 (en) Training data generation method and equipment
US10558756B2 (en) Unsupervised information extraction dictionary creation
US9965460B1 (en) Keyword extraction for relationship maps
US20170177180A1 (en) Dynamic Highlighting of Text in Electronic Documents
US20220215172A1 (en) Patent document creating device, method, computer program, computer-readable recording medium, server and system
US11138426B2 (en) Template matching, rules building and token extraction
US20140304579A1 (en) Understanding Interconnected Documents
US20160267117A1 (en) Answering natural language table queries through semantic table representation
JP2018538603A (en) Identify query patterns and related total statistics between search queries
US20160110316A1 (en) Generating a document preview
US11687714B2 (en) Systems and methods for generating text descriptive of digital images
WO2020117616A1 (en) Content-aware search suggestions
US7996768B2 (en) Operations on document components filtered via text attributes
US20220318497A1 (en) Systems and methods for generating dialog trees
Klahold et al. Computer aided writing
US9195706B1 (en) Processing of document metadata for use as query suggestions
US20090030893A1 (en) Query generation system for an information retrieval system
KR20160100322A (en) Identifying semantically-meaningful text selections
CN109033082B (en) Learning training method and device of semantic model and computer readable storage medium
WO2022211982A1 (en) Systems and methods for generating dialog trees
JP7416665B2 (en) Dialogue system and control method for dialogue system
US20230081077A1 (en) Document creation support apparatus, document creation support method and document creation support program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22712726

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22712726

Country of ref document: EP

Kind code of ref document: A1