CN115004175A - System, device and method for document query - Google Patents

System, device and method for document query Download PDF

Info

Publication number
CN115004175A
CN115004175A CN202080094631.8A CN202080094631A CN115004175A CN 115004175 A CN115004175 A CN 115004175A CN 202080094631 A CN202080094631 A CN 202080094631A CN 115004175 A CN115004175 A CN 115004175A
Authority
CN
China
Prior art keywords
documents
paragraphs
document
data
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080094631.8A
Other languages
Chinese (zh)
Inventor
J-P·多戴尔
Z·黄
X·马
R·M·纳拉帕蒂
K·拉贾戈帕兰
M·萨伊尼
S·森古普塔
S·K·辛格
D·索利奥斯
A·苏塔尼亚
D·王
Z·王
B·项
P·许
Y·袁
J·L·凯兹曼
N·库纳拉
A·M·格兰特
C·诺盖拉多斯桑托斯
P·吴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Amazon Technologies Inc
Original Assignee
Amazon Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/698,080 external-priority patent/US11475067B2/en
Priority claimed from US16/697,964 external-priority patent/US11314819B2/en
Priority claimed from US16/698,027 external-priority patent/US20210158209A1/en
Priority claimed from US16/697,979 external-priority patent/US11526557B2/en
Priority claimed from US16/697,948 external-priority patent/US11366855B2/en
Application filed by Amazon Technologies Inc filed Critical Amazon Technologies Inc
Publication of CN115004175A publication Critical patent/CN115004175A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Techniques for searching for documents are described. An exemplary method comprises: receiving a document search query; querying at least one index based on the document search query to identify matching data; acquiring the identified matching data; determining one or more of a top-ranked paragraph and a top-ranked document from a set of documents based on one or more calls to one or more machine learning models based at least on the obtained identified matching data and the document search query; and returning one or more items in the top ranked paragraph and the appropriate subset of documents.

Description

System, device and method for document query
Background
Enterprises generate more data than ever. Attempting to find relevant data from the generated data is a difficult task. Traditional search solutions rely on keyword-based document analysis to find specific terms in the data, while general approaches are inherently limited in that finer grained content cannot be "understood".
Drawings
Various embodiments according to the present disclosure will be described with reference to the accompanying drawings, in which:
FIG. 1 illustrates an embodiment of an enterprise search service.
FIG. 2 illustrates an embodiment of an enterprise search service for providing reasoning functionality.
FIG. 3 illustrates an embodiment of an enterprise search service for providing reasoning functionality.
FIG. 4 illustrates an embodiment of a method for performing inference (document search).
FIG. 5 illustrates an embodiment of an improved display of results of inference queries.
FIG. 6 illustrates an embodiment of a method for performing an improved display of results of an inference query.
FIG. 7 illustrates an embodiment of an enterprise search service 102 for providing ingestion functionality.
FIG. 8 illustrates an embodiment of a method for performing ingestion of one or more documents.
Fig. 9 illustrates an embodiment of an exemplary reserved field for use with ingestion.
FIG. 10 illustrates an embodiment of a graphical user interface for updating/adding/removing reserved fields for ingestion.
FIG. 11 illustrates an embodiment of a model building system.
FIG. 12 illustrates an embodiment of a method for model management.
Fig. 13 illustrates an embodiment of a graphical user interface for use in active learning of questions and answers for training a machine learning model.
FIG. 14 illustrates an embodiment of a graphical user interface for use in active learning of document rankings for training a machine learning model.
Fig. 15 illustrates an embodiment of a method of active learning for training a machine learning model.
FIG. 16 illustrates an embodiment of a method for training and using a problem generation model.
FIG. 17 illustrates a first exemplary set of candidate questions generated by a question generation model trained on known question-answer pairs.
FIG. 18 illustrates a second exemplary set of candidate questions generated by a question generation model trained on known question-answer pairs.
FIG. 19 illustrates an embodiment of a method for training a problem generation model.
Fig. 20 illustrates an exemplary provider network environment, according to some embodiments.
Fig. 21 is a block diagram of an exemplary provider network that provides storage services and hardware virtualization services to customers, according to some embodiments.
Fig. 22 is a block diagram illustrating an exemplary computer system that may be used in some embodiments.
Detailed Description
The present disclosure relates to methods, devices, systems, and non-transitory computer-readable storage media for indexing and searching text-based documents using machine learning. Obtaining documents, extracting text from documents and indexing them, etc., so that they may be searched using term-based or question-based queries. These text-based documents, including Frequently Asked Questions (FAQs), are searched according to a user query for one or more top-ranked (most relevant) documents, one or more top-ranked paragraphs (where a paragraph is a finite number of consecutive lines extracted from a given document), and/or one or more top-ranked FAQs.
Detailed herein are embodiments of an enterprise search service that enable users to intuitively search unstructured data using natural language. It returns a specific and personalized answer to the question, providing the end user with an experience that more closely interacts with human experts.
In the keyword-based document analysis method, it is difficult to determine any type of context of contents. Embodiments detailed herein allow an internally or externally hosted corpus of documents to access and index those documents. The index helps to provide context for the document and provides the appearance of a "structure" for unstructured documents. In some cases, the set of reserved fields used for indexing provides a more uniform context for tags in documents. Thus, embodiments of the enterprise search service described below allow for the answering of both factual and non-factual (e.g., how, what, why) questions by extracting relevant information from a corpus of documents. Such questions (e.g., "what the latest version of software X is") can often be answered in several words. In some embodiments, the enterprise search service allows for answers to short questions that may be answered in several lines, such as those found in frequently asked question documents (e.g., "what is the difference between the IP default gateway, the IP default network, and the IP route 0.0.0.0/0 command. In some embodiments, the enterprise search service allows for answering descriptive questions by identifying entire relevant documents, where the answer is the entire document. For example, "what is brazilian CLI? "
Another drawback of some search systems is that content related to the search results is displayed to the user. While some search results boldface particular words or phrases in the results, this is how much of the "correct" answer helps the user identify the search. Detailed herein are embodiments that further emphasize the "correct" answer based on the confidence of one or more machine learning models that found the "correct" answer. Answers that are not "correct" are either not emphasized or emphasized in a different manner.
Many businesses use log analysis or have use cases like customer service, search traffic reports, and FAQs that may benefit from the embodiments detailed herein. The detailed embodiments enable these enterprises to build more intelligent enterprise search applications that securely cover a wider source and provide powerful natural language understanding capabilities, with only a fraction of the time and complexity required to implement their own search solutions.
FIG. 1 illustrates an embodiment of an enterprise search service. The enterprise search service 102 allows for the use of one or more machine learning models to query or search for documents of an enterprise and/or an appropriate subset thereof. Details of various aspects of the enterprise search service 102 are discussed below. Prior to any search, documents and/or appropriate subsets thereof have been ingested prior to such a query. In some embodiments, the enterprise search service 102 provides the ability to take documents from data sources 105 internal to the provider network 100 and data sources 106 external to the provider network 100 (e.g., stored at third party locations, local storage, etc.).
Ingestion service 130 allows documents to be ingested into enterprise search service 102. Documents may be pulled from the data source (e.g., in response to a request) and/or pushed from the data source (e.g., synchronized when a document is added or changed). Ingestion service 130 may also obtain Access Control Lists (ACLs) associated with the documents. The ACL may be used to determine whether the search results are allowed to be provided.
To obtain documents from the data sources 105 or 106, the ingestion service is coupled to a connector service 180 that provides multiple connectors to connect to different data sources and receive data from those sources (as a push or pull) according to the appropriate protocol for the particular data source. It is noted that different data sources may use different transmission protocols, storage protocols, encryption protocols, etc.
The data connector of the connector service 180 is configured using the control plane 170. The control plane 170 contains the workflow for resource management of the enterprise search service 102. The control plane 170 may also be used to configure the model build pipeline 160 that builds specific models, vocabularies, and embeddings for hosting in the model hosting service 110 and for answering queries. It is noted that in some embodiments, model management service 150 may be used to refresh a given model.
Ingest service 130 also extracts text from the document, pre-processes the extracted text (e.g., tokenizes, normalizes, and/or removes noise), and invokes an indexing service to generate index entries for the text and cause the document (or a subset thereof) to be stored. Indexing service 140 indexes documents that have been acquired by ingestion service 130 into one or more indexes 107. An index is a data structure that maps data to organized data of multiple fields. Each document or subset of documents (e.g., a paragraph) is identified by a unique identifier. In some embodiments, the index includes a plurality of JSON documents.
In some embodiments, the index is an inverted index that lists each unique term that appears in any document and identifies all documents in which each term appears. An index may be considered an optimized set of documents, and each document is a set of fields that are key-value pairs containing data. Each index field has a dedicated optimized data structure. For example, text fields are stored in the inverted index, while numeric and geographic fields are stored in the BKD tree.
The indexing service 140 may be schema-less, meaning that documents may be indexed without explicitly specifying how to handle each of the different fields that may occur in the document. When dynamic mapping is enabled, the indexing service 140 automatically detects and adds new fields to the index. However, as noted below, the pattern of reserved fields may be used to map detected data to data types. The reserved fields allow distinguishing full text string fields from exact value string fields, performing language specific text analysis, optimizing fields for partial matching, and/or using data types that are not automatically detected.
Once the set of documents has been indexed, the set of documents may be queried via inference service 120. Inference service 120 processes search queries from end users by: perform query understanding (query classification and enrichment), invoke the indexing service 140 to obtain a set of relevant documents for the query, retrieve the set of relevant documents, and invoke one or more models of the model hosting service 110 to derive search results for the given query.
Examples of models used by the inference service 120 running in the model hosting service 110 include, but are not limited to, questions/answers (e.g., reading comprehension) that extract answers from paragraphs, document/paragraph ranking models that rank documents in order of relevance with respect to a query, and FAQ matching models that attempt to identify the correct answer for a given question from a given FAQ document.
The front end 104 of the enterprise search service 102 is coupled to one or more search service components 103 to provide a way for external communication (e.g., from an edge device 108, etc.) with the enterprise search service 102. For example, through the front end 104, a user may communicate with the ingestion service 130 to configure and initiate ingestion of one or more documents, provide queries to be served by the inference service 120, and so forth.
As shown, in some embodiments, the enterprise search service 102 is a service provided by the provider network 100. Provider network 100 (or "cloud" provider network) provides users with the ability to utilize one or more of various types of computing-related resources, such as: computing resources (e.g., executing Virtual Machine (VM) instances and/or containers, executing batch jobs, executing code instead of provisioning servers), data/storage resources (e.g., object stores, block-level stores, data archive stores, databases and database tables, etc.), network-related resources (e.g., configuring virtual networks (including groups of computing resources), Content Delivery Networks (CDNs), Domain Name Services (DNS)), application resources (e.g., databases, application build/deployment services), access policies or rules, identity policies or rules, machine images, routers and other data processing resources, etc. These and other computing resources may be provided as services such as: hardware virtualization services of a compute instance, storage services that may store data objects, and the like may be performed. A user (or "customer") of provider network 100 may utilize one or more user accounts associated with a customer account, although these items may be used somewhat interchangeably depending on the context of use. Users may interact with provider network 100 over one or more intermediate networks 101 (e.g., the internet) via one or more interfaces, such as by using Application Programming Interface (API) calls, via a console implemented as a website or application, and so forth. One or more interfaces may be part of, or serve as a front-end to, a control plane (e.g., control plane 170) of provider network 100, which includes "back-end" services that support and enable services that may be more directly provided to customers.
For example, a cloud provider network (or simply "cloud") generally refers to a large pool of accessible virtualized computing resources (such as computing, storage, and network resources, applications, and services). The cloud may provide convenient, on-demand network access to a shared pool of configurable computing resources, which may be programmatically preconfigured and issued in response to customer commands. These resources may be dynamically preconfigured and reconfigured to accommodate variable loads. Thus, cloud computing can be viewed as an application delivered as a service over a publicly accessible network (e.g., the internet, cellular communication network) and as hardware and software in cloud provider data centers that provide those services.
The cloud provider network may be formed as a plurality of zones, where one zone may be a geographic area in which the cloud provider aggregates data centers. Each zone may include a plurality (e.g., two or more) of Available Zones (AZ) that are connected to each other via a dedicated high-speed network (e.g., fiber optic communication connection). One AZ may provide an isolated fault domain including one or more data center facilities with separate power supplies, separate networking and separate cooling from data center facilities in another AZ. Preferably, the AZs within a zone are sufficiently distant from each other that the same natural disaster (or other fault-causing event) should not affect or take off-line more than one AZ at the same time. The customer may connect to the AZ of the cloud provider network via a publicly accessible network (e.g., internet, cellular communication network).
To provide these and other computing resource services, provider network 100 typically relies on virtualization technology. For example, virtualization techniques may be used to provide a user with the ability to control or utilize a compute instance (e.g., a VM that uses a guest operating system (O/S) that operates using a hypervisor that may or may not further run on top of an underlying host O/S; a container that may or may not operate in a VM; an instance that may execute on "bare metal" hardware without an underlying hypervisor), where one or more compute instances may be implemented using a single electronic device. Thus, a user may directly utilize a compute instance hosted by a provider network (e.g., provided by a hardware virtualization service) to perform various computing tasks. Additionally or alternatively, a user may indirectly utilize a computing instance by submitting code to be executed by a provider network (e.g., via an on-demand code execution service), which in turn utilizes the computing instance to execute the code, typically without the user having any control or knowledge of the underlying computing instance involved.
The circles with numbers inside represent exemplary actions that can be used to perform inference (query). At circle 1, an inference request is sent by the edge device 108 to the enterprise search service 102. The front end 104 invokes the inference service 120, which begins processing requests at circle 2.
Processing of the request includes accessing one or more indexes 107 via the indexing service 140 at circle 3 to obtain identifiers of the set of documents to be analyzed, accessing the identified set of documents (or text thereof) from the document store 109, and providing the documents (or text thereof) and queries to one or more machine learning models in the model hosting service 110 at circle 5 to determine one or more of a top document, a top paragraph, and/or a top FAQ.
The results of the determination of the one or more machine learning models are provided to the requestor (subject to any restrictions) at circle 6. Providing the results may also include using an enhanced display.
FIG. 2 illustrates an embodiment of an enterprise search service 102 for providing reasoning functionality. In particular, the illustrated aspects can be used to respond to a search query for a corpus of documents. Front end 104 receives the search request (or query) and provides the request to inference coordinator 220 of inference service 120.
In some embodiments, the query is submitted as an Application Programming Interface (API) call. In some embodiments, the default response to such a query includes a related paragraph, a matching FAQ, and a related document. The query may contain one or more fields that indicate how the search is performed and/or what is returned. The one or more fields include, for example, one or more of: an attribute filter field that enables filtering searches based on document attributes; an excluded document attributes field indicating which attributes are excluded from the response; a configuration field that defines which document attributes are to be computed; a contain document properties field indicating document properties to be included in the response; an index identifier field indicating one or more indexes to be searched; a page number field indicating the number of result pages to be returned; a page size field indicating the size of the result page to be returned; a query result type configuration field that sets the type of query (e.g., FAQ, paragraph, document); a query text field comprising a text string to be searched; and a user context field that identifies the end user making the query so that it can be determined whether the query results should be filtered based on the user (e.g., an access control list indicates that the user is not allowed to view content, such as general employees searching for health records of other employees).
Inference coordinator 220 coordinates various services to perform inference using queries. In some embodiments, inference coordinator 220 includes a state machine or algorithm that defines the actions to be taken. In some embodiments, inference coordinator 220 performs query classification and enrichment (or is coupled to a component that performs this operation). For example, in some embodiments, key phrases, entities, syntax, topics, and/or classifications are extracted. In some embodiments, the classifier machine learning model determines what type of problem is being presented. The factual and non-factual problems may be handled differently as to which models are used to determine the top result and how the result is displayed.
Inference coordinator 220 is coupled to indexing service 140 and utilizes indexing service 140 to access one or more indexes 107 to obtain matching document identifiers for queries. The index 107 includes a FAQ index 107A, a question/answer index 107B, and a document/paragraph index 107C. In some cases, inference coordinator 220 provides an indication of what index to use. In some embodiments, metadata 210 provides the physical location of index 107 for use by indexing service 140.
Inference coordinator 220 receives results (e.g., document identifiers) of various index queries for retrieving one or more documents for use by one or more machine learning models (e.g., FAQ model 212C, question/answer model 212B, and document/paragraph ranking model 212A) hosted by model hosting service 110. Inference coordinator 220 retrieves identified documents (e.g., entire documents, paragraphs, or FAQs) from text/document store 109 using document storage service 208. The retrieved documents are then served to one or more of the models 212A-C of the model hosting service 110 along with aspects of the query to identify one or more of: one or more top ranked documents, one or more top ranked paragraphs, and/or one or more top ranked FAQs. Note that models 212A-C provide confidence scores for their outputs. It is further noted that document storage service 208 stores document artifacts that will be used in reasoning to extract answers to a given query.
FIG. 3 illustrates an embodiment of an enterprise search service 102 for providing reasoning functionality. Inference coordinator 220 receives query 300. This is shown at circle 1. The inference coordinator 220 triggers queries against one or more indices 107.
In some embodiments, a query is triggered against document index 107A and paragraph index 107B (shown at circle 2). The identification of the set of "top" documents (e.g., the top 1000 documents) and "top" paragraphs (e.g., 5000 paragraphs) is provided from the indexing service 140 back to the inference coordinator 220. The associated document and paragraph (shown at circle 3) are retrieved and then sent to document/paragraph order model 212A.
The document/paragraph ranking model 212A analyzes and re-ranks the top document based on the relevance scores and, for a top subset (e.g., 100) of the ranked documents, determines a set number (e.g., 3) of paragraphs for each of the top subset (shown at circle 4) of the ranked documents. In some embodiments, the top document is analyzed and reordered based on a deep cross-network of features (DCN). Further, in some embodiments, the bidirectional coder representation (BERT) model from the transformer takes the highest ranked subset of documents and finds paragraphs and outputs a relevance score. The relevance scores for DCN and BERT are combined to get the final re-ranking of the top document. In some embodiments, when the data is a plain text document without metadata fields, DCN may be bypassed and only the top 100 documents may be reordered directly using BERT. Note that the output of document/paragraph ordering model 212A is a collection of top ranked documents 304 and/or top ranked paragraphs. In some embodiments, the top ranked paragraphs are found using the union of the top ranked documents and the indexed paragraphs.
The question and answer model 212B is used to determine a set of one or more top paragraphs for a query. A query is triggered at circle 5 against paragraph index 107B to find the highest number of paragraphs (e.g., 100) that are retrieved and sent to document/paragraph ordering model 212A for analysis and re-ordering. In particular, in some embodiments, the BERT model receives and reorders the top paragraphs and sends the first few (e.g., 5) to the question and answer model 212B at circle 6. In some embodiments, the question and answer model 212B is also BERT based. The question and answer model 212B analyzes the several paragraphs and outputs a top paragraph 306 with multiple candidate answers, which is sometimes highlighted. In some embodiments, the confidence score of the top paragraph is shifted when it exceeds a first threshold. In some embodiments, when aspects of the confidence score for the top passage exceed a second, more stringent threshold, those aspects of the top passage are highlighted as the best answer, while the score with the lesser confidence is enhanced in other ways (e.g., bold).
The FAQ model 212C is used to determine a set of one or more top FAQs for the query. A query is triggered at circle 7 against the FAQ question index 107C and the top set of matching questions is sent from the text/document store 109 to the FAQ model 212C. The FAQ model 212C reorders the top set of questions and returns the most relevant questions and their answers 308. In some embodiments, FAQ model 212C is a BERT-based model.
FIG. 4 illustrates an embodiment of a method for performing inference (document search). Some or all of the operations (or other processes described herein, or variations and/or combinations thereof) are performed under control of one or more computer systems configured with executable instructions and implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, implemented by hardware, or combinations thereof. The code is stored on a computer-readable storage medium in the form of, for example, a computer program comprising instructions executable by one or more processors. The computer readable storage medium is non-transitory. In some embodiments, one or more (or all) of the operations are performed by components of other figures, such as under the control of an inference coordinator 220 that invokes the indexing service 140 to retrieve document identifiers, retrieves identified documents using the document storage service 208, and invokes one or more ML models in the model hosting service 110 to analyze the retrieved documents.
At 401, a search query is received at a front end. The search query includes a question to be answered. In some embodiments, the search query includes an indication of what type of answer is desired (e.g., a list of documents, paragraphs, and/or FAQs). For example, a prefix (or suffix) may be used, such as the paragraph: QUESTONTIONTET. Alternatively, selection from a list of potential results may be used. Examples of search query API calls have been detailed above.
A search query is executed at 402 to generate one or more results. Documents that match the search query are identified by querying one or more indexes at 403. For example, in some embodiments, the index of documents is queried for a "top" matching document set at 405, the index of paragraphs is queried for a "top" matching paragraph set at 406, and/or the index of FAQ is queried for a "top" FAQ set at 407. It is noted that these indices may be independent of each other or combined in any manner. As noted, the inference coordinator may cause these one or more queries to occur. In some embodiments, the query is formed such that a request is made to "match" terms of the question. The match query returns documents that match the provided text, number, date, or boolean value. Matching queries may limit the number of results, the number of words of the question to use, and so on.
At 409, the identified documents, paragraphs, and/or FAQs are retrieved from the matched data at 409. As discussed, the inference coordinator may cause these one or more acquisitions to occur. The documents, paragraphs, and/or FAQs may be stored in separate locations or stored together. In addition, the documents, paragraphs, and/or FAQs may be preprocessed to make subsequent analysis easier. In some embodiments, the entire document is obtained. In some embodiments, what is obtained is text extracted from a document.
At 411, one or more of the top-ranked paragraphs, top-ranked documents, and top-ranked FAQs are determined from the retrieved documents, paragraphs, and/or FAQs based on one or more calls to one or more machine learning models of the search query. Several operations may occur in this action. Note that the model will generate a confidence score for its results.
In some implementations, a first machine learning model is used at 413 to determine an appropriate subset of the identified (acquired) set of documents. For example, in some embodiments, the obtained documents are reordered according to relevance score using a first model (e.g., a DCN model), and then a second model (e.g., based on BERT) looks at some highest number of those reordered documents (e.g., 100) and uses the top paragraphs of the retrieved paragraphs of those top documents to determine the relevance score for each document. The relevance scores from the first and second models are combined to generate a set of top ranked documents. In other embodiments, only reordering using the first model is performed.
In some embodiments, at 417, based on the query and the retrieved paragraphs, a second machine learning model is used to identify an appropriate subset of the set of identified (and retrieved) paragraphs. The proper subset is a reordering of the paragraphs. This may be the same model detailed as the BERT based model used at 413. The reordered subset is provided to a third model (along with aspects of the query), which determines the top paragraph from the reordered subset at 419. In some embodiments, the third model is a BERT-based model.
In some embodiments, at 421, an appropriate subset of the identified (and acquired) set of FAQs is determined from the acquired FAQs and the query using a fourth machine learning model. The proper subset includes the top-ranked FAQs. In some embodiments, the fourth machine learning model is based on BERT.
One or more of the top-ranked paragraphs, top-ranked documents, and top-ranked FAQs are returned at 423. Returning may include displaying the results. In some embodiments, the returned content is constrained by an access control list and/or a confidence score threshold. For example, if sharing top ranked documents with the searching user is not allowed based on the access control list, etc., either no content is returned, or lower ranked documents are returned, etc. In some embodiments, an improved display of results is used. Note that in some embodiments, the returned results are sent back through the front end and/or the inference coordinator.
At 425, in some embodiments, feedback is received regarding one or more of the top-ranked paragraphs, top-ranked documents, and one or more items returned in the top-ranked FAQ. This feedback can be used to adjust the model.
FIG. 5 illustrates an embodiment of an improved display of results of inference queries. As shown, a Graphical User Interface (GUI)500 allows a user to input a query using a query input mechanism 504 (e.g., an input box). In some embodiments, the user may further define the data set using the data set indicator/selector 502. For example, the user may define that HR documents are to be queried, or that HR FAQs are to be queried, etc. In some embodiments, documents, paragraphs, and FAQs are all queried by default.
The GUI 500 provides an indication of a plurality of results 506 returned with the results 508, 518 themselves. In some embodiments, answers 505 to the questions asked are extracted and displayed approximately. Note that each index type (e.g., document, paragraph, and FAQ) may show one result. In this example, the first result 508 displays text 510 that includes a highlighting of particularly relevant aspects of the result. In particular, the text "results" has been highlighted from the document text. The highlighted text is the top ranked text (or at least the top ranked text that the user is allowed to see) for which the results exceed one or more confidence score thresholds. In some embodiments, the highlighting is shown using font changes, and in some embodiments, the text is highlighted in color (as when a yellow background is used for the text portion). The first result 508 also includes a location 512 of the result (e.g., a document location) and means 514 for providing feedback, for example, as a feedback input 1106 in FIG. 11.
The second result 518 displays text 520 that less emphasizes the relevant aspects of the result. In particular, the text "results" has been emphasized from the document text, but is less conspicuous than the highlighted text. The emphasized text is the top ranked text (or at least the top ranked text that the user is allowed to see) that results in exceeding one or more confidence score thresholds (but not as much text as is highlighted). The emphasis may be bold, italic, underlined, changing font size, etc. The second result 518 also includes a location 522 of the result (e.g., a document location) and a means 524 for providing feedback, for example, as a feedback input 1106 in FIG. 11.
The feedback input may be in the form of an API request that includes one or more parameters, such as one or more of: click on feedback items (prompting that search results have been obtained), identifiers of the indexed queried, identifiers of the query itself, correlations such as approval or disapproval.
FIG. 6 illustrates an embodiment of a method for performing an improved display of results of an inference query. Some or all of the operations (or other processes described herein, or variations and/or combinations thereof) are performed under control of one or more computer systems configured with executable instructions and implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, implemented by hardware, or combinations thereof. The code is stored on a computer-readable storage medium in the form of, for example, a computer program comprising instructions executable by one or more processors. The computer readable storage medium is non-transitory. In some embodiments, one or more (or all) of the operations are performed by components of other figures.
A search query is received 601. For example, a search query is received at the front end 104 and passed to the inference coordinator 220. Examples of search queries have been detailed above.
The search query is executed 603 to generate one or more results. Embodiments of such implementations have been detailed above (e.g., at least with respect to fig. 4). The results may include, but are not limited to: text from top ranked documents, top ranked paragraphs, top ranked FAQs, etc. In some embodiments, the execution of the search query includes using one or more ML models.
One or more results are displayed at 605. Note that in some embodiments, the returned results are sent back for display by front end 104 and/or inference coordinator 220. It is often determined in one of these components what may be displayed and/or to emphasize certain aspects of the results. E.g., applications accessing control lists, etc.
In this example, it is assumed that the results are allowed to be displayed, but how the results are displayed varies depending on the confidence level of the underlying model in its analysis. At 607, it is determined whether an aspect of the result exceeds a first confidence threshold. For example, whether the results of the one or more ML models indicate a confidence score based on the one or more ML models, the results are considered reasonably correct. When the confidence score is low, this indicates that the results may not be particularly good. When the first threshold is not met, then either the results are not displayed at 609 or the results are de-emphasized as they are displayed.
When the first threshold is met, the results are emphasized when displayed. The type of emphasis may differ depending on whether an aspect of the result exceeds a second confidence threshold at 611 that is greater than the first threshold. When the second threshold is not met, then the aspect is emphasized using the first type of emphasis at 613. Examples of the first type of emphasis include, but are not limited to: bold, underline, change font size, and italics. When the second threshold is met, then the aspect is emphasized using a second type of emphasis at 615. The second type of emphasis is more prominent than the first type of emphasis and may include highlighting, bolding, underlining, changing font size, italicizing, or a combination thereof. Note that the emphasis of the first and second types is different.
FIG. 7 illustrates an embodiment of an enterprise search service 102 for providing ingestion functionality. The front end 104 receives ingestion requests, indexes creation requests, etc., and passes those requests to the ingestion service 130. Ingestion service 130 performs document verification on documents retrieved from data source 105/106. In some embodiments, ingestion service 130 coordinates various services to perform index creation, index update, and other ingestion tasks detailed below. In other embodiments, the ingestion service places the documents to be processed in a queue 735 that includes an extraction and pre-processing pipeline. The intake request requires acquisition of a set of documents, such that the documents are acquired, indexed, preprocessed, stored, etc.
As shown, ingestion service 130 is coupled to a plurality of services. The connector service 180 receives (as a push or pull) a document from a data source 105/106, where the physical location may be provided by the metadata 210. The indexing service 124 pulls documents and/or text (which may be pre-processed) from the queue 735 and creates or updates an index 107 associated with the documents (including paragraphs and FAQs). Metadata 210 may provide the physical location of those indices 107. The document storage service 208 also pulls documents from the queue 735 to store the documents for storage in the text/document store 109 with its blocks.
Before the index is updated, it needs to be created. In some embodiments, the create index API call is received by the front end 104, which calls the index service 124 to generate an index for the index 107. The create index request includes one or more fields for notifying the indexing service 124 of the behavior, such as a field for index description, a field for index name, a field for role in granting permissions for logs and metrics, a field identifying encryption keys, and so forth.
When an index has been created, it can be updated. Such updates may be in the form of single updates or batch updates that result in the ingestion of text and unstructured text into the index, the addition of custom attributes to documents (if needed), the addition of access control lists to documents added to the index, the storage of text, the preprocessing (and storage) of text, and so forth. In some embodiments, the update request includes one or more fields for notifying the behavior of the indexing service 124, the document storage service 208, the queue 735 (including the extraction and pre-processing pipelines), and the connector service 180, and includes one or more fields such as a field for the location of one or more documents, a field for the documents themselves, a field for the index name, a field for the role of granting permissions to logs and metrics, and the like. The extraction and pre-processing pipeline extracts text from the document and pre-processes it (e.g., tokenizes, etc.). In some embodiments, the extraction and pre-processing pipeline uses a sliding window to break the extracted text (e.g., tokens) into overlapping paragraphs. The overlapping paragraphs are then indexed and/or stored.
FIG. 8 illustrates an embodiment of a method for performing ingestion of one or more documents. Some or all of the operations (or other processes described herein, or variations and/or combinations thereof) are performed under control of one or more computer systems configured with executable instructions and are implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware, or by a combination thereof. The code is stored on a computer-readable storage medium in the form of, for example, a computer program comprising instructions executable by one or more processors. The computer readable storage medium is non-transitory. In some embodiments, one or more (or all) of the operations are performed by components of other figures.
At 801, an intake request is received. For example, the front end 104 receives an intake request.
One or more documents are obtained from one or more data sources upon request at 803. For example, when a request indicates a particular bucket, documents will be collected from that bucket. Note that the acquisition may simply be pulling one or more documents from the ingest request.
In some embodiments, the document is obtained at 805 by crawling one or more data sources. In some embodiments, such crawling of documents is performed by the connector service 180 and may include collecting from internal and/or external sources.
In some embodiments, an ACL is obtained at 807 for the obtained one or more documents. As described above, ACLs can be used to determine which results of the performed inference the user can see. The ACLs can be stored with the document or pointed to by the owner of the document.
Text is extracted from the retrieved one or more documents and preprocessed at 809. Metadata may also be extracted. For example, text may be extracted from a document that includes non-text, such as an image. The pre-processing of the extracted text includes one or more of tokenizing, normalizing, and/or removing noise. This is performed by the fetch and pre-processing pipeline 735 for each document fetched. Note that the extracted text may include paragraphs.
The extracted text and the preprocessed text are stored at 811. This may be performed by the document storage service 208, which places the extracted text and pre-processed text into the text/document store 109. Text and/or pre-processed text is used during reasoning.
At 813, an index entry is generated for the extracted text. In some embodiments, the index entry includes a pointer to the ACL. The generation of the index includes mapping the tags of the documents into fields of the index entries. In some embodiments, the mapping utilizes a reserved field. The reserved field is a "default" field, allowing standardization across multiple different accounts. Such normalization may help in training the models used in reasoning because it should be easier to develop a training data corpus when using generic labels (rather than training with different labels for each user account). For example, the reserved field "title" allows user account 1 and user account 2 to use the same tag in their documents. In some embodiments, existing tags are mapped to reserved fields. The mapping may be automatic or according to an administrator-provided mapping.
In some embodiments, the underlying captured document is stored at 815.
Fig. 9 illustrates an embodiment of an exemplary reserved field for use with ingestion. Documents may use tags to indicate what the text is. For example, a label indicating text is a title. As described above, the index entries include fields for text content, and these fields correspond to tags. In the enterprise search service 102 described herein, a "reserved" tab set may be used in the index entry field. These reservation tags allow for the marking of text in a generic manner between documents, users, etc.
An example of the name 901 of the "reserved" field and its corresponding data type 903 is shown. In some embodiments, the body and heading are default. Although these fields are "reserved," in some embodiments, these fields are updatable. For example, if the name "modified date" is not a name used by anyone, it may be altered to reflect the use case. In addition, new "reserved" fields may be added as needed or desired. Note that in some embodiments, the use of the "reserved" field may be overridden.
Fig. 10 illustrates an embodiment of a graphical user interface for updating/adding/removing reserved fields for ingestion. As shown, GUI 1000 allows a user to adjust the reserved fields, including adding fields, removing fields, and updating fields.
The GUI 1000 includes a reserved field search mechanism 1004 (e.g., an input box). In some embodiments, the user may further define the data set using the data set indicator/selector 1002. For example, a user may define that an HR document has a particular set of reserved fields, while a financial document uses a different set.
For each reserved field, the display includes a reserved field name 1006, an explicit mapping to name field 1008, a data type field 1012, and an indication of whether or not the reserved field is used. Reserved field name 1006 is a tag used by indexing service 140. An explicit mapping to name field 1008 allows a user to provide the indexing service 140 with a mapping of tags in existing documents to reserved fields. Fields 1006, 1008, and 1012 are editable and the application that updates field function 1016 commits the changes.
When the add field function 1018 is used, a new reserved field entry is added, allowing the user to add a reserved field name, an explicit mapping field to name, a data type field, and an indication of whether or not to use a reserved field. This may be performed using one or more GUIs (not shown).
Fields can be removed by using the use fields 1014 and then applying the remove fields function 1014.
FIG. 11 illustrates an embodiment of a model building system. The model building system may be used to build and refresh models. The depicted model building system includes a front end 104 coupled to ingestion service 130, a model building pipeline 160 coupled to ingestion service 130, a metrics aggregator 1102, a control plane 170, and a model storage service 1104. The depicted model building system includes a metrics aggregator 1102 coupled to the control plane 170, and a model storage service 1104 coupled to the model management service 150. The depicted model management service is coupled to a model hosting service 110, which can host one or any combination of the following: a document/paragraph order model 212A, a question/answer model 212B, and a FAQ model 212C.
In one embodiment, ingestion service 130 receives one or more documents to ingest and sends a report of the ingestion metric (e.g., a metric indicating the number of documents, the index size of the document corpus, index failure, etc.) to metric aggregator 1102, metric aggregator 1102 polls whether the document corpus has changed (e.g., exceeded a threshold) enough to trigger model build, when model build is triggered, an indication is sent to control plane 170 such that the control plane causes model build pipeline 160 to build a model (e.g., a machine learning model), which is then saved by model storage service 1104. The model building system may also include a training data generation service 1108, for example, to create training data 1110 from the user's data. Training data 1110 can be used by model build pipeline 160 in creating models and/or by model management service 150 in refreshing models.
The model (e.g., at or after initial use) may have improved functionality through further training. The training may be based at least in part on feedback input 1106, e.g., feedback provided by a user. In certain embodiments, model management service 150 pulls (e.g., from model storage service 1104) and refreshes (e.g., based on feedback input 1106) the model. Refreshing the model may include utilizing feedback (e.g., from feedback input 1106 or other feedback) in the next training iteration of the model. The next version of the model formed from the next training iteration may then be used by saving the model to the model hosting service 110, e.g., where the updated model is one or any combination of the following: a document/paragraph ranking model 212A, a question/answer model 212B, and a FAQ model 212C. Model refresh may be triggered (or an appropriate subset of data displayed for a user to mark upon active learning) when a confidence value (e.g., score) of the appropriate subset of data (e.g., answers and/or documents) returned by the model for the search query is below a confidence threshold. Additionally or alternatively, a model refresh may be triggered (or an appropriate subset of data displayed for labeling in active learning by a user) in response to a difference between a first confidence score for a first portion of the appropriate subset of data (e.g., a candidate answer or candidate document with a highest first score) with respect to its relevance to the search query and a second confidence score for a second portion of the appropriate subset of data (e.g., a candidate answer or candidate document with a second highest second score) with respect to its relevance to the search query exceeding a confidence difference threshold. An appropriate subset of data (e.g., answers and/or documents) may be selected for presentation to a user (e.g., for tagging by the user in active learning) based on confidence values of the appropriate subset of data returned by the model for the search query.
Feedback input 1106 may include click data (e.g., how many times the provided link was selected by the user) and/or customer annotated data (e.g., as discussed below with reference to fig. 13 and 14).
The model may include an input of a search query for searching ingested data (e.g., a user's documents) and an output of a best answer from a plurality of answers to the data and/or an output of a best document from a plurality of documents to the data. Active learning may be used to train models in which a user will request a desired output (e.g., answers or documents) that the user indicates an input (e.g., a search query). However, rather than requiring the user to indicate from the entirety (or substantially the entirety) of the data which answers and/or documents are most important to the search query (e.g., question), certain embodiments herein present the user with an appropriate subset of the data to indicate which answers and/or documents are most important to the search query. Thus, these embodiments allow a user to perform tagging of data (e.g., tagging as important enough for the next iteration of training) without overwhelming them with one or more non-information documents.
In one embodiment, active learning is applied to suggest a user query (or queries) and a particular subset of document candidates and/or answer candidates to a user (e.g., a human) for tagging based on the current performance of the model. Thus, the suggested subset of document candidates and/or answer candidates provides more value in improving the machine learning model than a randomly sampled set. Different approaches are possible for active learning. One example is to check for differences between confidence scores from the top candidates, e.g., and if the difference is less than a threshold, the model needs to refine such queries. Another example is diversity sampling in order to obtain greater data coverage. Next, a user (e.g., a client) may mark the relevance of the document and the answer candidates (e.g., by selecting an interface element of a graphical user interface). In certain embodiments, when a threshold amount of annotation data (e.g., 1000 samples) is received, the machine learning model is retrained using the annotation data (e.g., and the accuracy of the retrained model is evaluated on the set of set aside data from the user data (e.g., the ingested data)). In one embodiment, if the improvement exceeds a certain threshold, the model of the previous version is replaced with the retrained model. The above process may be repeated according to a predetermined schedule.
FIG. 12 illustrates an embodiment of a method for model management, such as implemented by model management service 150. The depicted model management includes: receiving (optionally) a search query 1201; (optionally) performing a search on the user's data using a machine learning model for the search query to generate results 1203; and (optionally) provide results of the search to the user 1205. The active learning 1207 includes: generating a confidence score (e.g., by a machine learning model) 1209 from the results of the search; selecting an appropriate subset of data 1211 based at least in part on the confidence scores for the appropriate subset of data; displaying the appropriate subset of data to the user 1213; receiving an indication from the user of one or more portions of the appropriate subset of data for use 1215 in a next training iteration of the machine learning model for the search query; performing a next training iteration of the machine learning model with one or more portions of the appropriate subset of data 1217; and (optionally) replacing the previous version of the machine learning model for the search query with a next version generated from a next training iteration when the accuracy score of the next version exceeds the accuracy score of the previous version 1219. After performing active learning 1207, another search query 1221 may be received, optionally, and then another search query 1223 performed on the user's data using a machine learning model trained with one or more portions of the appropriate subset of data, and the results 1225 of the other search provided to the user.
Fig. 13 illustrates an embodiment of a graphical user interface 1300 for use in active learning of questions and answers for training a machine learning model. The depicted graphical user interface 1300 includes a field 1302 that can be customized with text to indicate that the user is about to take an action (e.g., "please select the following answers that are relevant to the indicated query"), and a field 1304 for populating with queries for which the user is to label answers for relevance. Optionally, a plurality of candidate answers 1306 may be indicated.
The graphical user interface 1300 includes a plurality of entries 1308A-B and each entry includes a feedback input 1310A-B, respectively. Although two entries are shown, any number of entries may be used (e.g., where "X" is any positive integer). In the depicted embodiment, a query 1304 is provided for which active learning is to be performed, as well as a plurality of candidate answers, e.g., as discussed herein. For example, the candidate answer includes paragraph 1312, where the answer is highlighted (e.g., bold, underlined, labeled in a different color, etc.), and also includes paragraphs of surrounding text (e.g., providing context for the user to read and understand the answer and its possible relevance to the query). A link 1314 to the source document may also be included.
As depicted, example query 1304 is "How much is the package taken? ". Candidate answer 11308A includes the statement below paragraph 1312A "this optionCharging free. We will extract your return at your chosen address. ". The user may consider the candidate answer 1 to be relevant (e.g., the most important answer) and mark the feedback input 1310A (shown as a checkbox as an example). Candidate answer 21308B contains a statement below paragraph 1312B "if you select pickup and return is not due to our mistake, you will be chargedXX.XX dollarThe taking of the parts is convenient and easy. "(wherein XX.XX is the actual value). The user may consider the candidate answer 2 to be relevant (e.g., relevant independently of the candidate answer 1) and mark the feedback input 1310B (shown as a checkbox as an example). Highlighting may be added to the results provided by the model and also provide surrounding words (e.g., sentences before and/or after the results). Feedback input 1310 may be another interface element such as, but not limited to, a check box, a button, a drop down menu, etc.
The user may click on the submit interface element 1316 to cause a feedback input to be sent, for example, as feedback input 1106 in FIG. 11. Feedback inputs may be aggregated, for example, to trigger retraining of the model, as discussed herein.
FIG. 14 illustrates an embodiment of a graphical user interface for use in active learning of document rankings for training a machine learning model. The depicted graphical user interface 1400 includes a field 1402 and a field 1404, the field 1402 being customizable with text to indicate that the user is to take action (e.g., "please select the following documents relevant to the indicated query"), the field 1404 being used to populate with queries that the user is to target for relevance-tagged documents. Optionally, a plurality of candidate documents 1406 may be indicated.
Graphical user interface 1400 includes a plurality of entries 1408A-B and each entry includes a feedback input 1410A-B, respectively. Although two entries are shown, any number of entries may be used (e.g., where "X" is any positive integer). In the depicted embodiment, a query 1404 is provided for which active learning is to be performed, as well as a plurality of candidate documents, e.g., as discussed herein. For example, the candidate document includes a link 1412 to a document (e.g., hosted in storage 109).
As depicted, the example query 1404 is "an operation manual for widget Y? ". Candidate document 11408A includes link 1412A to the first document. The user may consider candidate document 1 to be relevant (e.g., the most important document) and mark feedback input 1410A (shown as a check box as an example). Candidate document 21408B includes a link 1412B to a second document. The user may consider candidate document 2 to be relevant (e.g., relevant independently of candidate document 1) and mark feedback input 1410B (shown as a check box as an example). Feedback input 1410 may be another interface element such as, but not limited to, a yes (or no), a check box, a button, a drop down menu, and the like.
The user may click on the submit interface element 1416 to cause the feedback input to be sent, for example, as feedback input 1106 in FIG. 11. Feedback inputs may be aggregated, for example, to trigger retraining of the model, as discussed herein.
Fig. 15 illustrates an embodiment of a method of active learning for training a machine learning model. The depicted method comprises: performing a search on the user's data using a machine learning model for the search query to generate results 1501; generate a confidence score 1503 for the search result; selecting an appropriate subset of data to provide to the user based on the confidence score 1505; displaying the appropriate subset of data to the user 1507; receiving an indication from a user of one or more portions of the appropriate subset of data for a next training iteration 1509 of the machine learning model; and performing a next training iteration 1511 of the machine learning model using one or more portions of the appropriate subset of data.
In certain embodiments, the models discussed herein (e.g., document/paragraph ranking model 212A, question/answer model 212B, and FAQ model 212C) are trained with a training data set. The training data may include questions and corresponding answers from user data (e.g., as compared to public data or data from other businesses). However, the generation of such challenge-response pairs may require manual annotation and is therefore expensive in terms of time and expense and/or prone to human error. A model building system (e.g., the model building system depicted in fig. 11) may use the training data. The model building system may include a training data generation service (e.g., training data generation service 1108 in FIG. 11), for example, to create training data 1110 from the user's data. The training data (e.g., training data 1110 in fig. 11) may be used by a model build pipeline (e.g., model build pipeline 160 in fig. 1 and 11) to create a model and/or by a model management service (e.g., model management service 150 in fig. 1 and 11) to refresh a model.
For example, certain embodiments herein remove a person from generating training data by removing the person from identifying answers to questions and/or from identifying answers to questions. These embodiments may include training a language machine learning model to identify (e.g., generate) question-and-answer pairs sets from user data (e.g., from their unstructured text data) without requiring manual annotation or other manual involvement.
In one embodiment, the request to build the model (e.g., using model build pipeline 160 in fig. 1 and 11) causes a service (e.g., training data generation service 1108 in fig. 11) to generate training data from the user data. The training data may include questions and their corresponding answers, e.g., candidate questions generated by the service for possible answers in the user data. The completed training data may then be provided to a model building pipeline for building a model specific to the user data.
FIG. 16 illustrates an embodiment of a method for training and using a problem generation model. In some embodiments, the training data to be generated is a set of one or more candidate questions from user data (e.g., the user's documents, paragraphs, etc.) that contain answers. The method depicted in FIG. 16 includes training 1601 of the problem generation model. In one embodiment, training 1601 of the question generation model includes training (e.g., language) a machine learning model with known question-answer pairs to predict questions 1603 from the answers. A known question-answer pair may be data that does not include user data (e.g., data that is not from a user), for example, the data may be public data. An example of a "known question-and-answer pair" is located in a machine reading understanding (MARCO) dataset. Known challenge-response pairs may be public data, for example, as opposed to the user's private data (e.g., hosted in storage 109 in fig. 1).
One example of a language Machine Learning (ML) model is a transformer-based language model that predicts the next word of a text string based on previous words within the text string. Certain embodiments herein modify the language ML model used to predict the next word of each successive word of a text string to predict each next (e.g., successive) word of a question for a given answer, e.g., predict each successive word of a known question (e.g., a multi-word question) from its known answer (e.g., a multi-word answer). One exemplary language ML model is a transformer model (e.g., GPT-2 model) that is first trained on (e.g., a very large amount of) data in an unsupervised manner using language modeling as a training signal, and then second trimmed on a much smaller set of supervised data (e.g., a known problem and its corresponding known answer) to help it solve a particular task.
Referring to fig. 16, training 1601 includes: training a (e.g., language) machine learning model using known question-answer pairs to predict questions 1603 from answers; receiving 1605 one or more documents from a user; generating a set of question-answer pairs from one or more documents from a user using a trained machine learning model 1607; and (optionally) store a set of question-answer pairs generated from one or more documents from the user (e.g., stored in the training data 1110 storage of fig. 11) 1609.
In some embodiments, the training data (e.g., generated by a machine rather than by a human or using human annotations) is then used to train another machine learning model. For example, a second machine learning model (e.g., document/paragraph ranking model 212A, question/answer model 212B, and FAQ model 212C) is trained using a set of question-answer pairs generated from one or more documents of a user (e.g., from a training data 1110 storage in fig. 11) to determine one or more top-ranked answers 1611 from the user's data for a search query from the user.
After the second machine learning model is trained, e.g., after receiving a search query 1613 from a user, it can be used to execute the search query on the user's data (e.g., documents) using the second machine learning model 1615 and provide the results of the search query to the user 1617, e.g., results of one or more of top ranked documents, top ranked paragraphs, or top ranked questions.
In certain implementations, the language ML model is also trained to detect problem-Ending (EOQ) tokens. In some embodiments, the input to the language model used to generate the training data (e.g., generate questions for known answers) includes a passage with answers and questions, for example, in the following format: a service start indicator (e.g., < bos >) is followed by a problem start indicator (e.g., < boq >), then a problem, and then a problem end indicator (e.g., < eoq >).
FIG. 17 illustrates a first exemplary set of candidate questions generated by a question generation model trained on known question-answer pairs. In FIG. 17, an exemplary composite question generation 1700 includes a known question 1702 (e.g., with a question end token), a paragraph with a known answer 1704 to the known question 1702, and shows candidate questions generated by a model 1706 trained to generate a composite question from these two inputs 1702 and 1704 (e.g., and their question end token < eoq >). The model so trained may thus be used with the user's data to generate specific training data based on the user's (e.g., client's) documents, e.g., rather than just training on other documents.
FIG. 18 illustrates a second exemplary set of candidate questions generated by a question generation model trained on known question-answer pairs. In fig. 18, an exemplary synthetic question generation 1800 includes a known question 1802 (e.g., with a question end token), a paragraph with a known answer 1804 to the known question 1802, and shows candidate questions generated by a model 1806 (e.g., and their question end token < eoq >) trained to generate a composite question from these two inputs 1802 and 1804. The model so trained may thus be used with the user's data to generate specific training data based on the user's (e.g., client's) documents, e.g., rather than just training on other documents. In one embodiment, the same model is used to generate question 1806 in FIG. 18 and question 1706 in FIG. 17. The trained models may then be used to generate training data that is subsequently used to train second machine learning models (e.g., document/paragraph ranking model 212A, question/answer model 212B, and FAQ model 212C), e.g., as well as the trained second machine learning models used as discussed herein.
FIG. 19 illustrates an embodiment of a method for training a problem generation model. The depicted method comprises: receiving one or more documents from a user 1901; generating a set of question-answer pairs from one or more documents from the user using a machine learning model trained to predict questions from answers 1903; and store a set of question-answer pairs 1905 generated from one or more documents from the user.
Fig. 20 illustrates an exemplary provider network (or "service provider system") environment, according to some embodiments. Provider network 2000 may provide resource virtualization to customers via one or more virtualization services 2010 that allow customers to purchase, lease, or otherwise obtain instances 2012 of virtualized resources (including, but not limited to, computing resources and storage resources) implemented on devices within one or more provider networks in one or more data centers. A local Internet Protocol (IP) address 2016 may be associated with a resource instance 2012; the local IP address is an internal network address of the resource instance 2012 on the provider network 2000. In some embodiments, the provider network 2000 may also provide a public IP address 2014 and/or a public IP address range (e.g., internet protocol version 4(IPv4) or internet protocol version 6(IPv6) address) that the customer may obtain from the provider 2000.
Conventionally, the provider network 2000 may allow a customer of the service provider (e.g., a customer operating one or more client networks 2050A-2050C that include one or more customer devices 2052) to dynamically associate at least some of the public IP addresses 2014 assigned or allocated to the customer with particular resource instances 2012 assigned to the customer via the virtualization service 2010. The provider network 2000 may also allow a customer to remap a public IP address 2014 previously mapped to one virtualized computing resource instance 2012 assigned to the customer to another virtualized computing resource instance 2012 also assigned to the customer. For example, a customer of a service provider (such as an operator of one or more customer networks 2050A-2050C) may implement a customer-specific application using the virtualized computing resource instance 2012 provided by the service provider and the public IP address 2014 and present the customer's application over an intermediate network 2040 such as the internet. Other network entities 2020 on the intermediate network 2040 may then generate traffic to the destination public IP address 2014 published by one or more client networks 2050A-2050C; the traffic is routed to the service provider data center and, at the data center, via the network underlay to the local IP address 2016 of the virtualized computing resource instance 2012 that currently maps to the destination public IP address 2014. Similarly, response traffic from the virtualized computing resource instance 2012 may be routed back over the network underlay onto the intermediate network 2040 to the source entity 2020.
As used herein, a local IP address refers to, for example, an internal or "private" network address of a resource instance in a provider network. The local IP address may be within an address block reserved by the Internet Engineering Task Force (IETF) comment Request (RFC)1918 and/or have an address format specified by IETF RFC 4193, and may be changed within the provider network. Network traffic originating outside the provider network is not routed directly to the local IP address; instead, the traffic uses a public IP address that maps to the local IP address of the resource instance. The provider network may include networking equipment or devices that provide Network Address Translation (NAT) or similar functionality to perform mapping from public IP addresses to local IP addresses and vice versa.
The public IP address is an internet changeable network address assigned to a resource instance by a service provider or customer. Traffic routed to the public IP address is translated, e.g., via a 1:1NAT, and forwarded to the corresponding local IP address of the resource instance.
The provider network infrastructure may assign some public IP addresses to particular resource instances; these public IP addresses may be referred to as standard public IP addresses, or simply standard IP addresses. In some embodiments, the mapping of the standard IP address to the local IP address of the resource instance is a default startup configuration for all resource instance types.
At least some public IP addresses may be assigned to or obtained by customers of provider network 2000; the customer may then assign the public IP address to which it is allocated to the particular resource instance allocated to the customer. These public IP addresses may be referred to as client public IP addresses, or simply client IP addresses. Instead of being assigned to a resource instance by the provider network 2000 as in the case of a standard IP address, a customer IP address may be assigned to a resource instance by a customer, e.g., via an API provided by a service provider. Unlike standard IP addresses, customer IP addresses are assigned to customer accounts and may be remapped by the respective customer to other resource instances as needed or desired. The customer IP address is associated with the customer account, rather than with a particular resource instance, and the customer controls the IP address until the customer chooses to release it. Unlike conventional static IP addresses, the client IP address allows the client to mask resource instance or availability zone failures by remapping the client's public IP address to any resource instance associated with the client account. For example, the client IP address enables the client to resolve a problem with a client resource instance or software by remapping the client IP address to an alternate resource instance.
Fig. 21 is a block diagram of an exemplary provider network that provides storage services and hardware virtualization services to customers, according to some embodiments. Hardware virtualization service 2120 provides a plurality of computing resources 2124 (e.g., VMs) to a customer. For example, computing resources 2124 may be leased or leased to customers of provider network 2100 (e.g., customers implementing customer network 2150). Each computing resource 2124 may be provided with one or more local IP addresses. Provider network 2100 can be configured to route packets from the local IP address of computing resource 2124 to a public internet destination and from a public internet source to the local IP address of computing resource 2124.
Provider network 2100 may provide, for example, customer network 2150 coupled to intermediate network 2140 via local network 2156 with the ability to implement virtual computing system 2192 via hardware virtualization service 2120 coupled to intermediate network 2140 and provider network 2100. In some embodiments, the hardware virtualization service 2120 may provide one or more APIs 2102 (e.g., web service interfaces) via which the client network 2150 may access functionality provided by the hardware virtualization service 2120, e.g., via the console 2194 (e.g., web-based application, standalone application, mobile application, etc.). In some embodiments, at the provider network 2100, each virtual computing system 2192 at the customer network 2150 may correspond to a computing resource 2124 that is leased, or otherwise provided to the customer network 2150.
A customer may access the functionality of the storage service 2110 from an instance of the virtual computing system 2192 and/or another customer device 2190 (e.g., via the console 2194), e.g., via one or more APIs 2102, to access data from and store data to storage resources 2118A-2118N of a virtual data store 2116 (e.g., folders or "buckets," virtualized volumes, databases, etc.) provided by the provider network 2100. In some embodiments, a virtualized data storage gateway (not shown) may be provided at customer network 2150, which may locally cache at least some data (e.g., frequently accessed or critical data), and may communicate with storage service 2110 via one or more communication channels to upload new or modified data from the local cache, such that a main storage area for data (virtualized data storage 2116) is maintained. In some embodiments, users via the virtual computing system 2192 and/or on another client device 2190 may install and access virtual data storage 2116 volumes via the storage service 2110 acting as a storage virtualization service, and these volumes may appear to the user as local (virtualized) storage 2198.
Although not shown in fig. 21, one or more virtualization services can also be accessed from resource instances within the provider network 2100 via one or more APIs 2102. For example, a customer, equipment service provider, or other entity may access a virtualization service from within a corresponding virtual network on provider network 2100 via API2102 to request allocation of one or more resource instances within the virtual network or within another virtual network.
Illustrative System
In some embodiments, a system implementing some or all of the techniques described herein may include a general-purpose computer system including, or configured to access, one or more computer-accessible media, such as computer system 2200 shown in fig. 22. In the illustrated embodiment, computer system 2200 includes one or more processors 2210 coupled to a system memory 2220 via an input/output (I/O) interface 2230. Computer system 2200 also includes a network interface 2240 coupled to I/O interface 2230. Although fig. 22 illustrates computer system 2200 as a single computing device, in various embodiments, computer system 2200 may include one computing device or any number of computing devices configured to work together as a single computer system 2200.
In various embodiments, computer system 2200 may be a single-processor system that includes one processor 2210 or a multi-processor system that includes several processors 2210 (e.g., two, four, eight, or another suitable number). Processor 2210 can be any suitable processor capable of executing instructions. For example, in various embodiments, processors 2210 may be general-purpose or embedded processors implementing any of a variety of Instruction Set Architectures (ISAs), such as the x86, ARM, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In a multi-processor system, each of processors 2210 may typically, but need not, implement the same ISA.
System memory 2220 may store instructions and data that are accessible by one or more processors 2210. In various embodiments, system memory 2220 may be implemented using any suitable memory technology, such as Random Access Memory (RAM), static RAM (sram), synchronous dynamic RAM (sdram), non-volatile/flash memory, or any other type of memory. In the illustrated embodiment, program instructions and data (such as those methods, techniques, and data described above) implementing one or more desired functions are shown stored as enterprise search service code 2225 and data 2226 in system memory 2220.
In one embodiment, I/O interface 2230 may be configured to coordinate I/O traffic between processor 2210, system memory 2220, and any peripheral devices in the device, including network interface 2240 or other peripheral interfaces. In some embodiments, I/O interface 2230 may perform any necessary protocol, timing, or other data transformations to convert data signals from one component (e.g., system memory 2220) into a format suitable for use by another component (e.g., processor 2210). In some embodiments, I/O interface 2230 may include devices that support attachment via various types of peripheral buses, such as, for example, a Peripheral Component Interconnect (PCI) bus standard or a variant of the Universal Serial Bus (USB) standard. In some embodiments, the functionality of I/O interface 2230 may be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments, some or all of the functionality of I/O interface 2230 (such as an interface to system memory 2220) may be incorporated directly into processor 2210.
Network interface 2240 may be configured to allow data to be exchanged between computer system 2200 and other devices 2260 (e.g., such as the other computer systems or devices illustrated in fig. 1) attached to one or more networks 2250. In various embodiments, network interface 2240 may support communication via any suitable wired or wireless general data network (e.g., such as an ethernet network type). In addition, the network interface 2240 may support communication via a telecommunications/telephony network, such as an analog voice network or a digital fiber optic communications network, via a Storage Area Network (SAN), such as a fibre channel SAN, or via I/O any other suitable type of network and/or protocol.
In some embodiments, computer system 2200 includes one or more offload cards 2270 (including one or more processors 2275 and possibly one or more network interfaces 2240) connected using an I/O interface 2230 (e.g., a bus implementing one version of the peripheral component interconnect express (PCI-E) standard or another interconnect such as the Quick Path Interconnect (QPI) or the hyper path interconnect (UPI)). For example, in some embodiments, computer system 2200 may function as a host electronic device hosting a compute instance (e.g., operating as part of a hardware virtualization service), and one or more offload cards 2270 execute a virtualization manager that may manage compute instances executing on the host electronic device. As an example, in some embodiments, one or more offload cards 2270 may perform compute instance management operations, such as pausing and/or un-pausing a compute instance, starting and/or terminating a compute instance, performing memory transfer/copy operations, and so forth. In some embodiments, these management operations may be performed by one or more offload cards 2270 in cooperation with a hypervisor executed by other processors 2210A to 2210N of computer system 2200 (e.g., in accordance with requests from the hypervisor). However, in some embodiments, the virtualization manager implemented by the one or more offload cards 2270 may accommodate requests from other entities (e.g., from the compute instance itself) and may not cooperate with (or service) any separate hypervisors.
In some embodiments, system memory 2220 may be one embodiment of a computer-accessible medium configured to store program instructions and data as described above. However, in other embodiments, program instructions and/or data may be received, transmitted or stored on different types of computer-accessible media. Generally speaking, a computer-accessible medium may include non-transitory storage media or memory media such as magnetic media or optical media, e.g., disk or DVD/CD coupled to computer system 2200 via I/O interface 2230. Non-transitory computer-accessible storage media may also include any volatile or non-volatile media, such as RAM (e.g., SDRAM, Double Data Rate (DDR) SDRAM, SRAM, etc.), Read Only Memory (ROM), etc., which may be included in some embodiments of computer system 2200 as system memory 2220 or another type of memory. Further, computer-accessible media may include transmission media or signals, such as electrical, electromagnetic, or digital signals, communicated via communication media (such as a network and/or a wireless link, such as may be implemented via network interface 2240).
The various embodiments discussed or presented herein may be implemented in a wide variety of operating environments, which in some cases may include one or more user computers, computing devices, or processing devices that can be used to operate any of a number of applications. The user or client device may include any of a number of general purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless, and handheld devices running mobile software and capable of supporting multiple network connection protocols and messaging protocols. Such a system may also include a plurality of workstations running any of a variety of commercially available operating systems and other known applications for purposes such as development and database management. These devices may also include other electronic devices, such as virtual terminals, thin clients, gaming systems, and/or other devices capable of communicating via a network.
Most embodiments utilize at least one network that will be familiar to those skilled in the art to support communication using any of a number of widely available protocols, such as transmission control protocol/internet protocol (TCP/IP), File Transfer Protocol (FTP), universal plug and play (UPnP), Network File System (NFS), Common Internet File System (CIFS), extensible messaging and presence protocol (XMPP), AppleTalk, and the like. The one or more networks may include, for example, a Local Area Network (LAN), a Wide Area Network (WAN), a Virtual Private Network (VPN), the internet, an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network, and any combination thereof.
In embodiments using a web server, the web server may run various servers or intermediariesAny of the layer applications, including HTTP servers, File Transfer Protocol (FTP) servers, Common Gateway Interface (CGI) servers, data servers, Java servers, business application servers, and the like. The one or more servers can also execute programs or scripts in response to requests from the user device, such as by executing programs or scripts that can be implemented in any programming language (e.g., via a network interface, such as a web browser or other network interface)
Figure BDA0003765369210000341
C. C # or C + +, or any scripting language (e.g., Perl, Python, PHP, or TCL), and combinations thereof. The one or more servers may also include database servers, including but not limited to those available from Oracle (R), Microsoft (R), Sybase (R), IBM (R), and the like. The database servers may be relational or non-relational (e.g., "NoSQL"), distributed or non-distributed, etc.
The environment disclosed herein may include a wide variety of data storage areas as discussed above, as well as other memory and storage media. These may reside in various locations, such as storage media local to (and/or resident in) one or more of the computers, or remote to any or all of the computers across a network. In a particular set of embodiments, the information may reside in a Storage Area Network (SAN) familiar to those skilled in the art. Similarly, any necessary files for performing the functions attributed to a computer, server, or other network device may be stored locally and/or remotely as appropriate. Where the system includes computerized devices, each such device may include hardware elements that may be electrically coupled via a bus, including, for example, at least one Central Processing Unit (CPU), at least one input device (e.g., a mouse, keyboard, controller, touch screen, or keypad), and/or at least one output device (e.g., a display device, printer, or speaker). Such a system may also include one or more storage devices, such as hard disk drives, optical storage devices, and solid state storage devices such as Random Access Memory (RAM) or Read Only Memory (ROM), as well as removable media devices, memory cards, flash memory cards, and the like.
Such devices may also include a computer-readable storage medium reader, a communication device (e.g., modem, network card (wireless or wired), infrared communication device, etc.), and working memory, as described above. The computer-readable storage media reader can be connected to or configured to receive computer-readable storage media representing remote, local, fixed, and/or removable storage devices and storage media for temporarily and/or more permanently containing, storing, transmitting, and retrieving computer-readable information. The system and various devices will also typically include a number of software applications, modules, services or other elements located within at least one working memory device, including an operating system and application programs, such as a client application or web browser. It should be appreciated that alternative embodiments may have numerous variations from the embodiments described above. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software (including portable software, such as applets), or both. In addition, connections to other computing devices, such as network input/output devices, may be employed.
Storage media and computer-readable media for containing code or portions of code may include any suitable media known or used in the art, including storage media and communication media such as, but not limited to, volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage and/or transmission of information such as computer-readable instructions, data structures, program modules or other data, including RAM, ROM, electrically-erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a system device. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will appreciate other ways and/or methods to implement the various embodiments.
In the foregoing description, various embodiments have been described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the described embodiments.
Bracketed text and boxes with dashed borders (e.g., large dashes, small dashes, dot dashes, and dots) are used herein to illustrate optional operations to add additional features to some embodiments. However, this notation should not be taken to mean that these are the only options or optional operations, and/or that in certain embodiments, the boxes with solid line boundaries are not optional.
In various embodiments, reference numerals with suffix letters may be used to indicate that one or more instances of the referenced entity may be present, and when multiple instances are present, each instance need not be the same, but may instead share some common features or act as usual. Moreover, the use of a particular suffix is not intended to imply the presence of a particular amount of that entity unless explicitly stated to the contrary. Thus, in various embodiments, two entities using the same or different suffix letters may or may not have the same number of instances.
References to "one embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Moreover, in the various embodiments described above, unless specifically indicated otherwise, extracted language such as the word "A, B or at least one of C" is intended to be understood to mean A, B or C or any combination thereof (e.g., A, B and/or C). Thus, disjunctive language is not intended and should not be construed to imply that a given implementation requires the presence of at least one of a, at least one of B, or at least one of C, respectively.
At least some embodiments of the disclosed technology may be described according to the following clauses:
1. a computer-implemented method, comprising:
receiving a document search query;
querying at least one index based on the document search query to identify matching data, the matching data comprising one or more of a set of documents comprising a set of frequently asked questions;
acquiring the identified matching data;
determining one or more of a best matching passage, a best matching document, and a best matching frequently asked question from the set of documents based on one or more calls to one or more machine learning models based on at least the obtained identified matching data and the document search query by:
ranking a set of best matching documents from the identified set of documents using at least a first machine learning model to determine a best matching document,
determining a best matching paragraph set from the identified paragraph set using a second machine learning model,
determining a best matching paragraph from the set of best matching paragraphs using a third machine learning model, an
Determining a best matching frequently asked question of the identified set of frequently asked questions using a fourth machine learning model; and
displaying one or more of the best matching passage, the best matching document, and the best matching frequently asked question.
2. The computer-implemented method of example 1, wherein the first machine learning model is a deep cross network model and the second, third, and fourth machine learning models are bi-directional encoder representations from a transformer model.
3. The computer-implemented method of any of examples 1-2, further comprising:
receiving feedback on one or more of the best matching passage, the best matching document, and the returned ones of the best matching frequently asked questions.
4. A computer-implemented method, comprising:
receiving a document search query;
querying at least one index based on the document search query to identify matching data;
acquiring the identified matching data;
determining one or more of a top-ranked paragraph and a top-ranked document from a set of documents based on one or more calls to one or more machine learning models based on at least the obtained identified matching data and the document search query by:
determining a suitable subset of documents from the identified set of documents,
identifying a set of paragraphs from the appropriate subset of documents based on the document search query,
determining an appropriate subset of paragraphs from the identified set of paragraphs,
determining top-ranked paragraphs from the proper subset of best matching paragraphs; and
returning one or more of the top ranked paragraphs and the appropriate subset of documents.
5. The computer-implemented method of example 4, wherein the one or more machine learning models determining a proper subset of documents from the identified set of documents is performed using a first machine learning model, determining a proper subset of paragraphs from the identified set of paragraphs is performed using a second machine learning model, and determining an earlier-ranked paragraph from the best-matching proper subset of paragraphs is performed using a third machine learning model.
6. The computer-implemented method of example 5, wherein the first machine learning model is a deep cross network model and the second and third machine learning models are bi-directional encoder representations from a transformer model.
7. The computer-implemented method of any of examples 4-6, wherein returning one or more of the top-ranked paragraphs and the appropriate subset of documents comprises displaying the one or more of the top-ranked paragraphs and the appropriate subset of documents.
8. The computer-implemented method of any of examples 4-6, wherein the at least one index includes at least one index for paragraphs and at least one index for documents.
9. The computer-implemented method of any of examples 4-8, wherein the document search query includes a question to be answered and a result includes an answer to the question.
10. The computer-implemented method of any of examples 4-9, wherein the top paragraph and the returned one or more items in the proper subset of documents.
11. The computer-implemented method of any of examples 4-10, wherein the documents include at least a word processing document, a text file, and a PostScript-based file.
12. The computer-implemented method of any of examples 4-11, wherein the document is accessible to a provider network.
13. The computer-implemented method of any of examples 4-12, wherein the appropriate subset of documents returned is organized as an ordered list of documents.
14. The computer-implemented method of any of examples 4-13, further comprising:
feedback is received regarding the best matching passage and the returned one or more items in the best matching document.
15. A system, comprising:
a data storage for storing a set of documents; and
a search service implemented by one or more electronic devices, the search service comprising instructions that, when executed, cause the search service to:
a document search query is received that includes a query term,
querying at least one index based on the document search query to identify matching data,
the identified matching data is obtained and,
determining one or more of a top-ranked paragraph and a top-ranked document from the set of documents based on one or more calls to one or more machine learning models based at least on the obtained identified matching data and the document search query to:
determining a suitable subset of documents from the identified set of documents,
identifying a set of paragraphs from the appropriate subset of documents based on the document search query, determining an appropriate subset of paragraphs from the identified set of paragraphs sets, and
determining the top ranked paragraph from the proper subset of best matching paragraphs, an
Returning one or more of the top ranked paragraphs and the appropriate subset of documents.
16. The system of example 15, wherein the one or more machine learning models determine a proper subset of documents from the identified set of documents is performed using a first machine learning model, determine a proper subset of paragraphs from the identified set of paragraphs is performed using a second machine learning model, and determine top-ranked paragraphs from the best-matching proper subset of paragraphs is performed using a third machine learning model.
17. The system of example 15, wherein the first machine learning model is a deep cross network model and the second and third machine learning models are bi-directional encoder representations from a transformer model.
18. The system of any of examples 15-17, wherein the document search query includes a question to be answered and the results include an answer to the question.
19. The system of any of examples 15-18, wherein the documents include at least a word processing document, a text file, and a PostScript-based file.
20. The system of any of examples 15-19, wherein returning one or more of the top-ranked paragraphs and the proper subset of documents comprises displaying the one or more of the top-ranked paragraphs and the proper subset of documents.
21. A computer-implemented method, comprising:
receiving an ingestion request for ingesting at least one document for a business;
obtaining the at least one document from at least one data source by:
crawl documents in the at least one data source,
obtaining an access control list of the at least one document;
extracting text from the obtained at least one document;
preprocessing the extracted text to generate a predictable and analyzable preprocessed text;
generating an index entry for the extracted text, the index entry mapping the extracted text to one or more of a plurality of reserved fields; and
storing the extracted text, the index entry, the preprocessed text, and the at least one document in at least one data storage location.
22. The computer-implemented method of example 21, wherein the data types of the one or more reserved fields include a text data type, a date data type, and a numerical data type.
23. The computer-implemented method of any of examples 21-22, wherein the plurality of reserved fields includes default fields for a title and a body.
24. A computer-implemented method, comprising:
receiving an intake request to take a document;
extracting text from the document;
preprocessing the extracted text to generate a predictable and analyzable preprocessed text;
generating an index entry for the extracted text, the index entry mapping the extracted text to a reserved field of a plurality of reserved fields; and
storing the extracted text, the index entry, and the preprocessed text in at least one data storage location.
25. The computer-implemented method of example 24, wherein the data type of the reserved field is selected from a text data type, a date data type, and a numerical data type.
26. The computer-implemented method of any of examples 24-25, wherein the plurality of reserved fields includes a field for a title and a field for a body.
27. The computer-implemented method of any of examples 24-26, wherein at least a subset of the plurality of reserved fields is updatable to change a mapping of the fields.
28. The computer-implemented method of any of examples 24-27, wherein the plurality of reserved fields are user-extensible.
29. The computer-implemented method of any of examples 24-28, wherein the ingestion request is an Application Programming Interface (API) call that includes at least an identifier of an index to be updated.
30. The computer-implemented method of any of examples 24-29, wherein the plurality of reserved fields are configurable for each client of a provider network.
31. The computer-implemented method of any of examples 24-30, further comprising:
storing the document in a storage location.
32. The computer-implemented method of any of examples 24-31, wherein the pre-processing comprises at least one of:
tokenizing said extracted text;
decomposing the extracted paragraphs into overlapping paragraphs based on a sliding window;
normalizing the token; and
removing noise from the extracted text.
33. The computer-implemented method of any of examples 24-32, wherein the document is obtained from a remote location.
34. The computer-implemented method of any of examples 24-33, further comprising:
storing an access control list associated with the obtained document.
35. A system, comprising:
a data store for storing one or more documents;
a search service implemented by one or more electronic devices, the search service comprising instructions that, when executed, cause the search service to:
receiving an ingest request to ingest a document from the data storage device,
-extracting a text from said document,
preprocessing the extracted text to generate a predictable and analyzable preprocessed text;
generating an index entry for the extracted text, the index entry mapping the extracted text to a reserved field of a plurality of reserved fields, an
Storing the extracted text, the index entry, and the preprocessed text in at least one data storage location.
36. The system of example 35, wherein the data type of the reserved field is selected from a text data type, a date data type, and a numeric data type.
37. The system of any of examples 35-36, wherein at least a subset of the plurality of reserved fields is updatable to change a mapping of the fields.
38. The system of any of examples 35-37, wherein the ingestion request is an Application Programming Interface (API) call that includes at least an identifier of an index to be updated.
39. The system of any of examples 35-38, wherein the plurality of reserved fields are configurable for each client of a provider network.
40. The system of any of examples 35-39, wherein the pre-processing comprises at least one of:
tokenizing said extracted text;
decomposing the extracted paragraphs into overlapping paragraphs based on a sliding window;
normalizing the token; and
removing noise from the extracted text
41. A computer-implemented method, comprising:
receiving a document search query;
querying at least one index based on the document search query to identify matching data;
acquiring the identified matching data;
determining one or more of top-ranked paragraphs, top-ranked documents, and top-ranked frequently asked questions from a set of documents based on one or more calls to one or more machine learning models based at least on the obtained identified matching data and the document search query;
determining that a confidence value for an aspect of the one or more of the top-ranked paragraphs, top-ranked documents, and top-ranked frequently asked questions exceeds a first confidence threshold in terms of its relevance to the document search query; and
displaying the one or more of the top-ranked paragraph, the proper subset of documents, and the top-ranked frequently asked questions, including emphasis on the aspect of the result that exceeds the first confidence threshold.
42. The computer-implemented method of example 41, wherein the aspect does not include text from the document search query.
43. The computer-implemented method of any of examples 41-42, wherein the confidence value is derived from an output of one or more of the machine learning models.
44. A computer-implemented method, comprising:
receiving a search query;
performing the search query on a plurality of documents to generate search query results, the documents including paragraphs of text;
determining that a confidence value of an aspect of the search query result exceeds a first confidence threshold with respect to its relevance to the search query; and
displaying the search results, including emphasis on the aspect of the results that exceeds the first confidence threshold.
45. The computer-implemented method of example 44, wherein the search query uses one or more machine learning models in the generation of the search query results.
46. The computer-implemented method of example 45, wherein the confidence value is derived from an output of one or more of the machine learning models.
47. The computer-implemented method of any of examples 44-46, wherein the results include one or more of the top-ranked paragraphs and the appropriate subset of documents.
48. The computer-implemented method of any of examples 44-47, wherein the document search query includes a question to be answered and the result includes an answer.
49. The computer-implemented method of any of examples 44-48, further comprising:
determining that a confidence value of the aspect of the one or more of the top-ranked paragraphs, top-ranked documents, and top-ranked frequently asked questions exceeds a second confidence threshold in relation to its relevance to the document search query, wherein the emphasis is a highlight when the first and second confidence thresholds are exceeded.
50. The computer-implemented method of example 49, wherein the emphasis is one or more of the following when only the first confidence threshold is exceeded: bolding, text shading, underlining, changes in font size, changes in font, or style of font.
51. The computer-implemented method of any of examples 44-50, wherein the documents include at least a word processing document, a text file, and a PostScript-based file.
52. The computer-implemented method of any of examples 44-51, further comprising:
receiving feedback regarding the display result.
53. The computer-implemented method of example 52, wherein the feedback is used to retrain one or more machine learning models used in the generation of the search query results.
54. The computer-implemented method of any of examples 44-53, wherein the emphasis is one or more of: highlighting, text shading, bolding, underlining, a change in font size, a change in font, or a style of font.
55. A system, comprising:
a data storage for storing a plurality of documents; and
a search service implemented by one or more electronic devices, the search service comprising instructions that, when executed, cause the search service to:
a search query is received that is associated with a search query,
performing the search query on a plurality of documents to generate search query results, the documents including paragraphs of text;
determining that a confidence value of an aspect of the search query result exceeds a first confidence threshold in terms of its relevance to the search query, an
Displaying the search results, including emphasis on the aspect of the results that exceeds the first confidence threshold.
56. The system of example 55, wherein the document search query comprises a question to be answered and the results comprise answers.
57. The system of any of examples 55-56, wherein the search service is further to:
determining that the confidence value of the aspect of the one or more of the top-ranked paragraphs, top-ranked documents, and top-ranked frequently asked questions exceeds a second confidence threshold in terms of its relevance to the document search query, wherein the emphasizing is highlighting when the first and second confidence thresholds are exceeded.
58. The system of any of examples 55-57, wherein the emphasis is one or more of the following when only the first confidence threshold is exceeded: bolding, text shading, underlining, changes in font size, changes in font, or style of font.
59. The system of any of examples 55-58, wherein the documents include at least a word processing document, a text file, and a PostScript-based file.
60. The system of any of examples 55-59, wherein the emphasis is one or more of: highlighting, text shading, bolding, underlining, a change in font size, a change in font, or a style of font.
61. A computer-implemented method, comprising:
receiving a search query from a user for data of the user;
performing a search on the data of the user for the search query using a machine learning model to generate results;
generating a confidence score for the results of the search;
selecting an appropriate subset of the data to provide to the user based on the confidence score;
displaying the appropriate subset of the data to the user via a graphical user interface;
receiving, via the graphical user interface from the user, an indication of one or more portions of the appropriate subset of the data for use in a next training iteration of the machine learning model for the search query; and
performing the next training iteration of the machine learning model using the one or more portions of the proper subset of the data.
62. The computer-implemented method of example 61, wherein the proper subset of the data is a plurality of candidate documents for the search query.
63. The computer-implemented method of any of examples 61-62, wherein the appropriate subset of the data is a plurality of candidate answers to the search query.
64. A computer-implemented method, comprising:
performing a search on the user's data for the search query using a machine learning model to generate results;
generating a confidence score for the results of the search;
selecting an appropriate subset of the data to provide to the user based on the confidence score;
displaying the appropriate subset of the data to the user;
receiving, from the user, an indication of one or more portions of the appropriate subset of the data for a next training iteration of the machine learning model; and
performing the next training iteration of the machine learning model using the one or more portions of the appropriate subset of the data.
65. The computer-implemented method of example 64, wherein the proper subset of the data is a plurality of candidate documents for the search query.
66. The computer-implemented method of example 65, wherein the displaying the plurality of candidate documents comprises displaying to the user a respective link for each of the plurality of candidate documents.
67. The computer-implemented method of example 65, wherein the displaying the plurality of candidate documents comprises displaying the search query to the user.
68. The computer-implemented method of example 65, wherein the indication from the user regarding the one or more portions is whether the user selected a respective interface element for each document of the plurality of candidate documents.
69. The computer-implemented method of any of examples 64-68, wherein the appropriate subset of the data is a plurality of candidate answers to the search query.
70. The computer-implemented method of example 69, wherein the displaying the plurality of candidate answers includes displaying, for each of the candidate answers, a respective paragraph, wherein an appropriate subset of the respective paragraph is highlighted as a candidate answer.
71. The computer-implemented method of example 69, wherein the displaying the plurality of candidate answers includes displaying the search query to the user.
72. The computer-implemented method of example 69, wherein the indication from the user regarding the one or more portions is whether the user selected a respective interface element for each of the plurality of candidate answers.
73. The computer-implemented method of any of examples 64-72, wherein the displaying the appropriate subset of the data is in response to the confidence score being less than a confidence threshold regarding its relevance to the search query.
74. The computer-implemented method of any of examples 64-73, wherein the displaying the proper subset of the data is in response to a confidence difference threshold exceeding a difference between a first confidence score of a first portion of the proper subset of the data regarding its relevance to the search query and a second confidence score of a second portion of the proper subset of the data regarding its relevance to the search query.
75. A system, comprising:
a data storage service implemented by the first one or more electronic devices for storing data for a user; and
a model management service implemented by a second one or more electronic devices, the model management service comprising instructions that, when executed, cause the model management service to:
performing a search on the data of the user for a search query using a machine learning model to generate results,
generating a confidence score for the results of the search,
selecting an appropriate subset of the data to provide to the user based on the confidence score,
displaying the appropriate subset of the data to the user,
receiving an indication from the user of one or more portions of the appropriate subset of the data for a next training iteration of the machine learning model, an
Performing the next training iteration of the machine learning model using the one or more portions of the appropriate subset of the data.
76. The system of example 75, wherein the appropriate subset of the data is a plurality of candidate documents for the search query.
77. The system of example 76, wherein said displaying the plurality of candidate documents comprises displaying to the user a respective link for each of the plurality of candidate documents.
78. The system of any of examples 75-77, wherein the proper subset of the data is a plurality of candidate answers to the search query.
79. The system of any of examples 75-78, wherein said displaying the plurality of candidate answers includes displaying, for each of the candidate answers, a respective passage with an appropriate subset of the respective passage highlighted as a candidate answer.
80. The system of any of examples 75-79, wherein the displaying the appropriate subset of the data is in response to the confidence score being less than a confidence threshold regarding its relevance to the search query.
81. A computer-implemented method, comprising:
training a language machine learning model on a first one or more documents comprising known question-answer pairs to predict a question from answers in the first one or more documents;
receiving a second one or more documents from the user;
generating a set of question-answer pairs from the second one or more documents from the user using the language machine learning model; and
storing the set of challenge-response pairs generated from the second one or more documents from the user.
82. The computer-implemented method of example 81, wherein the training comprises training the language machine learning model to predict each successive word of a known question from known answers to the known question-answer pairs thereof.
83. The computer-implemented method of example 81, further comprising training a machine learning model using the set of question-answer pairs generated from the second one or more documents.
84. A computer-implemented method, comprising:
receiving one or more documents from a user;
generating a set of question-answer pairs from the one or more documents from the user using a machine learning model trained to predict questions from answers; and
storing the set of challenge-response pairs generated from the one or more documents from the user.
85. The computer-implemented method of example 84, further comprising training the machine learning model using known question-answer pairs to predict the question from the answer.
86. The computer-implemented method of example 85, wherein the training includes training the machine learning model to predict each successive word of a known question from a known answer of the known question-answer pair thereof.
87. The computer-implemented method of example 86, wherein the training includes training the machine learning model to predict a question end token for the known question from the known answer.
88. The computer-implemented method of any of examples 84-87, wherein generating the set of challenge-response pairs from the one or more documents from the user using the machine learning model comprises generating a plurality of questions for a single answer of at least one of the set of challenge-response pairs from the one or more documents from the user.
89. The computer-implemented method of any of examples 84-88, further comprising training a second machine learning model using the set of question-answer pairs generated from the one or more documents to determine one or more top-ranked answers from the user's data for a search query from the user.
90. The computer-implemented method of any of examples 84-89, further comprising displaying the one or more top ranked answers to the user.
91. The computer-implemented method of any of examples 84-90, further comprising training a second machine learning model using the set of question-answer pairs generated from the one or more documents to determine one or more top-ranked documents from the user's data for a search query from the user.
92. The computer-implemented method of example 91, further comprising displaying the one or more top ranked documents to the user.
93. The computer-implemented method of any of examples 84-92, further comprising training a second machine learning model using the set of question-answer pairs generated from the one or more documents to determine one or more top-ranked paragraphs from the user's data for a search query from the user.
94. The computer-implemented method of example 93, further comprising displaying the one or more top-ranked paragraphs to the user.
95. A system, comprising:
a document storage service implemented by a first one or more electronic devices for storing one or more documents from a user; and
a training data generation service implemented by a second one or more electronic devices, the training data generation service comprising instructions that, when executed, cause the training data generation service to:
receiving the one or more documents from the user,
generating a set of question-answer pairs from the one or more documents from the user using a machine learning model trained to predict questions from answers, and
storing the set of challenge-response pairs generated from the one or more documents from the user.
96. The system of example 95, wherein the training data generation service includes instructions that, when executed, cause the training data generation service to train the machine learning model using known question-answer pairs to predict the question from the answer.
97. The system of example 96, wherein the training data generation service includes instructions that, when executed, cause the training data generation service to train the machine learning model to predict each successive word of a known question from a known answer of its known question-answer pair.
98. The system of example 97, wherein the training data generation service includes instructions that, when executed, cause the training data generation service to train the machine learning model to predict a question end marker for the known question from the known answer.
99. The system of any of examples 95-98, wherein the training data generation service generates a plurality of questions for a single answer to at least one of the set of quiz pairs from the one or more documents from the user.
100. The system of any of examples 95-99, further comprising a third one or more electronic device-implemented model building services comprising instructions that, when executed, cause the model building service to train a second machine learning model using the set of question-answer pairs generated from the one or more documents to determine one or more top-ranked answers from the user's data for a search query from the user.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the claims.

Claims (15)

1. A computer-implemented method, comprising:
receiving a document search query;
querying at least one index based on the document search query to identify matching data;
acquiring the identified matching data;
determining one or more of a top-ranked paragraph and a top-ranked document from a set of documents based on one or more calls to one or more machine learning models based on at least the obtained identified matching data and the document search query by:
determining a suitable subset of documents from the identified set of documents,
identifying a set of paragraphs from the appropriate subset of documents based on the document search query,
determining an appropriate subset of paragraphs from the identified set of paragraphs,
determining top-ranked paragraphs from the proper subset of best matching paragraphs; and
returning one or more of the top ranked paragraphs and the appropriate subset of documents.
2. The computer-implemented method of claim 1, wherein the one or more machine learning models determining a proper subset of documents from the identified set of documents is performed using a first machine learning model, determining a proper subset of paragraphs from the identified set of paragraphs is performed using a second machine learning model, and determining an upper-ranked paragraph from the best-matching proper subset of paragraphs is performed using a third machine learning model.
3. The computer-implemented method of claim 2, wherein the first machine learning model is a deep cross network model, and the second and third machine learning models are bi-directional encoder representations from a transformer model.
4. The computer-implemented method of any of claims 1-3, wherein returning one or more of the top-ranked paragraph and the appropriate subset of documents comprises displaying the one or more of the top-ranked paragraph and the appropriate subset of documents.
5. The computer-implemented method of any of claims 1-4, wherein the at least one index comprises at least one index for paragraphs and at least one index for documents.
6. The computer-implemented method of any of claims 1-5, wherein the document search query includes a question to be answered and a result includes an answer to the question.
7. The computer-implemented method of any of claims 1-6, wherein the top paragraph and the returned one or more items in the appropriate subset of documents.
8. The computer-implemented method of any of claims 1-7, wherein the documents include at least a word processing document, a text file, and a PostScript-based file.
9. The computer-implemented method of any of claims 1-8, wherein the document is accessible to a provider network.
10. The computer-implemented method of any of claims 1-9, wherein the appropriate subset of documents returned is organized as an ordered list of documents.
11. The computer-implemented method of any of claims 1-10, further comprising:
feedback is received regarding the best matching passage and the returned one or more items in the best matching document.
12. A system, comprising:
a data storage for storing a set of documents; and
a search service implemented by one or more electronic devices, the search service comprising instructions that, when executed, cause the search service to:
a document search query is received that includes a query term,
querying at least one index based on the document search query to identify matching data,
the identified matching data is obtained and,
determining one or more of a top-ranked paragraph and a top-ranked document from the set of documents based on one or more calls to one or more machine learning models based at least on the obtained identified matching data and the document search query to:
determining a suitable subset of documents from the identified set of documents,
identifying a set of paragraphs from the appropriate subset of documents based on the document search query,
determining a suitable subset of paragraphs from the identified set of paragraphs, an
Determining the top ranked paragraph from the proper subset of best matching paragraphs, an
Returning one or more of the top ranked paragraphs and the appropriate subset of documents.
13. The system of claim 12, wherein the one or more machine learning models determine a proper subset of documents from the identified set of documents is performed using a first machine learning model, determine a proper subset of paragraphs from the identified set of paragraphs is performed using a second machine learning model, and determine top-ranked paragraphs from the best-matching proper subset of paragraphs is performed using a third machine learning model.
14. The system of any of claims 12-13, wherein the document search query includes a question to be answered and a result includes an answer to the question.
15. The system of any of claims 12-14, wherein the documents include at least a word processing document, a text file, and a PostScript-based file.
CN202080094631.8A 2019-11-27 2020-11-24 System, device and method for document query Pending CN115004175A (en)

Applications Claiming Priority (11)

Application Number Priority Date Filing Date Title
US16/698,080 US11475067B2 (en) 2019-11-27 2019-11-27 Systems, apparatuses, and methods to generate synthetic queries from customer data for training of document querying machine learning models
US16/697,964 US11314819B2 (en) 2019-11-27 2019-11-27 Systems, apparatuses, and method for document ingestion
US16/697,979 2019-11-27
US16/697,964 2019-11-27
US16/698,027 US20210158209A1 (en) 2019-11-27 2019-11-27 Systems, apparatuses, and methods of active learning for document querying machine learning models
US16/697,979 US11526557B2 (en) 2019-11-27 2019-11-27 Systems, apparatuses, and methods for providing emphasis in query results
US16/698,080 2019-11-27
US16/697,948 2019-11-27
US16/698,027 2019-11-27
US16/697,948 US11366855B2 (en) 2019-11-27 2019-11-27 Systems, apparatuses, and methods for document querying
PCT/US2020/061947 WO2021108365A1 (en) 2019-11-27 2020-11-24 Systems, apparatuses, and methods for document querying

Publications (1)

Publication Number Publication Date
CN115004175A true CN115004175A (en) 2022-09-02

Family

ID=74046127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080094631.8A Pending CN115004175A (en) 2019-11-27 2020-11-24 System, device and method for document query

Country Status (3)

Country Link
EP (1) EP4062295A1 (en)
CN (1) CN115004175A (en)
WO (1) WO2021108365A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7185001B1 (en) * 2000-10-04 2007-02-27 Torch Concepts Systems and methods for document searching and organizing
US20090327266A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Index Optimization for Ranking Using a Linear Model
CN104216913A (en) * 2013-06-04 2014-12-17 Sap欧洲公司 Problem answering frame
US9330084B1 (en) * 2014-12-10 2016-05-03 International Business Machines Corporation Automatically generating question-answer pairs during content ingestion by a question answering computing system
CN109219811A (en) * 2016-05-23 2019-01-15 微软技术许可有限责任公司 Relevant paragraph searching system
US20190311064A1 (en) * 2018-04-07 2019-10-10 Microsoft Technology Licensing, Llc Intelligent question answering using machine reading comprehension

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7185001B1 (en) * 2000-10-04 2007-02-27 Torch Concepts Systems and methods for document searching and organizing
US20090327266A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Index Optimization for Ranking Using a Linear Model
CN104216913A (en) * 2013-06-04 2014-12-17 Sap欧洲公司 Problem answering frame
US9330084B1 (en) * 2014-12-10 2016-05-03 International Business Machines Corporation Automatically generating question-answer pairs during content ingestion by a question answering computing system
CN109219811A (en) * 2016-05-23 2019-01-15 微软技术许可有限责任公司 Relevant paragraph searching system
US20190311064A1 (en) * 2018-04-07 2019-10-10 Microsoft Technology Licensing, Llc Intelligent question answering using machine reading comprehension

Also Published As

Publication number Publication date
WO2021108365A1 (en) 2021-06-03
EP4062295A1 (en) 2022-09-28

Similar Documents

Publication Publication Date Title
US11475067B2 (en) Systems, apparatuses, and methods to generate synthetic queries from customer data for training of document querying machine learning models
US11366855B2 (en) Systems, apparatuses, and methods for document querying
US10678835B2 (en) Generation of knowledge graph responsive to query
US11314819B2 (en) Systems, apparatuses, and method for document ingestion
US11321329B1 (en) Systems, apparatuses, and methods for document querying
US11403356B2 (en) Personalizing a search of a search service
US20180293302A1 (en) Natural question generation from query data using natural language processing system
CN105550206B (en) The edition control method and device of structured query sentence
JP2020515944A (en) System and method for direct in-browser markup of elements in Internet content
US20210342541A1 (en) Stable identification of entity mentions
US20200257679A1 (en) Natural language to structured query generation via paraphrasing
US11514124B2 (en) Personalizing a search query using social media
US10719529B2 (en) Presenting a trusted tag cloud
US20230090050A1 (en) Search architecture for hierarchical data using metadata defined relationships
US20160110459A1 (en) Realtime Ingestion via Multi-Corpus Knowledge Base with Weighting
US20220253719A1 (en) Schema augmentation system for exploratory research
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
US10963686B2 (en) Semantic normalization in document digitization
US11526557B2 (en) Systems, apparatuses, and methods for providing emphasis in query results
US10776408B2 (en) Natural language search using facets
US20210158209A1 (en) Systems, apparatuses, and methods of active learning for document querying machine learning models
US9898467B1 (en) System for data normalization
CN115329753B (en) Intelligent data analysis method and system based on natural language processing
CN115004175A (en) System, device and method for document query
US20170322970A1 (en) Data organizing and display for dynamic collaboration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination