WO2019229769A1

WO2019229769A1 - An auto-disambiguation bot engine for dynamic corpus selection per query

Info

Publication number: WO2019229769A1
Application number: PCT/IN2019/050415
Authority: WO
Inventors: Sanjeev THOTTAPILLY; Animesh SAMUEL
Original assignee: Thottapilly Sanjeev; Samuel Animesh
Priority date: 2018-05-28
Filing date: 2019-05-28
Publication date: 2019-12-05

Abstract

A method for communicating with an auto-disambiguation bot engine for dynamic corpus selection per query, comprising: by a Question Generation model (QGM), generating a query (201), said query associated with a context token (301) and an answer token (302), said context token and said answer token fed with an answer (301) which is an output (210) of a Question Answer model (QA), said answer being selected from a dynamic corpus of documents and from data extracted from said dynamic corpus of documents; and by a Question Answer model (QA), receiving query words (201) from said Question Generation model (QA) and determining a context (202) of said query (201) and applying said determined context to obtain an output (210) answer from corpus, said corpus comprising contextually selected documents and contextually selected extracts, said output answer having the highest probabilistic score of context which matches with said input query (201).

Description

AN AUTO-DISAMBIGUATION BOT ENGINE FOR DYNAMIC CORPUS SEUECTION PER QUERY

FIEUD OF THE INVENTION:

This invention relates to the field of computer networks.

Particularly, this invention relates to a hot engine using a dynamic corpus per query.

BACKGROUND OF THE INVENTION:

Conventionally, the term, ‘hot’, has been used in the context of computational engines. Typically, bots are computer programs which perform automated tasks. In recent times, these messaging systems have enabled bots which are intelligent bots capable of improving computer process and further capable of applying intelligence to defined tasks in order to cause some translational effect associated nodes in the network of the messaging system.

With the evolution of smart computing systems including mobile computing systems and with the interfacing messaging systems on such computing and mobile systems, along with their ubiquitous deployments - users and organizations have been mobile in nature. This means that a user or an aspect of an organization can be accessed from any location and can also be monitored from any location.

With the advent of smart devices and such technological devices, human interaction is moving towards being guided by artificial intelligence and reduction in human labour. With the assistance of such technology, computational prowess can be used to enhance a user’s experience. Furthermore, it is to be needed that data can be embedded or carried in a variety of formats or platforms. Therefore, there is a need for a system, method, and engine which identifies such data irrespective of the formats or platforms.

OBJECTS OF THE INVENTION:

An object of the invention is to provide a system, method, and engine which identifies data irrespective of the formats or platforms.

Another object of the invention is to provide a system, method, and bot engine to accurately identify data irrespective of the formats or platforms.

Yet another object of the invention is to provide a system, method, and bot engine to robustly identify data irrespective of the formats or platforms.

Still another object of the invention is to provide a system, method, and bot engine to quicky identify data irrespective of the formats or platforms.

SUMMARY OF THE INVENTION:

According to this invention, there is provided a method for communicating with an auto-disambiguation bot engine for dynamic corpus selection per query, said method comprising the steps of:

by means of a Question Generation model, at a query-generation node, generating a query, as an output of a beam search, said query being associated with a context token and an answer token, said context token and said answer token being fed with at least an answer which is an output of a Question Answer model, at a question-answer node, said answer being selected from a dynamic corpus of documents and from data extracted from said dynamic corpus of documents; and by means of a Question Answer model, at a question-answer node, receiving query words from said Question Generation model and determining a context of said query and applying said determined context to obtain an output answer from said dynamically formed corpus of documents, said corpus comprising contextually selected documents and contextually selected extracts from said contextually selected documents, said output answer having the highest probabilistic score of context which matches with said input query.

Typically, said Question Generation model comprising a method of generating a query, said method comprising the steps of:

parsing input query, at an input node of a network of nodes;

capturing constituent keywords, at a keyword classification node, from said parsed input query and to classify said captured keywords based on pre-defmed classification parameters;

identifying, at a training corpus node, with respect to said parsed query, word embeddings, character embeddings, and sentence embeddings on a training corpus, each of said embeddings being an embeddings’ matrix of index vectors with respect to feature vectors;

feeding at least a Bidirectional Long-Short Term Memory with word embedding and character embedding and an output layer which is generated is a concatenated matrix of word embeddings and character embeddings in order to correctly classify the context of the query;

extracting context is extracted from probable answers from a dynamically determined / formed corpus; feeding at least a Bidirectional Long-Short Term Memory node with word embedding and character embedding to provide an output layer which is a concatenated matrix of word embeddings and character embeddings in order to correctly classify context of said query, said context being a matrix of contexts comprising weight-assigned vectors for said input query;

extracting Part of Speech from said query;

extracting Question Type, of said query, to check if Question type matches with expected answer types so that it increases a confidence score of an answer for that query;

determining a Self- Attention vector, in order to define where attention is, for each answer selected from said corpus, for that query;

determining an Attention vector, in an answer, for that query, for that context vector, to provide most relevant answer(s), having highest probabilistic score, as an output;

multiplying output of Bidirectional Long-Short Term Memory, for query, by output of said Self- Attention vector to provide a clear sequence of words to help identify context in a query to provide a query-context vector;

multiplying output of Bidirectional Long-Short Term Memory, for answers, is multiplied by output of said Self- Attention vector to provide a clear sequence of words to help identify context in an answer to provide an answer-context vector; feeding at least a Bidirectional Long-Short Term Memory node with a plurality of answer-context vectors per query-context vector, each of said vectors to be matched with said attention vector in order to find a match between an answer and a query, said answer having a highest probabilistic score for that query, in terms of context vector and in terms of attention vector; and

outputting said answer having highest probabilistic score. Typically, said Question Answer model (QA) comprising a method of providing an answer), said method comprising the steps of:

extracting a Context Token at the time of training;

extracting an Answer Token at the time of training;

determining a Self- Attention vector, where attention is, for each context for each set of answers, for each query;

determining an Attention vector, where attention should, in an answer, be for each query, for that context, to provide most relevant answer(s) as an output, wherein output of Answer Token along with output of Self- Attention provides the output of attention;

checking Conditional Encoding for context token with respect to answer token;

providing as an output of Attention along with output of conditional encoding a word-embedding and character-embedding output for every input which is then given to a series of BiLSTMs (Bidirectional Long-Short Term Memory) to check for context matching of answer with respect to query;

feeding the series of Bidirectional Long-Short Term Memory to check for context matching of answer with respect to query; and

by means of Beam Searching, checking with evidence scorer to output a predicted corpus of answers per query based on context, each answer having a probabilistic score per query; and

providing as an output, query words which is most pertinent to a query in terms of context from a dynamically formed corpus.

According to this invention, there is provided an auto-disambiguation bot engine for dynamic corpus selection per query, said bot engine comprising:

a query inputter configured to allow a user to input a query; a data collector configured to collect electronic documents;

extractor configured to extract documents of relevance without losing information and to format these documents;

a crawler configured to crawl content of each of said extracted documents; a query parser configured to parse and correct said input query;

a keyword classifier is configured to capture keywords from said parsed input query to classify keywords based on pre-defmed classification parameters; an NLP extraction module, through an NLP node, configured to act on parsed query to identify features (question type feature, named entity relationship feature, common sense path feature, part-of-speech tags);

embeddings configurator configured to represent captured keywords, from said input query, in numeral format in terms of its constituent extracted features (question type feature, named entity relationship feature, common sense path feature, part-of-speech tags);

a dependence tree configurator configured to define structural relationship between different words in parsed input query in order to output at least question types, at least named entity relationship words, at least common sense paths;

a Question Generation Model configured to generate queries in relation to said input query and said documents to form corpus which is a dynamically formed per query, thereby providing the most accurate results

a Question Answer Model configured to providing as an output, query words which is most pertinent to a query in terms of context from a dynamically formed corpus, said query words being fed to said Question Generation Model for training; and

an output node configured to output an answer from a dynamically formed corpus, the corpus comprising contextually selected documents and contextually selected extracts from these contextually selected documents, the answer having the highest probabilistic score of context which matches with the query input.

Typically, said engine comprising a tagger configured to classify and tag documents, based on pre-defmed parameters, in a relational database.

Typically, said extractor comprising multiple extractors selected from a group of extractors consisting of a PDF extractor for extracting PDF documents, a word extractor for extracting data from word documents, an Excel extractor for extracting content of Excel sheets with multiple sheets which may have different column and row span, a HTML extractor for extracting main content of a web page and also links of links, a plain text extractor for extracting text.

Typically, said crawler engaging with a hierarchy definition mechanism configured to understand and define hierarchy, with respect to content, in each extracted document.

Typically, said extractor comprising a text extractor configured to extract text from said crawled documents.

Typically, said query parser comprising a translator to receive input query in any language and to translate said input query as configured by said engine.

Typically, said query parser comprising an acronyms substitutor configured to identify acronyms from the parsed input query and to fetch substitutive full words correlative to the identified acronyms from a pre-fed acronyms vocabulary database. Typically, said query parser comprising an synonyms substitutor configured to identify synonyms from the parsed input query and to fetch substitutive full words correlative to the identified synonyms from a pre-fed synonyms vocabulary database.

Typically, said embeddings configurator being configured to process word embeddings created on training data for pre-configured languages to represent keywords in a numerical form, characterised in that, said word embeddings being vectors of [m x n] where m represents number of words and n represents the number of features tuned to represent a single word, where a first axis defines an index of the word and a second axis defines its features.

Typically, said embeddings configurator being configured to process character embeddings in terms of vectors of [a x b] where a represents number of characters and b represents the number of features tuned to represent a single character, where a first axis defines an index of the word and a second axis defines its features.

Typically, said embeddings configurator being configured to process sentence embeddings formed by taking representation of a word from a word embedding matrix and representation of a character from a character embedding matrix and combing said word representation and said character representation to provide a concatenated matrix.

Typically, said embeddings configurator being configured to process context cum query embedding formed by taking representation of a word from the word embedding matrix and representation of a character from character embedding matrix and combing both.

Typically, said NLP extraction module engages with a dynamic feature extractor (DFE) configured to extract features from the text based input query.

Typically, said NLP extraction module engages with a dynamic feature extractor configured to extract features from the text based input query, characterised in that, said dynamic feature extractor using embedding, POS tags, dependence tree configurator features together to determine a class label of an extracted keyword from an input query and its confidence score.

Typically, said NLP extraction module comprising a POS (part of speech) tagger configured to find out POS tags for words in parsed input query.

Typically, said dependence tree configurator comprises a focus extractor configured to pin point at least a focus word from the parsed input subject matter.

Typically, said dependence tree configurator comprises a focus extractor configured to pin point at least a focus word from the parsed input subject matter, characterised in that, features used to train are word embeddings, parts of speech tags, Named Entity Relationship.

Typically, said dependence tree configurator comprises an attribute identifier configured to identify attribute words from the parsed input subject matter. Typically, said dependence tree configurator comprises an attention mechanism used to generate a score with a softmax function giving a confidence score of an attribute being present on not in a current input.

Typically, said dependence tree configurator comprises an attribute identifier configured to identify attribute words from the parsed input subject matter, characterised in that, features used to train are word embeddings, parts of speech tags, Named Entity Relationship.

Typically, said dependence tree configurator comprises a dynamic feature extractor configured to extract NER (named entity relationship) words from the parsed input subject matter.

Typically, said dependence tree configurator comprises a dynamic feature extractor configured to extract NER (named entity relationship) words from the parsed input subject matter, characterised in that, features used to train are word embeddings, character embeddings, part-of- speech tags; to determine a class label of the keyword and a confidence score.

Typically, said dependence tree configurator comprises a common sense extractor configured to extract common sense for all words from the parsed input subject matter other than stop words and most frequently used words from a corpus.

Typically, said dependence tree configurator comprises a question type extractor configured to determine type of question that is input. Typically, focus word embeddings, attribute word embeddings, named entity relationship word embeddings, common sense embedding are embeddings where representation is only taken where the feature words are present; rest of everything is masked and has a value 0.0.

Typically, Question type embedding, parts of speech tags embedding, Named Entity Relationship embedding, Common Sense embedding are embeddings where representation is only taken where the feature words are present; rest of everything is masked and has a value 0.0.

Typically, said engine comprising a dis-ambiguity resolver in order to detect ambiguity and to further resolve it based on determined context.

Typically, said engine comprising a relevancy mapper to map relevancy of said input query within a corpus of documents, characterised in that, said input query is ranked and / or mapped for relevancy based on classified set of documents to form a corpus which is a dynamically formed per query.

Typically, said engine comprising a ranking mechanism [comprising a document ranker, a paragraph ranker, and a sentence ranker] configured to rank documents and paragraphs within each document.

Typically, said engine comprising a sentence ranking mechanism configured rank sentences within a paragraph / document based on a similarity matrix.

Typically, said engine comprising an evidence scorer configured to apply evidence-based confidence scoring on question type, answer type, semantic relationships; all based on localized features trained on model in order to rank outputs based on question type, answer type, and question similarity between generated and asked query.

Typically, said engine comprising a training module configured to evaluate accuracy of an output with respect to a query and to receive feedback to tune parameters on accuracy over a validation data set to make the engine, of this invention, understand new features.

Typically, said question-answer (QA) model is fed with word embedding and character embedding of an input query along with extracted common sense feature to calculate the accuracy of the engine such that embedding for each path of commonsense is obtained and self-attention is applied to it, which is further combined with the Bidirectional Long-Short Term Memory output.

Typically, said engine comprising a document ranking model which is trained on transformers with self-attention with positional embedding.

Typically, said question generation model (QGM) is trained on corpus data in order to generate queries and to apply self-attention and attention to corpus data in order to generate other query words.

Typically, said question generation model (QGM) comprising a beam search such that an output generated from Beam Search during query generation would be probabilities of all the vocabulary words in order to obtain a contextually relevant query. BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS:

The invention will now be described in relation to the accompanying drawings, in which:

FIGURE 1 illustrates a schematic of the system of this invention;

FIGURE 2 illustrates a schematic block diagram of the question-answer model; FIGURE 3 illustrates a schematic block diagram of the question generation model; and

FIGURE 4 illustrates a schematic block diagram which combines the question- answer model of Figure 2 and the question generation model of Figure 3 to provide a most-pertinent output (i.e. an answer), in response to a query, from a dynamically formed corpus.

DETAILED DESCRIPTION OF THE ACCOMPANYING DRAWINGS:

According to this invention, there is provided a bot engine using a dynamic corpus per query. Documents may be in web format, in text format, in word format, in pdf format, and the like. A query is input subject matter, output of which is to be searched from the plurality of types of electronic documents.

For the purposes of this invention, the term,“corpus” is defined as a defined pool or resources from which data can be searches and / or fetched. The term“corpus” also covers content (extracted) from the pool of resources.

For the purposes of this invention, the term,“dynamic corpus per query” is a corpus which is dynamically formed per query. For the purposes of this invention, the term,“node” is defined as a connected object or device or application or data in a network. Typically, a node is defined by parameters such as its position, behaviour, and value. A node’s position and value defines the network behaviour. A node’s position defines its relativity with connected nodes and cumulatively defines the network behaviour.

For the purposes of this invention, the term,“network” is defined to mean a collection of nodes connected to each other. This connection may change based on determined / computed context.

In a computer architectural environment, a network of nodes is defined to achieve an output or a purpose. The configuration of these nodes is such that for each and every input, an output that is linearly causal to the input based on the processing parameters is defined. This is static in nature.

However, intelligent systems beg these environments and, hence, networks to be more dynamic in nature. In other words, the same input should not always provide the same output; the same input may lead to a variety of outputs based on context understanding or ‘context’ understanding. Upon understanding this context understanding or‘context’ understanding, the network of nodes shape-shifts and aligns itself to form a variant output even if input is seemingly the same. This is dynamic in nature.

For the purposes of this invention, the term,“context” is defined as a vector which assigns a specific behaviour, weight, direction, and associative capabilities to a node which means that the node’s position in a network is defined, a node’s association with its connected node is defined, a node’s relative position relative to associated nodes is defined, a node’s input is defined (thereby defining behaviour), and a node’s output is defined (thereby defining behaviour) by this context vector. A rule engine may define such context vectors as outputs.

In at least an embodiment of this invention, a“node” is defined by means of at least a context vector.

For the purposes of this invention, the term, “channels” is defined to mean connections between nodes. These channels form pathways which, essentially, lead to shape-shifting of the network so that based on “context”, the network architecture changes and the same set of inputs may cause a different context- specific output.

FIGURE 1 illustrates a schematic of the system (100) of this invention.

In accordance with an embodiment of this invention, there is provided a query inputter (QI) configured to allow a user to input a query to this system. Based on this query and context (which may be based on a previous query or previous output of the system), the system of this invention needs to form a dynamic corpus from which an answer / output / reply / response should be fetched and served.

In accordance with an embodiment of this invention, there is provided a data collector (DC) configured to collect electronic documents (D) and, therefore, in turn, data. The collection is of PDFs, word documents, text documents, note pads, excel sheets, and the like electronic documents. The data collector (DC), of this invention, is configured to collect / extract data, in electronic format or in HTML format or in plain text format from a plurality of sources. Electronic documents (D) can be obtained from the plurality of sources. In at least an embodiment, HTML scrapers are used to scrape HTML content from web pages. Additionally,

1. Documents can be directly uploaded onto the system;

2. Blob storage path can be configured where all electronic documents are present; and

3. Web URL links can be added on the system so that all documents from the link can be downloaded and stored with the system. Recursively, all the documents present on the link or link of links (if configured) can be downloaded.

A tagger (T) is configured to classify and tags these documents based on pre- defrned parameters, these parameters being languages, regions, user type, region, origin, and / or the like. These tagged documents are stored in a blob storage and the path for the stored documents is linked to a bot identity, a region identity, a language identity, a user-type identity, an origin identity, and the like; all in a relational database.

In accordance with another embodiment of this invention, there is provided an extractor (E) configured to extract documents of relevance without losing information and further formats these documents. Multiple extractors (E) are used to extract documents based on the type of the document. In at least an embodiment, a PDF extractor is used to extract PDF documents. PDF documents may have scanned pages, pages, content with tables, content as images, text with hyperlink; all of which are maintained as is by the PDF extractor. In at least an embodiment, a word extractor does the same as the PDF extractor for extracting data from word documents. In at least an embodiment, an Excel extractor is configured to extract the content of Excel sheets with multiple sheets which may have different column and row span. In at least an embodiment, a HTML extractor is used to extract main content of a web page and also links of links can be extracted if configured for the same domain. In at least an embodiment, a plain text extractor considers everything as text.

In accordance with another embodiment of this invention, there is provided a crawler (C) configured to crawl content of each of these extracted documents in order to find pre-defined patterns. These patterns may be acronyms, synonyms, focus words, disambiguities, and the like entities. The crawler is further configured to crawl content and engage with a hierarchy definition mechanism (HDM) configured to understand and define hierarchy in the document. Not only the text is extracted or crawled but also the content is understood at the time of extraction. Text may be classified and defined in hierarchical format as headings, sub- headings, children to sub-headings, and so on and so forth. Text with bullets or steps discussed in the text is understood.

E.g. if a text has a structure comprising a header, a sub-header, and subject matter; then, it is a two-level hierarchy.

Crawlers (C) use the extractor (E) output and maintain hierarchy between the content. This step is the most important step as important things associated with the content can be lost here. Content with heading / highlighted / bold is represented in a heading form. Also, in the same way, sub-headings are represented in a sub-heading form. Bullets and step-wise points are maintained; if ignored, it may appear as plain text and it may lose its meaning. Tables extracted are kept in a tabular form; if ignored, it will mean nothing in a text form and adds up noise to the content. Information regarding hyperlinks are kept as it is. Images are also maintained and kept as is. Keeping all the information, as it is, helps the system to present data in a structurally readable form as it is present in the original document. While extracting all this hierarchical content, Acronyms, Synonyms, Focus Words, Disambiguations are extracted. Keeping the content in its original form helps in finding Acronyms, Synonyms, Focus Words and Disambiguation extraction task. All the data for representing the content as is, and is stored in NoSQL database. NoSQL databases are robust and do not require any schema definition upfront which is feasible for the architecture of this system. The content extracted is to be stored as is and no schema holds true for different forms of documents.

In at least an embodiment, Excel documents are stored in column and row format with the heading and subheading as it is. Multiple sheets in the same Excel can also be extracted keeping the information of the sheet number and name as it is. NoSQL database is used to store the records in the database.

In at least an embodiment of the extractor (E), there is provided a text extractor (TE) configured to extract text from these crawled documents and uploaded on to a platform of this system. The text is crawled for web URLs. Content in the form of tables is also extracted and saved in a tabular format. All this structured information about the text is kept in NoSQL Databases for further use. HTML format is obtained from the crawled websites. A HTML extractor (HE) is used to extract the main content of the web page and also links of links can be extracted if configured for the same domain. Text in header tabs and footer tabs is ignored as it is not useful for answering queries. Metadata information is also stored with the crawled content. All the content here is stored in NoSQL dataset.

In accordance with another embodiment of this invention, there is provided a query parser (QC) configured to parse and correct a query in terms of language, grammar, syntax, and / or the like parameters. A query may be corrected by keeping acronyms untouched. Additionally, non-English words may be used in documents. Further, without changing the meaning of the query, correct words are substituted for misspelled words.

In at least an embodiment, a keyword classifier (KC) is configured to capture constituent keywords, at a keyword classification node, from input subject matter and to classify the keywords based on pre-defined classification parameters. The keyword classifier is an Artificial Neural Networks’ based classifier. Training data is entered into this classifier in order to extract potential contexts from the input data. In at least an embodiment, the system, of this invention, comprises a translator (T) to receive input subject matter in any language and to translate the input subject matter as configured by the system. Multiple languages are supported: local languages like Hindi, Marathi, Gujarati, Punjabi is possible with foreign languages like Spanish, German, French. Language is not a barrier, for this system, as user can switch the language used in a conversation; i.e. previous query asked may be in English and the language for the current query may be other language for e.g. Hindi / Spanish / Arabic.

In at least an embodiment, the query parser (QC) comprises an acronyms substitutor (AS) configured to identify acronyms (at the stage of parsing) in the text based input subject matter and to fetch substitutive full words correlative to the identified acronyms from a pre-fed acronyms vocabulary database. The acronym, therefore, become data resident on an acronym node

e.g. How many SL do I get a year

[Sick Leave]

Depending on use cases, users, while typing or inputting, do not always write readable keywords. Rather, they may use short forms. In order to correctly understand the meaning of a word, in a context, short form needs to be converted in their original form. This acronyms substitutor (AS) reduces the out of vocabulary words and improves the performance of the bot engine.

In at least an embodiment, the query parser (QC) comprises a synonyms substitutor (SS) configured to identify synonyms (at the stage of parsing) of extracted focus words in the text based input subject matter and to fetch corretlative synonyms to the extracted focus words from a pre-fed synonyms vocabulary database. The synonym, therefore, become data resident on a synonym node.

e.g. I am looking for a vehicle loan

[car]

Synonyms from an input are determined and appropriate synonym is chosen from a vocabulary and a user’s input is modified. In order to semantically improve the accuracy, synonyms are substituted from the vocabulary on which the bot engines are trained at the time of training. This phenomenon improves the understanding of a query. More confidence score is obtained in the context classification task by this method.

The bot engine, of this invention, is trained on training data, at a training corpus node, where the query and its context / intent are defined in multiple languages and the system is configured. Word embeddings (WE) is learnt during the training process and multiple languages can be used this way by the user to communicate with the system. A translator (T) module, irrespective of the language used, translates the sentence in an embedding (numerical form of sentence). The system uses these embedding rather than the keywords itself to make predictions. Embeddings are fed to the model to determine context of the query. In at least an embodiment of an embeddings configurator, word embeddings (WE) are created on the training data for pre-configured languages to represent keywords in a numerical form. Word embeddings are vectors of [1 x 300] where 300 represents the number of features tuned to represent a single word. If the vocabulary consists of 50,000 words, then the word embedding will be a representation of [50000 x 300] matrix, where a first axis defines an index of the word and a second axis defines its features. These representations of keywords are in numerical form which is used by the bot engine so that correct context can be identified for the training data set. Representation of words is tuned during the training process so that the bot engine is able to compute / determine a correct context. Confidence score if obtained between 0 to 1 defines the correctness of the context. Training is done on the input data (training data set) in multiple epochs (i.e. multiple times, the same data is fed to the model at different time step) and error is back propagated and the keyword embeddings are tuned to minimize the error in the confidence score.

Each of said embeddings is an embeddings’ matrix of index vectors with respect to feature vectors.

According to a non-limiting exemplary embodiment in respect of a word embedding matrix, if there are 75,000 words in a vocabulary of training set, then word embedding matrix will be of shape [75,000 x No. of dimensions]. No. of dimension defines the word in each dimension (it can also be called as word feature). In at least one embodiment, this dimension is 300. Therefore, the word embedding matrix is [75,000 x 300]. Words which are similar to each other will share similar features and will come close to each other and words which are not similar will fall apart from each.

In at least an embodiment of an embeddings configurator, character embeddings (CE) are vectors of [1 x 8] where 8 represents the features for a single character. [No. of character words x 8] matrix is the representation matrix, where the first axis defines the index of the character and second axis defines its features.

According to a non-limiting exemplary embodiment in respect of a character embedding matrix, if there are 75 different alphabets (A-Z a-z 0-9 , . ! ? ( ) { } $ & 8 # @ : ; ...) in a training set, then character embedding matrix will be of shape [75 x No. of dimensions]. No. of dimension defines the character in each dimension (it can also be called as character feature). In at least one embodiment, this dimension is 8. Therefore, the character embedding matrix is [75 x 8]. Characters which are more often near to each other in a sequence will be close to each other and vice versa.

In at least an embodiment of an embeddings configurator, sentence embedding is formed by taking representation of a word from the word embedding matrix and representation of a character from character embedding matrix and combing both.

In at least an embodiment of an embeddings configurator, context cum query embedding is formed by taking representation of a word from the word embedding matrix and representation of a character from character embedding matrix and combing both.

E.g: Query - Who wrote Harry Potter ? Context - Harry Potter is a British American namesake film series based on the eponymous novels by author J K Rowling

For the above example, embedding of all the words in a context and query is taken and kept in a matrix in a sequence and all the embeddings of character is taken and kept in a matrix in the same sequence in which they appear. Character embedding flows through a CNN and a projected matrix is concatenated with Word embedding. A limit of how many words can be present in a context and query is kept and the matrix is formed for a context and query which are of size [No of words limit x 300]. If the context has more than the limit words then those words are not considered and if the context has less no of words then the remaining space is padded with a special symbol [PAD], same way a query is also represented

(note: [PAD] token has a special word embedding)

Context:

Query:

Words embeddings (WE) as well as Character level embedding (CE) is learnt on a training corpus.

1. Word embeddings (WE) are the vectors of [1 x 300] where 300 represents the features for a single word. [No. of vocab words x 300] matrix is the representation matrix, where the first axis defines the index of the word and second axis defines its features. 2. Character embedding (CE) are the vectors of [1 x 8] where 8 represents the features for a single character. [No. of character words x 8] matrix is the representation matrix, where the first axis defines the index of the character and second axis defines its features.

3. Focus word embeddings (FW) is used. Word Embedding (WE) is substituted where the words are present and all the features in word embedding are 0.0 wherever it is absent.

4. Attribute word embeddings (AT) is used. Word Embedding (WE) is substituted where the words are present and all the features in word embedding are 0.0 wherever it is absent.

5. Named entity relationship word embeddings (NER) is used. Word Embedding (WE) is substituted where the words are present and all the features in word embedding are 0.0 wherever it is absent.

6. Common sense embedding (CS) is used. Word Embedding (WE) is substituted for a complete path.

With the help of CNN (convolution neural network), word embedding (WE) and character embedding (CE) are combined and used as a single feature. Adding character embedding solves the problem on unseen words as the characters in the unseen words are known. BiLSTMs are fed with the word and character embedding and the output layer is generated by the LSTM is combined with Focus word embedding, Attribute embedding and Named entity relationship word embeddings. Focus word embedding, Attribute embedding, Named entity relationship word embeddings, and common sense path embeddings are used as a separate feature and with the help of self attention LSTM output is combined with the masked Focus word embedding, Attribute embedding, Named entity relationship word embeddings and Common sense path embedding. Training is done over multiple epochs and classification error is calculated which is back propagated to the model over a different time step to learn and tune the model to correctly classify the context of said query, said context being a matrix of contexts comprising weight-assigned vectors for said input query. Commonsense (CS) is used to semantically improve the score of context, of answer, having a strong relation with the query and decrease the score where relationship is not present. Based on user feedback, model is fine-tuned with the negative feedback and fine tuning is done to correct the wrong identified contexts at a regular interval.

In at least an embodiment, the system of this invention comprises a dynamic feature extractor (DFE) configured to extract features from the text based input subject matter. In at least a non-limiting exemplary embodiment, the extracted feature list comprises dates, city, country, company name, person’s name, price, mobile number, email identities, domain names, and the like. Dynamic entities like employee identities which could be 16 digit alpha-numeric characters can also can be extracted once defined.

In other words, entities extracted could be:

dates (including tomorrow, yesterday, the day before etc)

« date ranges (since 2012, May 2015, 8/8/16 etc)

cities

countries

continents

company name

* organizations

person name mobile number

codes _* addresses

email ids

domain names / URLs

Features play a vital part in evidence scoring to find exact answers to the text based input subject matter.

Predefined entities as well as use-case defined entities are extracted by this extractor. If a user mentions his name or address or his phone number or an order id etc., this extractor extracts and uses these entities for determining the entities asks the user and the entities given to the user is of the same form. In order to get the details of the user, some verification needs to be done and for that, a user’s detail is asked by the bot engine. In order to verify the data provided by the user, the dynamic feature extractors are trained. Correct features are used to communicate with background services to get details from the database.

The dynamic feature extractor is configured to identity a plurality of different entities trained on a large corpus. According to an exemplary embodiment, up to 128 different entities are trained on a large corpus. User-defined entities can also be trained depending on use-case scenarios. At the time of creation of bot engine, entities are marked on the platform and label for the entities is provided and these marked entities are used to train the model which are known as user defined entities. With the help of fine tuning (type of transfer learning), new entities are trained on the already available multiple (e.g. 128) entities pre-trained model. BIO type of representation is created of the training data, where B is the Begi nni ng part of the entity, I is the Inside part of the entity and O is the non-entity. Model uses embedding + POS tags + DTC features together to determine the class label of the keyword and the confidence score. Two rules are used to determine the entity starting and end are:

1. Entities cannot be more than five words long; and

2. Second prediction pointing as the end of the entity should he always greater than the first prediction.

With these two predictions, a class label is also predicted to determine the class of entity

For example: The Godfather is a 1972 American crime film directed by Francis Ford Coppola and produced by Albert S. Ruddy , based on Mario Puzo 's best - selling novel of the same name .

Output of NER -

(0, 0.99,“Movie”,“The”)

(1, 0.97,“Movie”.“Godfather”)

(2, 0.99,“O”.“is”)

(3, 0.99,“0”,“a”)

(4, 0.99,“Q”,“1972”)

(5, 0.99,“O”,“American”)

(6. 0 95,“Genre”,“crime”)

(7, 0.99,“O”,“film”)

(8, 0.98,“Q”.“directed”)

(9, 0.99 ,“O”,“bv”)

0.96,“Director”,“Francis”)

( 11. 0.96.“Director”.“Ford”)

(12, 0.95,“Director”,“Coppola”)

(13, 0.99,“O”.“and”)

(14, 0 99,“O”,“produced”) (15, 0.99,“O”,“by”)

( 16, 0 98.“Producer”.“Albert”)

(17, 0.91,“Producer”.“S”)

(18, 0.99,“Producer”,“Ruddy”)

(19, 0.99,“O”,

(20, 0.99, “based”)

(21, 0.99,“O”,“on”)

(22, 0.99,“Author”,“Mario”)

(23 0.97,“Author”.“Puzo”)

(24, 0.92,“O”,“‘s”)

(25, 0.99,“O”,“best”)

(26. 0.99,“O”,“-”)

(27, 0.99,“O”,“selling”)

(28, 0.88,“Book”,“novel”)

(29, 0.99,“O”,“of’)

(30, 0.99,“O”,“the”)

(31, 0.99,“O”,“same”)

(32, 0 99,“O”,“name”)

(33, 0.99,“O”,

In at least an embodiment, the keyword classifier comprises an NLP extraction module (NLP), through an NLP node, configured to act on text based input subject matter. From the parsed structured tree network of nodes, the following features are extracted, using standard natural language extraction mechanisms:

1. Synonyms (used by synonym substitutor, defined further ahead)

2. Acronyms (used by acronym substitutor, defined further ahead)

3. POS tags (used by POS tagger, defined further ahead) 4. Named Entities Relationships (used by dependence tree configurator, defined further ahead)

5. Coreference Resolution (used by verifier and selector, defined further ahead)

6. Focus words (used by focus extractor, defined further ahead)

7. Attributes (used by attributes identifier, defined further ahead)

8. Common Sense Extraction (used by a common sense extractor, defined further ahead)

According to one embodiment of the NLP extraction module, there is a POS (part of speech) tagger (PST) configured to find out POS tags for words in the text based input subject matter (e.g - nouns, proper nouns, adjectives, adverbs, verbs, preposition, determiners, and the like).

Finding accurate parts of a speech is one of the features used by an context classifier to define contexts from a training data set. A tag is represented with extracted keywords, from the keyword classifier (KC), in the question. These features are also used in other feature extractions like Named Entity, Focus Words, Attributes, Corefence Resolution, Dependency Parse Tree (i.e. dependence tree network of nodes); all listed above. Named Entity, for example, uses a POS tag as the main features as the representation of keywords with the tags in sequence helps it to point to an entity with its appropriate class.

Some examples are as follows:

Tag - Description

CC - Coordinating conjunction

CD - Cardinal number

DT - Determiner EX - Existential there

FW - Foreign word

IN - Preposition or subordinating conjunction

JJ - Adjective

JJR - Adjective, comparative

JJS - Adjective, superlative

LS - List item marker

MD - Modal

NN - Noun, singular or mass

NNS - Noun, plural

NNP - Proper noun, singular

NNPS - Proper noun, plural

PDT - Predeterminer

POS - Possessive ending

PRP - Personal pronoun

PRP$ - Possessive pronoun

RB - Adverb

RBR - Adverb, comparative

RBS - Adverb, superlative

RPP - article

SYM - Symbol

TO - to

UH - Interjection

VB - Verb, base form

VBD - Verb, past tense

VBG - Verb, gerund or present participle

VBN - Verb, past participle VBP - Verb, non-3rd person singular present

VBZ - Verb, 3rd person singular present

WDT - Wh- determiner

WP - Wh-pronoun

WP$ - Possessive wh-pronoun

WRB - Wh-adverb

In at least an embodiment, a keyword classifier comprises a dependence tree configurator (DTC) configured to define structural relationship between different words in a text (of subject matter).

Some examples are as follows:

apply _relcl

for_prep

Loan_pobj

What_attr the__.de t to_aux Home_compound

A dependency parse tree (i.e. a dependence tree network of nodes) is formed for the input through input node which links keywords together with a relationship; the relationship being subject, object, predicates etc. A channel / links / bond determines the relationship in the keywords and helps the hot engine to understand semantics and context of the associated words. With dependence tree configurator, focus words (FW) and attributes (AT) are extracted which are used by the bot engine to determine the contexts and actions to be taken by the bot engine. Corefence resolution, for example, uses dependence tree configurator, as a key feature while training and determining correctly resolved pronouns with the entity link / channel.

In at least an embodiment, the dependence tree configurator (DTC) comprises a focus extractor (FE) configured to pin point at least a focus word (FW) from the text based subject matter. These focus words help identify context since, more often than not, they have the highest weightage in the text based input subject matter. The focus word, therefore, become data resident on a focus node

e.g. What are the documents required for Home Loan

[Focus Word]

Focus extractors are used as a feature by the bot engine to determine exactly what the input is all about. It is also used to determine the scope of an input. The bot engine, of this invention, needs to determine whether the input is within the scope of this bot engine to answer or out of scope for the bot engine. An attention mechanism is used to focus more on the given keywords in order to determine the focus word. For example, while reading a paragraph or quickly searching for an answer to a query, humans tends to put more focus on some part rather reading the complete paragraph text; this is known as attention (A). In the same way, based on identification of focus words, attention is applied to a neural model network to correctly pin point to a focus word within a question. Features used to train are word embeddings + parts of speech (POS) tags + Named Entity Relationship (NER) + dependence tree configurator (DTC).

In at least an embodiment, the dependence tree configurator (DTC) comprises an attribute identifier (AI) configured to identify attribute words of a text based input subject matter. The attribute identifier considers focus words whilst identifying attributes of these identified focus word(s). The attribute, therefore, become data resident on an attribute node. A correlative link, i.e. a correlative channel, is established between the extracted focus node and the identified attribute node.

e.g. What are the documents required for Home Loan

[Attribute]

Based on the dependence tree configurator (DTC), linkage of the focus entities (i.e. focus entities’ channel) and the attention mechanism, attribute entities are determined which say what attributes are asked about in a given message. Attributes may or may not be present depending on the type of input to the bot engine.

The attention mechanism is used to generate a score with a softmax function (SF) giving the confidence score of an attribute being present on not in a current input. Features used to train are word embeddings + Focus words (FW) + parts of speech (POS) tags + Named Entity Relationship (NER) + dependence tree configurator (DTC).

In at least an embodiment, the dependence tree configurator (DTC) comprises a question type extractor configured to determine type of question that is input. Question type embedding, parts of speech (POS) tags embedding, Named Entity Relationship embedding, Common Sense embedding are the embeddings where representation is only taken where the feature words are present; rest of everything is masked and has a value 0.0.

Sentence - Who wrote Harry Potter ?

POS - [ WP VBD NNP NNP .]

POS Matrix:

Sentence - Who wrote Harry Potter ?

Question Type - WH Type

Question Type Matrix:

Sentence - Who wrote Harry Potter ? NER - [ O O Person Person O ]

Named entity word Matrix:

Sentence - Who wrote Harry Potter ?

Common

Sense - [ O O [is_a Person] [is_a Person] O ]

Common Sense word Matrix:

In accordance with another embodiment of this invention, there is provided a dis- ambiguity resolver (DAR) in order to detect ambiguity and to further resolve it. Context is applied in such cases.

E.g. if a current user query is - What about Delhi

and the previous query was - What is the weather in Mumbai?

Then the current query is resolved into

What is the weather in Delhi?

[Mumbai] A parse tree is formed for the document which links keywords together with a relationship such as subject, object, and / or predicate. Links (bond) determines the relationship in the keywords helps the model to understand the structural relationship between the linked words. With DTC, Focus Words, and Attributes are extracted. Dis-ambiguity resolver uses DTC to correctly resolve disambiguous words with the corresponding headings and subheadings in an unstructured text. Sometimes in a write up pronouns are referred which needs to be disambiguated. Disambiguation DAR steps helps the model understand the context more clearly when it is seen as a single paragraph without the neighbouring text. Without it, multiple paragraphs may have the same meaning and multiple answers may hold true without its heading and subheading.

In accordance with another embodiment of this invention, there is provided a relevancy mapper (RM) to map relevancy of the query (Q) within a corpus of knowledge base of documents. An input query is ranked and / or mapped for relevancy based on classified set of documents. This helps assign the correct classification-based set of documents for the query. Based on the data stored in the NoSQL database, documents are labelled and classified based on user query. Relevant documents are further used in a pipeline to extract the answer out of it. In at least an embodiment, in order to extract answers from an Excel dataset, a user query is matched with the headings and subheadings for an answer. As the data is in structured form, an answer is presented as is. Queries with filter on different columns can also be answered.

According to this invention, this corpus is dynamically formed per query, thereby providing the most accurate results. In accordance with another embodiment of this invention, there is provided a ranking mechanism [comprising a document ranker, a paragraph ranker, and a sentence ranker] configured to rank documents and paragraph within a document. For unstructured documents, training is done on a dataset of query(ies) and the mapped documents as a single training example. For a paragraph ranker, a document model is trained to define ranking of a paragraph within a given set of paragraphs. A paragraph with a high similarity score than the other paragraphs is ranked higher. Highest two ranked paragraphs are considered and all the sentences from highest ranked paragraphs are extracted and fed further in a pipeline to a sentence ranking mechanism. Final output of the sentence ranker is fed to a QA model. Word embeddings is combined with the character embedding on both query and sentences also known as context. Commonsense path embedding, Named entity word embeddings, Question type embedding, and POS tag embedding are the additional features fed to the model.

In accordance with another embodiment of this invention, there is provided a sentence ranking mechanism configured rank sentences within a paragraph / document based on a similarity matrix. Sentence ranking, based on the similarity matrix, on an input query (Q) is formed and top ranked sentences are chosen and fed further in a pipeline to a question-answer (QA) model.

In accordance with another embodiment of this invention, there is provided a question-answer (QA) model which is fed with word embedding and character embedding of an input query and sentences. BiLSTMs are fed with these (word and character) embeddings and the output of BiLSTM is further used. Commonsense is fed having a relation between the word of a query and word in a sentence. Embedding for each path of commonsense is obtained and self-attention is applied to it, which is further combined with the BiLSTMs output. Features like NER, Question type, and POS tags are also used in the same manner and merged with the BiLSTMs output. Error is calculated on the model’s output and model is further trained to obtain the best possible weights to minimize the error cost. Training data, Validation data, and Test data is split in a ratio of 90:5:5. Where 90 % of data is used for training and 5% each is used for validation and test. Test data does not take part in training process and is only used to calculate the accuracy of the system.

In order to train the QA model for dynamic corpus per query model, multiple models are used. Models for document ranking, models for sentence ranking, question generation model, QA model, evidence scorer model and an outputter are the core part of the system. All the hierarchical data is stored in NoSQL database. Mapping of the sources of data is done with bot_id, region_id, language_id, usertype_id and origin_id in relational databases.

Document ranking model is trained on transformers with self-attention with positional embedding. Transformers, with self- attention, achieves better accuracy over a large size text data such as paragraphs. Positional embedding is used to learn the relationship between the two context of the paragraph even if the two related context entities are at a large distance from each other in a paragraph. Training data consist of a query with multiple paragraphs associated with it. Each paragraph holds a rank with it based on the query. Ranker model is trained with an objective of most relevant paragraph to appear at the top. Same way sentence ranker is trained to rank the sentences associated with the query. Document ranker output is fed to the sentence ranker to rank the sentences of the document found by the document ranker. In accordance with another embodiment of this invention, there is provided a question generation model (QGM), at a query-generation node. A model is trained on the span answer data in order to generate queries on unstructured data. This trained model is fed with the paragraph tokens and self-attention and attention is applied to it. The attention output is merged with the answer tokens embedding. Answer tokens embedding is also fed to conditional encoders and outputs of attention and conditional encoders are merged together which flows through a layer of BiLSTM network. With the help of beam search over the BiLSTMs output, query words are generated.

In at least an embodiment of the beam search, an output generated from Beam Search during query generation would be the probabilities of all the vocabulary words. So, based on the probability, n numbers of words are selected and then for each n word, a next word is predicted. At each step, score is multiplied with the probabilities together. The k sequences with the most likely probabilities are selected and all other candidates are pruned. And thus, a question is generated from the context.

For Example:

Who

wrote

Harry

Potter

Harry bank

the

bank

script

Patter

bank

Patter

Harry

book writer

□f

is

was has

the

a

Harry

A Question-Answer (QA) output score gives a probabilistic score of the context being the answer A for Query Q. Number of context is not a fixed number. Output is a normalized (softmax function based) output and its value is between 0 to 1

For the above example context 37 is the correct answer and has a probability of 0.75

In accordance with another embodiment of this invention, there is provided a training module (TM) configured to evaluate accuracy of an output with respect to a query and to receive feedback to tune parameters on accuracy over a validation data set to make the system, of this invention, understand new features.

In a non-limiting exemplary embodiment, a Q&A model is fed with features extracted in a previous step and English words are turned into 300-dimensional vectors. Two words which mean near about same things are near to each other in vector space. This way every word in a corpus has some representation and semantically words which means same come close to each other.

Corpus is broken into training data, validation data, and test data with a ratio of 90%, 5%, and 5% respectively. The training set used to train on validation set is used to determine the accuracy during training and learn different parameters at the time of training. The test set is for the final accuracy of the model.

Hyperparameters are tuned based on the nature of data set. Multiple Bidirectional LSTM (Long Short Term Memory) is used to train the network. Attention mechanism is applied over LSTMs output. Attention mechanism means what word in a sequence should be applied attention to achieve better accuracy.

While training, evaluation is done on validation data set and accuracy is found in the model which is used to tune the parameter at the time of training.

Transfer learning is done to make model understand new features and also reuse previous learning and previous features. With transfer learning, QA model becomes more accurate on the specific domain.

Top candidate answers are chosen from the QA model which is further processed to pick the correct answer.

Find focus words of query and check predicates and pull set of top candidate answers. In accordance with another embodiment of this invention, there is provided an evidence scorer (ES) configured to apply evidence-based scoring on question type, answer type, semantic relationships; all based on localized features trained on model. Evidence scorer is used to rank the output based on question type, answer type and question similarity between generated and asked query.

E.g. Types of Questions

1. Alternate

what is better, tea or coffee

2. Yes/No

Would it be possible to get a discount

3. Prepositional

To whom do I call for customer related queries

4. Indirect

Thank you. Do you know if there is a supermarket nearby

5. Wh type

Why is the price of crude oil decreasing

6. Rhetorical

Is the pope catholic?

7. Interrogative

Is your house ready for visitors

8. Emphatic

Where on earth have I put my wallet?

9. Question Tag Question

You love her, don't you?

Answer types expected -

1. Name of city

2. Name of country

3. Distance

4. Price

5. Description of something

6. Name of a person

7. Time

8. Reason

9. Event

The evidence scorer (ES) is a scorer configured to obtain a confidence score based on evidence present to prove that the answer obtained is correct. In evidence, type of question and type of answer is used with a similarity measure between the generated query out of the answer paragraph and the query asked is obtained. All the scores are combined with a softmax function which gives score to the candidate answer and re-rank the answer in a final step. More closer the generated query on the answer with the asked query increases the overall evidence score of the system. Question type matches with the expected answer types increases the confidence score of the answer.

In at least an embodiment, the system, of this invention, comprises an output node configured to output (OP) an answer from a dynamically formed corpus, the corpus comprising contextually selected documents and contextually selected extracts from these contextually selected documents, the answer having the highest probabilistic score of context which matches with the query input. The output node is used to format the output as is and highlight the precise answer from the paragraph.

The output node (OP) is configured to output a response in at least one of the documents in a highlighted manner with its (document’s) structure maintained. Hence, the output node is a highlighted output node.

E.g. if a response is found from a pdf, then the outputter highlights the answer from the specific document from the specific page.

If multiple queries are input, then the question is broken into multiple parts and multiple answers are presented to the user, possibly from multiple documents, possibly with multiple highlights.

Formatting of the extracted answer is done based on hierarchical data stored in the dataset. Since formatting of the answer is maintained in the dataset, answer is presented in the same format. If the extracted answer is table, then the table structure is maintained. The main answer from a hierarchical answer is highlighted. FIGURE 2 illustrates a schematic block diagram of the question-answer model (QA).

201 - A query is input.

201a, 201b - Words embeddings (WE) as well as Character level embedding

(CE), with respect to a query, is learnt on a training corpus. Word embeddings (WE) are the vectors of [1 x 300] where 300 represents the features for a single word. [No. of vocab words x 300] matrix is the representation matrix, where the first axis defines the index of the word and second axis defines its features. Character embedding (CE) are the vectors of [1 x 8] where 8 represents the features for a single character. [No. of character words x 8] matrix is the representation matrix, where the first axis defines the index of the character and second axis defines its features. Sentence embedding is formed by taking representation of a word from the word embedding matrix and representation of a character from character embedding matrix and combing both.

201c - BiLSTMs (Bidirectional Long-Short Term Memory) are fed with the word embedding (302) and character embedding (303) and an output layer which is generated is a concatenated matrix of word embeddings (302) and character embeddings (303). Training is done over multiple epochs and classification error is calculated which is back propagated to the model over a different time step to learn and tune the model to correctly classify the context of the query. Based on user feedback, model is fine-tuned with negative feedback and fine tuning is done to correct the wrong identified contexts at a regular intervals.

202 - Context is extracted from probable answers from a dynamically determined / formed corpus.

202a, 202b - Words embeddings (WE) as well as Character level embedding

(CE), with respect to a context, is learnt on a training corpus. Word embeddings (WE) are the vectors of [1 x 300] where 300 represents the features for a single word. [No. of vocab words x 300] matrix is the representation matrix, where the first axis defines the index of the word and second axis defines its features. Character embedding (CE) are the vectors of [1 x 8] where 8 represents the features for a single character. [No. of character words x 8] matrix is the representation matrix, where the first axis defines the index of the character and second axis defines its features. Sentence embedding is formed by taking representation of a word from the word embedding matrix and representation of a character from character embedding matrix and combing both.

202c - BiLSTMs (Bidirectional Long-Short Term Memory) are fed with the word embedding (302) and character embedding (303) and an output layer which is generated is a concatenated matrix of word embeddings (302) and character embeddings (303). Training is done over multiple epochs and classification error is calculated which is back propagated to the model over a different time step to learn and tune the model to correctly classify the context of the answer. Based on user feedback, model is fine-tuned with negative feedback and fine tuning is done to correct the wrong identified contexts at a regular intervals.

203 - Part of Speech Tagger (POS), of said query, are extracted and used. Word Embedding (WE) is substituted where the words are present and all the features in word embedding are 0.0 wherever it is absent.

204 - Question Type is extracted and used. If Question type matches with an expected answer types increases the confidence score of the answer. Word Embedding (WE) is substituted where the words are present and all the features in word embedding are 0.0 wherever it is absent.

205 - Named entity word embeddings (NER) are extracted and used. For a corpus of selected answers, and in relation to the query, Word Embedding (WE) is substituted where the words are present and all the features in word embedding are 0.0 wherever it is absent.

206 - Common sense path embedding (CS) are extracted and used. For a corpus of selected answers, and in relation to the query, Word Embedding (WE) is substituted for a complete path to determine a common sense path.

207 - Self- Attention (SA) vector - determines where attention is, for each answer selected from a corpus, be for that query.

208 - Attention (A) vector - determines where attention should, in an answer, be for that query, for that context, to provide most relevant answer(s) as an output (210). Output of BiLSTM (201c), for query, is multiplied by output of Self- Attention (207) to provide a clear sequence of words to help identify context in a query. Additionally, output of BiLSTM (201c), for answers, is multiplied by output of Self- Attention (207) to provide a clear sequence of words to help identify context in an answer to provide answer-context vector.

209 - BiLSTMs (Bidirectional Long-Short Term Memory) are fed with a plurality of answer-context vectors per query-context vector, each of said vectors to be matched with said attention vector in order to find a match between an answer and a query, said answer having a highest probabilistic score for that query, in terms of context vector and in terms of attention vector.

210 - Most relevant output (answer) is delivered.

FIGURE 3 illustrates a schematic block diagram of the question generation model (QGM).

301 - Context Token is extracted at the time of training.

302 - Answer Token is extracted at the time of training.

303 - Self-Attention (SA) - determines where attention is, for each context for each set of answers, for each query. 304 - Attention (A) - determines where attention should, in an answer, be for each query, for that context, to provide most relevant answer(s) as an output (210). Output of Answer Token (302) along with output of Self-Attention (303) provides the output of attention (304).

305 - Conditional Encoding - checks for context token with respect to answer token.

306 - Output of Attention (304) along with output of conditional encoding (305) is used to provide a word-embedding and character-embedding output for every input which is then given to a series of BiLSTMs (Bidirectional Long-Short Term Memory) (307a, 307b, 307c) to check for context matching of answer with respect to query.

307a, 307b, 307c - series of BiLSTMs (Bidirectional Long-Short Term Memory) to check for context matching of answer with respect to query.

308 - Beam Searching - checks with evidence scorer (ES) to output a predicted corpus of answers per query based on context, each answer having a probabilistic score per query.

201 - Outputs query words which is most pertinent to a query in terms of context from a dynamically formed corpus.

Context Tokens, typically, comprise tokens which are, basically "separated by space" on text. Context tokens are paragraph tokens (i.e. all the words separated by space of the paragraph in which answer is present).

For example:

Context :- The kanji that make up Japans name mean sun origin and it is often called the Land of the Rising Sun.

Context Tokens :- [ "The", "kanji", "that", "make", "up", "Japans", "name", "mean”, "sun”, "origin", "and", "it", "is", "often", "called", "the", "Land", "of", "the", "Rising", "Sun", "." ]

Answer Tokens, typically, are actual answer text tokens from one of the Context Tokens. Answer Tokens are present in one of the Context Tokens.

For example:

Answer :- Land of the Rising Sun

Answer Tokens :- ["Land", "of", "the", "Rising", "Sun"]

FIGURE 4 illustrates a schematic block diagram which combines the question- answer model (QA), at a question-answer node, of Figure 2 and the question generation model (QGM) of Figure 3 to provide a most-pertinent output (i.e. an answer), in response to a query, from a dynamically formed corpus.

From a Question Generation model (QGM), of Figure 3, a query (201) is generated as an output of beam search (308) which is associated with context token (301) and answer token (302). Context token (301) and answer token (302) are fed with at least an answer (301) which is an output (210), of the Question Answer model (QA), of Figure 2, for purposes of training. In the Question Answer model (QA), query words (201) are received from the Question Generation model’s (QGM) and context (202) of the query is determined and applied to obtain an output (210) which results in a pertinent answer from a dynamically formed corpus, the corpus comprising contextually selected documents and contextually selected extracts from these contextually selected documents, the answer having the highest probabilistic score of context which matches with the query input.

The TECHNICAL ADVANCEMENT of this invention lies in understanding the user query, understanding its context, and further providing a response to a user query within the natural document which contains the response in a highlighted manner, thereby maintaining structure and format and source.

While this detailed description has disclosed certain specific embodiments for illustrative purposes, various modifications will be apparent to those skilled in the art which do not constitute departures from the spirit and scope of the invention as defined in the following claims, and it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.

Claims

CLAIMS,

1. A method for communicating with an auto-disambiguation bot engine for dynamic corpus selection per query, said method comprising the steps of:

- by means of a Question Generation model (QGM), at a query-generation node, generating a query (201), as an output of a beam search (308), said query being associated with a context token (301) and an answer token (302), said context token and said answer token being fed with at least an answer (301) which is an output (210) of a Question Answer model (QA), at a question-answer node, said answer being selected from a dynamic corpus of documents and from data extracted from said dynamic corpus of documents; and

- by means of a Question Answer model (QA), at a question-answer node, receiving query words (201) from said Question Generation model (QA) and determining a context (202) of said query (201) and applying said

determined context to obtain an output (210) answer from said dynamically formed corpus of documents, said corpus comprising contextually selected documents and contextually selected extracts from said contextually selected documents, said output answer having the highest probabilistic score of context which matches with said input query (201).

2. A method for communicating with an auto-disambiguation bot engine for dynamic corpus selection per query as claimed in claim 1 wherein, said Question Generation model (QGM) comprising a method of generating a query (201), said method comprising the steps of:

- parsing (201) input query, at an input node of a network of nodes; - capturing constituent keywords, at a keyword classification node, from said parsed input query and to classify said captured keywords based on pre- defined classification parameters;

- identifying, at a training corpus node, with respect to said parsed query, word embeddings (201a), character embeddings (201b), and sentence embeddings (201c) on a training corpus, each of said embeddings being an embeddings’ matrix of index vectors with respect to feature vectors;

- feeding at least a Bidirectional Long-Short Term Memory (201c) node with word embedding (201a) and character embedding (201b) and an output layer which is generated is a concatenated matrix of word embeddings (201a) and character embeddings (201b) in order to correctly classify the context of the query;

- extracting context (202) is extracted from probable answers from a

dynamically determined / formed corpus;

- feeding at least a Bidirectional Long-Short Term Memory node (202c) node with word embedding (202a) and character embedding (202b) to provide an output layer which is a concatenated matrix of word embeddings (202a) and character embeddings (202b) in order to correctly classify context of said query, said context being a matrix of contexts comprising weight-assigned vectors for said input query;

- extracting Part of Speech (POS) from said query;

- extracting Question Type, of said query, to check if Question type matches with expected answer types so that it increases a confidence score of an answer for that query;

- determining a Self- Attention (SA) vector, in order to define where attention is, for each answer selected from said corpus, for that query; - determining an Attention (A) vector, in an answer, for that query, for that context vector, to provide most relevant answer(s), having highest probabilistic score, as an output (210);

- multiplying output of Bidirectional Long-Short Term Memory (201c), for query, by output of said Self-Attention (207) vector to provide a clear sequence of words to help identify context in a query to provide a query- context vector;

- multiplying output of Bidirectional Long-Short Term Memory (201c), for answers, is multiplied by output of said Self-Attention (207) vector to provide a clear sequence of words to help identify context in an answer to provide an answer-context vector;

- feeding at least a Bidirectional Long-Short Term Memory node (209) with a plurality of answer-context vectors per query-context vector, each of said vectors to be matched with said attention vector in order to find a match between an answer and a query, said answer having a highest probabilistic score for that query, in terms of context vector and in terms of attention vector; and

- outputting said answer (210) having highest probabilistic score.

3. A method for communicating with an auto-disambiguation bot engine for dynamic corpus selection per query as claimed in claim 1 wherein, said

Question Answer model (QA) comprising a method of providing an answer), said method comprising the steps of:

- extracting a Context Token (301) at the time of training;

- extracting an Answer Token (302) at the time of training;

- determining a Self-Attention (303) vector, where attention is, for each context for each set of answers, for each query; - determining an Attention (304) vector, where attention should, in an answer, be for each query, for that context, to provide most relevant answer(s) as an output (210), wherein output of Answer Token (302) along with output of Self- Attention (303) provides the output of attention (304);

- checking Conditional Encoding (305) for context token with respect to answer token;

- providing (306) as an output of Attention (304) along with output of conditional encoding (305) a word-embedding and character-embedding output for every input which is then given to a series of BiLSTMs (Bidirectional Long-Short Term Memory) (307a, 307b, 307c) to check for context matching of answer with respect to query;

- feeding the series of Bidirectional Long-Short Term Memory (307a, 307b, 307c) to check for context matching of answer with respect to query; and

- by means of Beam Searching (308), checking with evidence scorer (ES) to output a predicted corpus of answers per query based on context, each answer having a probabilistic score per query; and

- providing as an output (201), query words which is most pertinent to a query in terms of context from a dynamically formed corpus.

4. An auto-disambiguation bot engine for dynamic corpus selection per

query, said bot engine comprising:

- a query inputter (QI) configured to allow a user to input a query;

- a data collector (DC) configured to collect electronic documents (D);

- extractor (E) configured to extract documents of relevance without losing information and to format these documents;

- a crawler (C) configured to crawl content of each of said extracted

documents; - a query parser (QC) configured to parse and correct said input query;

- a keyword classifier (KC) is configured to capture keywords from said

parsed input query to classify keywords based on pre-defmed classification parameters;

- an NLP extraction module (NLP), through an NLP node, configured to act on parsed query to identify features (question type feature, named entity relationship feature, common sense path feature, part-of-speech tags);

- embeddings configurator configured to represent captured keywords, from said input query, in numeral format in terms of its constituent extracted features (question type feature, named entity relationship feature, common sense path feature, part-of-speech tags);

- a dependence tree configurator (DTC) configured to define structural relationship between different words in parsed input query in order to output at least question types, at least named entity relationship words, at least common sense paths;

- a Question Generation Mode! (QGM) configured to generate queries in relation to said input query and said documents to form corpus which is a dynamically formed per query, thereby providing the most accurate results

- a Question Answer Model (QA) configured to providing as an output

(201 ), query words which is most pertinent to a query in terms of context from a dynamically formed corpus, said query words being fed to said Question Generation Model (QGM) for training; and

- an output node configured to output (OP) an answer from a dynamically formed corpus, the corpus comprising contextually selected documents and contextually selected extracts from these contextually selected documents, the answer having the highest probabilistic score of context which matches with the query input.

5. The auto-disambiguation bot engine as claimed in claim 4 wherein, said engine comprising a tagger (T) configured to classify and tag documents, based on pre-defined parameters, in a relational database.

6. The auto-disambiguation bot engine as claimed in claim 4 wherein, said extractor (E) comprising multiple extractors selected from a group of extractors consisting of a PDF extractor for extracting PDF documents, a word extractor for extracting data from word documents, an Excel extractor for extracting content of Excel sheets with multiple sheets which may have different column and row span, a HTML extractor for extracting main content of a web page and also links of links, a plain text extractor for extracting text.

7. The auto-disambiguation bot engine as claimed in claim 4 wherein, said crawler engaging with a hierarchy definition mechanism (HDM) configured to understand and define hierarchy, with respect to content, in each extracted document.

8. The auto-disambiguation bot engine as claimed in claim 4 wherein, said extractor (E) comprising a text extractor (TE) configured to extract text from said crawled documents.

9. The auto-disambiguation bot engine as claimed in claim 4 wherein, said query parser (QC) comprising a translator (T) to receive input query in any language and to translate said input query as configured by said engine.

10. The auto-disambiguation bot engine as claimed in claim 4 wherein, said query parser (QC) comprising an acronyms substitutor (AS) configured to identify acronyms from the parsed input query and to fetch substitutive full words correlative to the identified acronyms from a pre-fed acronyms vocabulary database.

11. The auto-disambiguation bot engine as claimed in claim 4 wherein, said query parser (QC) comprising an synonyms substitutor (AS) configured to identify synonyms from the parsed input query and to fetch substitutive full words correlative to the identified synonyms from a pre-fed synonyms vocabulary database.

12. The bot engine as claimed in claim 4 wherein, said embeddings configurator being configured to process word embeddings (WE) created on training data for pre-configured languages to represent keywords in a numerical form, characterised in that, said word embeddings being vectors of [m x n] where m represents number of words and n represents the number of features tuned to represent a single word, where a first axis defines an index of the word and a second axis defines its features.

13. The bot engine as claimed in claim 4 wherein, said embeddings configurator being configured to process character embeddings (CE) in terms of vectors of [a x b] where a represents number of characters and b represents the number of features tuned to represent a single character, where a first axis defines an index of the word and a second axis defines its features.

14. The bot engine as claimed in claim 4 wherein, said embeddings configurator being configured to process sentence embeddings formed by taking representation of a word from a word embedding matrix and representation of a character from a character embedding matrix and combing said word representation and said character representation to provide a concatenated matrix.

15. The bot engine as claimed in claim 4 wherein, said embeddings configurator being configured to process context cum query embedding formed by taking representation of a word from the word embedding matrix and representation of a character from character embedding matrix and combing both.

16. The bot engine as claimed in claim 4 wherein, said NLP extraction module (NLP) engages with a dynamic feature extractor (DFE) configured to extract features from the text based input query.

17. The bot engine as claimed in claim 4 wherein, said NLP extraction module (NLP) engages with a dynamic feature extractor (DFE) configured to extract features from the text based input query, characterised in that, said dynamic feature extractor using embedding, POS tags, dependence tree configurator features together to determine a class label of an extracted keyword from an input query and its confidence score.

18. The bot engine as claimed in claim 4 wherein, said NLP extraction module comprising a POS (part of speech) tagger (PST) configured to find out POS tags for words in parsed input query.

19. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises a focus extractor (FE) configured to pin point at least a focus word (FW) from the parsed input subject matter.

20. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises a focus extractor (FE) configured to pin point at least a focus word (FW) from the parsed input subject matter, characterised in that, features used to train are word embeddings, parts of speech (POS) tags, Named Entity Relationship (NER).

21. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises an attribute identifier (AI) configured to identify attribute words from the parsed input subject matter.

22. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises an attention mechanism used to generate a score with a softmax function (SF) giving a confidence score of an attribute being present on not in a current input.

23. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises an attribute identifier (AI) configured to identify attribute words from the parsed input subject matter, characterised in that, features used to train are word embeddings, parts of speech (POS) tags, Named Entity Relationship (NER).

24. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises a dynamic feature extractor (DFE) configured to extract NER (named entity relationship) words from the parsed input subject matter.

25. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises a dynamic feature extractor (DFE) configured to extract NER (named entity relationship) words from the parsed input subject matter, characterised in that, features used to train are word embeddings, character embeddings, part-of-speedi (POS) tags; to determine a class label of the keyword and a confidence score.

26. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises a common sense extractor (CSE) configured to extract common sense for all words from the parsed input subject matter other than stop words and most frequently used words from a corpus.

27. The bot engine as claimed in claim 4 wherein, said dependence tree configurator (DTC) comprises a question type extractor configured to determine type of question that is input.

28. The bot engine as claimed in claim 4 wherein, focus word embeddings, attribute word embeddings, named entity relationship word embeddings, common sense embedding are embeddings where representation is only taken where the feature words are present; rest of everything is masked and has a value 0.0.

29. The bot engine as claimed in claim 4 wherein, Question type embedding, parts of speech (POS) tags embedding, Named Entity Relationship embedding, Common Sense embedding are embeddings where representation is only taken where the feature words are present; rest of everything is masked and has a value 0.0.

30. The bot engine as claimed in claim 4 wherein, said engine comprising a dis- ambiguity resolver (DAR) in order to detect ambiguity and to further resolve it based on determined context.

31. The bot engine as claimed in claim 4 wherein, said engine comprising a relevancy mapper (RM) to map relevancy of said input query (Q) within a corpus of documents, characterised in that, said input query is ranked and / or mapped for relevancy based on classified set of documents to form a corpus which is a dynamically formed per query.

32. The bot engine as claimed in claim 4 wherein, said engine comprising a ranking mechanism [comprising a document ranker, a paragraph ranker, and a sentence ranker] configured to rank documents and paragraphs within each document.

33. The bot engine as claimed in claim 4 wherein, said engine comprising a sentence ranking mechanism configured rank sentences within a paragraph / document based on a similarity matrix.

34. The bot engine as claimed in claim 4 wherein, said engine comprising an evidence scorer (ES) configured to apply evidence-based confidence scoring on question type, answer type, semantic relationships; all based on localized features trained on model in order to rank outputs based on question type, answer type, and question similarity between generated and asked query.

35. The bot engine as claimed in claim 4 wherein, said engine comprising a training module (TM) configured to evaluate accuracy of an output with respect to a query and to receive feedback to tune parameters on accuracy over a validation data set to make the engine, of this invention, understand new features.

36. The bot engine as claimed in claim 4 wherein, said question-answer (QA) model is fed with word embedding and character embedding of an input query along with extracted common sense feature to calculate the accuracy of the engine such that embedding for each path of commonsense is obtained and self-attention is applied to it, which is further combined with the Bidirectional Long-Short Term Memory output.

37. The bot engine as claimed in claim 4 wherein, said engine comprising a document ranking model which is trained on transformers with self- attention with positional embedding.

38. The bot engine as claimed in claim 4 wherein, said question generation model (QGM) is trained on corpus data in order to generate queries and to apply self-attention and attention to corpus data in order to generate other query words.

39. The bot engine as claimed in claim 4 wherein, said question generation model (QGM) comprising a beam search such that an output generated from Beam Search during query generation would be probabilities of all the vocabulary words in order to obtain a contextually relevant query.