WO2011051970A2 - Method and system for obtaining semantically valid chunks for natural language applications - Google Patents

Method and system for obtaining semantically valid chunks for natural language applications Download PDF

Info

Publication number
WO2011051970A2
WO2011051970A2 PCT/IN2010/000693 IN2010000693W WO2011051970A2 WO 2011051970 A2 WO2011051970 A2 WO 2011051970A2 IN 2010000693 W IN2010000693 W IN 2010000693W WO 2011051970 A2 WO2011051970 A2 WO 2011051970A2
Authority
WO
WIPO (PCT)
Prior art keywords
predicates
natural language
objects
query
predicate
Prior art date
Application number
PCT/IN2010/000693
Other languages
French (fr)
Other versions
WO2011051970A3 (en
Inventor
Shailly Goyal
Shefali Bhat
Shailja Gulati
Chandrasekhar Anantaram
Original Assignee
Tata Consultancy Services Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tata Consultancy Services Ltd. filed Critical Tata Consultancy Services Ltd.
Publication of WO2011051970A2 publication Critical patent/WO2011051970A2/en
Publication of WO2011051970A3 publication Critical patent/WO2011051970A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to the field of natural language question answering systems.
  • the present invention relates to the application of ontology in natural language question answering systems.
  • Natural language (NL) enabled question answering systems for business applications aim at providing appropriate answers to the user queries.
  • query interpretation is a fundamental task.
  • the ambiguity can be either syntactic, (for example, prepositional phrase (PP) attachment), or it can be semantic.
  • NL enabled question answering systems mostly use general purpose NL parsers. Although these parsers give syntactically correct chunks for a sentence, these chunks might not be semantically meaningful in a domain. This can be illustrated with the following queries:
  • the chunks obtained from such general purpose NL parsers may not be helpful in extracting the answer to the user's query.
  • the problem becomes even more severe in case of complex queries involving multiple constraints and nested sub questions.
  • the problem is the requirement of a method to automatically enrich the output of a general purpose NL parser with the domain knowledge in order to obtain syntactically as well as semantically valid chunks for the queries in the domain.
  • US6829603 discloses the processing of natural language questions to obtain an equivalent structured query.
  • the method disclosed in US6829603 cannot interpret real- life natural language questions properly.
  • Some methods which can interpret real-life natural language questions properly depend so much on domain specific rules, that porting to other domains becomes an issue.
  • Republic et al. Paper: Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability, 2004) adapts the Charniak parser (Charniak et.
  • One more object of the present invention is to provide a method and system for identifying those syntactic parses of the natural language question which are semantically valid in the domain.
  • a system for obtaining semantically valid chunks for natural language application queries comprises: - predicate identifying means for identifying predicates in a natural language query;
  • constraint predicate set forming means for forming constraint predicate sets from the remaining predicates of the query in order to find semantically valid chunk sets from said natural language query;
  • said system includes a pre-defined default operator pre-fixing means adapted to pre-fix said default operator to said bound predicate and objects not having any numerical value in said natural language application.
  • said system includes grouping means for grouping said identified predicates and said identified objects that immediately follow/precede the POS tags of assignment words.
  • said third binding means includes compatibility checking means for checking compatibility of bound string objects with predicates using domain ontology.
  • a method for obtaining semantically valid chunks for natural language application queries comprises the steps of:
  • said method includes the step of pre-fixing means said default operator to said bound predicate and objects not having any numerical value in said natural language application.
  • said method includes the step of grouping said identified predicates and said identified objects that immediately follow/precede the POS tags of assignment words.
  • said step of binding string objects (not previously bound to any predicate) to their compatible predicates using domain ontology includes the step of checking compatibility of bound string objects with predicates using domain ontology.
  • FIG. 1 illustrates the overview of the method in accordance with the present invention
  • Figure 2 illustrates the flow diagram of the method of constraint identification in accordance with the present invention.
  • Figure 3 illustrates the flow diagram of the method of semantically valid chunk set formation in accordance with the present invention.
  • the system in accordance with the present invention is robust enough to analyze, understand and comprehend the question posed to it and to come up with the appropriate answer. This requires correct parsing, chunking, constraints formulation and sub query generation. Although most general purpose parsers parse the query correctly, due to lack of domain knowledge, domain relevant chunks are not obtained. Therefore, the method in accordance with the present invention focuses mainly on enriching general purpose parsers with domain knowledge using domain ontology in the form of RDF. Constraints formulation and sub query generation are handled which form the backbone of any robust NL system. Tackling all these issues make any natural language enabled business application system more robust, and enables it to handle even complex queries easily, efficiently and effectively.
  • NL based question answering system requires the queries to be analyzed and chunked in an appropriate manner to carry out the correct query generation and answer extraction.
  • An NL query typically has a set of unknown predicates whose values need to be determined based on the constraints imposed by the remaining part of the query other than the predicates.
  • Domain ontology along with a POS tagger is used to identify the constraints in the query. These constraints along with the domain knowledge and the parsed structure of the query are used to find the semantically valid chunk set. These chunks are then converted to a formal query language and the answer is retrieved from the ontology.
  • Figure 1 illustrates the overview of the method in accordance with the present invention.
  • solid arrows represent the process flow for a query
  • dashed arrows represent the information flow.
  • Constraint identification This involves identifying the correct predicate - object pairs.
  • Semantically valid chunk set identification This involves identification of the valid constraints for the unknown predicates so that correct interpretation of the given query can be ensured.
  • Semantic web technologies (Antoniou and van Harmelen, 2004) are used to create the domain ontology in RDF (Resource Description Framework) format using the relational data of the business application along with its meta information stored in the seed ontology (Bhat et al., 2007).
  • the ontology D 0 of a domain D describes the domain terms and their relationships in the ⁇ subject - predicate - object ⁇ format. For illustration, ⁇ Ritesh - project name - Bechtel ⁇ indicates that the predicate 'project name' of the subject 'Ritesh' has object 'Bmül'.
  • a synonym dictionary having information about the synonyms of the domain terms is also maintained.
  • the domain ontology and synonym dictionary are used to identify the concepts in the user query Q posed in the domain D.
  • the domain ontology D 0 is used to further classify the concepts as predicates and objects.
  • Constraint identification involves binding each 'objecf in the query with its corresponding 'predicate' . This predicate-object pair is referred to as a 'constraint'.
  • Figure 2 illustrates the flow diagram of the method of constraint identification in accordance with the present invention.
  • Predicates that do not form part of the constraint set are referred to as unknown predicates.
  • the set of unknown predicates is
  • constraint identification (or predicate - object binding) needs special attention due to the following reasons:
  • the objects 'Puneet' and 'Ritesh' are to be attached to the corresponding predicate 'employee name', which is not specified in the query. Hence, the system has to drill and extract the required predicate.
  • Constraint vs. unknown predicate The issue of unspecified predicates becomes even more severe when a predicate p t for an object o is present in the query, but the same predicate p also happens to be an unknown predicate.
  • the value 'project leader' in O Q is compatible to the predicate 'role' in P Q . But this predicate and its value are not to be bound as the predicate 'role' is an unknown predicate, whose value is to be determined.
  • Predicates followed by the respective objects In questions with multiple constraints, sometimes a predicate and its object may not be given consecutively. Instead, the query may have a predicate list followed by the corresponding object list (or vice versa). There is a need to identify and bind the appropriate predicate - operator - object group from the predicate and the object lists.
  • Step 1 - Binding operator object pairs for numerical/date objects The first step towards operator object pair binding is the identification of the comparison operators in the query.
  • Step 2 Grouping the predicates and objects that immediately follow/precede the POS tags of the assignment words, such as 'VBZ', 'VBP', 'IN', 'SYM' and the like.
  • the predicates that are immediately followed (or preceded) by any object are grouped. In case there is a list of predicates and a list of objects satisfying the above, then these lists are also grouped. These groups are the possible pairs for predicate object binding.
  • Step 3 From the groups obtained in Step 2, binding the predicates and objects of the same data type: The compatibility for the predicate and the object is also checked using domain ontology. While using a predicate list and its object list, one-on-one binding is done. These compatible predicate object pairs form the constraints of the query.
  • Step 4 The string objects that are not bound to any predicate in Step 3 are bound to their compatible predicates.
  • the compatible predicate for an object is determined using the domain ontology.
  • Step 5 The predicates bound to any object in Step 3 form the constraint predicate set, and the remaining predicates constitute the unknown predicate set.
  • the constraint sets thus obtained are used to find the semantically valid chunk set as discussed below.
  • FIG. 3 illustrates the flow diagram of the method of semantically valid chunk set formation in accordance with the present invention.
  • Semantically valid chunk set identifies the conditions on each unknown predicate in the query, and are constituted from the constraints and unknown predicates. Due to the syntactic ambiguity, more than one syntactic parse might be obtained for an NL query. Such cases may eventually result in more than one semantically viable chunk set.
  • Semantic chunk of a predicate is defined as:
  • the condition 'a' states that there is a semantic chunk for each unknown predicate in the query.
  • the condition 'b' states that each constraint in the query is used in at least one semantic chunk.
  • the semantically viable chunk set which is semantically valid as per the domain ontology is the semantically valid chunk set, ⁇ " - J .Q. These sets are referred to as SVaC sets.
  • the syntactic information of the question is used to obtain semantically viable chunk sets as described below.
  • the main task for identification of semantically viable chunk sets is to identify the conditions for all the unknown predicates.
  • the syntactic information of the query is exploited for this purpose.
  • a dependency based parser for example, Stanford Parser - Klein and Manning, Paper: Fast Exact Inference with a Factored Model for Natural Language Parsing, 2003; Link Parser - Grinberg et. al. Paper: A robust parsing algorithm for link grammars, 1995
  • the process of identifying the appropriate semantic chunks for different categories of queries is explained below.
  • an unknown predicate in the query plays the role of a noun
  • its syntactic modifiers identify the constraints on the predicate.
  • a noun can have either pre-nominal (e.g. adjective) or post-nominal (preposition phrase, relative clause etc.) modifiers.
  • Dependency based parsers provide dependencies between noun and its modifiers. This information along with the phrase structure of the query is used to determine the phrase modifying the unknown predicate. These phrases give the constraints for the unknown predicate.
  • the unknown predicate with its constraint is a candidate semantic chunk.
  • the preposition phrase 'with age > 30 years' is a post- nominal modifier of the noun 'associates'.
  • the constraint corresponding to this preposition phrase is 'age > 30', and hence the corresponding semantic chunk can be obtained as Q ⁇ ' + ⁇ , -J - - ⁇ > - , ⁇
  • a domain ' who' usually refers to a person, such as 'employee name', 'student name'; 'when' refers to date/time attributes like 'joining date', 'completion time'; and 'where' refers to locations like 'address', 'city'.
  • this information about the wh-words is identified, and stored in the seed ontology.
  • the predicate corresponding to the wh-word is found using the domain ontology, which might be a possible candidate for being an unknown predicate.
  • the wh-word in the question is compatible to more than one predicate in the domain, then more semantic chunks - corresponding to each compatible predicate - are obtained. Semantic information is used in such cases to resolve the ambiguity regarding the most appropriate predicate.
  • the constraints of the wh-word are determined on the basis of the role of the wh-word in the question as described below.
  • the words in the phrase enclosing the wh-word determine the constraints on the wh-word.
  • the set of these chunks is a semantically viable chunk set only if the chunk set satisfies the conditions (a) and (b) specified in the definition of SVC sets.
  • this chunk set is the semantically valid chunk set.
  • the semantically valid chunk set is found by using the domain specific semantic information as described below.
  • semantic information obtained from the domain ontology is used to determine the semantically valid chunk set.
  • Step 1 - Breadth first search The system in accordance with the present invention does a BFS on the tables in the ontology to determine if p t or p j belongs in the same table as that of p '.
  • BFS Breadth first search
  • Step 2 Depth first search (DFS):
  • the DFS method is involved to resolve the ambiguity regarding the constraint c ' if BFS is not able to do so.
  • the depths of the path from p ' to p t and p j are found using domain ontology.
  • the constraint c ' is attached to the predicate with which the distance of p ' is minimum; and the corresponding SVC set is the semantically viable chunk set.
  • the semantic chunks of the SVaC set are processed by a query manager module.
  • a formal query is generated from the semantic chunks to extract the answer to the user's question. Since the domain ontology is in RDF format, the queries are typically generated in SPARQL which is a query language for RDF.
  • the query generation starts with formulating SPARQL queries for the semantic chunks which do not contain any sub chunk.
  • the unknown predicate of the semantic chunk forms the 'SELECT' clause, and the constraints form a part of the ' WHERE' clause.
  • the technical advancements of the present invention include realization of a method and system for obtaining semantically valid chunks for natural language applications which:

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

A method and system for obtaining semantically valid chunks for natural language applications are disclosed in the present invention. The method includes the following steps: identifying predicates, objects and comparison operators in a natural language query; binding the identified predicates and objects using the identified comparison operator; replacing all occurrences of string comparator operators in said natural language query with corresponding mathematical operators; binding predicates and objects of same data type; checking compatibility of bound predicates and objects using domain ontology; binding string objects to their compatible predicates using domain ontology; forming constraint predicate sets from the remaining predicates of the query in order to find semantically valid chunk sets from said natural language query; syntactically parsing natural language query for absolving ambiguities; determining the depth between any two predicates using domain ontology, thereby providing a syntactically and semantically valid chunk set adapted to be used as a query.

Description

METHOD AND SYSTEM FOR OBTAINING SEMANTICALLY VALID CHUNKS FOR NATURAL LANGUAGE APPLICATIONS
FIELD OF THE INVENTION
The present invention relates to the field of natural language question answering systems.
Particularly, the present invention relates to the application of ontology in natural language question answering systems.
BACKGROUND OF THE INVENTION
Natural language (NL) enabled question answering systems for business applications aim at providing appropriate answers to the user queries. In such systems, query interpretation is a fundamental task. However, due to the innately ambiguous nature of the natural language, interpretation of a user's query is usually not straightforward. The ambiguity can be either syntactic, (for example, prepositional phrase (PP) attachment), or it can be semantic. In order to resolve such ambiguities, NL enabled question answering systems mostly use general purpose NL parsers. Although these parsers give syntactically correct chunks for a sentence, these chunks might not be semantically meaningful in a domain. This can be illustrated with the following queries:
• "Give the employees working in loss making projects". From this query, a human being can easily disambiguate that "loss making" is a modifier of "projects", and "working in loss making projects" is a modifier of "the employees". That is, the correct chunks can be represented as: "[Give [[the employees] [working [in [loss making [projects]]]]]]". However, the chunks obtained from a general purpose
l NL parser are "[Give [[[the employees] [working [in [loss]]]] [making [projects]]]]", which will be interpreted as "Give the employees who are working in loss and who make projects".
• "Give the projects having costing and billing >$25000 and <$35000, respectively". The general purpose NL chunker may interpret and chunk this query as "[Give [[the projects] [having [[costing] and [billing]] [>$25000] and [<$35000]]], respectively". From these chunks it is not possible to identify that "costing >$25,000" and "billing <$35,000" are the two constraints which modify "the projects".
Thus the chunks obtained from such general purpose NL parsers may not be helpful in extracting the answer to the user's query. The problem becomes even more severe in case of complex queries involving multiple constraints and nested sub questions. Thus, the problem is the requirement of a method to automatically enrich the output of a general purpose NL parser with the domain knowledge in order to obtain syntactically as well as semantically valid chunks for the queries in the domain.
Several attempts have been made to process natural language questions as disclosed in the documents given below.
United States Patent No. US6829603 (Androutsopoulos et. al.) discloses the processing of natural language questions to obtain an equivalent structured query. However, the method disclosed in US6829603 cannot interpret real- life natural language questions properly. Some methods which can interpret real-life natural language questions properly depend so much on domain specific rules, that porting to other domains becomes an issue. Popescu et al. (Paper: Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability, 2004) adapts the Charniak parser (Charniak et. al., Paper: A maximum-entropy-inspired parser) for domain-specific question answering by extending the training corpus of the parser with a set of 150 hand-tagged domain specific questions. Further, semantic rules inferred from domain knowledge are used to check and correct preposition attachment and preposition ellipsis errors. Katz et. al. (Paper: Syntactic and semantic decomposition strategies for question answering from multiple resources, (START), 2005) decomposes complex questions syntactically or semantically to obtain sub questions that can be answered from available resources. If these answers are not sufficient to solve the question, semantic information (in the form of rules that map 'key' domain questions to the answers) is used. The main drawback of these approaches is that the creation of domain specific rules is very resource intensive, and hence restricts portability.
Lopez et. al. (Paper: AquaLog: An ontology-driven question answering system for organizational semantic intranets, 2006) tries to transform the NL question to ontology specific triples using syntactic annotations, semantic terms and relations, and question words to interpret the natural language question. If these cannot resolve the ambiguity in the question, domain ontology and/or WordNet are used to make sense of the input query. There have also been some attempts for adapting general purpose natural language POS (Parts of Speech) taggers or parsers for a given domain. Coden et. al. (Paper: Domain-specific language models and lexicons for tagging 2006) adds a small domain specific POS tagged corpus to a large general English training set to build a POS tagger for the specific domain. Miller et. al. (Paper: Rapid Adaptation of POS Tagging for Domain Specific Uses, 2006) trains a generic domain POS tagger for biomedical texts by extending it with a lexicon that is updated to include domain-specific information based on the morphological rules specific to the domain. Pyysalo et. al. (Paper: Lexical adaptation of link grammar to the biomedical sub language: a comparative evaluation of three approaches, 2006) adapts a general purpose English parser to suit domain specific sentences by adding domain specific terminology to the lexicon of a parser, and by providing the parser with domain specific morphological rules to predict the morpho - syntactic class of unknown words.
None of the abovementioned work and documents disclose methods to enrich a general purpose NL parser with domain knowledge to obtain semantically valid chunks for an input query. Therefore, it is felt that there is a need for a method and system for obtaining semantically valid chunks for natural language applications which:
• can interpret queries easily, efficiently and effectively;
• can parse the query correctly with regard to both syntax and semantics;
• has required domain knowledge; and
• can identify the predicate - object pairs of a query correctly. OBJECTS OF THE INVENTION
it is an object of the present invention to provide a method and system for obtaining semantically valid chunks for natural language applications which can interpret queries easily, efficiently and effectively.
It is another object of the present invention to provide a method and system for obtaining semantically valid chunks for natural language applications which can parse the query correctly with regard to both syntax and semantics.
It is yet another object of the present invention to provide a method and system for enriching a general purpose natural language parser with the domain knowledge (in the form of domain ontology) so that the semantically valid chunks for natural language query can be obtained.
It is still another object of the present invention to provide a method and system for obtaining the correct predicate - object pairs of a natural language query so that the constraints in the query can be identified.
One more object of the present invention is to provide a method and system for identifying those syntactic parses of the natural language question which are semantically valid in the domain.
SUMMARY OF THE INVENTION
According to this invention, there is provided a system for obtaining semantically valid chunks for natural language application queries, said system comprises: - predicate identifying means for identifying predicates in a natural language query;
- object identifying means for identifying objects in said natural language query;
- identification means for identifying comparison operators in said natural language query;
- first binding means adapted to bind said identified predicate with said identified objects using said identified comparison operator;
- mathematical operator dictionary means adapted to replace all occurrences of string comparator operators in said natural language query with corresponding mathematical operators;
- second binding means for binding predicates and objects of same data type;
- compatibility checking means for checking compatibility of bound predicates and objects using domain ontology;
- third binding means for binding string objects (not previously bound to any predicate) to their compatible predicates using domain ontology;
- constraint predicate set forming means for forming constraint predicate sets from the remaining predicates of the query in order to find semantically valid chunk sets from said natural language query;
- syntactic parsing means adapted to syntactically parse natural language query for absolving ambiguities; and
- depth determination means, using domain ontology, for determining depth between any two predicates, thereby providing a syntactically and semantically valid chunk set adapted to be used as a query. Typically, said system includes a pre-defined default operator pre-fixing means adapted to pre-fix said default operator to said bound predicate and objects not having any numerical value in said natural language application.
Typically, said system includes grouping means for grouping said identified predicates and said identified objects that immediately follow/precede the POS tags of assignment words.
Typically, said third binding means includes compatibility checking means for checking compatibility of bound string objects with predicates using domain ontology.
According to this invention, there is provided a method for obtaining semantically valid chunks for natural language application queries, said system comprises the steps of:
- identifying predicates in a natural language query;
- identifying objects in said natural language query;
- identifying comparison operators in said natural language query;
- binding said identified predicate with said identified objects using said identified comparison operator;
- replacing all occurrences of string comparator operators in said natural language query with corresponding mathematical operators;
- grouping predicates and objects;
- binding predicates and objects of same data type; - checking compatibility of bound predicates and objects using domain ontology;
- binding string objects (not previously bound to any predicate) to their compatible predicates using domain ontology;
- forming constraint predicate sets from the remaining predicates of the query in order to find semantically valid chunk sets from said natural language query;
- syntactically parse natural language query for absolving ambiguities; and
- determining depth between any two predicates, using domain ontology, thereby providing a syntactically and semantically valid chunk set adapted to be used as a query.
Typically, said method includes the step of pre-fixing means said default operator to said bound predicate and objects not having any numerical value in said natural language application.
Typically, said method includes the step of grouping said identified predicates and said identified objects that immediately follow/precede the POS tags of assignment words.
Typically, said step of binding string objects (not previously bound to any predicate) to their compatible predicates using domain ontology includes the step of checking compatibility of bound string objects with predicates using domain ontology. BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The method and system for obtaining semantically valid chunks (SVC) for natural language applications will now be described with reference to the accompanying drawings, in which:
Figure 1 illustrates the overview of the method in accordance with the present invention;
Figure 2 illustrates the flow diagram of the method of constraint identification in accordance with the present invention; and
Figure 3 illustrates the flow diagram of the method of semantically valid chunk set formation in accordance with the present invention.
DETAILED DESCRIPTION OF THE INVENTION
The drawings and the description thereof are merely illustrative of a method and system for obtaining semantically valid chunks for natural language applications and only exemplify the system of the invention and in no way limit the scope thereof.
The system in accordance with the present invention is robust enough to analyze, understand and comprehend the question posed to it and to come up with the appropriate answer. This requires correct parsing, chunking, constraints formulation and sub query generation. Although most general purpose parsers parse the query correctly, due to lack of domain knowledge, domain relevant chunks are not obtained. Therefore, the method in accordance with the present invention focuses mainly on enriching general purpose parsers with domain knowledge using domain ontology in the form of RDF. Constraints formulation and sub query generation are handled which form the backbone of any robust NL system. Tackling all these issues make any natural language enabled business application system more robust, and enables it to handle even complex queries easily, efficiently and effectively.
NL based question answering system requires the queries to be analyzed and chunked in an appropriate manner to carry out the correct query generation and answer extraction. An NL query typically has a set of unknown predicates whose values need to be determined based on the constraints imposed by the remaining part of the query other than the predicates. Domain ontology along with a POS tagger is used to identify the constraints in the query. These constraints along with the domain knowledge and the parsed structure of the query are used to find the semantically valid chunk set. These chunks are then converted to a formal query language and the answer is retrieved from the ontology. Figure 1 illustrates the overview of the method in accordance with the present invention. In Figure 1 , solid arrows represent the process flow for a query, and dashed arrows represent the information flow.
For the appropriate interpretation and analysis of the query, the important areas those are to be analyzed and addressed can be summarized as:
• Constraint identification: This involves identifying the correct predicate - object pairs. • Semantically valid chunk set identification: This involves identification of the valid constraints for the unknown predicates so that correct interpretation of the given query can be ensured.
• Query generation: In this step, semantically valid chunks are converted to appropriate formal language query using the domain ontology.
Semantic web technologies (Antoniou and van Harmelen, 2004) are used to create the domain ontology in RDF (Resource Description Framework) format using the relational data of the business application along with its meta information stored in the seed ontology (Bhat et al., 2007). The ontology D0 of a domain D describes the domain terms and their relationships in the {subject - predicate - object} format. For illustration, {Ritesh - project name - Bechtel} indicates that the predicate 'project name' of the subject 'Ritesh' has object 'Bechtel'. A synonym dictionary having information about the synonyms of the domain terms is also maintained. The domain ontology and synonym dictionary are used to identify the concepts in the user query Q posed in the domain D. The domain ontology D0 is used to further classify the concepts as predicates and objects. For a query Q, we denote the set of predicates as ¾ = i i - P^- - - Pn I =K S - Pi — o} € UQ . an(j p. js present in the query Q) . The set of objects present in the query Q is °Q = W > · · om \ 3(s - p - ot) € Dc> or 0. ls a numerical/date value, and o, is present in the query Q) .
For a successful query creation and execution, the process of identification and formulation of correct constraints is of utmost importance. Constraint identification involves binding each 'objecf in the query with its corresponding 'predicate' . This predicate-object pair is referred to as a 'constraint'. Figure 2 illustrates the flow diagram of the method of constraint identification in accordance with the present invention.
A constraint is defined as = ^ °i °i PQ> and 0j is the value for the predicate pt in Q. All the constraints in the query Q are identified, and " '- ' ' " ' ^n s denotes the constraints set. Predicate used in any constraint is referred to as a constraint predicate. The set of constraint predicates is PQ = iPi I Pi e PQ such that 3^ 0**
Predicates that do not form part of the constraint set are referred to as unknown predicates. The set of unknown predicates is
Figure imgf000013_0001
In a natural language query, constraint identification (or predicate - object binding) needs special attention due to the following reasons:
Unspecified predicates: For some (or all) of the objects present in the query, the corresponding predicates might not be explicitly specified. Yet, these predicates are to be identified and bound to their respective objects. For example, in the query 'Give me the role of Puneet in the project having Ritesh as project leader", the predicate set is given by PQ = {role, project name}, and the object set is given by OQ = {Puneet, Ritesh, project leader}. Here the objects 'Puneet' and 'Ritesh' are to be attached to the corresponding predicate 'employee name', which is not specified in the query. Hence, the system has to drill and extract the required predicate. Constraint vs. unknown predicate: The issue of unspecified predicates becomes even more severe when a predicate pt for an object o is present in the query, but the same predicate p also happens to be an unknown predicate. For example, in the query mentioned above, the value 'project leader' in OQ is compatible to the predicate 'role' in PQ. But this predicate and its value are not to be bound as the predicate 'role' is an unknown predicate, whose value is to be determined.
Numerical object and mathematical operator binding: Many times the query posed might entail numerical value comparison. Hence, such questions involve the usage of comparative operators. These operators can be specified in many ways; like '<', '>', or in words like 'less than', 'below', or assignment words like 'is', 'as' etc. Sometime there might not be any word or operator specified between the predicate and its object. Thus these operators are to be identified and bound with the correct object.
Predicates followed by the respective objects: In questions with multiple constraints, sometimes a predicate and its object may not be given consecutively. Instead, the query may have a predicate list followed by the corresponding object list (or vice versa). There is a need to identify and bind the appropriate predicate - operator - object group from the predicate and the object lists.
The main steps of the process for predicate - object discovery and binding in accordance with the present invention are as follows:
Step 1 - Binding operator object pairs for numerical/date objects: The first step towards operator object pair binding is the identification of the comparison operators in the query. For operator identification, the system in c accordance with the present invention maintains a mathematical operator dictionary. All the occurrences of the string comparators in the question are replaced by the corresponding mathematical comparator. Also, if there is any numerical value in the question that is not preceded by any operator, '=' operator is prefixed by default. Thus, the corresponding operator object pairs are formed.
Step 2 - Grouping the predicates and objects that immediately follow/precede the POS tags of the assignment words, such as 'VBZ', 'VBP', 'IN', 'SYM' and the like. The predicates that are immediately followed (or preceded) by any object are grouped. In case there is a list of predicates and a list of objects satisfying the above, then these lists are also grouped. These groups are the possible pairs for predicate object binding.
Step 3 - From the groups obtained in Step 2, binding the predicates and objects of the same data type: The compatibility for the predicate and the object is also checked using domain ontology. While using a predicate list and its object list, one-on-one binding is done. These compatible predicate object pairs form the constraints of the query.
Step 4 - The string objects that are not bound to any predicate in Step 3 are bound to their compatible predicates. The compatible predicate for an object is determined using the domain ontology.
Step 5 - The predicates bound to any object in Step 3 form the constraint predicate set, and the remaining predicates constitute the unknown predicate set. The constraint sets thus obtained are used to find the semantically valid chunk set as discussed below.
Figure 3 illustrates the flow diagram of the method of semantically valid chunk set formation in accordance with the present invention. Semantically valid chunk set identifies the conditions on each unknown predicate in the query, and are constituted from the constraints and unknown predicates. Due to the syntactic ambiguity, more than one syntactic parse might be obtained for an NL query. Such cases may eventually result in more than one semantically viable chunk set. A semantically viable chunk set (SVC set) of a query Q corresponding to the kth parse is a set v L'Qk— v- -JQ I μ "~ *= t where is a semantic chunk. Semantic chunk of a predicate
Figure imgf000016_0001
is defined as:
Figure imgf000016_0002
such that SVC, satisfies the following: a, Vp€ P ^SC h SVCQh .
b. Vc" e <¾»( 35C¾fc 6 SVCQ^ such ύκύ SC¾,
(p, ^l , ¾ ; . , . </ . . . . Cr).
The condition 'a' states that there is a semantic chunk for each unknown predicate in the query. The condition 'b' states that each constraint in the query is used in at least one semantic chunk. For a query Q, the semantically viable chunk set which is semantically valid as per the domain ontology is the semantically valid chunk set, ~" -J.Q. These sets are referred to as SVaC sets. The syntactic information of the question is used to obtain semantically viable chunk sets as described below.
For a query, the main task for identification of semantically viable chunk sets is to identify the conditions for all the unknown predicates. The syntactic information of the query is exploited for this purpose. A dependency based parser (for example, Stanford Parser - Klein and Manning, Paper: Fast Exact Inference with a Factored Model for Natural Language Parsing, 2003; Link Parser - Grinberg et. al. Paper: A robust parsing algorithm for link grammars, 1995) is used to obtain the syntactic structure of the question. The process of identifying the appropriate semantic chunks for different categories of queries is explained below.
If an unknown predicate in the query plays the role of a noun, its syntactic modifiers identify the constraints on the predicate. A noun can have either pre-nominal (e.g. adjective) or post-nominal (preposition phrase, relative clause etc.) modifiers. Dependency based parsers provide dependencies between noun and its modifiers. This information along with the phrase structure of the query is used to determine the phrase modifying the unknown predicate. These phrases give the constraints for the unknown predicate. The unknown predicate with its constraint is a candidate semantic chunk. For example, for the question 'What is the role of the associates with age > 30 years?', the preposition phrase 'with age > 30 years' is a post- nominal modifier of the noun 'associates'. The constraint corresponding to this preposition phrase is 'age > 30', and hence the corresponding semantic chunk can be obtained as Q · ' + ·, -J - - <> - , ·
Further, the preposition phrase Of the associates with age > 30 years' is
^^-i employee— name modifying the noun 'role'. Since the semantic chunk, "'Q ' , for the phrase 'of the associates with age > 30 years' has already been identified, the semantic chunk for the predicate 'role' is
SC?q le = irole
Figure imgf000018_0001
In a domain 'who' usually refers to a person, such as 'employee name', 'student name'; 'when' refers to date/time attributes like 'joining date', 'completion time'; and 'where' refers to locations like 'address', 'city'. For the given business application, this information about the wh-words is identified, and stored in the seed ontology. In questions involving any of these wh-words, the predicate corresponding to the wh-word is found using the domain ontology, which might be a possible candidate for being an unknown predicate. If the wh-word in the question is compatible to more than one predicate in the domain, then more semantic chunks - corresponding to each compatible predicate - are obtained. Semantic information is used in such cases to resolve the ambiguity regarding the most appropriate predicate. The constraints of the wh-word are determined on the basis of the role of the wh-word in the question as described below.
• If the wh-word is the subject in the question, the corresponding verb phrase determines the constraint on the wh-word.
• In other cases, the words in the phrase enclosing the wh-word determine the constraints on the wh-word. In the case of wh-words becoming the determiners of the unknown predicates also, the constraints are determined. For example, in the question 'In which project is Ritesh allocated?', the constraint for the unknown predicate 'project name' can be identified as 'employee name = Ritesh'. Thus the semantic chunk is {project name; employee name = Ritesh}.
Using the syntactic information, all possible semantic chunks for a parse structure of the question are determined. The set of these chunks is a semantically viable chunk set only if the chunk set satisfies the conditions (a) and (b) specified in the definition of SVC sets.
If for a query Q, only one semantically viable chunk set is found then this chunk set is the semantically valid chunk set. In other cases, the semantically valid chunk set is found by using the domain specific semantic information as described below.
If more than one semantically viable chunk sets are obtained for a question, semantic information obtained from the domain ontology is used to determine the semantically valid chunk set. Let SV CQl = {SC¾ e P$} and SVCQ, = be any two SVC sets for a query Q. Since there are more than one SVC set for Q, 3pu pj £ ¾ . and c' = (p ) £ CQ such that c > is a constituent of
SCQi € / (JQi and 5<¾ SV C^ . But, in the valid interpretation of Q, c ' can specify either the unknown predicate p, or the unknown predicate Pj. Hence, it can be concluded that, in this case, the syntactic information is not sufficient to resolve the ambiguity whether c ' is a constraint of p, or pj. To resolve such ambiguities, the depth between concerned predicates is used. The number of tables required to be traversed in order to find the relationship between any two predicates is determined through the domain ontology. This is referred to as the depth between two predicates. If for a pair of predicates, there exist more than one path, then the one with the minimum depth is chosen. It is observed that the semantic chunk in which the unknown predicate and the constraint predicate pair has lesser depth is the one which is more likely to be the correct pair. Domain ontology is used to find the depth between two predicates as described below.
Step 1 - Breadth first search (BFS): The system in accordance with the present invention does a BFS on the tables in the ontology to determine if pt or pj belongs in the same table as that of p '. Without loss of generality, assume that ?, and p ' belong to the same table, and pj does not belong to the table of p '. In this case, SC"Q\ , and consequently SV CQ\ is assumed to be correct, and SV CQ2 is rejected. Thus in this case, SV OCQ = SV CQ\ .
Step 2 - Depth first search (DFS): The DFS method is involved to resolve the ambiguity regarding the constraint c ' if BFS is not able to do so. The depths of the path from p ' to pt and pj are found using domain ontology. The constraint c ' is attached to the predicate with which the distance of p ' is minimum; and the corresponding SVC set is the semantically viable chunk set.
An advantage of this approach is that depending upon question complexity, the system in accordance with the present invention does a deeper analysis. Domain ontology is used only if a question cannot be resolved using just the syntactic information. If domain information also is not sufficient for question interpretation, then answers for all possible interpretations are found, and the user is left with the option of identifying the correct answer.
The semantic chunks of the SVaC set are processed by a query manager module. In this module, a formal query is generated from the semantic chunks to extract the answer to the user's question. Since the domain ontology is in RDF format, the queries are typically generated in SPARQL which is a query language for RDF.
For a semantically valid chunk set, the query generation starts with formulating SPARQL queries for the semantic chunks which do not contain any sub chunk. The unknown predicate of the semantic chunk forms the 'SELECT' clause, and the constraints form a part of the ' WHERE' clause.
The answers obtained from these independent semantic chunks are substituted in the semantic chunks involving nested sub chunks. Finally, the SPARQL query is generated for these chunks and the answer is returned to the user.
TECHNICAL ADVANCEMENTS
The technical advancements of the present invention include realization of a method and system for obtaining semantically valid chunks for natural language applications which:
• can interpret queries easily, efficiently and effectively;
• can parse the query correctly with regard to both syntax and semantics;
• has required domain knowledge; and
• can identify the predicate - object pairs of a query correctly. While considerable emphasis has been placed herein on the particular features of this invention, it will be appreciated that various modifications can be made, and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other modifications in the nature of the invention or the preferred embodiments will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.

Claims

Claims.
1. A system for obtaining semantically valid chunks for natural language application queries, said system comprising:
- predicate identifying means for identifying predicates in a natural language query;
- object identifying means for identifying objects in said natural language query;
- identification means for identifying comparison operators in said natural language query;
- first binding means adapted to bind said identified predicate with said identified objects using said identified comparison operator;
- mathematical operator dictionary means adapted to replace all occurrences of string comparator operators in said natural language query with corresponding mathematical operators;
- second binding means for binding predicates and objects of same data type;
- compatibility checking means for checking compatibility of bound predicates and objects using domain ontology;
- third binding means for binding string objects (not previously bound to any predicate) to their compatible predicates using domain ontology;
- constraint predicate set forming means for forming constraint predicate sets from the remaining predicates of the query in order to find semantically valid chunk sets from said natural language query;
- syntactic parsing means adapted to syntactically parse natural language query for absolving ambiguities; and - depth determination means, using domain ontology, for determining depth between any two predicates, thereby providing a syntactically and semantically valid chunk set adapted to be used as a query.
2. A system as claimed in claim 1 wherein, said system includes a predefined default operator pre-fixing means adapted to pre-fix said default operator to said bound predicate and objects not having any numerical value in said natural language application.
3. A system as claimed in claim 1 wherein, said system includes grouping means for grouping said identified predicates and said identified objects that immediately follow/precede the POS tags of assignment words.
4. A system as claimed in claim 1 wherein, said third binding means includes compatibility checking means for checking compatibility of bound string objects with predicates using domain ontology.
5. A method for obtaining semantically valid chunks for natural language application queries, said system comprising the steps of:
- identifying predicates in a natural language query;
- identifying objects in said natural language query;
- identifying comparison operators in said natural language query;
- binding said identified predicate with said identified objects using said identified comparison operator; - replacing all occurrences of string comparator operators in said natural language query with corresponding mathematical operators;
- grouping predicates and objects;
- binding predicates and objects of same data type;
- checking compatibility of bound predicates and objects using domain ontology;
- binding string objects (not previously bound to any predicate) to their compatible predicates using domain ontology;
- forming constraint predicate sets from the remaining predicates of the query in order to find semantical ly valid chunk sets from said natural language query;
- syntactically parse natural language query for absolving ambiguities; and
- determining depth between any two predicates, using domain ontology, thereby providing a syntactically and semantically valid chunk set adapted to be used as a query.
A method as claimed in claim 5 wherein, said method includes the step of pre-fixing means said default operator to said bound predicate and objects not having any numerical value in said natural language application.
A method as claimed in claim 5 wherein, said method includes the step of grouping said identified predicates and said identified objects that immediately follow/precede the POS tags of assignment words. A method as claimed in claim 5 wherein, said step of binding string objects (not previously bound to any predicate) to their compatible predicates using domain ontology includes the step of checking compatibility of bound string objects with predicates using domain ontology.
PCT/IN2010/000693 2009-10-28 2010-10-27 Method and system for obtaining semantically valid chunks for natural language applications WO2011051970A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2501/MUM/2009 2009-10-28
IN2501MU2009 2009-10-28

Publications (2)

Publication Number Publication Date
WO2011051970A2 true WO2011051970A2 (en) 2011-05-05
WO2011051970A3 WO2011051970A3 (en) 2011-07-07

Family

ID=43922729

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2010/000693 WO2011051970A2 (en) 2009-10-28 2010-10-27 Method and system for obtaining semantically valid chunks for natural language applications

Country Status (1)

Country Link
WO (1) WO2011051970A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9372924B2 (en) 2012-06-12 2016-06-21 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US10303763B2 (en) 2017-01-06 2019-05-28 International Business Machines Corporation Process for identifying completion of domain adaptation dictionary activities
KR20200003329A (en) * 2018-06-29 2020-01-09 김태정 Method and apparatus for constructing chunk based on natural language processing
CN112749548A (en) * 2020-11-02 2021-05-04 万齐智 Rule-based Chinese structured financial event default completion extraction method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998025217A1 (en) * 1996-12-04 1998-06-11 Quarterdeck Corporation Method and apparatus for natural language querying and semantic searching of an information database
CN1255213A (en) * 1997-03-04 2000-05-31 石仓博 Language analysis system and method
US6947923B2 (en) * 2000-12-08 2005-09-20 Electronics And Telecommunications Research Institute Information generation and retrieval method based on standardized format of sentence structure and semantic structure and system using the same
US20090070311A1 (en) * 2007-09-07 2009-03-12 At&T Corp. System and method using a discriminative learning approach for question answering
CN101398835A (en) * 2007-09-30 2009-04-01 日电(中国)有限公司 Service selecting system and method, and service enquiring system and method based on natural language

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998025217A1 (en) * 1996-12-04 1998-06-11 Quarterdeck Corporation Method and apparatus for natural language querying and semantic searching of an information database
CN1255213A (en) * 1997-03-04 2000-05-31 石仓博 Language analysis system and method
US6947923B2 (en) * 2000-12-08 2005-09-20 Electronics And Telecommunications Research Institute Information generation and retrieval method based on standardized format of sentence structure and semantic structure and system using the same
US20090070311A1 (en) * 2007-09-07 2009-03-12 At&T Corp. System and method using a discriminative learning approach for question answering
CN101398835A (en) * 2007-09-30 2009-04-01 日电(中国)有限公司 Service selecting system and method, and service enquiring system and method based on natural language

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9372924B2 (en) 2012-06-12 2016-06-21 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US9922024B2 (en) 2012-06-12 2018-03-20 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US10268673B2 (en) 2012-06-12 2019-04-23 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US10303763B2 (en) 2017-01-06 2019-05-28 International Business Machines Corporation Process for identifying completion of domain adaptation dictionary activities
US10872205B2 (en) 2017-01-06 2020-12-22 International Business Machines Corporation Process for identifying completion of domain adaptation dictionary activities
KR20200003329A (en) * 2018-06-29 2020-01-09 김태정 Method and apparatus for constructing chunk based on natural language processing
KR102209786B1 (en) 2018-06-29 2021-01-29 김태정 Method and apparatus for constructing chunk based on natural language processing
CN112749548A (en) * 2020-11-02 2021-05-04 万齐智 Rule-based Chinese structured financial event default completion extraction method
CN112749548B (en) * 2020-11-02 2024-04-26 万齐智 Rule-based default completion extraction method for Chinese structured financial events

Also Published As

Publication number Publication date
WO2011051970A3 (en) 2011-07-07

Similar Documents

Publication Publication Date Title
US9448995B2 (en) Method and device for performing natural language searches
US11080295B2 (en) Collecting, organizing, and searching knowledge about a dataset
Ell et al. SPARQL query verbalization for explaining semantic search engine queries
CN111061832A (en) Character behavior extraction method based on open domain information extraction
Steinmetz et al. From natural language questions to SPARQL queries: a pattern-based approach
CN113779062A (en) SQL statement generation method and device, storage medium and electronic equipment
KR20100066919A (en) Triple indexing and searching scheme for efficient information retrieval
Al-Safadi Natural language processing for conceptual modeling
WO2011051970A2 (en) Method and system for obtaining semantically valid chunks for natural language applications
Srivastava et al. Improving machine translation through linked data
Darģis et al. Annotation of the corpus of the Saeima with multilingual standards
Barkschat Semantic information extraction on domain specific data sheets
Zhang et al. FactQA: Question answering over domain knowledge graph based on two-level query expansion
Fudholi et al. Ontology-based information extraction for knowledge enrichment and validation
Mvumbi Natural language interface to relational database: a simplified customization approach
Iqbal et al. A Negation Query Engine for Complex Query Transformations
Nguyen et al. Systematic knowledge acquisition for question analysis
Rao et al. Automatic identification of concepts and conceptual relations from patents using machine learning methods
Dedhia et al. Techniques to automatically generate entity relationship diagram
Nevzorova et al. Corpus management system: Semantic aspects of representation and processing of search queries
Vickers Ontology-based free-form query processing for the semantic web
Arumugam Processing the textual information using open natural language processing (NLP)
Vileiniškis et al. An approach for Semantic search over Lithuanian news website corpus
Vlachidis et al. The semantics of negation detection in archaeological grey literature
Yarushkina et al. The Method for Improving the Quality of Information Retrieval Based on Linguistic Analysis of Search Query

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10826238

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10826238

Country of ref document: EP

Kind code of ref document: A2