WO2016050066A1 - 知识库中问句解析的方法及设备 - Google Patents

知识库中问句解析的方法及设备 Download PDF

Info

Publication number
WO2016050066A1
WO2016050066A1 PCT/CN2015/078362 CN2015078362W WO2016050066A1 WO 2016050066 A1 WO2016050066 A1 WO 2016050066A1 CN 2015078362 W CN2015078362 W CN 2015078362W WO 2016050066 A1 WO2016050066 A1 WO 2016050066A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource item
question
value
candidate phrase
phrase
Prior art date
Application number
PCT/CN2015/078362
Other languages
English (en)
French (fr)
Inventor
赵军
刘康
何世柱
张轶博
Original Assignee
华为技术有限公司
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 中国科学院自动化研究所 filed Critical 华为技术有限公司
Priority to EP15845782.0A priority Critical patent/EP3179384A4/en
Publication of WO2016050066A1 publication Critical patent/WO2016050066A1/zh
Priority to US15/472,279 priority patent/US10706084B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • Embodiments of the present invention relate to the field of communications, and, more particularly, to a method and apparatus for parsing a question in a knowledge base.
  • Knowledge Base is a structured, easy-to-use, easy-to-use, comprehensive and organized knowledge cluster in knowledge engineering. It is the need to solve problems in one or some areas, using one or several kinds of knowledge. A collection of interrelated pieces of knowledge that are stored, organized, managed, and used in computer memory.
  • the construction of the knowledge base has gone through the process of adding artificial and group intelligence to the entire Internet using machine learning and information extraction technology.
  • the early knowledge base was built manually by experts. For example, WordNet, CYC, CCD, HowNet, China Encyclopedia, etc.
  • the traditional artificially constructed knowledge base gradually exposes the shortcomings of small scale, less knowledge and slower update.
  • the deterministic knowledge framework constructed by experts can not meet the needs of large-scale computing under the noisy environment of the Internet. . This is one of the reasons why the CYC project ultimately failed.
  • a large number of network knowledge bases based on group intelligence have emerged, including Wikipedia, Interactive Encyclopedia, Baidu Encyclopedia and so on. Based on these network resources, a large number of automated semi-automatic knowledge base construction methods are used to build large-available knowledge bases such as YAGO, DBpedia, Freebase, etc.
  • Knowledge-base-based Question Answering can be built.
  • the knowledge base-based question answering system may have a lower coverage rate due to the limitation of the knowledge base size, but it has certain Reasoning ability.
  • higher accuracy is achieved in the limited field. Therefore, some knowledge-based question-and-answer systems have emerged, some have become independent applications, and some have been enhanced as existing products, such as Apple's siri, Google's knowledge map.
  • Question Answering means that users do not need to break down the problem into keywords, but directly ask questions in the form of natural language, after the question and answer system handles the user's problem, and then quickly search and solve the problem from the knowledge base or the Internet. Correspond to the answer, then return the answer directly to the user, not the relevant web page. Therefore, the question and answer system greatly reduces the difficulty of the user's use, and it is more convenient and efficient than the traditional keyword search and semantic search technology and other search engines.
  • the Question Answering over Linked Data (QALD) evaluation competition promoted the development of the Q&A system.
  • the goal is to convert natural language questions into structured simple protocol resource description framework query statements (Simple Protocol and RDF (Resource Description Framework) Query Language, SPARQL) for large-scale structured association data.
  • Simple Protocol and RDF Resource Description Framework
  • SPARQL Resource Description Framework
  • QALD Question Answering over Linked Data
  • the goal is to convert natural language questions into structured simple protocol resource description framework query statements (Simple Protocol and RDF (Resource Description Framework) Query Language, SPARQL) for large-scale structured association data.
  • SPARQL Resource Description Framework
  • Converting a natural language question into a structured SPARQL depends on the conversion rules for the knowledge base. However, in the current question and answer system, the conversion rules are manually configured, which results in not only a lot of manpower but also poor domain scalability.
  • the embodiment of the invention provides a method for parsing a question based on a knowledge base, which does not need to manually configure a conversion rule and is domain-independent.
  • a method for parsing a question in a knowledge base including:
  • the observation predicate is used to represent a feature of the first candidate phrase, the first a feature of the resource item and a relationship between the first candidate phrase and the first resource item, the point in the possible question analysis space is a proposition set, and the true and false of the proposition in the proposition set is implied predicate Value representation
  • a formalized query statement is generated according to the combination of the true propositions.
  • the uncertainty reasoning is based on a Markov logic network MLN, where the MLN includes a predefined first-order formula and the first-order formula Weights.
  • the method further includes:
  • the first-order formula includes a Boolean formula and a weighting formula, where the weight of the Boolean formula is + ⁇ , The weight of the weighting formula is a weighting formula weight, and the manually labeled value of the implicit predicate corresponding to the plurality of natural language questions satisfies the Boolean formula.
  • the value of the observed predicate corresponding to the multiple natural language questions, and the multiple The value of the implicit predicate corresponding to the natural language question and the first-order formula construct an undirected graph, and determine the weight of the first-order formula through training, including:
  • the MLN is represented as M
  • the first-order formula is represented as ⁇ i
  • the first-order formula expressed as the weight w i
  • a set of sub-formulas corresponding to the first-order formula ⁇ i , c is a sub-formula in the set of sub-formulas, As a binary function, Representing the true and false of the first-order formula under the set of propositions y.
  • the obtaining the true proposition in the proposition set that the confidence meets a preset condition Combination of:
  • the feature of the first candidate phrase includes a position of the first candidate phrase in the question, a part of the main word of the first candidate phrase, and a dependency path between the first candidate phrase s Mark,
  • the feature of the first resource item includes a type of the first resource item, a correlation value between the first resource item, and a parameter matching relationship between the first resource item.
  • the relationship between the first candidate phrase and the first resource item includes a priori matching score of the first candidate phrase and the first resource item
  • Determining the value of the observed predicate according to the first candidate phrase and the first resource item including:
  • the a priori matching score being used to indicate a probability that the first candidate phrase is mapped to the first resource item.
  • the formal query statement is a simple protocol resource description framework query statement SPARQL.
  • the generating a formalized query statement according to the combination of the true proposition includes:
  • the SPARQL is generated using a SPARQL template according to the combination of the true propositions.
  • the SPARQL template includes an ASK WHERE template, a SELECT COUNT (?url)WHERE template, and a SELECT? Url WHERE template,
  • the SPARQL is generated using the ASK WHERE template according to the combination of the true propositions
  • the SELECT is used.
  • the url WHERE template generates the SPARQL, or, when using the SELECT?
  • the SPARQL generated by the url WHERE template cannot obtain a numeric answer, the SPARQL is generated using the SELECT COUNT(?url)WHERE template.
  • the phrase detecting the phrase to determine the first candidate phrase includes: using the word sequence in the question as the first candidate phrase, wherein the word sequence satisfies:
  • the part of the word sequence is jj or nn or rb or vb, wherein jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb;
  • the words included in the sequence of words are not all stop words.
  • a device for question and answer analysis including:
  • a receiving unit configured to receive a question input by the user
  • phrase detecting unit configured to perform phrase detection on the question sentence received by the receiving unit to determine a first candidate phrase
  • mapping unit configured to map the first candidate phrase determined by the phrase detecting unit to a first resource item in a knowledge base, wherein the first resource item has a consistent semantic with the first candidate phrase
  • a first determining unit configured to determine, according to the first candidate phrase and the first resource item, a value of an observed predicate and a possible question analysis space, wherein the observed predicate is used to represent the first candidate phrase a feature, a feature of the first resource item, and a relationship between the first candidate phrase and the first resource item, a point in the possible question analysis space is a proposition set, and a proposition in the proposition set True or false is characterized by the value of the implicit predicate;
  • a second determining unit configured to determine, for each proposition set in the possible question analysis space, according to the value of the observed predicate determined by the first determining unit and the value of the implicit predicate Sexual reasoning, calculating the confidence of each of the proposition sets;
  • An obtaining unit configured to acquire the confidence determined by the second determining unit to meet a preset condition a combination of true propositions in a set of propositions, wherein the true proposition is used to represent a search phrase selected from among the first candidate phrases, a search resource item selected from the first resource item, and the Search for characteristics of a resource item;
  • a generating unit configured to generate a formalized query statement according to the combination of the true propositions.
  • the uncertainty reasoning is based on a Markov logic network MLN, where the MLN includes a predefined first-order formula and the first-order formula Weights.
  • the obtaining unit is further configured to obtain a plurality of natural language questions from the knowledge base;
  • the phrase detecting unit is further configured to perform phrase detection on the question sentence received by the acquiring unit to determine a first candidate phrase
  • the mapping unit is further configured to map the second candidate phrase to a second resource item in the knowledge base, where the second resource item and the second candidate phrase have consistent semantics;
  • the first determining unit is further configured to determine, according to the second candidate phrase and the second resource item, a value of an observed predicate corresponding to the plurality of natural language questions;
  • the obtaining unit is further configured to acquire a manually labeled value of an implicit predicate corresponding to the plurality of natural language questions;
  • the second determining unit is further configured to: according to a value of the observed predicate corresponding to the plurality of natural language questions, a value of the implicit predicate corresponding to the plurality of natural language questions, and the first-order formula, An undirected graph is constructed, and the weight of the first-order formula is determined by training.
  • the first-order formula includes a Boolean formula and a weighting formula, where the weight of the Boolean formula is + ⁇ , The weight of the weighting formula is a weighting formula weight, and the manually labeled value of the implicit predicate corresponding to the plurality of natural language questions satisfies the Boolean formula.
  • the second determining unit is specifically configured to: according to a value of the observed predicate corresponding to the plurality of natural language questions, a value of the implicit predicate corresponding to the plurality of natural language questions, and the first-order formula An undirected graph is constructed, and the weighting formula weights are determined by training.
  • the second determining unit is specifically configured to:
  • the value of the implicit predicate corresponding to the question and the first-order formula are used to construct an undirected graph, and the weight of the first-order formula is determined by using a difference injection relaxation algorithm MIRA.
  • the MLN is represented as M
  • the first-order formula is represented as ⁇ i
  • the first-order formula expressed as the weight w i
  • the second determining unit is specifically configured to:
  • a set of sub-formulas corresponding to the first-order formula ⁇ i , c is a sub-formula in the set of sub-formulas, As a binary function, Representing the true and false of the first-order formula under the set of propositions y.
  • the acquiring unit is specifically configured to:
  • the feature of the first candidate phrase includes a position of the first candidate phrase in the question, a part of the main word of the first candidate phrase, and a dependency path between the first candidate phrase s Mark,
  • the feature of the first resource item includes a type of the first resource item, a correlation value between the first resource item, and a parameter matching relationship between the first resource item.
  • the relationship between the first candidate phrase and the first resource item includes a priori matching score of the first candidate phrase and the first resource item
  • the first determining unit is specifically configured to:
  • the a priori matching score being used to indicate a probability that the first candidate phrase is mapped to the first resource item.
  • the formal query statement is a simple protocol resource description framework query statement SPARQL.
  • the generating unit is specifically configured to:
  • the SPARQL is generated using a SPARQL template according to the combination of the true propositions.
  • the SPARQL template includes an ASK WHERE template, a SELECT COUNT (?url)WHERE template, and a SELECT? Url WHERE template,
  • the generating unit is specifically configured to:
  • the SPARQL is generated using the ASK WHERE template according to the combination of the true propositions
  • the url WHERE template generates the SPARQL
  • the SELECT is used.
  • the url WHERE template generates the SPARQL, or, when using the SELECT?
  • the SPARQL generated by the url WHERE template cannot obtain a numeric answer, the SPARQL is generated using the SELECT COUNT(?url)WHERE template.
  • the phrase detecting unit is specifically configured to:
  • the part of the word sequence is jj or nn or rb or vb, wherein jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb;
  • the words included in the sequence of words are not all stop words.
  • Embodiments of the present invention are based on a predefined uncertainty inference network that can be used to convert a natural language question input by a user into a structured SPARQL.
  • the predefined uncertainty inference network can be applied to a knowledge base of any domain, and has domain extensibility, so that there is no need to manually configure a conversion rule for the knowledge base.
  • FIG. 1 is a flow chart of a method for parsing a question in a knowledge base according to an embodiment of the present invention.
  • FIG. 2 is an example of a dependency analysis tree in accordance with an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a method for parsing a question in a knowledge base according to another embodiment of the present invention.
  • FIG. 4 is another example of a resource item query graph in accordance with an embodiment of the present invention.
  • Figure 5 is a flow diagram of a method of determining weighting formula weights in accordance with one embodiment of the present invention.
  • Figure 6 is a block diagram of an apparatus for parsing a question in accordance with one embodiment of the present invention.
  • Figure 7 is a block diagram of an apparatus for parsing a question in accordance with another embodiment of the present invention.
  • a formal query question is a Structured Query Language (SQL) or SPARQL.
  • SQL Structured Query Language
  • SPARQL takes the subject-attribute-object (subject-property-object, SPO) is represented by a triple format.
  • Converting a natural language question into a formal query depends on the conversion rules for the knowledge base.
  • the conversion rules corresponding to different knowledge bases are also different.
  • due to the large amount of ambiguity in natural language questions it also leads to the lack of robustness of manually configured transformation rules.
  • Natural Language Processing is a tool used in computational science, artificial intelligence, and language disciplines to describe the relationship between machine language and natural language. NLP involves human-computer interaction. NLP tasks can include: Automatic summarization, Coreference resolution, Discourse analysis, Machine translation, Morphological segmentation, Named entity recognition. (Named entity recognition, NER), Natural language generation, Natural language understanding, Optical character recognition (OCR), Part-of-speech tagging, Syntactic analysis (Parsing), Question answering, Relationship extraction, Sentence breaking, Sentiment analysis, Speech recognition, Speech segmentation, Topic segmentation and recognition ( Topic segmentation and recognition), Word segmentation, Word sense disambiguation, Information retrieval (IR), Information extraction (IE), Speech processing, etc.
  • NLP tasks can include: Automatic summarization, Coreference resolution, Discourse analysis, Machine translation, Morphological segmentation, Named entity recognition. (Named entity recognition, NER), Natural language generation, Natural language understanding, Optical character recognition (OCR), Part-of-speech tagging, Syntact
  • Stanford Natural Language Processing Natural Language Processing
  • the NLP is designed for the different tasks of the above NLP.
  • the Stanford NLP tool is employed in embodiments of the invention.
  • the part-of-speech tool can be used to determine the part-of-speech of each word in a question.
  • Uncertainty reasoning refers to a variety of reasoning problems other than precise reasoning. Inference including incomplete and inexact knowledge, fuzzy knowledge reasoning, non-monotonic reasoning, etc.
  • the types of uncertainty reasoning are numerical methods and non-numeric methods, among which numerical methods include probability-based methods.
  • the probability-based method is based on the theory of probability theory developed, such as the credibility method, the subjective Bayesian method, and the evidence theory.
  • Markov logic network is a commonly used one in uncertainty reasoning networks.
  • Markov Logic Network is a statistical relational learning framework combining First-Order Logic (FOL) and Markov Network.
  • FOL First-Order Logic
  • the difference between the Markov logic network and the traditional first-order logic is that the traditional first-order logic requires that there is no conflict between all the rules. If a certain proposition cannot satisfy all the rules at the same time, it is false; In Markov's logical network, each rule has a weight, and a proposition is true according to a probability.
  • First-Order Logic can also be called predicate logic or first-order predicate logic, which consists of several first-order predicate rules.
  • First-order predicate rules consist of four types of symbols, namely constants, variables, functions, and predicates.
  • the constant refers to a simple object in the domain; the variable can refer to several objects in the domain; the function represents the mapping of a group of objects to an object; the predicate refers to the relationship between several objects in the domain, or the properties of the object.
  • Variables and constants can be of type.
  • a type of variable can only take values from an object set of a defined type.
  • An item can be an expression that arbitrarily represents an object.
  • An atom is a predicate that acts on a set of items.
  • a constant term refers to an item without a variable.
  • a ground atom or ground predicate is an atom or predicate whose arguments are constant.
  • rules start from the atom and are recursively established with conjunctions (such as implied relationships, equivalence relationships, etc.) and quantifiers (such as full quantifiers and existential quantifiers).
  • rules are usually expressed in the form of clauses.
  • a possible world is the assignment of true values to all possible closed atoms.
  • First-order logic can be thought of as establishing a series of hard rules on a collection of possible worlds, that is, if a world violates one of the rules, Then the probability of existence of this world is zero.
  • MLN The basic idea of MLN is to let those hard rules loosen. That is, when a world violates one of the rules, the possibility of existence in this world will be reduced, but it is not impossible. The fewer rules a world violates, the greater the possibility that the world exists. To this end, each rule is given a specific weight that reflects the binding force to the possible world that satisfies the rule. If the weight of a rule is larger, the difference between them will be greater for the two worlds that satisfy and do not satisfy the rule.
  • Markov logic networks can well combine language features and knowledge base constraints.
  • the logic formula in the probability framework is capable of modeling soft rule constraints.
  • a set of weighted formulas in Markov Logic is called a Markov logic network.
  • a first-order formula and a penalty can be included.
  • a closed atom can be a first-order formula that punishes illegal violations.
  • the first-order formula includes first-order predicates, logical connectors, and variables.
  • FIG. 1 is a flow chart of a method for parsing a question in a knowledge base according to an embodiment of the present invention.
  • the method shown in Figure 1 includes:
  • the embodiment of the invention uses the observation predicate and the implicit predicate to perform uncertainty reasoning, and can convert the natural language question into a formal query statement. Moreover, in the embodiment of the present invention, the method of uncertainty reasoning can be applied to a knowledge base of any field, and has domain scalability, so that it is not necessary to manually configure a conversion rule for the knowledge base.
  • the question input by the user in 101 is a natural language question.
  • the natural language question is "Give me all actors who were born in Berlin.”
  • a sequence of tokens in the question can be identified by phrase detection.
  • the word sequence in the question can be used as the first candidate phrase.
  • the word sequence is also called a multi-word sequence or a word sequence or a term or an n-gram sequence or n-gram(s), which refers to a sequence of n consecutive words.
  • a plurality of first candidate phrases can be determined in 102.
  • a limited word sequence that satisfies the following may be used as the first candidate phrase:
  • the part of the word sequence has the part of speech as jj or nn or rb or vb, wherein jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb.
  • the head word may also be referred to as an important word or a dominant word, and the token of the part of speech may be obtained from the set of part of speech.
  • the length of the word sequence refers to the number of words included in the word sequence.
  • the word sequence "born in” has a length of two.
  • stanford's part-of-speech tagging tool can be used to determine the part of speech of each word.
  • English stop words have “a”, “an”, “the”, “that” Wait.
  • Chinese stop words have “one”, “some”, “not only” and so on.
  • the determined first candidate phrases include: actors, who, born in, in, Berlin.
  • 103 can be understood as mapping each first candidate phrase to a first resource item in the knowledge base.
  • 103 may also be referred to as phrase mapping.
  • a first candidate phrase may be mapped to a plurality of first resource items.
  • the type of the first resource item may be an entity (Entity) or a class (Class) or a relationship (Relation).
  • the word2vec tool is used to convert all the words in the first candidate phrase into a vector form.
  • the vector form of the category in the knowledge base is the vector form of its label (corresponding to the rdfs:label relationship); then the first candidate phrase is calculated and each category is The cosine similarity on the vector; finally, the N categories having the largest cosine similarity value are used as the first resource item consistent with the semantics of the first candidate phrase.
  • the word2vec tool is a tool for converting words into vectors.
  • it can be an open code developed and provided by Google. For details, see: Http://code.google.com/p/word2vec/.
  • the first candidate phrase is mapped to a relationship, and the relationship template defined by PATTY and ReVerb is used as a resource.
  • the alignment of the relationship in DBpedia with the relation patterns defined by PATTY and ReVerb on the instance is calculated, that is, the instance pair in the DBpedia that satisfies the relationship of the relationship template.
  • the first candidate phrase can match the relationship template, the relationship satisfying the relationship template is taken as the first resource item consistent with the first candidate phrase semantics.
  • the first candidate phrase can be mapped to the first resource item, in particular, each first candidate phrase is mapped to the at least one first resource item. And, the first candidate phrase having the mapping relationship and the first resource item have consistent semantics.
  • the first candidate phrases actors who, born in, in, and Berlin are mapped to the first resource item as shown in Table 2.
  • the first column of the second column is the first candidate phrase
  • the second column is the first resource item
  • the third column is the identifier of the first resource item.
  • the first candidate phrase "in” is mapped to five first resource items.
  • 104 can be understood as a process of feature extraction.
  • Implicit predicates can include the following forms:
  • Hasphrase(p) indicates that the candidate phrase p is selected.
  • hasResource(p,r) indicating that the resource item r is selected, and the candidate phrase p is mapped to the resource item r.
  • p can be the phrase identifier of the candidate phrase
  • p and r can be the identifier of the resource item.
  • the parameter matching relationship rr may be one of the following: 1_1, 1_2, 2_1, and 2_2.
  • the parameter matching relationship rr may be one of the following: 1_1, 1_2, 2_1, and 2_2. Then, the parameter matching relationship between the resource item p and the resource item r is m1_m2 indicating that the m1th parameter of the resource item p is aligned with the m2th parameter of the resource item r. Wherein m1 is 1 or 2 and m2 is 1 or 2.
  • dbo:height 1_1 dbr:Michael Jordan indicates that the parameter matching relationship between the resource item dbo:height and the resource item dbr:Michael Jordan is 1_1. That is, the first parameter of the resource item dbo:height is aligned with the first parameter of the resource item dbr:Michael Jordan.
  • the value of the implicit predicate 1 indicates that the parameter matching relationship between the corresponding candidate phrase, the resource item, the resource item, and the resource item is selected.
  • the value of the implicit predicate is 0, indicating the corresponding candidate phrase, The parameter matching relationship between resource items, resource items, and resource items is not selected.
  • a value of 1 for the implicit predicate indicates that the corresponding proposition is true
  • a value of 0 for the implicit predicate indicates that the corresponding proposition is false.
  • hasphrase(11) 1, indicating that the proposition “candidate phrases are selected” is true.
  • Hasphrase(11) 1, indicating that the proposition “candidate phrases are selected” is false.
  • a possible question parse space can be constructed based on the implicit predicate.
  • a point in the possible question analysis space represents a set of propositions.
  • a set of propositions includes a set of propositions, and this set of propositions is also represented by the values of a set of implicit predicates. It can be understood that the true and false of a set of propositions in a set of propositions is characterized by the value of the corresponding implied predicate.
  • the embodiment of the present invention further defines an observed predicates for indicating a feature of the first candidate phrase, a feature of the first resource item, and the first candidate phrase and the first resource item. relationship.
  • the feature of the first candidate phrase includes a position of the first candidate phrase in the question, a part of the main word of the first candidate phrase, and a dependency between the first candidate phrase Labels on the path, etc.
  • the feature of the first resource item includes a type of the first resource item, a correlation value between the first resource item, a parameter matching relationship between the first resource item, and the like. .
  • the relationship between the first candidate phrase and the first resource item includes a priori matching score of the first candidate phrase and the first resource item.
  • determining the value of the observation predicate in 104 includes: determining a position of the first candidate phrase in the question; and determining a part of speech of the main word of the first candidate phrase by using a part of speech tagging tool of stanford; Determining, by the stanford dependency syntax analysis tool, a label on a dependent path between the two candidate phrases; determining a type of the first resource item from the knowledge base, wherein the type is an entity or a category Or a relationship; determining, by the knowledge base, a parameter matching relationship between the two resource items, wherein the parameter matching relationship is one of the following: 1_1, 1_2, 2_1, and 2_2.
  • a similarity coefficient between the two resource items as a correlation value between the two first resource items; calculating between the first candidate phrase and the first resource item a priori matching score, the a priori matching score being used to indicate a probability that the first candidate phrase is mapped to the first resource item.
  • determining, from the knowledge base, a parameter matching between the two resource items includes: determining, from the knowledge base, a parameter matching relationship m1_m2 between the first resource item r1 and the first resource item r2, and indicating the m1th parameter of the first resource item r1 and the first The m2th parameter of the resource item r2 is aligned.
  • the first resource item includes the first resource item r1 and the first resource item r2, where m1 is 1 or 2, and m2 is 1 or 2.
  • observation predicate may include the following form:
  • phraseIndex(p, i, j) represents the starting position i and the ending position j of the candidate phrase p in the question.
  • phrasePosTag(p, pt) represents the part of speech pt of the head word of the candidate phrase p.
  • the singular part of speech can be used to determine the part of speech of the main word.
  • the phraseDepTag(p, q, dt) represents the tag dt on the dependency path between the candidate phrase p and the candidate phrase q.
  • a stanford dependency analysis tool (stanford dependency parser) tool may be used to establish a dependency parse trees of the question, and feature extraction is performed according to the dependency analysis tree to determine a dependency path between the two candidate phrases. label.
  • phraseDepOne(p,q) indicates that the predicate is true when there is only one label on the dependency path between the candidate phrase p and the candidate phrase q, otherwise it is false.
  • the predicate phraseDepOne(p,q) in the observation predicate only includes predicates whose result is true.
  • hasMeanWord(p,q) indicating that when the words on the dependency path between the candidate phrase p and the candidate phrase q are all stop words or the part of speech is dt, in, wdt, to, cc, ex, pos or wp, hasMeanWord (p, q) is false, otherwise it is true.
  • dt is a qualifier, in is a preposition in, wdt is a question word starting with w, to is a preposition to, cc is a conjunction, ex is a existential word there, pos is a possessive ending word, and wp is an interrogative pronoun.
  • the interrogative words beginning with w are such as what, which, etc., and the connecting words are like,, but, or, etc.
  • the representation symbol of the above part of speech can be obtained from the part of the token set.
  • the predicate hasMeanWord(p,q) in the observation predicate only includes predicates whose result is true.
  • resourceType(r, rt) indicating that the resource item r is of type rt.
  • rt is E or C or R.
  • E denotes an entity (Entity)
  • C denotes a class
  • R denotes a relationship.
  • priorMatchScore(p, r, s) represents a priori match score s between the candidate phrase p and the resource item r.
  • the corresponding frequency refers to the number of times the candidate phrase p is linked to the resource item r divided by the total number of times the candidate phrase p is chained.
  • the a priori matching score of the candidate phrase p and the resource item r may be ⁇ s 1 +(1 ⁇ ) ⁇ s 2 .
  • s 1 is the Levenshtein distance between the label of the resource item r and the candidate phrase p
  • s 2 is a cosine similarity measure between the vector of the candidate phrase p and the vector of the resource item r.
  • the Levenshtein distance can be found in "A guided tour to approximate string matching" published by Navarro in 2001 at ACM Comput.
  • the calculation of s 2 can be referred to the "Recurrent neural network based language model" published by Mikolov et al. in INTERSPEECH in 2010.
  • the a priori matching score of the candidate phrase p and the resource item r may be ⁇ s 1 + ⁇ s 2 +(1 ⁇ ) ⁇ s 3 .
  • s 1 is the Levenshtein distance between the label of the resource item r and the candidate phrase p
  • s 2 is the cosine similarity measure between the vector of the candidate phrase p and the vector of the resource item r
  • s 3 is the resource item r and the relationship template
  • the relationship template is a relationship template defined by PATTY and ReVerb as described above. For the calculation of s 3 , see "Natural language questions for the web of data" published by Yahya et al. at EMNLP in 2012.
  • the correlation value s has a value interval of 0 to 1.
  • the correlation value s may be a similarity coefficient of the resource item p and the resource item q.
  • the similarity coefficient may also be referred to as a Jaccard similarity coefficient or a Jaccard coefficient or a similarity evaluation coefficient.
  • the similarity coefficient of the resource item p and the resource item q can be equal to the Jaccard of the resource item p and the ingress set of the resource item q. coefficient.
  • isTypeCompatible(p, q, rr) represents a parameter matching relationship rr between the resource item p and the resource item q.
  • the parameter matching relationship rr may be one of the following: 1_1, 1_2, 2_1, and 2_2.
  • the parameter matching relationship may be as described above, and to avoid repetition, details are not described herein again.
  • hasQueryResult(p,q,o,rr1,rr2) which represents a parameter matching relationship between resource item p, resource item q, and resource item o.
  • the resource item p and the resource item q have a parameter matching relationship rr1
  • the resource item q and the resource item o have a parameter matching relationship rr2.
  • phraseIndex(p,i,j),phrasePosTag(p,pt),phraseDepTag(p,q,dt),phraseDepOne(p,q) and hasMeanWord(p,q) are used. Representing the characteristics of the candidate phrase.
  • resourceType(r, rt), hasRelatedness(p, q, s), isTypeCompatible(p, q, rr), and hasQueryResult(p, q, o, rr1, rr2) are used to represent the characteristics of the resource item.
  • priorMatchScore(p, r, s) is used to represent the relationship between the candidate phrase and the resource item.
  • p and q may be phrase identifiers of candidate phrases, and p, q, r and o may be identifiers of resource items.
  • the value of the corresponding observed predicate can be determined.
  • an expression in which the value of the observed predicate is 1 includes:
  • phraseIndex(11,3,3) 1
  • 11 is the phrase identifier of the candidate phrase "actors", as shown in Table 1.
  • phrasePosTag(13, vb) is 1, indicating that the proposition that the main word of the first candidate phrase born in is born and its part of speech vb is true. 13 is the phrase identifier of the candidate phrase "born in”, as shown in Table 1.
  • phraseDepTag(13,15,pobj) is 1, indicating that the proposition that "the first candidate phrase born in and the label of the first candidate phrase Berlin dependent path is pobj" is true.
  • 13 is the phrase identifier of the candidate phrase "born”
  • 15 is the phrase identifier of the candidate phrase "Berlin”, as shown in Table 1.
  • the identifier of the resource item may also be represented by the predicate resource.
  • the first candidate phrase and the first resource item determined in 102 and 103 are ambiguous.
  • Embodiments of the present invention eliminate ambiguity of the first candidate phrase and the first resource item by uncertainty reasoning.
  • Uncertainty reasoning is based on uncertainty information to make reasoning and decision making.
  • the uncertainty reasoning network can process incomplete and noisy data sets, and use the weights of probability measures to describe the correlation between data, aiming at solving data inconsistency and uncertainty.
  • the model used in the uncertainty reasoning in 105 may be any one of the following: Bayesian Network, Probabilistic Relational Model, Bayesian Logic Program Model (Bayesian logic programs), Relational Markov Network, Markov Logic Network, Probabilistic Soft Logic.
  • Bayesian Network Probabilistic Relational Model
  • Bayesian Logic Program Model Bayesian logic programs
  • Relational Markov Network Markov Logic Network
  • Probabilistic Soft Logic The invention is not limited thereto.
  • the uncertainty inference in 105 is based on a Markov Logical Network (MLN), wherein the MLN includes a predefined first-order formula and the first-order The weight of the formula. That is to say, the model used for uncertainty reasoning is MLN.
  • MN Markov Logical Network
  • the first-order formula may include Boolean formulas and weighted formulas.
  • the Boolean formula has a weight of + ⁇
  • the Boolean formula can be understood as a first-order logic formula in first-order logic, which means hard constraints, also called hard formulas (hf), which are all closed.
  • the weight of the weighting formula is the weighting formula weight, which is soft constraints, and can also be called soft formulas (sf). Closed atoms can be illegal with some kind of punishment.
  • the first-order formula is composed of first-order predicates, logical conjunctions and variables.
  • the first-order predicate may include the aforementioned observation predicate and implicit predicate.
  • the MLN may also include a second-order formula, a first-order formula, a weight of the second-order formula, and a weight of the first-order formula.
  • the MLN may also include higher order formulas and weights, which are not limited by the present invention.
  • the Boolean formula is as shown in Table 4. Where the symbol "_" represents any constant in the logical variable.
  • indicates the number of closed atoms in the formula.
  • Hf1 indicates that if a phrase p is selected, the phrase p is mapped to at least one resource item.
  • Hf2 indicates that if a mapping of a phrase p to a resource item is selected, then the phrase p must be selected.
  • Hf3 indicates that a phrase p can only be mapped to one resource item.
  • Hf4 indicates that if a phrase p is not selected, the mapping relationship of any phrase p to the resource item is not selected.
  • Hf5 indicates that if a mapping of a phrase to a resource item r is selected, the resource item r is related to at least one other resource item.
  • Hf6 indicates that two resource items r1 and r2 can only have one parameter matching relationship.
  • Hf7 indicates that if there is a parameter matching relationship between the two resource items r1 and r2, then at least one mapping of the phrase to the resource item r1 is selected and at least one of the phrases to the resource item r2 is selected.
  • Hf8 Indicates that any two selected phrases do not overlap. The overlap here can be used to characterize the position in the question.
  • Hf9, hf10, hf11, hf12 indicates that if the type of a resource item r is an entity or a category, the resource item r cannot have the second parameter aligned with other resource items.
  • Hf13 indicates that the parameter matching relationship between two resource items r1 and r2 must be consistent.
  • weighting formula is shown in Table 5. Where the symbol "+" indicates that each constant of the logical variable should be individually set with a weight.
  • Sf1, sf2 The greater the a priori matching score s indicating that the phrase p is mapped to the resource item r, the greater the probability that the phrase p and the resource item r are selected.
  • Sf3 The part of speech of the main word representing the phrase p has some association with the type of the resource item r to which the phrase p is mapped.
  • Sf4, sf5, sf6 indicates that the label on the dependency path between the two phrases p1 and p2 has some association with the parameter matching relationship between the two resource items r1 and r2, wherein the phrase p1 is mapped to the resource item r1, The phrase p2 is mapped to the resource item r2.
  • Sf7 indicates that the greater the correlation value between the two resource items r1 and r2, the greater the possibility that there is a parameter matching relationship between the two resource items r1 and r2.
  • Sf8 indicates that if a resource item triple has a query result, then the three resource items should have a corresponding parameter matching relationship.
  • the weighting formula weight may be manually set.
  • it can be an empirical value preset by a manager or expert of the knowledge base.
  • the weighting formula weight may also be obtained through training through a learning method.
  • weighting formula weights are generally different for different knowledge bases.
  • the Boolean formula shown in Table 4 can be understood as a general rule that all knowledge bases satisfy.
  • the weighting formula shown in Table 5 can be understood as a specific rule with different weights for weighting formulas for different knowledge bases.
  • the Boolean formula and the weighting formula may also be collectively referred to as "meta rules". That is, the "meta rule” is a rule that applies to a knowledge base of different fields.
  • 105 may also be referred to as inference or joint inference or joint disambiguation.
  • joint reasoning can be performed using the thebeast tool.
  • a cutting plane method or a cutting plane approach may be adopted according to the value of the observed predicate and the value of the implicit predicate. Calculate the confidence of each of the set of propositions.
  • the thebeast tool can be found at https://code.google.com/p/thebeast/ .
  • confidence can also be called credibility.
  • the maximum likelihood estimation of the undirected graph model can be used to calculate the confidence of each of the proposition sets.
  • the MLN is represented as M
  • the first-order formula is represented as ⁇ i
  • the weight of the first-order formula is represented as w i
  • the proposition set is represented as y
  • a set of sub-formulas corresponding to the first-order formula ⁇ i , c is a sub-formula in the set of sub-formulas, As a binary function, Representing the true and false of the first-order formula under the set of propositions y.
  • the binary function (binary feature function) The value is 1 or 0. Specifically, under the proposition set y, when the sub-formula c is true, Is 1. Otherwise 0.
  • a maximum number of cycles can be set in 105.
  • the maximum number of loops is 100.
  • each confidence level in the confidence set corresponds to a set of propositions.
  • one or several proposition sets may be selected from a plurality of proposition sets of the possible question analysis space, and the confidence of the selected one or several proposition sets satisfies a preset condition.
  • a set of propositions having the highest value of confidence may be determined, and a combination of true propositions in a set of propositions having the highest value of the confidence may be obtained.
  • a plurality of proposition sets having the highest value of the confidence may be determined, and a combination of true propositions in the plurality of proposition sets having the highest value of the confidence is obtained.
  • the invention is not limited thereto.
  • the true and false of the proposition in the proposition set is represented by the value of the implicit predicate, it can be understood that the combination of the true proposition is obtained in 106, that is, the combination of the implicit predicate having a value of 1 is obtained. And said The true proposition is used to represent a search phrase selected from the first candidate phrases, a search resource item selected from the first resource item, and a feature of the search resource item.
  • a formalized query statement can be generated at 107.
  • the formal query can be SQL.
  • the formal query statement may be SPARQL, and correspondingly, 107 may also be referred to as a process of SPARQL Generation.
  • 107 may be: generating the SPARQL by using a SPARQL template according to the combination of the true propositions.
  • a combination of the true propositions can be utilized to construct a triple of SPARQL, and further, a SPARQL template is used to generate SPARQL.
  • the SPARQL template also includes an ASK WHERE template, a SELECT COUNT (?url)WHERE template, and a SELECT ? Url WHERE template.
  • the SPARQL is generated using an ASK WHERE template according to the combination of the true propositions.
  • the url WHERE template When the question is a Number question, according to the combination of the true propositions, use SELECT?
  • the url WHERE template When the question is a Number question, according to the combination of the true propositions, use SELECT?
  • the url WHERE template generates the SPARQL, or, when using SELECT?
  • the SPARQL is generated using the SELECT COUNT(?url)WHERE template when the SPARQL generated by the url WHERE template cannot obtain a numeric answer.
  • the generated SPARQL is:
  • the method 107 may include: generating a query resource map according to the combination of the true propositions, wherein the query resource map includes a vertex and an edge, the vertex includes the search phrase, the search resource item, and The search phrase in each vertex is mapped to the search resource item in the vertex.
  • the edge represents a parameter matching relationship between two of the two search vertices; and the SPARQL is further generated according to the query resource map.
  • three search resource items connected to each other in the query resource map may be used as a triple of the SPARQL, where a search is located in the middle of the three connected search resource items that are connected to each other.
  • the type of the resource item is a relationship.
  • the natural language question can be converted into SPARQL.
  • the predefined first-order formula used is domain-independent, that is, the predefined Boolean formula and weighting formula can be applied to all knowledge bases and have scalability. That is to say, the method provided by the embodiment of the present invention does not need to manually set the conversion rule.
  • FIG. 3 it is an example of the analysis of the question of the present invention.
  • the determined first candidate phrase includes: software, developed, developed by, organizations, found in, founded, California, USA.
  • 303 can refer to 103 in the foregoing embodiment. To avoid repetition, details are not described herein again.
  • the first candidate phrase software is mapped to: dbo: Software, dbr: Software, and the like. It is not listed here one by one.
  • 305 can refer to 105 and 106 in the foregoing embodiment. To avoid repetition, details are not described herein again.
  • the combination of the true propositions is a combination in which the implied predicate has a value of 1.
  • the resource item query graph may also be referred to as a Semantic Items Query Graph.
  • the vertices in the resource item query graph may include: a search resource item, a type of the search resource item, and a location of the search phrase mapped to the search resource item in the question.
  • the edge in the resource item query graph includes: a parameter matching relationship between two search resource items of the two vertices connected by the edge.
  • relationship between search resource items in the resource item query graph is a binary relationship.
  • the vertices in the resource item query graph may include: a search phrase, a search resource item, a type of the search resource item, a search phrase mapped to the search resource item, and a location of the search phrase in the question.
  • FIG. 4 is another example of a resource item query graph. The vertices 311 to 315 are included.
  • the vertex 311 includes: a search resource item dbo: Software, a type of the search resource item, a search phrase Software, and a position 11 of the search phrase in the question.
  • the search phrase Software maps to the search resource item dbo:Software.
  • the vertex 312 includes: a search resource item dbo:developer, a search resource item type Relation, a search phrase developed by, and a search phrase position 45 in the question.
  • the search phrase Software maps to the search resource item dbo:Software.
  • the vertex 313 includes: a search resource item dbo:Company, a search resource item type Class, a search phrase organization, and a position 66 of the search phrase in the question.
  • the search phrase organization maps to the search resource item dbo:Company.
  • the vertex 314 includes: a search resource item dbo:foundationPlace, a search resource item type Relation, a search phrase found in, and a position 78 of the search phrase in the question.
  • the search phrase found in maps to the search resource item dbo:foundationPlace.
  • the vertex 315 includes: a search resource item dbr: California, a search resource item type Entity, a search phrase California, and a position 99 of the search phrase in the question.
  • the search phrase California maps to the search resource item dbr:California.
  • the edge 1_1 between the vertex 311 and the vertex 312 represents the search resource item dbo: the parameter matching relationship between the Software and the search resource item dbo:developer is 1_1.
  • the edge 2_1 between the vertex 312 and the vertex 313 indicates that the parameter matching relationship between the search resource item dbo:developer and the search resource item dbo:Company is 2_1.
  • the edge 1_1 between the vertex 313 and the vertex 314 indicates that the parameter matching relationship between the search resource item dbo:Company and the search resource item dbo:foundationPlace is 1_1.
  • the edge 1_2 between the vertex 315 and the vertex 314 represents a parameter matching relationship between the search resource item dbr:California and the search resource item dbo:foundationPlace is 1_2.
  • SPARQL generation (SPARQL genaration).
  • the binary relationship in the resource item query graph is converted into a ternary relationship.
  • the three search resource items connected to each other in the resource item query graph have a ternary relationship, and the type of the search resource item located in the middle of the three connected search resource items is a relationship.
  • the natural language question in 301 is Normal, using SELECT?
  • the url WHERE template, the generated SPARQL is:
  • the natural language question can be converted into SPARQL.
  • the predefined first-order formula used is domain-independent, that is, the predefined Boolean formula and weighting formula can be applied to all knowledge bases and have scalability. That is to say, the method provided by the embodiment of the present invention does not need to manually set the conversion rule.
  • the predefined Boolean formula and the weighting formula are language-independent, that is, have language extensibility.
  • it can be used in both the English knowledge base and the Chinese knowledge base.
  • the uncertainty inference in 105 may be based on the MLN.
  • the MLN includes a predefined first-order formula and a weight of the first-order formula.
  • the first-order formula may include a Boolean formula and a weighting formula.
  • the weight of the Boolean formula is + ⁇ , and the weight of the weighting formula is the weight of the weighting formula.
  • the weighting formula weight can be obtained through training and learning. Then, it can be understood that before 101, as shown in FIG. 5, it may further include:
  • the weight of the first-order formula for the knowledge base can be determined by the learning method, and can be used as a conversion rule for the knowledge base. In this way, there is no need to manually set the conversion rules, and the first-order formula of the predefined Markov logic network MLN is extensible and can be applied to any knowledge base.
  • the question and answer system knowledge base includes a problem library, and the problem library includes a plurality of natural language questions. Then, in 401, a plurality of natural language questions can be obtained from the question library in the question and answer system knowledge base.
  • the embodiment of the present invention does not limit the number of multiple natural language questions. For example, multiple natural language questions can be one thousand natural language questions.
  • 110 natural language questions can be obtained from the training set of the problem solving library Q1 of the Question Answer over Linked Data (QALD).
  • QALD Question Answer over Linked Data
  • the process of 402 refer to the process of 102 of the foregoing embodiment.
  • the process of 403 refer to the process of 103 of the foregoing embodiment.
  • the process of 404 refer to the process of 104 of the foregoing embodiment. To avoid repetition, we will not repeat them here.
  • the values of the observation predicates corresponding to the plurality of natural language files can be determined.
  • the first order formula includes a Boolean formula and a weighting formula.
  • the weight of the Boolean formula is + ⁇ , and the weight of the weighting formula is the weight of the weighting formula.
  • the value of the artificially annotated implicit predicate in 405 satisfies the Boolean formula.
  • the weight of the first-order formula is determined by training, that is, the weighting formula weight is determined by training.
  • the undirected graph may include a Markov Network (MN).
  • a difference may be used according to a value of an observed predicate corresponding to the plurality of natural language questions, a value of an implicit predicate corresponding to the plurality of natural language questions, and the first-order formula.
  • a Margin Infused Relaxed Algorithm (MIRA) is used to determine the weight of the first-order formula.
  • the weighting formula weights can be learned using the thebeast tool.
  • the weighting formula weights may be initialized to 0 first, and then the weighting formula weights are updated using MIRA.
  • the maximum number of cycles of training may also be set, for example, the maximum number of cycles of training is 10.
  • the weighting formula weights of sf3 in Table 5 can be as shown in Table 6. It can be seen from Table 6 that when the part of speech of the main word of the candidate phrase is nn, the probability that the candidate phrase is mapped to the resource item of type E is relatively large.
  • the weighting formula weights of any one of the knowledge bases can be determined, so that the conversion rules for any one of the knowledge bases can be obtained.
  • the method for determining the weight of the first-order formula is a data-driven manner, which can be applied to different knowledge bases.
  • the efficiency of the Q&A analysis of the knowledge base can be improved.
  • structural learning may also be performed according to the constructed undirected graph, and then the second-order formula or even higher-order formula may be learned, and further according to the learned second-order formula or higher order.
  • the formula constructs a new undirected graph and learns the weights corresponding to the second-order formula or higher-order formula.
  • the invention is not limited thereto.
  • FIG. 6 is a block diagram of an apparatus for parsing a question in accordance with one embodiment of the present invention.
  • the apparatus 500 shown in FIG. 6 includes a receiving unit 501, a phrase detecting unit 502, a mapping unit 503, a first determining unit 504, a second determining unit 505, an obtaining unit 506, and a generating unit 507.
  • the receiving unit 501 is configured to receive a question input by the user.
  • the phrase detecting unit 502 is configured to perform phrase detection on the question received by the receiving unit 501 to determine a first candidate phrase.
  • mapping unit 503 configured to map the first candidate phrase determined by the phrase detecting unit 502 to a first resource item in a knowledge base, where the first resource item is consistent with the first candidate phrase Semantics.
  • a first determining unit 504 configured to determine, according to the first candidate phrase and the first resource item, a value of an observed predicate and a possible question analysis space, where the observed predicate is used to represent the first candidate a feature of the phrase, a feature of the first resource item, and a relationship between the first candidate phrase and the first resource item, and a point in the possible question analysis space is a set of propositions in the set of propositions
  • the true and false of a proposition is characterized by the value of an implicit predicate.
  • a second determining unit 505 configured to determine, for each proposition set in the possible question analysis space, the value of the observed predicate and the value of the implicit predicate according to the first determining unit 504, to perform uncertainty Inference, calculating the confidence of each of the set of propositions.
  • the obtaining unit 506 is configured to acquire a combination of true propositions in the proposition set in which the confidence level satisfies a preset condition, where the true proposition is used to represent a search phrase selected from the first candidate phrases, The selected search resource item of the first resource item and the feature of the search resource item.
  • the generating unit 507 is configured to generate a formalized query statement according to the combination of the true propositions acquired by the obtaining unit 506.
  • the embodiment of the invention uses the observation predicate and the implicit predicate to perform uncertainty reasoning, and can convert the natural language question into a formal query statement. Moreover, in the embodiment of the present invention, the method of uncertainty reasoning can be applied to a knowledge base of any field, and has domain scalability, so that it is not necessary to manually configure a conversion rule for the knowledge base.
  • the uncertainty reasoning is based on a Markov logic network MLN, and the MLN includes a predefined first-order formula and a weight of the first-order formula.
  • the obtaining unit 506 is further configured to obtain a plurality of natural language questions from the knowledge base;
  • the phrase detecting unit 502 is further configured to perform phrase detection on the question sentence received by the obtaining unit 506 to determine a first candidate phrase;
  • the mapping unit 503 is further configured to map the second candidate phrase to a second resource item in the knowledge base, where the second resource item and the second candidate phrase have consistent semantics;
  • the first determining unit 504 is further configured to determine, according to the second candidate phrase and the second resource item, a value of an observed predicate corresponding to the plurality of natural language questions;
  • the obtaining unit 506 is further configured to obtain a manually labeled pair of the plurality of natural language questions The value of the implicit predicate;
  • the second determining unit 505 is further configured to: according to a value of the observed predicate corresponding to the plurality of natural language questions, a value of the implicit predicate corresponding to the plurality of natural language questions, and the first-order formula An undirected graph is constructed, and the weight of the first-order formula is determined by training.
  • the first-order formula includes a Boolean formula and a weighting formula
  • the weight of the Boolean formula is + ⁇
  • the weight of the weighting formula is a weighting formula weight
  • the artificially labeled the The value of the implicit predicate corresponding to the plurality of natural language questions satisfies the Boolean formula
  • the second determining unit 505 is specifically configured to: according to the value of the observed predicate corresponding to the plurality of natural language questions, The values of the implicit predicates corresponding to the plurality of natural language questions and the first-order formula construct an undirected graph, and the weighting formula weights are determined through training.
  • the second determining unit 505 is specifically configured to: according to the value of the observed predicate corresponding to the plurality of natural language questions, corresponding to the plurality of natural language questions
  • the value of the implicit predicate and the first-order formula are used to construct an undirected graph, and the weight of the first-order formula is determined by using a difference injection relaxation algorithm MIRA.
  • the MLN is represented as M
  • the first-order formula is represented as ⁇ i
  • the weight of the first-order formula is represented as w i
  • the proposition set is represented as y
  • the second determination Unit 505 is specifically configured to:
  • the obtaining unit 506 is specifically configured to: determine a proposition set with the highest value of the confidence, and obtain a combination of true propositions in the proposition set with the highest value of the confidence.
  • the feature of the first candidate phrase includes a position of the first candidate phrase in the question, a part of the main word of the first candidate phrase, and a dependency path between the first candidate phrase s Mark,
  • the feature of the first resource item includes a type of the first resource item, and the first resource item a correlation value between the two, a parameter matching relationship between the two resource items,
  • the relationship between the first candidate phrase and the first resource item includes a priori matching score of the first candidate phrase and the first resource item
  • the first determining unit 504 is specifically configured to:
  • the a priori matching score being used to indicate a probability that the first candidate phrase is mapped to the first resource item.
  • the formal query statement is a simple protocol resource description framework query statement SPARQL.
  • the generating unit 507 is specifically configured to:
  • the SPARQL is generated using a SPARQL template according to the combination of the true propositions.
  • the SPARQL template includes an ASK WHERE template, a SELECT COUNT (?url) WHERE template, and a SELECT? Url WHERE template,
  • the generating unit 507 is specifically configured to:
  • the SPARQL is generated using the ASK WHERE template according to the combination of the true propositions
  • the url WHERE template generates the SPARQL
  • the SELECT is used.
  • the url WHERE template generates the SPARQL, or, when using the SELECT?
  • the SPARQL generated by the url WHERE template cannot obtain a numeric answer, the SPARQL is generated using the SELECT COUNT(?url)WHERE template.
  • phrase detecting unit 502 is specifically configured to:
  • the part of the word sequence is jj or nn or rb or vb, wherein jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb;
  • the words included in the sequence of words are not all stop words.
  • the device 500 may be a server of a knowledge base.
  • the device 500 can implement various processes implemented by the device in the embodiments of FIG. 1 to FIG. 5, and details are not described herein again to avoid repetition.
  • FIG. 7 is a block diagram of an apparatus for parsing a question in accordance with another embodiment of the present invention.
  • the apparatus 600 shown in FIG. 7 includes a processor 601, a receiving circuit 602, a transmitting circuit 603, and a memory 604.
  • the receiving circuit 602 is configured to receive a question input by the user.
  • the processor 601 is configured to perform phrase detection on the question received by the receiving circuit 602 to determine a first candidate phrase.
  • the processor 601 is further configured to map the first candidate phrase to a first resource item in the knowledge base, where the first resource item has a consistent semantic with the first candidate phrase.
  • the processor 601 is further configured to determine, according to the first candidate phrase and the first resource item, a value of an observed predicate and a possible question analysis space, where the observed predicate is used to represent the first candidate phrase a feature, a feature of the first resource item, and a relationship between the first candidate phrase and the first resource item, a point in the possible question analysis space is a proposition set, and a proposition in the proposition set
  • the true and false are characterized by the value of the implicit predicate.
  • the processor 601 is further configured to determine, for each proposition set in the possible question analysis space, the first determining unit 504, determining the value of the observed predicate and the value of the implicit predicate, and performing uncertainty reasoning. Calculating the confidence of each of the set of propositions.
  • the receiving circuit 602 is further configured to acquire a combination of true propositions in the set of propositions whose confidence meets a preset condition, where the true proposition is used to represent a search phrase selected from the first candidate phrase, a search resource item selected from the first resource item and a feature of the search resource item.
  • the processor 601 is further configured to generate a formalized query statement according to the combination of the true propositions.
  • the embodiment of the present invention uses the observation predicate and the implicit predicate to perform uncertainty reasoning, and can The language question is converted into a formal query. Moreover, in the embodiment of the present invention, the method of uncertainty reasoning can be applied to a knowledge base of any field, and has domain scalability, so that it is not necessary to manually configure a conversion rule for the knowledge base.
  • bus system 605 which in addition to the data bus includes a power bus, a control bus, and a status signal bus.
  • bus system 605 various buses are labeled as bus system 605 in FIG.
  • Processor 601 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 1001 or an instruction in a form of software.
  • the processor 1001 may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA), or the like. Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA Field Programmable Gate Array
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present invention may be directly implemented by the hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory 604, and the processor 601 reads the information in the memory 604 and completes the steps of the above method in combination with its hardware.
  • the memory 604 in the embodiments of the present invention may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM), a programmable read only memory (PROM), an erasable programmable read only memory (Erasable PROM, EPROM), or an electric Erase programmable read only memory (EEPROM) or flash memory.
  • the volatile memory can be a Random Access Memory (RAM) that acts as an external cache.
  • RAM Random Access Memory
  • many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (Synchronous DRAM).
  • SDRAM double data rate synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SDRAM Synchronous Connection Dynamic Random Access Memory
  • DR RAM Direct Memory Bus Random Memory
  • the embodiments described herein can be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof.
  • the processing unit can be implemented in one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processing (DSP), Digital Signal Processing Equipment (DSP Device, DSPD), programmable Programmable Logic Device (PLD), Field-Programmable Gate Array (FPGA), general purpose processor, controller, microcontroller, microprocessor, other for performing the functions described herein In an electronic unit or a combination thereof.
  • ASICs Application Specific Integrated Circuits
  • DSP Digital Signal Processing
  • DSP Device Digital Signal Processing Equipment
  • PLD programmable Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • a code segment can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software group, a class, or any combination of instructions, data structures, or program statements.
  • a code segment can be combined into another code segment or hardware circuit by transmitting and/or receiving information, data, arguments, parameters or memory contents. Information, arguments, parameters, data, etc. can be communicated, forwarded, or transmitted using any suitable means including memory sharing, messaging, token passing, network transmission, and the like.
  • the techniques described herein can be implemented by modules (eg, procedures, functions, and so on) that perform the functions described herein.
  • the software code can be stored in a memory unit and executed by the processor.
  • the memory unit can be implemented in the processor or external to the processor, in the latter case the memory unit can be communicatively coupled to the processor via various means known in the art.
  • the uncertainty reasoning is based on a Markov logic network MLN, and the MLN includes a predefined first-order formula and a weight of the first-order formula.
  • the memory 604 can be used to store resource items, types of resource items, and the like. Memory 604 can also be used to store the first order formula. The memory 604 can also be used to store SPARQL templates.
  • the receiving circuit 602 is further configured to obtain a plurality of natural language questions from the knowledge base;
  • the processor 601 is further configured to perform phrase detection on the question to determine a first candidate short language
  • the processor 601 is further configured to map the second candidate phrase to a second resource item in the knowledge base, where the second resource item and the second candidate phrase have consistent semantics;
  • the processor 601 is further configured to determine, according to the second candidate phrase and the second resource item, a value of an observed predicate corresponding to the plurality of natural language questions;
  • the receiving circuit 602 is further configured to obtain a manually labeled value of an implicit predicate corresponding to the plurality of natural language questions;
  • the processor 601 is further configured to build according to a value of an observed predicate corresponding to the plurality of natural language questions, a value of an implicit predicate corresponding to the plurality of natural language questions, and the first-order formula
  • An undirected graph is determined by training to determine the weight of the first-order formula.
  • the first-order formula includes a Boolean formula and a weighting formula
  • the weight of the Boolean formula is + ⁇
  • the weight of the weighting formula is a weighting formula weight
  • the processor 601 is specifically configured to: according to a value of an observed predicate corresponding to the plurality of natural language questions, a value of an implicit predicate corresponding to the plurality of natural language questions, and the first-order formula, An undirected graph is constructed, and the weighting formula weights are determined by training.
  • the processor 601 is specifically configured to:
  • the MLN is represented as M
  • the first-order formula is represented as ⁇ i
  • the weight of the first-order formula is represented as w i
  • the proposition set is represented as y
  • the processor 601 Specifically for:
  • the receiving circuit 602 is specifically configured to: determine a proposition set with the highest value of the confidence, and obtain a true proposition in the proposition set with the highest value of the confidence The combination.
  • the feature of the first candidate phrase includes a position of the first candidate phrase in the question, a part of the main word of the first candidate phrase, and a dependency path between the first candidate phrase s Mark,
  • the feature of the first resource item includes a type of the first resource item, a correlation value between the first resource item, and a parameter matching relationship between the first resource item.
  • the relationship between the first candidate phrase and the first resource item includes a priori matching score of the first candidate phrase and the first resource item
  • the processor 601 is specifically configured to:
  • the a priori matching score being used to indicate a probability that the first candidate phrase is mapped to the first resource item.
  • the formal query statement is a simple protocol resource description framework query statement SPARQL.
  • the processor 601 is specifically configured to:
  • the SPARQL is generated using a SPARQL template according to the combination of the true propositions.
  • the SPARQL template includes an ASK WHERE template, a SELECT COUNT (?url) WHERE template, and a SELECT? Url WHERE template,
  • the processor 601 is specifically configured to:
  • the SPARQL is generated using the ASK WHERE template according to the combination of the true propositions
  • the SELECT is used.
  • the url WHERE template generates the SPARQL, or, when using the SELECT?
  • the SPARQL generated by the url WHERE template cannot obtain a numeric answer, the SPARQL is generated using the SELECT COUNT(?url)WHERE template.
  • the processor 601 is specifically configured to:
  • the part of the word sequence is jj or nn or rb or vb, wherein jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb;
  • the words included in the sequence of words are not all stop words.
  • the device 600 may be a server of a knowledge base.
  • the device 600 can implement various processes implemented by the device in the embodiments of FIG. 1 to FIG. 5, and details are not described herein again to avoid repetition.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种知识库中问句解析的方法,包括:接收用户输入的问句(101);对问句进行短语检测确定候选短语(102);将候选短语映射到知识库中的资源项(103);进一步确定观察谓词的值和可能的问句分析空间(104);对可能的问句分析空间中的每一个命题集合,根据观察谓词和隐含谓词的值进行不确定性推理计算置信度(105),并获取置信度满足预设条件的命题集合中的真命题的组合(106);根据所述真命题的组合,生成形式化查询语句(107)。利用观察谓词和隐含谓词,该方法进行不确定性推理,能够将自然语言问句转化为形式化查询语句。不确定性推理的方法能够应用于任何领域的知识库,具有领域扩展性,这样无需针对知识库人工地配置转换规则。

Description

知识库中问句解析的方法及设备
本申请要求于2014年9月29日提交中国专利局、申请号为201410513189.4、发明名称为“知识库中问句解析的方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及通信领域,并且更具体地,涉及一种知识库中问句解析的方法及设备。
背景技术
知识库(Knowledge Base,KB)是知识工程中结构化、易操作、易利用、全面有组织的知识集群,是针对某一个或某一些领域问题求解的需要,采用某一种或某几种知识表示方式在计算机存储器中存储、组织、管理和使用的互相联系的知识片集合。
目前互联网上已经出现了大量的知识资源和知识社区,例如维基百科(Wikipedia)、百度百科、互动百科等。从这些知识资源中,已有研究已经挖掘出以实体、实体关系为核心的大规模知识库。除此之外,还存在一些领域知识库,如天气知识库、餐饮知识库等。
知识库的建设经历了由人工和群体智能添加到面向整个互联网利用机器学习和信息抽取技术自动获取的过程。早期的知识库是由专家人工构建。例如WordNet、CYC、CCD、HowNet、中国大百科全书等。但是随着信息技术的发展,传统人工构建的知识库逐渐暴露出规模小、知识少、更新慢的缺点;同时由专家构建的确定性知识框架也无法满足互联网有噪环境下大规模计算的需求。这也是CYC项目最终失败的原因之一。随着Web 2.0的飞速崛起,出现了大量基于群体智慧的网络知识库,包括Wikipedia、互动百科、百度百科等。以这些网络资源为基础,大量的自动半自动知识库构建方法被用来构建大型可用的知识库,比如YAGO,DBpedia,Freebase等。
基于这些知识库,可以构建起知识库问答系统(Knowledge-base-based Question Answering)。与基于检索技术的问答系统相比,基于知识库的问答系统由于知识库规模的限制,对问题的覆盖率可能会较低,但其具备一定的 推理能力。另外,在限定领域内会达到较高的准确率。因此,一些基于知识库的问答系统应运而生,有些成为独立的应用,有些作为已有产品的增强功能,比如苹果的siri、谷歌的知识图谱等。
问答系统(Question Answering)是指不需要用户把问题分解成关键词,而直接以自然语言的形式提问,经过问答系统对用户的问题的处理,再从知识库或者互联网快速搜索出和用户的问题对应的答案,然后把答案直接返回给用户,而不是相关的网页。因此问答系统大大降低了用户的使用难度,它比传统的关键字检索和语义搜索技术等搜索引擎更加方便和高效。
关联数据问答(Question Answering over Linked Data,QALD)评测比赛推动了问答系统的发展。其目标是针对大规模结构化的关联数据,将自然语言问句转换为结构化的简单协议资源描述框架查询语句(Simple Protocol and RDF(Resource Description Framework,资源描述框架)Query Language,SPARQL),从而建立友好的自然语言查询接口。将自然语言问句转换为结构化的SPARQL,需要依赖于针对于知识库的转换规则。但是目前的问答系统中,转换规则都是人工配置,这样导致不仅耗费大量人力,而且领域扩展性也很差。
发明内容
本发明实施例提供一种基于知识库的问句解析的方法,不需要人工配置转换规则,并且是领域无关的。
第一方面,提供了一种知识库中问句解析的方法,包括:
接收用户输入的问句;
对所述问句进行短语检测,以确定第一候选短语;
将所述第一候选短语映射到所述知识库中的第一资源项,其中,所述第一资源项与所述第一候选短语具有一致的语义;
根据所述第一候选短语和所述第一资源项,确定观察谓词的值和可能的问句分析空间,其中,所述观察谓词用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系,所述可能的问句分析空间中的点为命题集合,所述命题集合中的命题的真假由隐含谓词的值表征;
对所述可能的问句分析空间中的每一个命题集合,根据所述观察谓词的 值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度;
获取所述置信度满足预设条件的命题集合中的真命题的组合,其中,所述真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述第一资源项中所选中的搜索资源项和所述搜索资源项的特征;
根据所述真命题的组合,生成形式化查询语句。
结合第一方面,在第一方面的第一种可能的实现方式中,所述不确定性推理基于马尔科夫逻辑网络MLN,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。
结合第一方面或者第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,在所述接收用户输入的问句之前,所述方法还包括:
从所述知识库中获取多个自然语言问句;
对所述多个自然语言问句进行短语检测,以确定所述多个自然语言问句的第二候选短语;
将所述第二候选短语映射到所述知识库中的第二资源项,其中,所述第二资源项与所述第二候选短语具有一致的语义;
根据所述第二候选短语和所述第二资源项,确定与所述多个自然语言问句对应的观察谓词的值;
获取人工标注的与所述多个自然语言问句对应的隐含谓词的值;
根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重。
结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述一阶公式包括布尔公式和加权公式,所述布尔公式的权重为+∞,所述加权公式的权重为加权公式权重,所述人工标注的与所述多个自然语言问句对应的隐含谓词的值满足所述布尔公式,
根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重,包括:
根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所 述加权公式权重。
结合第一方面的第二种可能的实现方式,在第一方面的第四种可能的实现方式中,所述根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重,包括:
根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,采用差额注入松弛算法MIRA,确定所述一阶公式的权重。
结合上述第一方面的任一种可能的实现方式,在第一方面的第五种可能的实现方式中,所述MLN表示为M,所述一阶公式表示为φi,所述一阶公式的权重表示为wi,所述命题集合表示为y,
对所述问句分析空间中的每一个命题集合,根据所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度,包括:
根据
Figure PCTCN2015078362-appb-000001
计算所述每一个命题集合的置信度,
其中,Z为归一化常数,
Figure PCTCN2015078362-appb-000002
为与一阶公式φi对应的子公式的集合,c为
Figure PCTCN2015078362-appb-000003
的所述子公式的集合中的一个子公式,
Figure PCTCN2015078362-appb-000004
为二值函数,
Figure PCTCN2015078362-appb-000005
表示在所述命题集合y下,所述一阶公式的真假。
结合第一方面或者上述第一方面的任一种可能的实现方式,在第一方面的第六种可能的实现方式中,所述获取所述置信度满足预设条件的命题集合中的真命题的组合,包括:
确定所述置信度的值最大的命题集合,并获取所述置信度的值最大的命题集合中的真命题的组合。
结合第一方面或者上述第一方面的任一种可能的实现方式,在第一方面的第七种可能的实现方式中,
所述第一候选短语的特征包括所述第一候选短语在所述问句中的位置、所述第一候选短语的主要词的词性、所述第一候选短语两两之间的依存路径上的标签,
所述第一资源项的特征包括所述第一资源项的类型、所述第一资源项两两之间的相关性值、所述第一资源项两两之间的参数匹配关系,
所述第一候选短语与所述第一资源项的关系包括所述第一候选短语与所述第一资源项的先验匹配得分,
所述根据所述第一候选短语和所述第一资源项,确定观察谓词的值,包括:
确定所述第一候选短语在所述问句中的位置;
采用stanford词性标注工具,确定所述第一候选短语的主要词的词性;
采用stanford依存句法分析工具,确定所述第一候选短语两两之间的依存路径上的标签;
从所述知识库中确定所述第一资源项的类型,其中,所述类型为实体或类别或关系;
从所述知识库中确定所述第一资源项两两之间的参数匹配关系;
将所述第一资源项两两之间的相似性系数,作为所述两个第一资源项两两之间的相关性值;
计算所述第一候选短语与所述第一资源项之间的先验匹配得分,所述先验匹配得分用于表示所述第一候选短语映射到所述第一资源项的概率。
结合第一方面或者上述第一方面的任一种可能的实现方式,在第一方面的第八种可能的实现方式中,所述形式化查询语句为简单协议资源描述框架查询语句SPARQL。
结合第一方面的第八种可能的实现方式,在第一方面的第九种可能的实现方式中,所述根据所述真命题的组合,生成形式化查询语句,包括:
根据所述真命题的组合,利用SPARQL模板生成所述SPARQL。
结合第一方面的第九种可能的实现方式,在第一方面的第十种可能的实现方式中,所述SPARQL模板包括ASK WHERE模板、SELECT COUNT(?url)WHERE模板和SELECT ?url WHERE模板,
所述根据所述真命题的组合,利用SPARQL模板生成所述SPARQL,包括:
当所述问句为Yes/No问题时,根据所述真命题的组合,使用所述ASK WHERE模板生成所述SPARQL;
当所述问句为Normal问题时,根据所述真命题的组合,使用所述 SELECT ?url WHERE模板生成所述SPARQL;
当所述问句为Number问题时,根据所述真命题的组合,使用所述SELECT ?url WHERE模板生成所述SPARQL,或者,当使用所述SELECT ?url WHERE模板生成的SPARQL无法得到数值型答案时,使用所述SELECT COUNT(?url)WHERE模板生成所述SPARQL。
结合第一方面或者上述第一方面的任一种可能的实现方式,在第一方面的第十一种可能的实现方式中,所述对所述问句进行短语检测,以确定第一候选短语,包括:将所述问句中的词序列作为所述第一候选短语,其中,所述词序列满足:
所述词序列中所有连续的非停用词都以大写字母开头,或者,若所述词序列中所有连续的非停用词不都以大写字母开头,则所述词序列的长度小于四;
所述词序列的主要词的词性为jj或nn或rb或vb,其中,jj为形容词,nn为名词,rb为副词,vb为动词;
所述词序列所包括的词不全为停用词。
第二方面,提供了一种问答解析的设备,包括:
接收单元,用于接收用户输入的问句;
短语检测单元,用于对所述接收单元接收的所述问句进行短语检测,以确定第一候选短语;
映射单元,用于将所述短语检测单元确定的所述第一候选短语映射到知识库中的第一资源项,其中,所述第一资源项与所述第一候选短语具有一致的语义;
第一确定单元,用于根据所述第一候选短语和所述第一资源项,确定观察谓词的值和可能的问句分析空间,其中,所述观察谓词用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系,所述可能的问句分析空间中的点为命题集合,所述命题集合中的命题的真假由隐含谓词的值表征;
第二确定单元,用于对所述可能的问句分析空间中的每一个命题集合,根据所述第一确定单元确定的所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度;
获取单元,用于获取所述第二确定单元确定的所述置信度满足预设条件 的命题集合中的真命题的组合,其中,所述真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述第一资源项中所选中的搜索资源项和所述搜索资源项的特征;
生成单元,用于根据所述真命题的组合,生成形式化查询语句。
结合第二方面,在第二方面的第一种可能的实现方式中,所述不确定性推理基于马尔科夫逻辑网络MLN,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。
结合第二方面或者第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,
所述获取单元,还用于从所述知识库中获取多个自然语言问句;
所述短语检测单元,还用于对所述获取单元接收的所述问句进行短语检测,以确定第一候选短语;
所述映射单元,还用于将所述第二候选短语映射到所述知识库中的第二资源项,其中,所述第二资源项与所述第二候选短语具有一致的语义;
所述第一确定单元,还用于根据所述第二候选短语和所述第二资源项,确定与所述多个自然语言问句对应的观察谓词的值;
所述获取单元,还用于获取人工标注的与所述多个自然语言问句对应的隐含谓词的值;
所述第二确定单元,还用于根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述一阶公式包括布尔公式和加权公式,所述布尔公式的权重为+∞,所述加权公式的权重为加权公式权重,所述人工标注的与所述多个自然语言问句对应的隐含谓词的值满足所述布尔公式,
所述第二确定单元,具体用于:根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述加权公式权重。
结合第二方面的第二种可能的实现方式,在第二方面的第四种可能的实现方式中,所述第二确定单元,具体用于:
根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语 言问句对应的隐含谓词的值和所述一阶公式,构建无向图,采用差额注入松弛算法MIRA,确定所述一阶公式的权重。
结合上述第二方面的任一种可能的实现方式,在第二方面的第五种可能的实现方式中,所述MLN表示为M,所述一阶公式表示为φi,所述一阶公式的权重表示为wi,所述命题集合表示为y,
所述第二确定单元,具体用于:
根据所述观察谓词的值和所述隐含谓词构建可能的世界,所述可能的世界表示为y;
根据
Figure PCTCN2015078362-appb-000006
计算所述每一个命题集合的置信度,
其中,Z为归一化常数,
Figure PCTCN2015078362-appb-000007
为与一阶公式φi对应的子公式的集合,c为
Figure PCTCN2015078362-appb-000008
的所述子公式的集合中的一个子公式,
Figure PCTCN2015078362-appb-000009
为二值函数,
Figure PCTCN2015078362-appb-000010
表示在所述命题集合y下,所述一阶公式的真假。
结合第二方面或者上述第二方面的任一种可能的实现方式,在第二方面的第六种可能的实现方式中,所述获取单元,具体用于:
确定所述置信度的值最大的命题集合,并获取所述置信度的值最大的命题集合中的真命题的组合。
结合第二方面或者上述第二方面的任一种可能的实现方式,在第二方面的第七种可能的实现方式中,
所述第一候选短语的特征包括所述第一候选短语在所述问句中的位置、所述第一候选短语的主要词的词性、所述第一候选短语两两之间的依存路径上的标签,
所述第一资源项的特征包括所述第一资源项的类型、所述第一资源项两两之间的相关性值、所述第一资源项两两之间的参数匹配关系,
所述第一候选短语与所述第一资源项的关系包括所述第一候选短语与所述第一资源项的先验匹配得分,
所述第一确定单元,具体用于:
确定所述第一候选短语在所述问句中的位置;
采用stanford词性标注工具,确定所述第一候选短语的主要词的词性;
采用stanford依存句法分析工具,确定所述第一候选短语两两之间的依存路径上的标签;
从所述知识库中确定所述第一资源项的类型,其中,所述类型为实体或类别或关系;
从所述知识库中确定所述第一资源项两两之间的参数匹配关系;
将所述第一资源项两两之间的相似性系数,作为所述两个第一资源项两两之间的相关性值;
计算所述第一候选短语与所述第一资源项之间的先验匹配得分,所述先验匹配得分用于表示所述第一候选短语映射到所述第一资源项的概率。
结合第二方面或者上述第二方面的任一种可能的实现方式,在第二方面的第八种可能的实现方式中,所述形式化查询语句为简单协议资源描述框架查询语句SPARQL。
结合第二方面的第八种可能的实现方式,在第二方面的第九种可能的实现方式中,所述生成单元,具体用于:
根据所述真命题的组合,利用SPARQL模板生成所述SPARQL。
结合第二方面的第九种可能的实现方式,在第二方面的第十种可能的实现方式中,所述SPARQL模板包括ASK WHERE模板、SELECT COUNT(?url)WHERE模板和SELECT ?url WHERE模板,
所述生成单元,具体用于:
当所述问句为Yes/No问题时,根据所述真命题的组合,使用所述ASK WHERE模板生成所述SPARQL;
当所述问句为Normal问题时,根据所述真命题的组合,使用所述SELECT ?url WHERE模板生成所述SPARQL;
当所述问句为Number问题时,根据所述真命题的组合,使用所述SELECT ?url WHERE模板生成所述SPARQL,或者,当使用所述SELECT ?url WHERE模板生成的SPARQL无法得到数值型答案时,使用所述SELECT COUNT(?url)WHERE模板生成所述SPARQL。
结合第二方面或者上述第二方面的任一种可能的实现方式,在第二方面的第十一种可能的实现方式中,所述短语检测单元,具体用于:
将所述问句中的词序列作为所述第一候选短语,其中,所述词序列满足:
所述词序列中所有连续的非停用词都以大写字母开头,或者,若所述词 序列中所有连续的非停用词不都以大写字母开头,则所述词序列的长度小于四;
所述词序列的主要词的词性为jj或nn或rb或vb,其中,jj为形容词,nn为名词,rb为副词,vb为动词;
所述词序列所包括的词不全为停用词。
本发明实施例基于预定义的不确定性推理网络,能够用于将用户输入的自然语言问句转换为结构化的SPARQL。本发明实施例中,该预定义的不确定性推理网络能够应用于任何领域的知识库,具有领域扩展性,这样无需针对知识库人工地配置转换规则。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本发明一个实施例的知识库中问句解析的方法的流程图。
图2是本发明一个实施例的依存分析树的一例。
图3是本发明另一个实施例的知识库中问句解析的方法的示意图。
图4是本发明一个实施例的资源项查询图的另一例。
图5是本发明一个实施例的确定加权公式权重的方法的流程图。
图6是本发明一个实施例的问句解析的设备的框图。
图7是本发明另一个实施例的问句解析的设备的框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在知识库问答系统中,需将自然语言问句(natural language question)转换为形式化查询语句。例如,形式化查询问句为结构化查询语句(Structure Query Language,SQL)或SPARQL。一般地,SPARQL以主体-属性-对象 (subject-property-object,SPO)三元组形式(triple format)表示。
例如:自然语言问句“Which software has been developed by organization founded in California,USA?”所对应的SPARQL为:
?url_answer rdf:type dbo:Software
?url_answer db:developer ?x1
?x1 rdf:type dbo:Company
?x1 dbo:foundationPlace dbr:California。
将自然语言问句转换为形式化查询语句,需要依赖于针对于知识库的转换规则。也就是说,不同的知识库所对应的转换规则也是不同的。但是目前的问答系统中,需要人工地对每个知识库的转换规则进行人工配置。对于某一个知识库,人工地收集一些问题,并确定问题的答案,根据这些问题人工地总结出一些规律作为转换规则。也就是说,人工配置的转换规则没有领域扩展性,针对某一个知识库所配置的转换规则不能用于另外一个知识库。并且,由于自然语言问句中存在大量的歧义,也会导致人工配置的转换规则缺乏鲁棒性。
自然语言处理(Natural Language Processing,NLP)是计算科学、人工智能和语言学科中用于描述机器语言与自然语言之间的关系的工具。NLP涉及人机交互。NLP的任务(tasks)可以包括:自动监督(Automatic summarization)、互参分辨率(Coreference resolution)、引语分析(Discourse analysis)、机器翻译(Machine translation)、形态分割(Morphological segmentation)、命名实体识别(Named entity recognition,NER)、自然语言生成(Natural language generation)、自然语言理解(Natural language understanding)、光学字符识别(Optical character recognition,OCR)、词性标注(Part-of-speech tagging)、句法分析(Parsing)、问答系统(Question answering)、关系提取(Relationship extraction)、断句(Sentence breaking)、情绪分析(Sentiment analysis)、语音识别(Speech recognition)、语音分割(Speech segmentation)、话题分割与识别(Topic segmentation and recognition)、词分割(Word segmentation)、词义消歧(Word sense disambiguation)、信息检索(Information retrieval,IR)、信息抽取(Information extraction,IE)、语音处理(Speech processing)等。
具体地,斯坦福(Stanford)自然语言处理(Natural Language Processing, NLP)工具是针对上述NLP的不同任务所设计的。本发明实施例中采用了Stanford NLP工具。例如,其中的词性标注工具可以用于确定一个问句中的每一个单词(word)的词性(Part-of-speech)。
不确定性推理泛指除精确推理以外的其他各种推理问题。包括不完备、不精确知识的推理,模糊知识的推理,非单调性推理等。
不确定性推理过程实际上是一种从不确定的初始证据出发,通过运用不确定性知识,最终推出具有一定不确定性但却又是合理或基本合理的结构的思维过程。
不确定性推理的类型有数值方法和非数值方法,其中数值方法包括基于概率的方法。具体地,基于概率的方法是基于概率论的有关理论发展起来的方法,如可信度方法,主观贝叶斯(Bayes)方法、证据理论等。
其中,马尔科夫逻辑网络是不确定性推理网络中较为常用的一种。
马尔科夫逻辑网络(Markov Logic Network,MLN)是一种结合一阶逻辑(First-Order Logic,FOL)和马尔科夫网络(Markov Network)的统计关系学习(Statistical Relational Learning)框架。马尔科夫逻辑网络与传统的一阶逻辑的不同之处在于:传统的一阶逻辑要求所有的规则之间不允许有冲突,如果某一个命题不能同时满足所有规则,则其为假;而在马尔科夫逻辑网络中,每个规则都有一个权重,一个命题会按照一个概率为真。
其中,一阶逻辑(First-Order Logic,FOL)也可以称为谓词逻辑或一阶谓词逻辑,由若干一阶谓词规则组成。一阶谓词规则由四种类型的符号组成,即常量、变量、函数和谓词。其中,常量指定义域里一个简单的对象;变量可以指定义域里若干对象;函数表示一组对象到一个对象的映射;谓词指定义域中若干对象之间的关系、或者对象的属性。变量和常量可以有类型。一个类型的变量仅能从定义类型的对象集中取值。一个项可以是任意地表示一个对象的表达式。原子是作用于一组项的谓词。一个常项是指没有变量的项。一个闭原子(ground atom)或闭谓词(ground predicate)是指所有参数均为常项的原子或谓词。一般地,规则是从原子开始,用连接词(如蕴含关系、等价关系等)和量词(如全称量词和存在量词)递归地建立起来。在一阶逻辑中,通常把规则表示成从句的形式。一个可能世界(a possible world)是指给所有可能出现的闭原子都赋予了真值。一阶逻辑可看作是在一个可能世界的集合上建立一系列硬规则,即如果一个世界违反了其中的某一条规则, 那么这个世界的存在概率即为零。
MLN的基本思想是让那些硬规则有所松弛,即当一个世界违反了其中的一条规则,那么这个世界存在的可能性将降低,但并非不可能。一个世界违反的规则越少,那么这个世界存在的可能性就越大。为此,给每个规则都加上了一个特定的权重,它反映了对满足该规则的可能世界的约束力。若一个规则的权重越大,则对于满足和不满足该规则的两个世界而言,它们之间的差异将越大。
这样,通过设计不同的一阶逻辑公式(高阶规则模板),马尔科夫逻辑网络能够很好的结合语言特征和知识库限制。该概率框架中的逻辑公式能够对软规则限制进行建模。马尔科夫逻辑(Markov Logic)中一组加权的公式集合就称为一个马尔科夫逻辑网络。
具体地,在MLN中,可以包括一阶公式和惩罚(penalty)。闭原子可以以某种惩罚违法对应的一阶公式。
其中,一阶公式中包括一阶谓词、逻辑联结词(logical connectors)和变量。
图1是本发明一个实施例的知识库中问句解析的方法的流程图。图1所示的方法包括:
101,接收用户输入的问句。
102,对所述问句进行短语检测,以确定第一候选短语。
103,将所述第一候选短语映射到所述知识库中的第一资源项,其中,所述第一资源项与所述第一候选短语具有一致的语义。
104,根据所述对应候选短语和所述对应资源项,计算观察谓词的值和可能的问句分析空间,其中,所述观察谓词用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系,所述可能的问句分析空间中的点为命题集合,所述命题集合中的命题的真假由隐含谓词的值表征。
105,对所述可能的问句分析空间中的每一个命题集合,根据所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度。
106,获取所述置信度满足预设条件的命题集合中的真命题的组合,其中,所述真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述 第一资源项中所选中的搜索资源项和所述搜索资源项的特征。
107,根据所述真命题的组合,生成形式化查询语句。
本发明实施例利用观察谓词和隐含谓词,进行不确定性推理,能够将自然语言问句转化为形式化查询语句。并且,本发明实施例中,不确定性推理的方法能够应用于任何领域的知识库,具有领域扩展性,这样无需针对知识库人工地配置转换规则。
可理解,本发明实施例中,在101中用户输入的问句为自然语言问句(natural language question)。
例如,该自然语言问句为“Give me all actors who were born in Berlin.”。
进一步地,在102中,可通过短语检测(phrase detection),识别出问句中的词(token)序列。可选地,可将所述问句中的词序列作为所述第一候选短语。其中,词序列又称为多词序列或词语序列或词项或n元词序列或n-gram(s),是指n个连续的单词组成的序列。
可理解,102中可确定多个第一候选短语。
可选地,102中,可将满足如下的限定的词序列作为第一候选短语:
(1)、所述词序列中所有连续的非停用词都以大写字母开头;或者,若所述词序列中所有连续的非停用词不都以大写字母开头,则所述词序列的长度小于四。
(2)、所述词序列的主要词(head word)的词性为jj或nn或rb或vb,其中,jj为形容词,nn为名词,rb为副词,vb为动词。
(3)、所述词序列所包括的词不全为停用词。
同时,所有连续的大写字母开头的非停用词必须在同一个词序列中。
可理解,本发明实施例中,head word也可以称为重要词或主导词等,并且可以从词性标注集合中获取词性的表示符号。
举例来说,“United States Court of Appeals for the District of Columbia Circuit”中所有连续的非停用词都以大写字母开头,为一个候选短语。可以理解,所有连续的非停用词都以大写字母开头的词序列一般为专有名词。
其中,词序列的长度是指词序列所包括的词的个数。例如,词序列“born in”的长度为2。
其中,可以采用stanford的词性标注工具来确定每一个词的词性。
举例来说,英文的停用词(stop words)有“a”、“an”、“the”、“that” 等。中文的停用词有“一个”、“一些”、“不但”等。
例如,在问句“Give me all actors who were born in Berlin”中,所确定的第一候选短语包括:actors、who、born in、in、Berlin。
具体地,可以表示为表一的形式,其中表一的第一列为所述第一候选短语的短语标识。
表一
11 actors
12 who
13 born in
14 in
15 Berlin
本发明实施例中,103可以理解为是将每个第一候选短语映射到知识库中的第一资源项。本发明实施例中,103也可以称为短语映射(phrase mapping)。具体地,一个第一候选短语可能映射到多个第一资源项。第一资源项的类型可以为实体(Entity)或类别(Class)或关系(Relation)。
举例来说,假设该知识库为DBpedia。103具体为:
将第一候选短语映射到实体(Entity),考虑到DBpedia中的实体来自于Wikipedia中的实体页面,首先收集Wikipedia中的锚文本(anchor text)、重定向页面和消歧页面,并利用Wikipedia中的锚文本、重定向页面和消歧页面构建第一候选短语与实体之间的对应辞典,当第一候选短语匹配到实体的提及(mention)短语的时候,那么该实体即为与该第一候选短语语义一致的第一资源项。
将第一候选短语映射到类别(Class),考虑到有词汇变种的情况,特别是同义词,例如,短语film、movie和show都可以映射到类别dbo:Film。首先利用word2vec工具把第一候选短语中所有的词转换为向量形式,知识库中类别的向量形式为其标签(对应rdfs:label关系)的向量形式;然后计算第一候选短语与每个类别在向量上的余弦相似度;最后将余弦相似度值最大的N个类别作为与该第一候选短语语义一致的第一资源项。
其中,word2vec工具是一种将词(word)转换成向量(vector)的工具。例如,可以是由谷歌(google)开发并提供的一段开放代码,具体可以参见: http://code.google.com/p/word2vec/。
将第一候选短语映射到关系(Relation),使用PATTY和ReVerb所定义的关系模板作为资源。首先计算DBpedia中的关系与PATTY和ReVerb所定义的关系模板(relation patterns)在实例上的对齐,也就是统计DBpedia中满足关系模板的关系的实例对。然后,如果第一候选短语能够匹配关系模板,那么,将满足关系模板的关系作为与该第一候选短语语义一致的第一资源项。
其中,PATTY和ReVerb所定义的关系模板可以参见Nakashole等人于2012在EMNLP发表的“Patty:a taxonomy of relational patterns with semantic types”,以及Fader等人于2011在EMNLP发表的“Identifying relations for open information extraction”。
这样,通过103,可以将第一候选短语映射到第一资源项,具体地,每一个第一候选短语映射到至少一个第一资源项。并且,具有映射关系的第一候选短语和第一资源项具有一致的语义。
其中,若一个第一候选短语映射到多个第一资源项,说明该一个第一候选短语具有歧义。
例如,在问句“Give me all actors who were born in Berlin”中,在103中,可确定第一候选短语actors、who、born in、in、Berlin映射为第一资源项如表二所示。其中,表二的第一列为第一候选短语,第二列为第一资源项,第三列为第一资源项的标识。并且,第一候选短语“in”映射到五个第一资源项。
表二
actors dbo:Actor 21
who dbo:Person 22
born in dbo:birthPlace 23
in dbo:headquarter 24
in dbo:league 25
in dbo:location 26
in dbo:ground 27
in dbo:locationCity 28
Berlin dbr:Berlin 29
本发明实施例中,104可以理解为是特征抽取(feature extraction)的过程。
具体地,本发明实施例定义隐含谓词(hidden predicates)。隐含谓词可以包括如下的形式:
hasphrase(p),表示候选短语p被选中。
hasResource(p,r),表示资源项r被选中,且候选短语p映射到资源项r。
hasRelation(p,r,rr),表示资源项p和资源项r之间的参数匹配关系rr被选中。
可理解,其中,p可以为候选短语的短语标识,p和r可以为资源项的标识。其中,参数匹配关系rr为可以为以下一种:1_1、1_2、2_1和2_2。
具体地,本发明实施例中,参数匹配关系rr可以为以下一种:1_1、1_2、2_1和2_2。那么,资源项p和资源项r之间的参数匹配关系为m1_m2表示资源项p的第m1个参数与资源项r的第m2个参数对齐。其中,m1为1或2,m2为1或2。
如表三所示,为上述参数匹配关系的具体举例。其中,表三的第三列给出了一个问句,以解释第二列中的参数匹配关系。
表三
Figure PCTCN2015078362-appb-000011
其中,“dbo:height 1_1 dbr:Michael Jordan”表示资源项dbo:height与资源项dbr:Michael Jordan之间的参数匹配关系为1_1。即,资源项dbo:height的第1个参数与资源项dbr:Michael Jordan的第1个参数对齐。
可理解,隐含谓词的值为1表示相应的候选短语、资源项、资源项和资源项之间的参数匹配关系被选中。隐含谓词的值为0表示相应的候选短语、 资源项、资源项和资源项之间的参数匹配关系没有被选中。换句话说,隐含谓词的值为1表示相应的命题为真,隐含谓词的值为0表示相应的命题为假。
例如,结合表一,hasphrase(11)=1,表示“候选短语actors被选中”这个命题为真。hasphrase(11)=1,表示“候选短语actors被选中”这个命题为假。
这样,对于102和103所确定第一候选短语和第一资源项,能够基于隐含谓词构建可能的问句分析空间(possible question parse space)。具体地,可能的问句分析空间中的一个点表示一个命题集合。一个命题集合包括一组命题,并且这一组命题是由一组隐含谓词的值还表示的。可理解,一个命题集合中的一组命题的真假由对应的隐含谓词的值来表征。
具体地,本发明实施例还定义观察谓词(observed predicates)用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系。
其中,所述第一候选短语的特征包括所述第一候选短语在所述问句中的位置、所述第一候选短语的主要词的词性、所述第一候选短语两两之间的依存路径上的标签等。
其中,所述第一资源项的特征包括所述第一资源项的类型、所述第一资源项两两之间的相关性值、所述第一资源项两两之间的参数匹配关系等。
其中,所述第一候选短语与所述第一资源项的关系包括所述第一候选短语与所述第一资源项的先验匹配得分。
那么,可理解,104中确定观察谓词的值包括:确定所述第一候选短语在所述问句中的位置;采用stanford的词性标注工具,确定所述第一候选短语的主要词的词性;采用stanford依存句法分析工具,确定所述第一候选短语两两之间的依存路径上的标签;从所述知识库中确定所述第一资源项的类型,其中,所述类型为实体或类别或关系;从所述知识库中确定所述第一资源项两两之间的参数匹配关系,其中,所述参数匹配关系为以下一种:1_1、1_2、2_1和2_2。将所述第一资源项两两之间的相似性系数,作为所述两个第一资源项两两之间的相关性值;计算所述第一候选短语与所述第一资源项之间的先验匹配得分,所述先验匹配得分用于表示所述第一候选短语映射到所述第一资源项的概率。
具体地,从所述知识库中确定所述第一资源项两两之间的参数匹配关 系,包括:从所述知识库中确定第一资源项r1和第一资源项r2之间的参数匹配关系m1_m2,用于表示所述第一资源项r1的第m1个参数与所述第一资源项r2的第m2个参数对齐。其中,所述第一资源项包括所述第一资源项r1和所述第一资源项r2,m1为1或2,m2为1或2。
具体地,观察谓词可以包括如下的形式:
phraseIndex(p,i,j),表示候选短语p在问句中的起始位置i和结束位置j。
phrasePosTag(p,pt),表示候选短语p的主要词(head word)的词性pt。
具体地,可以采用stanford词性标注工具确定主要词的词性。
phraseDepTag(p,q,dt),表示候选短语p和候选短语q之间的依存路径上的标签dt。
具体地,可以采用stanford依存分析(stanford dependency parser)工具建立问句的依存分析树(dependency parse trees),根据所述依存分析树进行特征提取,从而确定两个候选短语之间的依存路径上的标签。
例如,问句“Give me all actors who were born in Berlin.”的依存分析树如图2所示。
phraseDepOne(p,q),表示当候选短语p和候选短语q之间的依存路径上的标签只有一个时,该谓词为真,否则为假。
可理解,观察谓词中的谓词phraseDepOne(p,q)只包括结果为真的谓词。
hasMeanWord(p,q),表示当候选短语p和候选短语q之间的依存路径上的词全部为停用词或者词性为dt、in、wdt、to、cc、ex、pos或wp时,hasMeanWord(p,q)为假,否则为真。
其中,dt为限定词,in为介词in,wdt为以w开头的疑问词,to为介词to,cc为连接词,ex为存在词there,pos为所有格结尾词,wp为疑问代词。其中,以w开头的疑问词如what、which等,连接词如and、but、or等。具体地,可以从词性标注集合获取上述词性的表示符号。
可理解,观察谓词中的谓词hasMeanWord(p,q)只包括结果为真的谓词。
resourceType(r,rt),表示资源项r的类型为rt。其中rt为E或C或R。E表示实体(Entity),C表示类别(Class),R表示关系(Relation)。
priorMatchScore(p,r,s),表示候选短语p与资源项r之间的先验匹配得分s。
举例来说,假设知识库为DBpedia。
具体地,若资源项r的类型为E,首先收集Wikipedia中的锚文本、重定向页面和消歧页面,候选短语p匹配到资源项r的提及短语,可将对应的频率作为先验匹配得分。其中,对应的频率是指候选短语p链接到资源项r的次数除以候选短语p链出的总次数。
具体地,若资源项r的类型为C,候选短语p与资源项r的先验匹配得分可以为γ·s1+(1-γ)·s2。其中,γ为0至1之间的任意值,例如γ=0.6。s1为资源项r的标签与候选短语p之间的Levenshtein距离,s2为候选短语p的向量与资源项r的向量之间的余弦相似性度量值。其中,Levenshtein距离可以参见Navarro于2001年在ACM Comput.Surv.发表的“A guided tour to approximate string matching”。其中,s2的计算可以参见Mikolov等人于2010年在INTERSPEECH发表的“Recurrent neural network based language model”。
具体地,若资源项r的类型为R,候选短语p与资源项r的先验匹配得分可以为α·s1+β·s2+(1-α-β)·s3。其中,α和β为0至1之间的任意值,且α+β<1,例如α=0.3,β=0.3。s1为资源项r的标签与候选短语p之间的Levenshtein距离,s2为候选短语p的向量与资源项r的向量之间的余弦相似性度量值,s3为资源项r与关系模板的匹配集合的Jaccard系数。其中,关系模板为如前所述的PATTY和ReVerb所定义的关系模板。s3的计算可以参见Yahya等人于2012年在EMNLP发表的“Natural language questions for the web of data”。
hasRelatedness(p,q,s),表示资源项p和资源项q之间的相关性值s。该相关性值s的取值区间为0至1。具体地,该相关性值s可以为资源项p和资源项q的相似性系数。可选地,该相似性系数也可以称为Jaccard相似性系数或Jaccard系数或相似度评价系数。
例如,参见Yahya等人于2012年在EMNLP发表的“Natural language questions for the web of data”,资源项p和资源项q的相似性系数可以等于资源项p和资源项q的入度集合的Jaccard系数。
isTypeCompatible(p,q,rr),表示资源项p和资源项q之间的参数匹配关系rr。
具体地,本发明实施例中,参数匹配关系rr可以为以下一种:1_1、1_2、2_1和2_2。具体地,参数匹配关系可如前所述,为避免重复,这里不再赘述。
hasQueryResult(p,q,o,rr1,rr2),表示资源项p、资源项q和资源项o之间的参数匹配关系。具体地,资源项p和资源项q之间具有参数匹配关系rr1,资源项q和资源项o之间具有参数匹配关系rr2。
可理解,上述所描述的观察谓词中,phraseIndex(p,i,j)、phrasePosTag(p,pt)、phraseDepTag(p,q,dt)、phraseDepOne(p,q)和hasMeanWord(p,q)用于表示所述候选短语的特征。resourceType(r,rt)、hasRelatedness(p,q,s)、isTypeCompatible(p,q,rr)和hasQueryResult(p,q,o,rr1,rr2)用于表示所述资源项的特征。priorMatchScore(p,r,s)用于表示所述候选短语与所述资源项之间的关系。
其中,p和q可以为候选短语的短语标识,p、q、r和o可以为资源项的标识。
这样,基于102和103所确定的第一候选短语和第一资源项,能够确定相应的观察谓词的值。
例如,对问句“Give me all actors who were born in Berlin”,在表一和表二的基础上,可以在104计算观察谓词的值。具体地,其中观察谓词的值为1的表达式包括:
phraseIndex(11,3,3)
phraseIndex(12,4,4)
phraseIndex(13,6,7)
phraseIndex(14,7,7)
phraseIndex(15,8,8)
phrasePosTag(11,nn)
phrasePosTag(12,wp)
phrasePosTag(13,vb)
phrasePosTag(14,in)
phrasePosTag(15,nn)
phraseDepTag(11,13,rcmod)
phraseDepTag(12,13,nsubjpass)
phraseDepTag(12,14,nsubjpass)
phraseDepTag(13,15,pobj)
phraseDepTag(14,15,pobj)
phraseDepOne(11,13)
phraseDepOne(12,13)
phraseDepOne(12,14)
phraseDepOne(13,15)
phraseDepOne(14,15)
hasMeanWord(12,14)
resourceType(21,E)
resourceType(22,E)
resourceType(23,R)
resourceType(24,R)
resourceType(25,R)
resourceType(26,R)
resourceType(27,R)
resourceType(28,R)
resourceType(29,E)
priorMatchScore(11,21,1.000000)
priorMatchScore(12,22,1.000000)
priorMatchScore(13,23,1.000000)
priorMatchScore(14,24,1.000000)
priorMatchScore(14,25,1.000000)
priorMatchScore(14,26,1.000000)
priorMatchScore(14,27,1.000000)
priorMatchScore(14,28,1.000000)
priorMatchScore(15,29,1.000000)
hasRelatedness(21,23,1.000000)
hasRelatedness(22,23,1.000000)
hasRelatedness(22,24,0.440524)
hasRelatedness(22,25,0.425840)
hasRelatedness(22,26,0.226393)
hasRelatedness(22,27,0.263207)
hasRelatedness(23,29,0.854583)
hasRelatedness(24,29,0.816012)
hasRelatedness(26,29,0.532818)
hasRelatedness(27,29,0.569732)
hasRelatedness(28,29,0.713400)
isTypeCompatible(21,23,1_1)
isTypeCompatible(22,23,1_1)
isTypeCompatible(22,23,1_2)
isTypeCompatible(22,24,1_2)
isTypeCompatible(22,25,1_1)
isTypeCompatible(22,26,1_1)
isTypeCompatible(22,26,1_2)
isTypeCompatible(22,27,1_2)
isTypeCompatible(23,29,2_1)
isTypeCompatible(24,29,2_1)
isTypeCompatible(26,29,2_1)
isTypeCompatible(27,29,2_1)
isTypeCompatible(28,29,2_1)
hasQueryResult(21,23,29,1_1,2_1)
hasQueryResult(22,23,29,1_1,2_1)
hasQueryResult(22,26,29,1_1,2_1)
可理解,观察谓词的值为1,即表示对应的命题为真。
例如,其中,phraseIndex(11,3,3)的值为1,表示“第一候选短语actors在问句中的起始位置i和结束位置j均为3”这一命题为真。其中,11为候选短语“actors”的短语标识,如表一所示。
其中,phrasePosTag(13,vb)的值为1,表示“第一候选短语born in的主要词为born,其词性vb”这一命题为真。其中,13为候选短语“born in”的短语标识,如表一所示。
其中,phraseDepTag(13,15,pobj)的值为1,表示“第一候选短语born in和第一候选短语Berlin依存路径上的标签为pobj”这一命题为真。其中,13为候选短语“born in”的短语标识,15为候选短语“Berlin”的短语标识,如表一所示。
上述其他的观察谓词的值为1的表达式的含义可以参照上述解释,为避免重复,这里不再赘述。
可理解,还包括观察谓词的值为0的表达式,为节省篇幅,这里不再罗列。
可选地,本发明实施例中,也可以用谓词resource表示资源项的标识。
例如,结合表二可知,以下谓词的值为1:
resource(21,dbo:Actor)
resource(22,dbo:Person)
resource(23,dbo:birthPlace)
resource(24,dbo:headquarter)
resource(25,dbo:league)
resource(26,dbo:location)
resource(27,dbo:ground)
resource(28,dbo:locationCity)
resource(29,dbr:Berlin)
可理解,本发明实施例中,102和103中所确定的第一候选短语和第一资源项是有歧义的。本发明实施例通过不确定性推理来消除所述第一候选短语和所述第一资源项的歧义。
不确定性推理是根据不确定性信息作出推理和决策。不确定性推理网络可以处理不完整的和带有噪音的数据集,用概率测度的权重来描述数据间的相关性,旨在解决数据的不一致性和不确定性。
本发明实施例中,105中的不确定性推理所使用的模型可以为如下的任意一种:贝叶斯网络(Bayesian Network)、似然关系模型(Probabilistic relational models)、贝叶斯逻辑程序模型(Bayesian logic programs)、关系马尔科夫网(Relational Markov Network)、马尔科夫逻辑网(Markov Logic Network)、概率软化逻辑(Probabilistic Soft Logic)。本发明对此不作限定。
可选地,本发明实施例中,105中的不确定性推理是基于马尔科夫逻辑网络(Markov Logic Network,MLN)的,其中,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。也就是说,不确定性推理所使用的模型为MLN。
可选地,本发明实施例中,一阶公式可以包括布尔公式(Boolean formulas)和加权公式(weighted formulas)。其中,布尔公式的权重为+∞,布尔公式可以理解为一阶逻辑中的一阶逻辑公式,表示硬规则(hard constraints),也可以称为硬公式(hard formulas,hf),是所有的闭原子必须满足的限制条件。加权公式的权重为加权公式权重,是软规则(soft constraints),也可以称为软公式(soft formulas,sf),闭原子可以以某种惩罚违法。
其中,一阶公式是由一阶谓词、逻辑联结词和变量所组成的。其中,一阶谓词可以包括前述的观察谓词和隐含谓词。
应注意,本发明实施例中,MLN也可以包括二阶公式、一阶公式、所述二阶公式的权重、以及所述一阶公式的权重。或者,MLN也可以包括更高阶的公式及权重,本发明对此不作限定。
具体地,布尔公式如表四所示。其中,符号“_”表示逻辑变量中的任意常量。|·|表示公式中为真的闭原子的个数。
表四
Figure PCTCN2015078362-appb-000012
具体地,表四中的含义如下:
hf1:表示如果一个短语p被选中,那么该短语p至少映射到一个资源项。
hf2:表示如果一个短语p到资源项的映射被选中,那么该短语p必须被选中。
hf3:表示一个短语p只能映射到一个资源项。
hf4:表示如果一个短语p没有被选中,那么任何一个短语p到资源项的映射关系都不被选中。
hf5:表示如果一个短语到资源项r的映射被选中,那么,该资源项r至少与其他的一个资源项有关系。
hf6:表示两个资源项r1和r2只能有一个参数匹配关系。
hf7:表示如果两个资源项r1和r2存在参数匹配关系,那么,至少有一个短语到资源项r1的映射被选中且至少有一个短语到资源项r2的映射被选中。
hf8:表示任意两个被选中的短语没有重叠。这里的重叠可以用在问句中的位置表征。
hf9、hf10、hf11、hf12:表示如果一个资源项r的类型为实体或类别,那么,该资源项r不能有第二个参数与其他资源项对齐。
hf13:表示两个资源项r1和r2之间的参数匹配关系必须一致。
可理解,表四中,逻辑联结词“∧”表示与(and),逻辑联结词“∨”表示或(or),逻辑联结词“!”表示非(not)。
具体地,加权公式如表五所示。其中,符号“+”表示逻辑变量的每个常数都应该单独设置权重。
表五
Figure PCTCN2015078362-appb-000013
Figure PCTCN2015078362-appb-000014
具体地,表五中的含义如下:
sf1、sf2:表示短语p映射到资源项r的先验匹配得分s越大,短语p和资源项r被选中的概率越大。
sf3:表示短语p的主要词的词性与该短语p映射到的资源项r的类型有某种关联。
sf4、sf5、sf6:表示两个短语p1和p2之间的依存路径上的标签与两个资源项r1和r2之间的参数匹配关系有某种关联,其中,短语p1映射到资源项r1,短语p2映射到资源项r2。
sf7:表示两个资源项r1和r2之间的相关性值越大,这两个资源项r1和r2之间有参数匹配关系的可能性越大。
sf8:表示如果一个资源项三元组存在查询结果,那么,这三个资源项之间应该具有相应的参数匹配关系。
应注意,本发明实施例中,加权公式权重可以是人工设置的。例如,可以是由知识库的管理者或专家预设置的经验值。
本发明实施例中,加权公式权重也可以是通过学习的方法,经过训练所得到的。
可理解,对于不同的知识库,加权公式权重一般不同。本发明实施例中,表四所示的布尔公式可以理解为所有的知识库满足的通用的规则。表五所示的加权公式可以理解为针对不同的知识库,加权公式权重不同的特定的规则。
本发明实施例中,也可以将布尔公式和加权公式统称为“元规则”。即,“元规则”是适用于不同的领域的知识库的规则。
本发明实施例中,105也可以称为推理(Inference)或联合推理(joint Inference)或联合消歧(joint disambiguation)。具体地,可以使用thebeast工具进行联合推理。可选地,对所述问句分析空间中的每一个命题集合,可 以根据所述观察谓词的值和所述隐含谓词的值,,采用切平面方法(cutting plane method或cutting plane approach),计算所述每一个命题集合的置信度。具体地,thebeast工具可以参见:https://code.google.com/p/thebeast/
可理解,置信度也可以称为可信度。并且,可以采用无向图模型的极大似然估计,计算所述每一个命题集合的置信度。
可选地,所述MLN表示为M,所述一阶公式表示为φi,所述一阶公式的权重表示为wi,所述命题集合表示为y,那么,105可以为:
根据
Figure PCTCN2015078362-appb-000015
计算所述每一个命题集合的置信度。
其中,Z为归一化常数,
Figure PCTCN2015078362-appb-000016
为与一阶公式φi对应的子公式的集合,c为
Figure PCTCN2015078362-appb-000017
的所述子公式的集合中的一个子公式,
Figure PCTCN2015078362-appb-000018
为二值函数,
Figure PCTCN2015078362-appb-000019
表示在所述命题集合y下,所述一阶公式的真假。
其中,二值函数(binary feature function)
Figure PCTCN2015078362-appb-000020
的值为1或0。具体地,在所述命题集合y下,当子公式c为真时,
Figure PCTCN2015078362-appb-000021
为1。否则为0。
可选地,在105中可以设置一个最大循环次数。例如,该最大循环次数为100。
这样,在105中计算每一个命题集合的置信度之后,可以得到与可能的问句分析空间对应的置信度集合,并且所述置信度集合中的每一个置信度对应一个命题集合。
进一步地,在106中,可以从可能的问句分析空间的多个命题集合中选择一个或者几个命题集合,并且所选择的一个或者几个命题集合的置信度满足预设条件。
可选地,在106中,可以确定置信度的值最大的一个命题集合,并获取所述置信度的值最大的一个命题集合中的真命题的组合。
或者,可选地,在106中,可以确定置信度的值最大的多个命题集合,并获取所述置信度的值最大的多个命题集合中的真命题的组合。本发明对此不作限定。
由于命题集合中的命题的真假由隐含谓词的值来表征,那么,可理解,在106中获取真命题的组合,即获取隐含谓词的值为1的组合。并且,所述 真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述第一资源项中所选中的搜索资源项和所述搜索资源项的特征。
例如,对问句“Give me all actors who were born in Berlin.”,所确定的隐含谓词的值为1的表达式如下:
hasphrase(11)
hasphrase(13)
hasphrase(15)
hasResource(11,21)
hasResource(13,23)
hasResource(15,29)
hasRelation(21,23,1_1)
hasRelation(23,29,2_1)
进一步地,可以在107生成形式化查询语句。可选地,形式化查询语句可以为SQL。或者,本发明实施例中,形式化查询语句可以为SPARQL,相应地,107也可以称为SPARQL生成(SPARQL Generation)的过程。
可选地,107可以为:根据所述真命题的组合,利用SPARQL模板生成所述SPARQL。
具体地,可利用所述真命题的组合,构建SPARQL的三元组,进一步,利用SPARQL模板生成SPARQL。
具体地,自然语言问句可以分为三类:Yes/No,Number和Normal。相应地,SPARQL模板也包括ASK WHERE模板、SELECT COUNT(?url)WHERE模板和SELECT ?url WHERE模板。
那么,当所述问句为Yes/No问题时,根据所述真命题的组合,使用ASK WHERE模板生成所述SPARQL。
当所述问句为Normal问题时,根据所述真命题的组合,使用SELECT ?url WHERE模板生成所述SPARQL。
当所述问句为Number问题时,根据所述真命题的组合,使用SELECT ?url WHERE模板生成所述SPARQL,或者,当使用SELECT ?url WHERE模板生成的SPARQL无法得到数值型答案时,使用SELECT COUNT(?url)WHERE模板生成所述SPARQL。
例如,问句“Give me all actors who were born in Berlin.”为Normal问题, 所生成的SPARQL为:
SELECT ?url WHERE{
?x rdf:type dbo:Actor.
?x dbo:birthplace dbr:Berlin.
}
可选地,107可包括:根据所述真命题的组合,生成查询资源图,其中,所述查询资源图包括顶点和边,所述顶点包括所述搜索短语、所述搜索资源项,并且,每个顶点中所述搜索短语映射到该顶点中所述搜索资源项。所述边表示相连的两个顶点中两个所述搜索资源项之间的参数匹配关系;进一步根据所述查询资源图生成所述SPARQL。
具体地,可以将所述查询资源图中相互连接的三个所述搜索资源项,作为所述SPARQL的三元组,其中,位于所述相互连接的三个所述搜索资源项的中间的搜索资源项的类型为关系。
这样,本发明实施例中,可以将自然语言问句转化为SPARQL。并且所采用的预定义的一阶公式是领域无关的,即预定义的布尔公式和加权公式可以运用于所有的知识库,具有可扩展性。也就是说,采用本发明实施例所提供的方法,无需人工地设置转换规则。
举例来说,如图3所示,是本发明的问句解析的一例。
301,接收用户输入的问句。假设该问句为自然语言问句:“Which software has been developed by organization founded in California,USA?”
302,对301输入的问句进行短语检测(phrase detection),确定第一候选短语。
具体地,302可以参见前述实施例中的102,为避免重复,这里不再赘述。
例如,所确定的第一候选短语包括:software,developed,developed by,organizations,founded in,founded,California,USA。
303,对302中所确定的第一候选短语进行短语映射(phrase mapping),将第一候选短语映射到第一资源项。
具体地,303可以参见前述实施例中的103,为避免重复,这里不再赘述。
例如,将第一候选短语software映射到:dbo:Software,dbr:Software等。 这里不再一一罗列。
304,通过特征提取(feature extraction),确定观察谓词的值,并构建可能的问句分析空间。
具体地,304可以参见前述实施例中的104,为避免重复,这里不再赘述。
应注意,这里不再一一罗列。
305,通过联合推理(Inference),计算所述每一个命题集合的置信度,并获取所述置信度满足预设条件的命题集合中的真命题的组合。
具体地,305可以参见前述实施例中的105和106,为避免重复,这里不再赘述。
其中,所述真命题的组合即其中隐含谓词的值为1的组合。
例如,所确定的隐含谓词的值为1的表达式为
hasPhrase(software),
hasPhrase(developed by),
hasPhrase(organizations),
hasPhrase(founded in),
hasPhrase(California);
hasResource(software,dbo:Software),
hasResource(developed by,dbo:developer),
hasResource(California,dbr:California),
hasResource(organizations,dbo:Company),
hasResource(founded in,dbo:foundationPlace);
hasRelation(dbo:Software,dbo:developer,1_1),
hasRelation(dbo:developer,dbo:Company,2_1),
hasRelation(dbo:Company,dbo:foundationPlace,1_1),
hasRelation(dbo:foundationPlace,dbr:California,2_1)。
306,生成资源项查询图。
具体地,该资源项查询图也可以称为语义项查询图(Semantic Items Query Graph)。
具体地,资源项查询图中的顶点可包括:搜索资源项、搜索资源项的类型、映射到所述搜索资源项的搜索短语在问句中的位置。
具体地,资源项查询图中的边包括:所述边相连的两个顶点中的两个搜索资源项之间的参数匹配关系。
应注意,资源项查询图中搜索资源项之间的关系是二元关系。
可选地,资源项查询图中的顶点可包括:搜索短语,搜索资源项、搜索资源项的类型、映射到所述搜索资源项的搜索短语、以及所述搜索短语在问句中的位置。如图4是资源项查询图的另一例。包括顶点311至315。
其中,顶点311包括:搜索资源项dbo:Software、搜索资源项的类型Class、搜索短语Software和搜索短语在问句中的位置11。其中,搜索短语Software映射到搜索资源项dbo:Software。
其中,顶点312包括:搜索资源项dbo:developer、搜索资源项的类型Relation、搜索短语developed by和搜索短语在问句中的位置45。其中,搜索短语Software映射到搜索资源项dbo:Software。
其中,顶点313包括:搜索资源项dbo:Company、搜索资源项的类型Class、搜索短语organizations和搜索短语在问句中的位置66。其中,搜索短语organizations映射到搜索资源项dbo:Company。
其中,顶点314包括:搜索资源项dbo:foundationPlace、搜索资源项的类型Relation、搜索短语founded in和搜索短语在问句中的位置78。其中,搜索短语founded in映射到搜索资源项dbo:foundationPlace。
其中,顶点315包括:搜索资源项dbr:California、搜索资源项的类型Entity、搜索短语California和搜索短语在问句中的位置99。其中,搜索短语California映射到搜索资源项dbr:California。
其中,顶点311与顶点312之间的边1_1表示搜索资源项dbo:Software与搜索资源项dbo:developer之间的参数匹配关系为1_1。
其中,顶点312与顶点313之间的边2_1表示搜索资源项dbo:developer与搜索资源项dbo:Company之间的参数匹配关系为2_1。
其中,顶点313与顶点314之间的边1_1表示搜索资源项dbo:Company与搜索资源项dbo:foundationPlace之间的参数匹配关系为1_1。
其中,顶点315与顶点314之间的边1_2表示搜索资源项dbr:California与搜索资源项dbo:foundationPlace之间的参数匹配关系为1_2。
307,SPARQL生成(SPARQL genaration)。
具体地,将资源项查询图中的二元关系转换为三元关系。
也就是,将资源项查询图中相互连接的三个搜索资源项,具有三元关系,并且,位于所述相互连接的三个搜索资源项的中间的搜索资源项的类型为关系。
例如,301中的自然语言问句为Normal问题,使用SELECT ?url WHERE模板,所生成的SPARQL为:
SELECT ?url WHERE{
?url_answer rdf:type dbo:Software
?url_answer dbo:developer ?x1
?x1 rdf:type dbo:Company
?x1 dbo:foundationPlace dbr:California
}
这样,本发明实施例中,可以将自然语言问句转化为SPARQL。并且所采用的预定义的一阶公式是领域无关的,即预定义的布尔公式和加权公式可以运用于所有的知识库,具有可扩展性。也就是说,采用本发明实施例所提供的方法,无需人工地设置转换规则。
并且,可理解,本发明实施例中,预定义的布尔公式和加权公式是语言无关的,即具有语言扩展性。例如,既可以用于英文知识库,也可以用于中文知识库。
如前所述,本发明实施例中,105中不确定性推理可以基于MLN。其中,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。
可选地,所述一阶公式可以包括布尔公式和加权公式。布尔公式的权重为+∞,加权公式的权重为加权公式权重。其中,加权公式权重可以是通过学习的方法,经过训练所得到的。那么,可理解,在101之前,如图5所示,还可包括:
401,从所述知识库中获取多个自然语言问句。
402,对所述多个自然语言问句进行短语检测,以确定所述多个自然语言问句的第二候选短语。
403,将所述第二候选短语映射到所述知识库中的第二资源项,其中,所述第二资源项与所述第二候选短语具有一致的语义。
404,根据所述第二候选短语和所述第二资源项,确定与所述多个自然语言问句对应的观察谓词的值。
405,获取人工标注的与所述多个自然语言问句对应的隐含谓词的值。
406,根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重。
这样,本发明实施例中,基于预定义的一阶公式,能够通过学习的方法,确定针对知识库的一阶公式的权重,并可以作为针对知识库的转换规则。这样,无需人工设置转换规则,并且预定义的马尔科夫逻辑网络MLN的一阶公式具有可扩展性,能够应用与任何的知识库。
具体地,问答系统知识库包括问题库,问题库中包括多个自然语言问句。那么,401中可以是从问答系统知识库中的问题库获取多个自然语言问句。本发明实施例对多个自然语言问句的数目不作限定。例如,多个自然语言问句可以为1千个自然语言问句。
例如,可以从关联数据问答系统(Question Answering over Linked Data,QALD)的问题库Q1的训练集(training set)获取110个自然语言问题。
本发明实施例中,402的过程可以参见前述实施例的102的过程,403的过程可以参见前述实施例的103的过程,404的过程可以参见前述实施例的104的过程。为避免重复,这里不再赘述。这样,针对401中的多个自然语言问句,能够确定与所述多个自然语言文件分别对应的观察谓词的值。
可理解,在405之前,需人工地标注所述多个自然语言问句中每个自然语言问句对应的隐含谓词的值,也就是说,405中所获取的与所述多个自然语言问句对应的隐含谓词的值是人工标注的(hand-labeled)。
可选地,一阶公式包括布尔公式和加权公式。布尔公式的权重为+∞,加权公式的权重为加权公式权重。那么,405中人工标注的隐含谓词的值满足所述布尔公式。相应地,406中,通过训练确定所述一阶公式的权重,即通过训练确定所述加权公式权重。其中,无向图可以包括马尔科夫网络(Markov Network,MN)。
可选地,406中,可以根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,采用差额注入松弛算法(Margin Infused Relaxed Algorithm,MIRA),确定所述一阶公式的权重。
具体地,在406中,可以使用thebeast工具学习加权公式权重。在参数 学习的过程中,可以先将加权公式权重初始化为0,再使用MIRA更新所述加权公式权重。可选地,在训练的过程中,还可以设置训练的最大循环次数,例如训练的最大循环次数为10。
举例来说,表五中的sf3的加权公式权重可如表六所示。从表六可以看出,候选短语的主要词的词性为nn时,该候选短语映射到类型为E的资源项的可能性比较大。
表六
Figure PCTCN2015078362-appb-000022
这样,通过图5所示的实施例,可以确定任何一个知识库的加权公式权重,从而可以得到针对任何一个知识库的转换规则。
可理解,本发明实施例中,确定一阶公式的权重的方法是一种数据驱动的方式,可以适用于不同的知识库。在大大减少人力的情况下,可以提高知识库的问答解析的效率。
应理解,本发明实施例中,也可以根据所构建的无向图,进行结构学习,进而学习到二阶公式甚至更高阶的公式,进一步根据所学习到的二阶公式或更高阶的公式构建新的无向图,并学习二阶公式或更高阶的公式所对应的权重。本发明对此不作限定。
图6是本发明一个实施例的问句解析的设备的框图。图6所示的设备500包括:接收单元501、短语检测单元502、映射单元503、第一确定单元504、第二确定单元505、获取单元506、和生成单元507。
接收单元501,用于接收用户输入的问句。
短语检测单元502,用于对所述接收单元501接收的所述问句进行短语检测,以确定第一候选短语。
映射单元503,用于将所述短语检测单元502确定的所述第一候选短语映射到知识库中的第一资源项,其中,所述第一资源项与所述第一候选短语具有一致的语义。
第一确定单元504,用于根据所述第一候选短语和所述第一资源项,确定观察谓词的值和可能的问句分析空间,其中,所述观察谓词用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系,所述可能的问句分析空间中的点为命题集合,所述命题集合中的命题的真假由隐含谓词的值表征。
第二确定单元505,用于对所述可能的问句分析空间中的每一个命题集合,根据第一确定单元504确定所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度。
获取单元506,用于获取所述置信度满足预设条件的命题集合中的真命题的组合,其中,所述真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述第一资源项中所选中的搜索资源项和所述搜索资源项的特征。
生成单元507,用于根据所述获取单元506获取的所述真命题的组合,生成形式化查询语句。
本发明实施例利用观察谓词和隐含谓词,进行不确定性推理,能够将自然语言问句转化为形式化查询语句。并且,本发明实施例中,不确定性推理的方法能够应用于任何领域的知识库,具有领域扩展性,这样无需针对知识库人工地配置转换规则。
可选地,作为一个实施例,所述不确定性推理基于马尔科夫逻辑网络MLN,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。
可选地,作为另一个实施例,
所述获取单元506,还用于从所述知识库中获取多个自然语言问句;
所述短语检测单元502,还用于对所述获取单元506接收的所述问句进行短语检测,以确定第一候选短语;
所述映射单元503,还用于将所述第二候选短语映射到所述知识库中的第二资源项,其中,所述第二资源项与所述第二候选短语具有一致的语义;
所述第一确定单元504,还用于根据所述第二候选短语和所述第二资源项,确定与所述多个自然语言问句对应的观察谓词的值;
所述获取单元506,还用于获取人工标注的与所述多个自然语言问句对 应的隐含谓词的值;
所述第二确定单元505,还用于根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重。
可选地,作为另一个实施例,所述一阶公式包括布尔公式和加权公式,所述布尔公式的权重为+∞,所述加权公式的权重为加权公式权重,所述人工标注的与所述多个自然语言问句对应的隐含谓词的值满足所述布尔公式,所述第二确定单元505,具体用于:根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述加权公式权重。
可选地,作为另一个实施例,所述第二确定单元505,具体用于:根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,采用差额注入松弛算法MIRA,确定所述一阶公式的权重。
可选地,作为另一个实施例,所述MLN表示为M,所述一阶公式表示为φi,所述一阶公式的权重表示为wi,所述命题集合表示为y,第二确定单元505,具体用于:
根据
Figure PCTCN2015078362-appb-000023
计算所述每一个命题集合的置信度,其中,Z为归一化常数,
Figure PCTCN2015078362-appb-000024
为与一阶公式φi对应的子公式的集合,c为
Figure PCTCN2015078362-appb-000025
的所述子公式的集合中的一个子公式,
Figure PCTCN2015078362-appb-000026
为二值函数,
Figure PCTCN2015078362-appb-000027
表示在所述命题集合y下,所述一阶公式的真假。
可选地,作为另一个实施例,获取单元506,具体用于:确定所述置信度的值最大的命题集合,并获取所述置信度的值最大的命题集合中的真命题的组合。
可选地,作为另一个实施例,
所述第一候选短语的特征包括所述第一候选短语在所述问句中的位置、所述第一候选短语的主要词的词性、所述第一候选短语两两之间的依存路径上的标签,
所述第一资源项的特征包括所述第一资源项的类型、所述第一资源项两 两之间的相关性值、所述第一资源项两两之间的参数匹配关系,
所述第一候选短语与所述第一资源项的关系包括所述第一候选短语与所述第一资源项的先验匹配得分,
所述第一确定单元504,具体用于:
确定所述第一候选短语在所述问句中的位置;
采用stanford词性标注工具,确定所述第一候选短语的主要词的词性;
采用stanford依存句法分析工具,确定所述第一候选短语两两之间的依存路径上的标签;
从所述知识库中确定所述第一资源项的类型,其中,所述类型为实体或类别或关系;
从所述知识库中确定所述第一资源项两两之间的参数匹配关系;
将所述第一资源项两两之间的相似性系数,作为所述两个第一资源项两两之间的相关性值;
计算所述第一候选短语与所述第一资源项之间的先验匹配得分,所述先验匹配得分用于表示所述第一候选短语映射到所述第一资源项的概率。
可选地,作为另一个实施例,所述形式化查询语句为简单协议资源描述框架查询语句SPARQL。
可选地,作为另一个实施例,所述生成单元507,具体用于:
根据所述真命题的组合,利用SPARQL模板生成所述SPARQL。
可选地,作为另一个实施例,所述SPARQL模板包括ASK WHERE模板、SELECT COUNT(?url)WHERE模板和SELECT ?url WHERE模板,
所述生成单元507,具体用于:
当所述问句为Yes/No问题时,根据所述真命题的组合,使用所述ASK WHERE模板生成所述SPARQL;
当所述问句为Normal问题时,根据所述真命题的组合,使用所述SELECT ?url WHERE模板生成所述SPARQL;
当所述问句为Number问题时,根据所述真命题的组合,使用所述SELECT ?url WHERE模板生成所述SPARQL,或者,当使用所述SELECT ?url WHERE模板生成的SPARQL无法得到数值型答案时,使用所述SELECT COUNT(?url)WHERE模板生成所述SPARQL。
可选地,作为另一个实施例,所述短语检测单元502,具体用于:
将所述问句中的词序列作为所述第一候选短语,其中,所述词序列满足:
所述词序列中所有连续的非停用词都以大写字母开头,或者,若所述词序列中所有连续的非停用词不都以大写字母开头,则所述词序列的长度小于四;
所述词序列的主要词的词性为jj或nn或rb或vb,其中,jj为形容词,nn为名词,rb为副词,vb为动词;
所述词序列所包括的词不全为停用词。
可选地,作为另一个实施例,设备500可以是知识库的服务器。
设备500能够实现图1至图5的实施例中由设备实现的各个过程,为避免重复,这里不再赘述。
图7是本发明另一个实施例的问句解析的设备的框图。图7所示的设备600包括:处理器601、接收电路602、发送电路603和存储器604。
接收电路602,用于接收用户输入的问句。
处理器601,用于对所述接收电路602接收的所述问句进行短语检测,以确定第一候选短语。
处理器601,还用于将所述第一候选短语映射到知识库中的第一资源项,其中,所述第一资源项与所述第一候选短语具有一致的语义。
处理器601,还用于根据所述第一候选短语和所述第一资源项,确定观察谓词的值和可能的问句分析空间,其中,所述观察谓词用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系,所述可能的问句分析空间中的点为命题集合,所述命题集合中的命题的真假由隐含谓词的值表征。
处理器601,还用于对所述可能的问句分析空间中的每一个命题集合,根据第一确定单元504确定所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度。
接收电路602,还用于获取所述置信度满足预设条件的命题集合中的真命题的组合,其中,所述真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述第一资源项中所选中的搜索资源项和所述搜索资源项的特征。
处理器601,还用于根据所述真命题的组合,生成形式化查询语句。
本发明实施例利用观察谓词和隐含谓词,进行不确定性推理,能够将自 然语言问句转化为形式化查询语句。并且,本发明实施例中,不确定性推理的方法能够应用于任何领域的知识库,具有领域扩展性,这样无需针对知识库人工地配置转换规则。
设备600中的各个组件通过总线系统605耦合在一起,其中总线系统605除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图7中将各种总线都标为总线系统605。
上述本发明实施例揭示的方法可以应用于处理器601中,或者由处理器601实现。处理器601可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1001中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1001可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器604,处理器601读取存储器604中的信息,结合其硬件完成上述方法的步骤。
可以理解,本发明实施例中的存储器604可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate  SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。本文描述的系统和方法的存储器604旨在包括但不限于这些和任意其它适合类型的存储器。
可以理解的是,本文描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本申请所述功能的其它电子单元或其组合中。
当在软件、固件、中间件或微码、程序代码或代码段中实现实施例时,它们可存储在例如存储部件的机器可读介质中。代码段可表示过程、函数、子程序、程序、例程、子例程、模块、软件分组、类、或指令、数据结构或程序语句的任意组合。代码段可通过传送和/或接收信息、数据、自变量、参数或存储器内容来稿合至另一代码段或硬件电路。可使用包括存储器共享、消息传递、令牌传递、网络传输等任意适合方式来传递、转发或发送信息、自变量、参数、数据等。
对于软件实现,可通过执行本文所述功能的模块(例如过程、函数等)来实现本文所述的技术。软件代码可存储在存储器单元中并通过处理器执行。存储器单元可以在处理器中或在处理器外部实现,在后一种情况下存储器单元可经由本领域己知的各种手段以通信方式耦合至处理器。
可选地,作为一个实施例,所述不确定性推理基于马尔科夫逻辑网络MLN,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。
本发明实施例中,存储器604可用于存储资源项、以及资源项的类型等。存储器604还可用于存储所述一阶公式。存储器604还可用于存储SPARQL模板。
可选地,作为另一个实施例,
所述接收电路602,还用于从所述知识库中获取多个自然语言问句;
所述处理器601,还用于对所述问句进行短语检测,以确定第一候选短 语;
所述处理器601,还用于将所述第二候选短语映射到所述知识库中的第二资源项,其中,所述第二资源项与所述第二候选短语具有一致的语义;
所述处理器601,还用于根据所述第二候选短语和所述第二资源项,确定与所述多个自然语言问句对应的观察谓词的值;
所述接收电路602,还用于获取人工标注的与所述多个自然语言问句对应的隐含谓词的值;
所述处理器601,还用于根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重。
可选地,作为另一个实施例,所述一阶公式包括布尔公式和加权公式,所述布尔公式的权重为+∞,所述加权公式的权重为加权公式权重,所述人工标注的与所述多个自然语言问句对应的隐含谓词的值满足所述布尔公式,
所述处理器601,具体用于:根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述加权公式权重。
可选地,作为另一个实施例,所述处理器601,具体用于:
根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,采用差额注入松弛算法MIRA,确定所述一阶公式的权重。
可选地,作为另一个实施例,所述MLN表示为M,所述一阶公式表示为φi,所述一阶公式的权重表示为wi,所述命题集合表示为y,处理器601,具体用于:
根据
Figure PCTCN2015078362-appb-000028
计算所述每一个命题集合的置信度,其中,Z为归一化常数,
Figure PCTCN2015078362-appb-000029
为与一阶公式φi对应的子公式的集合,c为
Figure PCTCN2015078362-appb-000030
的所述子公式的集合中的一个子公式,
Figure PCTCN2015078362-appb-000031
为二值函数,
Figure PCTCN2015078362-appb-000032
表示在所述命题集合y下,所述一阶公式的真假。
可选地,作为另一个实施例,接收电路602,具体用于:确定所述置信度的值最大的命题集合,并获取所述置信度的值最大的命题集合中的真命题 的组合。
可选地,作为另一个实施例,
所述第一候选短语的特征包括所述第一候选短语在所述问句中的位置、所述第一候选短语的主要词的词性、所述第一候选短语两两之间的依存路径上的标签,
所述第一资源项的特征包括所述第一资源项的类型、所述第一资源项两两之间的相关性值、所述第一资源项两两之间的参数匹配关系,
所述第一候选短语与所述第一资源项的关系包括所述第一候选短语与所述第一资源项的先验匹配得分,
所述处理器601,具体用于:
确定所述第一候选短语在所述问句中的位置;
采用stanford词性标注工具,确定所述第一候选短语的主要词的词性;
采用stanford依存句法分析工具,确定所述第一候选短语两两之间的依存路径上的标签;
从所述知识库中确定所述第一资源项的类型,其中,所述类型为实体或类别或关系;
从所述知识库中确定所述第一资源项两两之间的参数匹配关系;
将所述第一资源项两两之间的相似性系数,作为所述两个第一资源项两两之间的相关性值;
计算所述第一候选短语与所述第一资源项之间的先验匹配得分,所述先验匹配得分用于表示所述第一候选短语映射到所述第一资源项的概率。
可选地,作为另一个实施例,所述形式化查询语句为简单协议资源描述框架查询语句SPARQL。
可选地,作为另一个实施例,所述处理器601,具体用于:
根据所述真命题的组合,利用SPARQL模板生成所述SPARQL。
可选地,作为另一个实施例,所述SPARQL模板包括ASK WHERE模板、SELECT COUNT(?url)WHERE模板和SELECT?url WHERE模板,
所述处理器601,具体用于:
当所述问句为Yes/No问题时,根据所述真命题的组合,使用所述ASK WHERE模板生成所述SPARQL;
当所述问句为Normal问题时,根据所述真命题的组合,使用所述 SELECT ?url WHERE模板生成所述SPARQL;
当所述问句为Number问题时,根据所述真命题的组合,使用所述SELECT ?url WHERE模板生成所述SPARQL,或者,当使用所述SELECT ?url WHERE模板生成的SPARQL无法得到数值型答案时,使用所述SELECT COUNT(?url)WHERE模板生成所述SPARQL。
可选地,作为另一个实施例,所述处理器601,具体用于:
将所述问句中的词序列作为所述第一候选短语,其中,所述词序列满足:
所述词序列中所有连续的非停用词都以大写字母开头,或者,若所述词序列中所有连续的非停用词不都以大写字母开头,则所述词序列的长度小于四;
所述词序列的主要词的词性为jj或nn或rb或vb,其中,jj为形容词,nn为名词,rb为副词,vb为动词;
所述词序列所包括的词不全为停用词。
可选地,作为另一个实施例,设备600可以是知识库的服务器。
设备600能够实现图1至图5的实施例中由设备实现的各个过程,为避免重复,这里不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (24)

  1. 一种知识库中问句解析的方法,其特征在于,包括:
    接收用户输入的问句;
    对所述问句进行短语检测,以确定第一候选短语;
    将所述第一候选短语映射到所述知识库中的第一资源项,其中,所述第一资源项与所述第一候选短语具有一致的语义;
    根据所述第一候选短语和所述第一资源项,确定观察谓词的值和可能的问句分析空间,其中,所述观察谓词用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系,所述可能的问句分析空间中的点为命题集合,所述命题集合中的命题的真假由隐含谓词的值表征;
    对所述可能的问句分析空间中的每一个命题集合,根据所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度;
    获取所述置信度满足预设条件的命题集合中的真命题的组合,其中,所述真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述第一资源项中所选中的搜索资源项和所述搜索资源项的特征;
    根据所述真命题的组合,生成形式化查询语句。
  2. 根据权利要求1所述的方法,其特征在于,所述不确定性推理基于马尔科夫逻辑网络MLN,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。
  3. 根据权利要求2所述的方法,其特征在于,在所述接收用户输入的问句之前,所述方法还包括:
    从所述知识库中获取多个自然语言问句;
    对所述多个自然语言问句进行短语检测,以确定所述多个自然语言问句的第二候选短语;
    将所述第二候选短语映射到所述知识库中的第二资源项,其中,所述第二资源项与所述第二候选短语具有一致的语义;
    根据所述第二候选短语和所述第二资源项,确定与所述多个自然语言问句对应的观察谓词的值;
    获取人工标注的与所述多个自然语言问句对应的隐含谓词的值;
    根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重。
  4. 根据权利要求3所述的方法,其特征在于,所述一阶公式包括布尔公式和加权公式,所述布尔公式的权重为+∞,所述加权公式的权重为加权公式权重,所述人工标注的与所述多个自然语言问句对应的隐含谓词的值满足所述布尔公式,
    根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重,包括:
    根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述加权公式权重。
  5. 根据权利要求3所述的方法,其特征在于,所述根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重,包括:
    根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,采用差额注入松弛算法MIRA,确定所述一阶公式的权重。
  6. 根据权利要求2至5任一项所述的方法,其特征在于,所述MLN表示为M,所述一阶公式表示为φi,所述一阶公式的权重表示为wi,所述命题集合表示为y,
    对所述问句分析空间中的每一个命题集合,根据所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度,包括:
    根据
    Figure PCTCN2015078362-appb-100001
    计算所述每一个命题集合的置信度,
    其中,Z为归一化常数,
    Figure PCTCN2015078362-appb-100002
    为与一阶公式φi对应的子公式的集合,c为
    Figure PCTCN2015078362-appb-100003
    的所述子公式的集合中的一个子公式,
    Figure PCTCN2015078362-appb-100004
    为二值函数,
    Figure PCTCN2015078362-appb-100005
    表示在所述命题集合y下,所述一阶公式的真假。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述获取所述置信度满足预设条件的命题集合中的真命题的组合,包括:
    确定所述置信度的值最大的命题集合,并获取所述置信度的值最大的命题集合中的真命题的组合。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,
    所述第一候选短语的特征包括所述第一候选短语在所述问句中的位置、所述第一候选短语的主要词的词性、所述第一候选短语两两之间的依存路径上的标签,
    所述第一资源项的特征包括所述第一资源项的类型、所述第一资源项两两之间的相关性值、所述第一资源项两两之间的参数匹配关系,
    所述第一候选短语与所述第一资源项的关系包括所述第一候选短语与所述第一资源项的先验匹配得分,
    所述根据所述第一候选短语和所述第一资源项,确定观察谓词的值,包括:
    确定所述第一候选短语在所述问句中的位置;
    采用stanford词性标注工具,确定所述第一候选短语的主要词的词性;
    采用stanford依存句法分析工具,确定所述第一候选短语两两之间的依存路径上的标签;
    从所述知识库中确定所述第一资源项的类型,其中,所述类型为实体或类别或关系;
    从所述知识库中确定所述第一资源项两两之间的参数匹配关系;
    将所述第一资源项两两之间的相似性系数,作为所述两个第一资源项两两之间的相关性值;
    计算所述第一候选短语与所述第一资源项之间的先验匹配得分,所述先验匹配得分用于表示所述第一候选短语映射到所述第一资源项的概率。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述形式化查询语句为简单协议资源描述框架查询语句SPARQL。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述真命题的组合,生成形式化查询语句,包括:
    根据所述真命题的组合,利用SPARQL模板生成所述SPARQL。
  11. 根据权利要求10所述的方法,其特征在于,所述SPARQL模板包括ASK WHERE模板、SELECT COUNT(?url)WHERE模板和SELECT?url WHERE模板,
    所述根据所述真命题的组合,利用SPARQL模板生成所述SPARQL,包括:
    当所述问句为Yes/No问题时,根据所述真命题的组合,使用所述ASK WHERE模板生成所述SPARQL;
    当所述问句为Normal问题时,根据所述真命题的组合,使用所述SELECT?url WHERE模板生成所述SPARQL;
    当所述问句为Number问题时,根据所述真命题的组合,使用所述SELECT?url WHERE模板生成所述SPARQL,或者,当使用所述SELECT?url WHERE模板生成的SPARQL无法得到数值型答案时,使用所述SELECT COUNT(?url)WHERE模板生成所述SPARQL。
  12. 根据权利要求1至11任一项所述的方法,其特征在于,所述对所述问句进行短语检测,以确定第一候选短语,包括:将所述问句中的词序列作为所述第一候选短语,其中,所述词序列满足:
    所述词序列中所有连续的非停用词都以大写字母开头,或者,若所述词序列中所有连续的非停用词不都以大写字母开头,则所述词序列的长度小于四;
    所述词序列的主要词的词性为jj或nn或rb或vb,其中,jj为形容词,nn为名词,rb为副词,vb为动词;
    所述词序列所包括的词不全为停用词。
  13. 一种问答解析的设备,其特征在于,包括:
    接收单元,用于接收用户输入的问句;
    短语检测单元,用于对所述接收单元接收的所述问句进行短语检测,以确定第一候选短语;
    映射单元,用于将所述短语检测单元确定的所述第一候选短语映射到知识库中的第一资源项,其中,所述第一资源项与所述第一候选短语具有一致的语义;
    第一确定单元,用于根据所述第一候选短语和所述第一资源项,确定观 察谓词的值和可能的问句分析空间,其中,所述观察谓词用于表示所述第一候选短语的特征、所述第一资源项的特征和所述第一候选短语与所述第一资源项的关系,所述可能的问句分析空间中的点为命题集合,所述命题集合中的命题的真假由隐含谓词的值表征;
    第二确定单元,用于对所述可能的问句分析空间中的每一个命题集合,根据所述第一确定单元确定的所述观察谓词的值和所述隐含谓词的值,进行不确定性推理,计算所述每一个命题集合的置信度;
    获取单元,用于获取所述第二确定单元确定的所述置信度满足预设条件的命题集合中的真命题的组合,其中,所述真命题用于表示从所述第一候选短语中所选中的搜索短语、从所述第一资源项中所选中的搜索资源项和所述搜索资源项的特征;
    生成单元,用于根据所述真命题的组合,生成形式化查询语句。
  14. 根据权利要求13所述的设备,其特征在于,所述不确定性推理基于马尔科夫逻辑网络MLN,所述MLN包括预定义的一阶公式以及所述一阶公式的权重。
  15. 根据权利要求14所述的设备,其特征在于,
    所述获取单元,还用于从所述知识库中获取多个自然语言问句;
    所述短语检测单元,还用于对所述获取单元接收的所述问句进行短语检测,以确定第一候选短语;
    所述映射单元,还用于将所述第二候选短语映射到所述知识库中的第二资源项,其中,所述第二资源项与所述第二候选短语具有一致的语义;
    所述第一确定单元,还用于根据所述第二候选短语和所述第二资源项,确定与所述多个自然语言问句对应的观察谓词的值;
    所述获取单元,还用于获取人工标注的与所述多个自然语言问句对应的隐含谓词的值;
    所述第二确定单元,还用于根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述一阶公式的权重。
  16. 根据权利要求15所述的设备,其特征在于,所述一阶公式包括布尔公式和加权公式,所述布尔公式的权重为+∞,所述加权公式的权重为加权公式权重,所述人工标注的与所述多个自然语言问句对应的隐含谓词的值满 足所述布尔公式,
    所述第二确定单元,具体用于:根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,通过训练确定所述加权公式权重。
  17. 根据权利要求15所述的设备,其特征在于,所述第二确定单元,具体用于:
    根据与所述多个自然语言问句对应的观察谓词的值、与所述多个自然语言问句对应的隐含谓词的值和所述一阶公式,构建无向图,采用差额注入松弛算法MIRA,确定所述一阶公式的权重。
  18. 根据权利要求14至17任一项所述的设备,其特征在于,所述MLN表示为M,所述一阶公式表示为φi,所述一阶公式的权重表示为wi,所述命题集合表示为y,
    所述第二确定单元,具体用于:
    根据
    Figure PCTCN2015078362-appb-100006
    计算所述每一个命题集合的置信度,
    其中,Z为归一化常数,
    Figure PCTCN2015078362-appb-100007
    为与一阶公式φi对应的子公式的集合,c为
    Figure PCTCN2015078362-appb-100008
    的所述子公式的集合中的一个子公式,
    Figure PCTCN2015078362-appb-100009
    为二值函数,
    Figure PCTCN2015078362-appb-100010
    表示在所述命题集合y下,所述一阶公式的真假。
  19. 根据权利要求13至18任一项所述的设备,其特征在于,所述获取单元,具体用于:
    确定所述置信度的值最大的命题集合,并获取所述置信度的值最大的命题集合中的真命题的组合。
  20. 根据权利要求13至18任一项所述的设备,其特征在于,
    所述第一候选短语的特征包括所述第一候选短语在所述问句中的位置、所述第一候选短语的主要词的词性、所述第一候选短语两两之间的依存路径上的标签,
    所述第一资源项的特征包括所述第一资源项的类型、所述第一资源项两两之间的相关性值、所述第一资源项两两之间的参数匹配关系,
    所述第一候选短语与所述第一资源项的关系包括所述第一候选短语与 所述第一资源项的先验匹配得分,
    所述第一确定单元,具体用于:
    确定所述第一候选短语在所述问句中的位置;
    采用stanford词性标注工具,确定所述第一候选短语的主要词的词性;
    采用stanford依存句法分析工具,确定所述第一候选短语两两之间的依存路径上的标签;
    从所述知识库中确定所述第一资源项的类型,其中,所述类型为实体或类别或关系;
    从所述知识库中确定所述第一资源项两两之间的参数匹配关系;
    将所述第一资源项两两之间的相似性系数,作为所述两个第一资源项两两之间的相关性值;
    计算所述第一候选短语与所述第一资源项之间的先验匹配得分,所述先验匹配得分用于表示所述第一候选短语映射到所述第一资源项的概率。
  21. 根据权利要求13至20任一项所述的设备,其特征在于,所述形式化查询语句为简单协议资源描述框架查询语句SPARQL。
  22. 根据权利要求21所述的设备,其特征在于,所述生成单元,具体用于:
    根据所述真命题的组合,利用SPARQL模板生成所述SPARQL。
  23. 根据权利要求22所述的设备,其特征在于,所述SPARQL模板包括ASK WHERE模板、SELECT COUNT(?url)WHERE模板和SELECT?url WHERE模板,
    所述生成单元,具体用于:
    当所述问句为Yes/No问题时,根据所述真命题的组合,使用所述ASK WHERE模板生成所述SPARQL;
    当所述问句为Normal问题时,根据所述真命题的组合,使用所述SELECT?url WHERE模板生成所述SPARQL;
    当所述问句为Number问题时,根据所述真命题的组合,使用所述SELECT?url WHERE模板生成所述SPARQL,或者,当使用所述SELECT?url WHERE模板生成的SPARQL无法得到数值型答案时,使用所述SELECT COUNT(?url)WHERE模板生成所述SPARQL。
  24. 根据权利要求13至23任一项所述的设备,其特征在于,所述短语 检测单元,具体用于:
    将所述问句中的词序列作为所述第一候选短语,其中,所述词序列满足:
    所述词序列中所有连续的非停用词都以大写字母开头,或者,若所述词序列中所有连续的非停用词不都以大写字母开头,则所述词序列的长度小于四;
    所述词序列的主要词的词性为jj或nn或rb或vb,其中,jj为形容词,nn为名词,rb为副词,vb为动词;
    所述词序列所包括的词不全为停用词。
PCT/CN2015/078362 2014-09-29 2015-05-06 知识库中问句解析的方法及设备 WO2016050066A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP15845782.0A EP3179384A4 (en) 2014-09-29 2015-05-06 Method and device for parsing interrogative sentence in knowledge base
US15/472,279 US10706084B2 (en) 2014-09-29 2017-03-29 Method and device for parsing question in knowledge base

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410513189.4A CN105528349B (zh) 2014-09-29 2014-09-29 知识库中问句解析的方法及设备
CN201410513189.4 2014-09-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/472,279 Continuation US10706084B2 (en) 2014-09-29 2017-03-29 Method and device for parsing question in knowledge base

Publications (1)

Publication Number Publication Date
WO2016050066A1 true WO2016050066A1 (zh) 2016-04-07

Family

ID=55629397

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/078362 WO2016050066A1 (zh) 2014-09-29 2015-05-06 知识库中问句解析的方法及设备

Country Status (4)

Country Link
US (1) US10706084B2 (zh)
EP (1) EP3179384A4 (zh)
CN (1) CN105528349B (zh)
WO (1) WO2016050066A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359303A (zh) * 2018-12-10 2019-02-19 枣庄学院 一种基于图模型的词义消歧方法和系统
CN109635004A (zh) * 2018-12-13 2019-04-16 广东工业大学 一种数据库的对象描述提供方法、装置及设备
CN110321437A (zh) * 2019-05-27 2019-10-11 腾讯科技(深圳)有限公司 一种语料数据处理方法、装置、电子设备及介质
CN111143539A (zh) * 2019-12-31 2020-05-12 重庆和贯科技有限公司 基于知识图谱的教学领域问答方法
CN112784590A (zh) * 2021-02-01 2021-05-11 北京金山数字娱乐科技有限公司 文本处理方法及装置
US20210191986A1 (en) * 2017-03-30 2021-06-24 Nec Corporation Information processing apparatus, information processing method, and computer readable recording medium
CN117271561A (zh) * 2023-11-20 2023-12-22 海信集团控股股份有限公司 一种基于大语言模型的sql语句生成方法、装置及设备
US20240004907A1 (en) * 2022-06-30 2024-01-04 International Business Machines Corporation Knowledge graph question answering with neural machine translation

Families Citing this family (77)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9720899B1 (en) 2011-01-07 2017-08-01 Narrative Science, Inc. Automatic generation of narratives from data using communication goals and narrative analytics
US10055450B1 (en) * 2014-08-19 2018-08-21 Abdullah Uz Tansel Efficient management of temporal knowledge
US10176249B2 (en) * 2014-09-30 2019-01-08 Raytheon Company System for image intelligence exploitation and creation
US10747823B1 (en) 2014-10-22 2020-08-18 Narrative Science Inc. Interactive and conversational data exploration
US11922344B2 (en) 2014-10-22 2024-03-05 Narrative Science Llc Automatic generation of narratives from data using communication goals and narrative analytics
US11238090B1 (en) 2015-11-02 2022-02-01 Narrative Science Inc. Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from visualization data
US11288328B2 (en) 2014-10-22 2022-03-29 Narrative Science Inc. Interactive and conversational data exploration
US10631057B2 (en) 2015-07-24 2020-04-21 Nuance Communications, Inc. System and method for natural language driven search and discovery in large data sources
US10847175B2 (en) 2015-07-24 2020-11-24 Nuance Communications, Inc. System and method for natural language driven search and discovery in large data sources
EP3142029A1 (en) * 2015-09-11 2017-03-15 Google, Inc. Disambiguating join paths for natural language queries
US11232268B1 (en) 2015-11-02 2022-01-25 Narrative Science Inc. Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from line charts
US11222184B1 (en) 2015-11-02 2022-01-11 Narrative Science Inc. Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from bar charts
US11188588B1 (en) 2015-11-02 2021-11-30 Narrative Science Inc. Applied artificial intelligence technology for using narrative analytics to interactively generate narratives from visualization data
US20170161319A1 (en) * 2015-12-08 2017-06-08 Rovi Guides, Inc. Systems and methods for generating smart responses for natural language queries
US10031967B2 (en) * 2016-02-29 2018-07-24 Rovi Guides, Inc. Systems and methods for using a trained model for determining whether a query comprising multiple segments relates to an individual query or several queries
US10133735B2 (en) 2016-02-29 2018-11-20 Rovi Guides, Inc. Systems and methods for training a model to determine whether a query with multiple segments comprises multiple distinct commands or a combined command
US10853583B1 (en) 2016-08-31 2020-12-01 Narrative Science Inc. Applied artificial intelligence technology for selective control over narrative generation from visualizations of data
CN108241649B (zh) * 2016-12-23 2022-07-01 北京奇虎科技有限公司 基于知识图谱的搜索方法及装置
CN106599317B (zh) * 2016-12-30 2019-08-27 上海智臻智能网络科技股份有限公司 问答系统的测试数据处理方法、装置及终端
US20180203856A1 (en) * 2017-01-17 2018-07-19 International Business Machines Corporation Enhancing performance of structured lookups using set operations
US10713442B1 (en) 2017-02-17 2020-07-14 Narrative Science Inc. Applied artificial intelligence technology for interactive story editing to support natural language generation (NLG)
CN108509463B (zh) * 2017-02-28 2022-03-29 华为技术有限公司 一种问题的应答方法及装置
US10289615B2 (en) * 2017-05-15 2019-05-14 OpenGov, Inc. Natural language query resolution for high dimensionality data
US20180341871A1 (en) * 2017-05-25 2018-11-29 Accenture Global Solutions Limited Utilizing deep learning with an information retrieval mechanism to provide question answering in restricted domains
US10574777B2 (en) * 2017-06-06 2020-02-25 International Business Machines Corporation Edge caching for cognitive applications
CN107918634A (zh) * 2017-06-27 2018-04-17 上海壹账通金融科技有限公司 智能问答方法、装置及计算机可读存储介质
CN107480133B (zh) * 2017-07-25 2020-07-28 广西师范大学 一种基于答案蕴涵与依存关系的主观题自适应阅卷方法
CN107748757B (zh) * 2017-09-21 2021-05-07 北京航空航天大学 一种基于知识图谱的问答方法
US10769186B2 (en) 2017-10-16 2020-09-08 Nuance Communications, Inc. System and method for contextual reasoning
US11372862B2 (en) * 2017-10-16 2022-06-28 Nuance Communications, Inc. System and method for intelligent knowledge access
CN111095259B (zh) 2017-10-25 2023-08-22 谷歌有限责任公司 使用n-gram机器的自然语言处理
US10528664B2 (en) 2017-11-13 2020-01-07 Accenture Global Solutions Limited Preserving and processing ambiguity in natural language
US11238075B1 (en) * 2017-11-21 2022-02-01 InSkill, Inc. Systems and methods for providing inquiry responses using linguistics and machine learning
US11586655B2 (en) 2017-12-19 2023-02-21 Visa International Service Association Hyper-graph learner for natural language comprehension
US11042708B1 (en) 2018-01-02 2021-06-22 Narrative Science Inc. Context saliency-based deictic parser for natural language generation
WO2019140384A2 (en) * 2018-01-12 2019-07-18 Gamalon, Inc. Probabilistic modeling system and method
US10963649B1 (en) 2018-01-17 2021-03-30 Narrative Science Inc. Applied artificial intelligence technology for narrative generation using an invocable analysis service and configuration-driven analytics
EP3514706A1 (en) 2018-01-18 2019-07-24 Université Jean-Monnet Method for processing a question in natural language
CN108172226A (zh) * 2018-01-27 2018-06-15 上海萌王智能科技有限公司 一种可学习应答语音和动作的语音控制机器人
US10755046B1 (en) * 2018-02-19 2020-08-25 Narrative Science Inc. Applied artificial intelligence technology for conversational inferencing
US11023684B1 (en) * 2018-03-19 2021-06-01 Educational Testing Service Systems and methods for automatic generation of questions from text
GB201804807D0 (en) * 2018-03-26 2018-05-09 Orbital Media And Advertising Ltd Interaactive systems and methods
CN108765221A (zh) * 2018-05-15 2018-11-06 广西英腾教育科技股份有限公司 抽题方法及装置
CN108829666B (zh) * 2018-05-24 2021-11-26 中山大学 一种基于语义解析和smt求解的阅读理解题求解方法
US10963492B2 (en) * 2018-06-14 2021-03-30 Google Llc Generation of domain-specific models in networked system
US11042713B1 (en) 2018-06-28 2021-06-22 Narrative Scienc Inc. Applied artificial intelligence technology for using natural language processing to train a natural language generation system
US10747958B2 (en) * 2018-12-19 2020-08-18 Accenture Global Solutions Limited Dependency graph based natural language processing
US11281864B2 (en) * 2018-12-19 2022-03-22 Accenture Global Solutions Limited Dependency graph based natural language processing
CN111383773B (zh) * 2018-12-28 2023-04-28 医渡云(北京)技术有限公司 医学实体信息的处理方法、装置、存储介质及电子设备
US20200242142A1 (en) * 2019-01-24 2020-07-30 International Business Machines Corporation Intelligent cryptic query-response in action proposal communications
US11341330B1 (en) 2019-01-28 2022-05-24 Narrative Science Inc. Applied artificial intelligence technology for adaptive natural language understanding with term discovery
US10789266B2 (en) 2019-02-08 2020-09-29 Innovaccer Inc. System and method for extraction and conversion of electronic health information for training a computerized data model for algorithmic detection of non-linearity in a data
US10706045B1 (en) * 2019-02-11 2020-07-07 Innovaccer Inc. Natural language querying of a data lake using contextualized knowledge bases
CN109902165B (zh) * 2019-03-08 2021-02-23 中国科学院自动化研究所 基于马尔科夫逻辑网的智能交互式问答方法、系统、装置
CN110532358B (zh) * 2019-07-05 2023-08-22 东南大学 一种面向知识库问答的模板自动生成方法
CN110532397B (zh) * 2019-07-19 2023-06-09 平安科技(深圳)有限公司 基于人工智能的问答方法、装置、计算机设备及存储介质
CN110399501B (zh) * 2019-07-31 2022-08-19 北京中科瑞通信息科技有限公司 一种基于语言统计模型的地质领域文献图谱生成方法
CN110597957B (zh) * 2019-09-11 2022-04-22 腾讯科技(深圳)有限公司 一种文本信息检索的方法及相关装置
CN110569368B (zh) * 2019-09-12 2022-11-29 南京大学 面向rdf知识库问答的查询松弛方法
CN110727783B (zh) * 2019-10-23 2021-03-02 支付宝(杭州)信息技术有限公司 一种基于对话系统对用户问句提出反问的方法和装置
US10789461B1 (en) 2019-10-24 2020-09-29 Innovaccer Inc. Automated systems and methods for textual extraction of relevant data elements from an electronic clinical document
CN110765781B (zh) * 2019-12-11 2023-07-14 沈阳航空航天大学 一种领域术语语义知识库人机协同构建方法
KR102524766B1 (ko) * 2019-12-17 2023-04-24 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 자연어 및 지식 그래프 기반 표현 학습 방법 및 장치
US11983640B2 (en) * 2019-12-30 2024-05-14 International Business Machines Corporation Generating question templates in a knowledge-graph based question and answer system
CN111177355B (zh) * 2019-12-30 2021-05-28 北京百度网讯科技有限公司 基于搜索数据的人机对话交互方法、装置和电子设备
US11914561B2 (en) 2020-03-03 2024-02-27 Rovi Guides, Inc. Systems and methods for interpreting natural language search queries using training data
US11594213B2 (en) 2020-03-03 2023-02-28 Rovi Guides, Inc. Systems and methods for interpreting natural language search queries
TWI747246B (zh) * 2020-04-24 2021-11-21 孫光天 基於神經網路運算模組與格變文法的文字語意理解方法
CN111767388B (zh) * 2020-05-07 2023-07-04 北京理工大学 一种候选池生成方法
CN111581365B (zh) * 2020-05-07 2023-04-25 北京理工大学 一种谓词抽取方法
CN113806468B (zh) * 2020-06-11 2024-01-09 中移(苏州)软件技术有限公司 一种文本对话引导方法及装置、设备、存储介质
CN111930911B (zh) * 2020-08-12 2024-03-29 杭州东方通信软件技术有限公司 一种快速领域问答方法及其装置
CN112651226B (zh) * 2020-09-21 2022-03-29 深圳前海黑顿科技有限公司 基于依存句法树的知识解析系统及方法
US11507572B2 (en) 2020-09-30 2022-11-22 Rovi Guides, Inc. Systems and methods for interpreting natural language search queries
US11526509B2 (en) 2020-12-31 2022-12-13 International Business Machines Corporation Increasing pertinence of search results within a complex knowledge base
CN112529184B (zh) * 2021-02-18 2021-07-02 中国科学院自动化研究所 融合领域知识与多源数据的工业过程优化决策方法
CN117610541B (zh) * 2024-01-17 2024-06-11 之江实验室 大规模数据的作者消歧方法、装置及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112746A1 (en) * 2005-11-14 2007-05-17 James Todhunter System and method for problem analysis
CN101118554A (zh) * 2007-09-14 2008-02-06 中兴通讯股份有限公司 智能交互式问答系统及其处理方法
EP2051174A1 (en) * 2007-10-19 2009-04-22 Xerox Corporation Real-time query suggestion in a troubleshooting context
CN101599072A (zh) * 2009-07-03 2009-12-09 南开大学 基于信息推理的智能计算机系统构造方法
WO2010137940A1 (en) * 2009-05-25 2010-12-02 Mimos Berhad A method and system for extendable semantic query interpretation
CN103679273A (zh) * 2013-12-20 2014-03-26 南京邮电大学 一种基于隶属云理论的不确定性推理方法

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPR511301A0 (en) * 2001-05-18 2001-06-14 Mastersoft Research Pty Limited Parsing system
US20060053000A1 (en) 2004-05-11 2006-03-09 Moldovan Dan I Natural language question answering system and method utilizing multi-modal logic
CN101079025B (zh) 2006-06-19 2010-06-16 腾讯科技(深圳)有限公司 一种文档相关度计算系统和方法
CN101510221B (zh) 2009-02-17 2012-05-30 北京大学 一种用于信息检索的查询语句分析方法与系统
US20100211533A1 (en) * 2009-02-18 2010-08-19 Microsoft Corporation Extracting structured data from web forums
CN101566998B (zh) 2009-05-26 2011-12-28 华中师范大学 一种基于神经网络的中文问答系统
EP2622599B1 (en) 2010-09-28 2019-10-23 International Business Machines Corporation Evidence diffusion among candidate answers during question answering
CN102346756B (zh) * 2010-12-24 2013-04-03 镇江诺尼基智能技术有限公司 一种设备故障解决方案知识管理与检索系统及方法
CN102279875B (zh) 2011-06-24 2013-04-24 华为数字技术(成都)有限公司 钓鱼网站的识别方法和装置
CN102306144B (zh) 2011-07-18 2013-05-08 南京邮电大学 一种基于语义词典的词语消歧方法
US9183511B2 (en) * 2012-02-24 2015-11-10 Ming Li System and method for universal translating from natural language questions to structured queries
US9460211B2 (en) * 2013-07-08 2016-10-04 Information Extraction Systems, Inc. Apparatus, system and method for a semantic editor and search engine
CN103500208B (zh) 2013-09-30 2016-08-17 中国科学院自动化研究所 结合知识库的深层数据处理方法和系统
CN103678576B (zh) * 2013-12-11 2016-08-17 华中师范大学 基于动态语义分析的全文检索系统
CN103699689B (zh) 2014-01-09 2017-02-15 百度在线网络技术(北京)有限公司 事件知识库的构建方法及装置
CN104050157A (zh) 2014-06-16 2014-09-17 海信集团有限公司 歧义消解的方法和系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112746A1 (en) * 2005-11-14 2007-05-17 James Todhunter System and method for problem analysis
CN101118554A (zh) * 2007-09-14 2008-02-06 中兴通讯股份有限公司 智能交互式问答系统及其处理方法
EP2051174A1 (en) * 2007-10-19 2009-04-22 Xerox Corporation Real-time query suggestion in a troubleshooting context
WO2010137940A1 (en) * 2009-05-25 2010-12-02 Mimos Berhad A method and system for extendable semantic query interpretation
CN101599072A (zh) * 2009-07-03 2009-12-09 南开大学 基于信息推理的智能计算机系统构造方法
CN103679273A (zh) * 2013-12-20 2014-03-26 南京邮电大学 一种基于隶属云理论的不确定性推理方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3179384A4 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11669691B2 (en) * 2017-03-30 2023-06-06 Nec Corporation Information processing apparatus, information processing method, and computer readable recording medium
US20210191986A1 (en) * 2017-03-30 2021-06-24 Nec Corporation Information processing apparatus, information processing method, and computer readable recording medium
CN109359303A (zh) * 2018-12-10 2019-02-19 枣庄学院 一种基于图模型的词义消歧方法和系统
CN109359303B (zh) * 2018-12-10 2023-04-07 枣庄学院 一种基于图模型的词义消歧方法和系统
CN109635004A (zh) * 2018-12-13 2019-04-16 广东工业大学 一种数据库的对象描述提供方法、装置及设备
CN109635004B (zh) * 2018-12-13 2023-05-05 广东工业大学 一种数据库的对象描述提供方法、装置及设备
CN110321437A (zh) * 2019-05-27 2019-10-11 腾讯科技(深圳)有限公司 一种语料数据处理方法、装置、电子设备及介质
CN110321437B (zh) * 2019-05-27 2024-03-15 腾讯科技(深圳)有限公司 一种语料数据处理方法、装置、电子设备及介质
CN111143539A (zh) * 2019-12-31 2020-05-12 重庆和贯科技有限公司 基于知识图谱的教学领域问答方法
CN111143539B (zh) * 2019-12-31 2023-06-23 重庆和贯科技有限公司 基于知识图谱的教学领域问答方法
CN112784590A (zh) * 2021-02-01 2021-05-11 北京金山数字娱乐科技有限公司 文本处理方法及装置
US20240004907A1 (en) * 2022-06-30 2024-01-04 International Business Machines Corporation Knowledge graph question answering with neural machine translation
US12013884B2 (en) * 2022-06-30 2024-06-18 International Business Machines Corporation Knowledge graph question answering with neural machine translation
CN117271561A (zh) * 2023-11-20 2023-12-22 海信集团控股股份有限公司 一种基于大语言模型的sql语句生成方法、装置及设备
CN117271561B (zh) * 2023-11-20 2024-03-01 海信集团控股股份有限公司 一种基于大语言模型的sql语句生成方法、装置及设备

Also Published As

Publication number Publication date
US20170199928A1 (en) 2017-07-13
CN105528349A (zh) 2016-04-27
EP3179384A1 (en) 2017-06-14
EP3179384A4 (en) 2017-11-08
CN105528349B (zh) 2019-02-01
US10706084B2 (en) 2020-07-07

Similar Documents

Publication Publication Date Title
WO2016050066A1 (zh) 知识库中问句解析的方法及设备
Höffner et al. Survey on challenges of question answering in the semantic web
Kolomiyets et al. A survey on question answering technology from an information retrieval perspective
US9792280B2 (en) Context based synonym filtering for natural language processing systems
Hajar Using YouTube comments for text-based emotion recognition
US9904667B2 (en) Entity-relation based passage scoring in a question answering computer system
Mills et al. Graph-based methods for natural language processing and understanding—A survey and analysis
US9886390B2 (en) Intelligent caching of responses in a cognitive system
Bella et al. Domain-based sense disambiguation in multilingual structured data
CN113255295A (zh) 一种自然语言到pptl形式化规约自动生成方法及系统
Mezghanni et al. Deriving ontological semantic relations between Arabic compound nouns concepts
Li et al. Neural factoid geospatial question answering
Vaishnavi et al. Paraphrase identification in short texts using grammar patterns
CN110377753B (zh) 基于关系触发词与gru模型的关系抽取方法及装置
Liu Toward robust and efficient interpretations of idiomatic expressions in context
Boella et al. Supervised learning of syntactic contexts for uncovering definitions and extracting hypernym relations in text databases
Ferilli et al. Natural language processing
Kalita Detecting and extracting events from text documents
TWI603320B (zh) 全域對話系統
Hameed et al. Short Text Semantic Similarity Measurement Approach Based on Semantic Network
Chen et al. Learning word embeddings from intrinsic and extrinsic views
Hoque et al. A content-aware hybrid architecture for answering questions from open-domain texts
Jenkins Designing Service-Oriented Chatbot Systems Using a Construction Grammar-Driven Natural Language Generation System
Costa Automatic Extraction and Validation of Lexical Ontologies from text
Dasgupta et al. Description Logics based Formalization of Wh-Queries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15845782

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2015845782

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015845782

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE