CN107798126B

CN107798126B - Question-answer processing method based on knowledge base

Info

Publication number: CN107798126B
Application number: CN201711111378.9A
Authority: CN
Inventors: 程祥; 苏森; 朱署光
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2021-11-02
Anticipated expiration: 2037-11-13
Also published as: CN107798126A

Abstract

The invention provides a question-answer processing method based on a knowledge base, which comprises the following steps: determining a candidate information pair according to a target problem and a knowledge base, wherein the information pair comprises information of an entity and information of the relation between the entity and other entities; inquiring the trained embedding table to obtain the embedding position data of each letter, the embedding position data of each word and the embedding position data of each phrase corresponding to the target problem and the candidate information; calculating the matching degree score of the target problem and each candidate information pair according to each embedded position data; determining a target information pair in the candidate information pair according to the matching degree score of the target problem and each candidate information pair; and inquiring a knowledge base according to the target information pair to obtain an answer corresponding to the question.

Description

Question-answer processing method based on knowledge base

Technical Field

The invention relates to the field of artificial intelligence, in particular to a question and answer processing method based on a knowledge base.

Background

With the rapid development of internet technology, people are more and more accustomed to obtaining information through networks. On this basis, question-answering systems have received a great deal of attention and have been extensively studied and applied. The question-answering system based on the knowledge base takes a large-scale knowledge base which is artificial or constructed as an information source and can answer questions which are frequently put forward by people and are based on knowledge and facts.

In order to realize the question answering based on the knowledge base, a semantic parsing scheme or a deep learning scheme can be adopted. The semantic parsing scheme converts the problem into a logic expression which can be understood by a machine, and then the expression is used for inquiring a knowledge base to obtain an answer. However, semantic parsing schemes often extract the features of questions and logical expressions manually in order to order and return the best logical expressions, and are difficult to separate from some templates and trigger words, and thus difficult to expand. In contrast, in the deep learning scheme, the order of the answers is obtained through vector abstraction questions with low-dimensional real numbers and semantic information of candidate answers and vector similarity calculation.

In the existing related art, in the deep learning technology, the processing of the questions and the answers is often processed only according to the word granularity of the questions and the entity granularity of the answers, and the processing mode causes the situation that the answer matching degree is not high.

Disclosure of Invention

The invention provides a question-answer processing method based on a knowledge base, which is used for solving the problem of low answer matching degree.

According to a first aspect of the present invention, there is provided a knowledge-base-based question-answer processing method, including:

determining a candidate information pair according to a target problem and a knowledge base, wherein the information pair comprises information of an entity and information of the relation between the entity and other entities;

inquiring the target problem obtained by the trained embedded table, the embedded position data of each letter corresponding to the candidate information, the embedded position data of each word and the embedded position data of each phrase;

calculating the matching degree score of the target problem and each candidate information pair according to each embedded position data;

determining a target information pair in the candidate information pair according to the matching degree score of the target problem and each candidate information pair;

and inquiring a knowledge base according to the target information pair to obtain an answer corresponding to the question.

Optionally, the determining a candidate information pair according to the target problem and the knowledge base includes:

extracting all word strings in the target problem, wherein the length of the word strings is more than or equal to 1 and less than or equal to the length of the problem;

determining M word strings in the word string; wherein M is any integer greater than or equal to 1;

determining K entities in the knowledge base according to the M word strings; wherein K is any integer greater than or equal to 1;

the candidate information pairs include relationships between the K entities and corresponding entities.

Optionally, the determining M word strings includes:

deleting the word strings containing the query pronouns in all the word strings;

deleting the word string as a stop word;

reserving word strings judged to be entity names or parts of entity names;

and selecting M word strings with the longest length from the remaining word strings.

Optionally, the determining K entities in the knowledge base according to the M word strings includes:

and for the entity corresponding to each word string, keeping L entities which are the most frequently appeared in the fact as head entities in a knowledge base to obtain the K entities.

Optionally, the embedded table is obtained after training according to natural language data in a first preset range;

wherein, training according to the natural language data of first preset scope obtains, includes:

obtaining an untrained embedded table;

extracting sentences of natural language in the first preset range;

training an embedding table according to the extracted sentences and facts in the sentences;

a trained embedding table is obtained, which contains the letters, words, phrases and their corresponding embedding position data of the problem, and the letters, words, phrases and their corresponding embedding position data of the information pairs.

Optionally, the training the embedded table according to the extracted sentences and the facts therein includes:

obtaining the content of letters, words and phrases and corresponding embedded position data according to the extracted sentences and the facts in the sentences;

for the obtained letters, words and phrases, respectively predicting the obtained letters, words and phrases according to surrounding letters, words and phrases and letters, words and phrases of lower-layer semantics thereof; to train the obtained embedded position data of letters, words and phrases.

Optionally, the calculating, according to the data of each embedded position, a matching degree score between the target problem and each candidate information pair includes:

according to the target problem and the letters, words and phrases of the contexts of the candidate information pairs, the embedded position data of the target problem and the candidate information pairs are adjusted to obtain processed embedded position data;

according to the processed embedded position data and the trained attention model, determining attention expressions of letters, words and phrases of candidate information pairs;

determining the importance degree of the letters of each candidate information pair relative to each letter in the target problem according to the attention expression of the letters of each candidate information pair; determining the importance degree of the words of each candidate information pair relative to each word in the target question according to the attention expression of the words of each candidate information pair; and: determining the importance degree of the phrases of each candidate information pair relative to each phrase in the target problem according to the attention expression of the phrases of each candidate information pair;

obtaining, according to the processed embedded position data of the letters, words and phrases in the target problem and the importance degrees: the letter matching degree score of the letters of each candidate information pair relative to the letters in the target problem; the word matching degree score of the word of each candidate information pair relative to the word in the target problem; and the phrase matching degree score of the phrase of each candidate information pair relative to the phrase in the target problem;

and obtaining the matching degree score of each candidate information pair according to the letter matching degree score, the word matching degree score and the phrase matching degree score.

Optionally, the obtaining the matching degree score of each candidate information pair according to the letter matching degree score, the word matching degree score and the phrase matching degree score includes:

for each candidate information pair, respectively determining the matching degree score thereof through the following processes:

aiming at each letter in each word in the target problem, calculating a first average value of the letter matching degree scores of the letters in each word according to the letter matching degree score;

comparing the word matching degree score of each word in the target problem with the corresponding first average value, and taking the larger value as the matching degree score of the determined word;

calculating a second average value of the matching degree scores of the determined words of each word in each phrase according to the matching degree scores of the determined words;

comparing the phrase matching degree score of each phrase in the target problem with the corresponding second average value, and taking a larger value as the matching degree score of the determined phrase;

and according to the determined matching degree score of the phrases, taking the average value of the matching degree scores of the phrases corresponding to the target problem as the matching degree score of the candidate information pair.

Optionally, the parameters in the attention model are determined by training according to a given question, a first information pair corresponding to the given question, and a second information pair not corresponding to the given question, and the given question and the first information pair are obtained from the knowledge base and determined by manual labeling.

Optionally, the parameters in the attention model are determined according to the given question, the first information pair corresponding to the given question, the second information pair not corresponding to the given question, and natural language training in a second preset range, and the given question and the first information pair are obtained from the knowledge base and determined through manual labeling. .

The question and answer processing method based on the knowledge base, provided by the invention, comprises the steps of obtaining embedded position data of each letter, embedded position data of each word and embedded position data of each phrase corresponding to the target question and the candidate information by inquiring a trained embedded table; and calculating the matching degree scores of the target question and each candidate information pair according to the embedded position data, and further determining the target information pairs in the candidate information pairs.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a first flow chart of a knowledge-base-based question-answering processing method according to the present invention;

FIG. 2 is a schematic flow chart of a knowledge-base-based question-answering processing method according to the present invention;

FIG. 3 is a schematic flow chart of step S21 in FIG. 2;

FIG. 4 is a schematic flow chart of step S22 in FIG. 2;

FIG. 5 is a schematic flowchart of step S223 in FIG. 4;

FIG. 6 is a schematic flow chart of step S24 in FIG. 2;

FIG. 7 is a flowchart illustrating step S249 in FIG. 5;

FIG. 8 is a diagram illustrating the relationship between the number of question-answer pairs and the accuracy of the present invention;

FIG. 9 is a diagram illustrating a sentence-fact pair quantity versus accuracy relationship in accordance with the present invention;

FIG. 10 is a diagram illustrating the relationship between the number of a pair of synonymous question sentences and the accuracy rate according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Example 1

FIG. 1 is a first flow chart of a knowledge-base-based question-answering processing method according to the present invention; referring to fig. 1, the present embodiment provides a question-answering processing method based on a knowledge base, including:

s11: and determining candidate information pairs according to the target problem and the knowledge base, wherein the information pairs comprise the information of the entity and the information of the relation between the entity and other entities.

The knowledge base K is composed of a set of entities E, a set of relationships R and a set of facts

And (4) forming. Given a question q, the answer is o. Suppose that

And entity s and relationship r are explicitly or implicitly referenced by problem q, the goal is to determine s and r so that with < s, r,? Query F, return results as answers. Therefore, after the information pair is determined, the corresponding answer can be determined.

S12: inquiring the target problem obtained by the trained embedded table, the embedded position data of each letter corresponding to the candidate information, the embedded position data of each word and the embedded position data of each phrase;

the embedded position data can comprise embedded expressions of all letters, words and phrases, specifically, embedded sequences of the letters, the words and the phrases can be included, and the embedded sequences comprise vector data of the letters, the words and the phrases or other data which can represent the embedded positions; the embedded location data may also include embedded vector data or other data that may characterize the embedded location without forming a sequence.

The embedded table may contain the letters, words, phrases and their corresponding embedded position data of the question, as well as the letters, words, phrases and their corresponding embedded position data of the information pairs.

S13: calculating the matching degree score of the target problem and each candidate information pair according to each embedded position data; the matching degree can be understood as any value representing the matching degree between the target problem and the candidate information pair.

S14: determining a target information pair in the candidate information pair according to the matching degree score of the target problem and each candidate information pair; in the case of using a uniform matching degree scoring criterion, a target information pair may be determined in the candidate information pairs by numerical comparison of the scores, and for example, a candidate information pair with the lowest score or the highest score may be used as the target information pair.

S15: and inquiring a knowledge base according to the target information pair to obtain an answer corresponding to the question.

In the question and answer processing method based on the knowledge base provided in this embodiment, the embedding position data of each letter, the embedding position data of each word, and the embedding position data of each phrase corresponding to the target question and the candidate information obtained by querying the trained embedding table; and calculating the matching degree scores of the target question and each candidate information pair according to the embedded position data, and further determining the target information pairs in the candidate information pairs.

Example 2

FIG. 2 is a schematic flow chart of a knowledge-base-based question-answering processing method according to the present invention; referring to fig. 2, the present embodiment provides a question-answering processing method based on a knowledge base, including:

s21: and determining candidate information pairs according to the target problem and the knowledge base, wherein the information pairs comprise entity information and information of the corresponding relation between the entities.

And (4) forming. Given a question q, the answer is o. Suppose that

FIG. 3 is a schematic flow chart of step S21 in FIG. 2; referring to fig. 3, the determining candidate information pairs according to the target question and the knowledge base, namely step S21, may include:

s211: extracting all word strings n-grams in the target problem, wherein the word strings n-grams can be understood as any one or more continuous words; the length of the word string n-gram is more than or equal to 1 and less than or equal to the length of the problem; if the target question is denoted as q, it is considered as the following word sequence:

wherein w represents a word, n_q(w)Representing the total number of words in the target question.

The word string n-gram can be understood as a single word or a combination of words.

S212: determining M word strings n-grams in the word strings; wherein M is any integer greater than or equal to 1;

the determining M word strings therein, that is, step S212, includes:

s2121: deleting the word string n-gram containing the query pronouns in all the word strings; query pronouns may be listed as: while, what, where, which, whhy, how.

S2122: deleting the word string n-gram as a stop word; specifically, the word string may include only one word.

S2123: reserving a word string n-gram which is judged to be the entity name or part of the entity name; other word strings n-grams may be deleted.

S2124: and selecting M word strings with the longest length from the remaining word strings. In a specific implementation process, all word string n-grams which are subsequences of other word string n-grams can be deleted, and the longest five of the remaining word string n-grams are reserved, namely M is 5.

S213: determining K entities in the knowledge base according to the M word strings n-gram; wherein K is any integer greater than or equal to 1; the candidate information pairs include relationships between the K entities and corresponding entities.

Wherein, the determining K entities in the knowledge base according to the M word strings, that is, step S213, may include:

for the entity corresponding to each word string n-gram, L entities which are kept in a knowledge base and used as head entities and appear in the fact most frequently are obtained to obtain the K entities, wherein L can be 2; k may be a product of L and M, and may be specifically 2 × 5 — 10.

S22: training according to natural language data in a first preset range to obtain a trained embedded table; it is also understood that the embedded table is trained from a first predetermined range of natural language data.

FIG. 4 is a schematic flow chart of step S22 in FIG. 2; referring to fig. 4, in a specific process, step S22 may include:

s221: an untrained embedded table is obtained.

Specifically, an embedded table H can be established according to requirements, and the structure of the embedded table can be enumerated to include the letter content and the embedded position data of the problem, the word content and the embedded position data of the problem, the phrase content and the embedded position data of the problem, the letter content and the embedded position data of the information pair, the word content and the embedded position data of the information pair, the phrase content and the embedded position data of the information pair; the embedded position data may be enumerated as vector data. Each letter, word, and phrase corresponds to a low-dimensional, real quantized vector of the same dimension, which can be used to represent its position in semantic space.

S222: and extracting sentences of the natural language in the first preset range.

Specifically, a large number of sentences can be extracted from the web page, and sentences containing at least one phrase are reserved. For sentence s and fact f, it can be converted separately and then look up the embedding table to obtain the embedding sequence H(s)^(c))、H(s^(w))、H(s^(p))、H(f^(c))、H(f^(w)) And H (f)^(p)) Embedding the sequence H(s) in the letters of the sentence s^(c)) For example, there are

Where w may characterize a word, p may characterize a phrase, and c may characterize a letter. s^(w)It is understood that other ways of identifying may be analogized in turn to characterize the words in the sentence s.

A specific implementation process of querying the embedded table may be understood by referring to the following process, which is also applicable to the following query of the embedded table by the target question and candidate information pair, i.e. the query in step S23, and may also be understood by referring to the following process, so that specific questions, information pairs, letters, words, phrases, etc. are not specifically limited, and are only an example:

for a given question q, it can be converted into letter, word and phrase sequences, respectively, i.e.

And

in order to obtain a letter sequence, words in the problem need to be split into letters, and punctuation marks and spaces are also regarded as letters; in order to obtain a word sequence, the content needs to be divided by spaces and punctuation marks, the words are reserved, and the punctuation marks are regarded as the words; in order to obtain the phrase sequence, the phrases in the problem need to be identified by using a phrase table obtained in advance, the phrases are regarded as an integral inseparable body, then the content is divided by a blank space and punctuation marks, the phrases are reserved, and words and punctuation marks are regarded as the phrases.

For a given pair a of information it can be converted into a sequence of letters, words and phrases, i.e.

And

the transformation was similar to that given the problem, but with the following differences: when the letter and word sequence is obtained, the name of the entity is obtained by using a knowledge base API, then the same letters or words appearing in the entity and the relation are distinguished, and then the space in the entity, the separators of the entity and the relation and the separators in the relation are normalized into the space; when the phrase sequence is obtained, the entity and the relationship are regarded as phrases, so that the pair of information pairs contains 2 phrases.

S223: training an embedding table according to the extracted sentences and facts in the sentences; training the embedding table, as can be understood, determines what letters, words, and phrases can be through learning of sentences in the extracted natural language, and embedding location data corresponding to each letter, word, and phrase, which can include a corresponding vector.

FIG. 5 is a schematic flowchart of step S223 in FIG. 4; referring to fig. 5, step S223 may include:

s2231: obtaining the content of letters, words and phrases and corresponding embedded position data according to the extracted sentences and the facts in the sentences;

s2232: for the obtained letters, words and phrases, respectively predicting the obtained letters, words and phrases according to surrounding letters, words and phrases and letters, words and phrases of lower-layer semantics thereof; to train the obtained embedded position data of letters, words and phrases.

S224: a trained embedding table is obtained, which contains the letters, words, phrases and their corresponding embedding position data of the problem, and the letters, words, phrases and their corresponding embedding position data of the information pairs.

The training process according to surrounding letters, words and phrases may be listed as follows:

given the sentence s and the letter c therein as an example, wherein:

an objective function can be established:

wherein, c_＜0And

are all special symbols < pad >. log right represents the probability of predicting the center letter with the following letter; exp can be understood as an exponential function with e as base.

Referring to the above example, replacing c representing letters with w representing words and p representing phrases can obtain the target functions of words and phrases correspondingly. A sequence of words s being a sentence s^(w)Phrase sequence s^(p)And the letter sequence f of the fact f^(c)Word sequence f^(w)Establishing an objective function

And

in addition, the phrase sequence f due to the fact f^(p)＝{p₁,p₂,p₃The context content is less, and needs to be processed separately, so that an objective function can be established:

the obtained letters, words and phrases are respectively processed according to the letters, words and phrases of the lower-layer semantics, and the following steps can be listed:

for a given frequency limit N^minIf the frequency N of the word w in the corpus is_w＜N_minAnd it contains the letters of

Then order:

wherein, GRU represents a gating cycle unit, which can extract the semantics of a sequence by using the semantics and sequence relation of a sequence unit, and the calculation process of GRU is as follows:

Z＝σ(W_zcH(c_t)+W_zhh_t-₁+b_z)

r＝σ(W_rcH(c_t)+W_rhh_t-1+b_r)

where z is an update gate controlling how many candidate states and old states are used for the update state, respectively, r is a reset gate controlling how many old states are reset to obtain the candidate states,

is a candidate state at time t, h_tThe state at time t. σ and tanh are respectively a logic function and a hyperbolic tangent function, and W and b are a parameter matrix and a vector.

Similarly, for phrases, knowledge base words, entities and relationships that are less than the frequency limit, their embeddings are obtained by the above formula. For example, c in which a letter is represented may be replaced by w representing a word, and p representing a phrase may correspond to an objective function of the word or the phrase.

Specifically, the embedded table and the gated round robin unit may be trained using a batch gradient descent algorithm based on the combination of low frequency content embedding and next granularity embedding.

S23: inquiring the target problem obtained by the trained embedded table, the embedded position data of each letter corresponding to the candidate information, the embedded position data of each word and the embedded position data of each phrase;

the embedded position data can comprise an embedded sequence of letters, words and phrases, and the embedded sequence comprises vector data of the letters, the words and the phrases or other data which can represent the embedded position; the embedded location data may also include embedded vector data or other data that may characterize the embedded location without forming a sequence.

S24: calculating the matching degree score of the target problem and each candidate information pair according to each embedded position data; the matching degree can be understood as any value representing the matching degree between the target problem and the candidate information pair.

FIG. 6 is a schematic flow chart of step S24 in FIG. 2; referring to fig. 6, the step of calculating the matching degree score between the target question and each candidate information pair according to each embedded position data, that is, step S24, may include:

s241: and adjusting the embedded position data of the target question and each candidate information pair according to the letters, words and phrases of the contexts of the target question and each candidate information pair to obtain the processed embedded position data. Before that, the target question and candidate information pair can also be converted into letter, word and phrase embedding sequence, namely H (q)^(c))、H(q^(w))、H(q^(p))、H(a^(c))、H(a^(w)) And H (a)^(p))。

Its embedding sequence can be understood as embedding position data.

In a specific implementation process, the sequence can be processed by using a bidirectional gating cycle unit, so that the embedding of each position can highlight self semantics and fuse context semantics, and the following representations are taken as examples:

comprises the following steps:

wherein m is_i ^(c)Can be understood as the processed embedded position of the letter c in the questionData; corresponding n_j ^(c)Can be understood as the processed embedded position data of the letter c in the pair of information.

Representing the embedded position data after the embedded position data of the letter c is adjusted and processed through the letters;

representing the embedded position data after the embedded position data of the letter c is adjusted and processed through the letters; after the two are fused, the corresponding m can be obtained_i ^(c)。

S242: and determining the attention expression of letters, words and phrases of the candidate information pairs according to the processed embedded position data and the trained attention model.

S243: and determining the importance degree of the letters of the candidate information pairs relative to each letter in the target question according to the attention expression of the letters of the candidate information pairs.

S244: the importance degree of the words of each candidate information pair relative to each word in the target question is determined according to the attention expression of the words of each candidate information pair.

S245: and determining the importance degree of the phrases of each candidate information pair relative to each phrase in the target problem according to the attention expression of the phrases of each candidate information pair.

S246: and obtaining letter matching degree scores of the letters of the candidate information pairs relative to the letters in the target problem according to the processed embedded position data of the letters, the words and the phrases in the target problem and the importance degrees.

S247: and obtaining word matching degree scores of the words of the candidate information pairs relative to the words in the target problem according to the processed embedded position data of the letters, the words and the phrases in the target problem and the importance degrees.

S248: and obtaining the phrase matching degree score of the phrases of each candidate information pair relative to the phrases in the target problem according to the processed embedded position data of the letters, the words and the phrases in the target problem and each importance degree.

For the above processes, an understanding can be specifically enumerated:

the score for each letter, word, and phrase of the question q is calculated to indicate how well the question q matches the information pair a at the corresponding granularity and location. First, using the attention mechanism, the attention expression of information pairs a on letter granularity is calculated:

wherein the content of the first and second substances,

it can be understood that the attention expression of the jth letter of the information pair relative to the ith letter of the question is on letter granularity; v, W and b are preset constants in the attention model and can be determined through training of the attention model.

It can be understood that the information pair is in letter granularity, and the importance degree of the jth letter in all letters of the information pair relative to the ith letter of the question is expressed;

can be understood as the expression of the degree of importance of all letters of the information pair relative to the ith letter of the question;

wherein

The information pair a is embedded in the letter granularity position after being processed by a bidirectional gating circulation unit, namely the processed embedded position data. The attention mechanism can adjust the attention of the information to the content according to the question content, so that the matching of the information and the content is reflected to the maximum extent. Then, the information is matched with the attention expression of a and the expression of the question q on the corresponding granularity and position, and the score is calculated:

r_i ^(c)the letter match score for the ith letter of the question can be understood.

Similarly, a score for each letter, word, and phrase of the question q may be calculated. It should be noted that the scores are calculated on the granularity of words and phrases, and all the data embedded in the processed positions are adopted in the calculation process.

S249: and obtaining the matching degree score of each candidate information pair according to the letter matching degree score, the word matching degree score and the phrase matching degree score.

The obtaining the matching degree score of each candidate information pair according to the letter matching degree score, the word matching degree score, and the phrase matching degree score, that is, step S249, may include:

s2491: and aiming at each letter in each word in the target problem, calculating a first average value of the letter matching degree scores of the letters in each word according to the letter matching degree score.

S2492: comparing the word matching degree score of each word in the target problem with the corresponding first average value, and taking the larger value as the matching degree score of the determined word;

in a specific example, a question q includes a word w_i＝{c_x,...,c_yThe average score of letter sequence of the word is calculated first

Recalculation

Thus in the word w_iRetaining a score at the granularity that better supports matching the question q with the information pair a.

In subsequent processes, the above process can be similarly performed on all words of the question q, preserving the matching scores of the words or letter sequences on the different words. A similar process is also performed for all phrases contained in question q, with scores at the granularity that better support matching question q with information pair a being retained on different phrases.

S2493: calculating a second average value of the matching degree scores of the determined words of each word in each phrase according to the matching degree scores of the determined words;

s2494: comparing the phrase matching degree score of each phrase in the target problem with the corresponding second average value, and taking a larger value as the matching degree score of the determined phrase;

s2495: and according to the determined matching degree score of the phrases, taking the average value of the matching degree scores of the phrases corresponding to the target problem as the matching degree score of the candidate information pair. In particular, it can calculate

I.e. s (q, a).

S25: determining a target information pair in the candidate information pair according to the matching degree score of the target problem and each candidate information pair; in the case of using a uniform matching degree scoring criterion, a target information pair may be determined in the candidate information pairs by numerical comparison of the scores, and for example, a candidate information pair with the lowest score or the highest score may be used as the target information pair.

S26: and inquiring a knowledge base according to the target information pair to obtain an answer corresponding to the question.

In addition to the above description, the present embodiment also describes how to train the attention model.

Parameters in the attention model are determined according to a given problem, a first information pair corresponding to the given problem and a second information pair not corresponding to the given problem, and the given problem and the first information pair are obtained from the knowledge base and determined through manual marking.

A large number of relationship pairs can be extracted from the knowledge base, the relationship pairs including question and information pairs; the relationship pair may be characterized as: set QA { (q)₁,a₁),(q₂,a₂),...,(q_n,a_n)}，

Based on the above set, an objective function can be established:

wherein R is_qIndicates that for a certain problem q, any information pair which does not correspond to the problem q is selected, and the number of the information pairs is K_QABut also to the second information pair; QA may be understood as the first information pair. Taking the obtained information as an objective function, and obtaining matching parameters under the condition of corresponding information pair by using the corresponding information to the parameters of the information pair training attention model which does not correspond to the corresponding information pair;

wherein l (q, a, a') is Max (γ)_QA-s(q,a)+s(q,a'),0)；

Which passes a preset parameter gamma_QAThe difference from s (q, a) -s (q, a') is larger than 0. It can be understood as a form of Hinge loss function (Hinge loss) that represents when the positive and negative sample score difference is greater than or equal to a certain value, wherein it can be understood that: the loss is 0, otherwise the loss is the difference between the fixed value and the score difference of the positive and negative samples.

In particular, a batch back propagation algorithm may be used to train the bidirectional gated loop unit and the attention model.

In addition, if the data amount in the knowledge base is not enough, the parameters in the attention model are determined according to the given question, the first information pair corresponding to the given question, the second information pair not corresponding to the given question and natural language training in a second preset range, and the given question and the first information pair are obtained from the knowledge base and determined through manual marking.

In a specific implementation process, entity links can be performed on a large number of sentences extracted from a webpage, and data pairs are recorded, wherein the data pairs comprise sentence information and corresponding entity information in the sentences, and start and end positions of entity mentions in the sentences.

And selecting sentences which link two entities simultaneously and form facts, and recording corresponding data pairs.

In a multi-task learning mode, data pairs are used for bidirectional gating cycle unit and attention model training, and an objective function is as follows:

alternating optimization of L_QAAnd L_SF。

Where s can characterize a sentence, f can characterize a first fact that the sentence corresponds to, and f' can characterize a second fact that the sentence does not correspond to.

Compared with the prior art, the invention better discloses the association between the question and the entity-relation by comprehensively utilizing the information of the question and the information pair on the three granularities of the letter, the word and the phrase, thereby selecting the entity-relation pair which is more matched with the meaning of the question and improving the accuracy rate of the question and answer.

By comparing with the method proposed by the literature and the simplified version of the method, the method for expressing and matching the multiple granularities, which is proposed by us, can be determined to have obvious advantages in the accuracy rate of the answer. The experimental training and testing data set was SimpleQuestions, which contained 100842 "problem-fact" pairs and was divided into 70% for training, 10% for validation, and 20% for testing; the knowledge bases for answering questions were FB2M and FB5M, which contained 2150604 entities, 6701 relationships, 14180937 facts and 4904397 entities, 7523 relationships and 22441880 facts, respectively; sentences used for training multi-granularity expression are extracted from a ClueWeb12 data set; "sentence-fact" pairs for supplemental training were also extracted from the ClueWeb12 dataset, with entity links completed through FACC1 annotation data coordinated with ClueWeb 12. The evaluation criterion is the answer accuracy, and the calculation method is that if the "head entity" and the "relationship" in the knowledge base facts for answering the question are correctly selected, the question is correctly answered.

Methods	FM2M	FB5M
			Memory Networks	62.7	63.9
Character-level QA	70.9	70.3
			CFO	N/A	75.7
Ours	79	78.3

TABLE 1

Wherein, table 1 is a comparison of the response accuracy of an embodiment of the present invention with that of the prior art method.

As shown in table 1, the method achieved the best accuracy on knowledge bases of different scales when compared to the existing method. Among them, Memory Networks literature (Bordes, A., Chorra, S., Uuser, N., & Weston, J. (2015) Large-scale Simple query with Memory Networks, CoRR, abs/1506.02075) uses the word granularity of the problem and the entity, relationship granularity information of the Knowledge base, Charater-Level QA literature (Goub D, He X.Character-Level query with attribute: 1604.00727,2016) uses the letter granularity information of the problem and Knowledge base, CFO literature (Dai, Z., Li, L., Xu, W. (2016) CFO: Conditional new query with entity information, K.R. 19 K.S. 19. K.S. 1606.01994. the entity information of the Knowledge base.

Multi-grained	Character-level	Word-level	Phrase-level
				78.3	74.0	70.3	71.2

TABLE 2

In which table 2 is an example of response accuracy comparison with a simplified version of itself.

Table 2 further demonstrates the effect of using different granularity information on the accuracy of the answers. The last three items in the table are simplified versions of the method, the letter, word and phrase granularity information is independently used, a plurality of steps are omitted in the training process of the embedded table, and the hierarchical matching process is simplified into single granularity matching and then the average value is obtained. As shown in table 2, the effect of using multiple pieces of granularity information together cannot be achieved by using any piece of granularity information alone. In single granularity, the effect of using letter granularity is the best, the reason is the problem of SimpleQuestions data set, most entity mentions directly repeat the name of the entity, and the types of the relation are less, so that the better effect can be achieved by using letter granularity information. The use of word or phrase granularity alone is less effective in the face of words and phrases that do not appear in the training set, since synonyms and phrases can only be matched, and there is a lack of matching for similar letter sequences.

FIG. 8 is a diagram illustrating the relationship between the number of question-answer pairs and the accuracy of the present invention; FIG. 9 is a diagram illustrating a sentence-fact pair quantity versus accuracy relationship in accordance with the present invention; FIG. 10 is a diagram illustrating the relationship between the number of a pair of synonymous question sentences and the accuracy rate according to the present invention. Fig. 8 shows the number of pairs of questions and answers adjusted for training, and fig. 9 shows the change in accuracy of the method after the number of pairs of "sentence-fact" or replacing "sentence-fact" with synonymous question corpus. As shown in fig. 9, the effect of reducing the number of question-answer pairs on the accuracy rate is not significant initially under the condition of using the "sentence-fact" pair, and becomes significant when the effect is reduced to 8000 or less, which shows that under the condition of using the "sentence-fact" pair, only fewer question-answer pairs are needed to obtain better training effect. And the number of the 'sentence-fact' pairs is changed, or the 'sentence-fact' pairs are replaced by the synonymous question and the number of the synonymous question pairs is changed, so that the effect brought by the synonymous question is not as obvious as the 'sentence-fact' pairs, and the synonymous question can deal with the phrase variability of the problem, so that the problem that only the phrase variability is solved is not enough, the multi-granularity and the 'sentence-fact' pairs are required to be combined, and the coverage rate of the answerable problem is improved while the phrase variability is solved, so that the accuracy rate is improved.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A question-answer processing method based on a knowledge base is characterized by comprising the following steps:

inquiring the embedding position data of each letter, each word and each phrase corresponding to the target problem and the candidate information obtained by the trained embedding table, wherein the embedding position is a low-dimensional real number vector with the same dimension corresponding to any one of the letters, the words and the phrases and is used for representing the position of any one of the letters, the words and the phrases in a semantic space; the embedded position data is data corresponding to the embedded position of any one of the letters, the words and the phrases;

inquiring a knowledge base according to the target information pair to obtain an answer corresponding to the question;

the calculating the matching degree score of the target problem and each candidate information pair according to each embedded position data comprises the following steps:

according to the processed embedded position data and a trained attention model, determining attention expressions of letters, words and phrases of candidate information pairs, wherein the attention model is used for adjusting attention of corresponding information pairs according to problems so as to reflect matching degrees between the problems and the information pairs;

determining attention expressions of letters, words and phrases of candidate information pairs according to the processed embedded position data and the trained attention model, wherein the attention expressions comprise:

calculating to obtain attention expression of any one of the letters, the words and the phrases in the candidate information pair by applying a preset attention expression formula according to the processed embedded position data of any one of the letters, the words and the phrases in the target problem, the processed embedded position data of a corresponding one of the letters, the words and the phrases in the candidate information pair and a model constant determined based on training of the attention model;

2. The method of claim 1, wherein determining candidate information pairs from the target problem and the knowledge base comprises:

3. The method of claim 2, wherein said determining M word strings comprises:

deleting the word string as a stop word;

reserving word strings judged to be entity names or parts of entity names;

4. The method of claim 2, wherein said determining K entities in said knowledge base from said M word strings comprises:

5. The method according to any one of claims 1 to 4, wherein the embedded table is obtained after training according to a first preset range of natural language data;

obtaining an untrained embedded table;

extracting sentences of natural language in the first preset range;

6. The method of claim 5, wherein training an embedding table based on the extracted sentences and facts therein comprises:

7. The method of claim 1, wherein obtaining the matching degree score of each candidate information pair according to the letter matching degree score, the word matching degree score and the phrase matching degree score comprises:

8. The method of claim 1, wherein the parameters in the attention model are determined by training according to a given question, a first information pair corresponding to the given question, and a second information pair not corresponding to the given question, and wherein the given question and the first information pair are obtained from the knowledge base and determined by manual labeling.

9. The method of claim 1, wherein the parameters in the attention model are determined according to a given question, a first information pair corresponding to the given question, a second information pair not corresponding to the given question, and a second preset range of natural language training, and wherein the given question and the first information pair are obtained from the knowledge base and determined by manual labeling.