CN109933777B

CN109933777B - Knowledge base expanding device

Info

Publication number: CN109933777B
Application number: CN201711362321.6A
Authority: CN
Inventors: 陈培华; 朱频频
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2017-12-18
Filing date: 2017-12-18
Publication date: 2024-02-06
Anticipated expiration: 2037-12-18
Also published as: CN109933777A

Abstract

A knowledge base extension apparatus, the apparatus comprising: the data acquisition module is used for acquiring a to-be-expanded problem and word segmentation is carried out on the to-be-expanded problem to obtain a plurality of original words, wherein the to-be-processed problem is a standard question or an expanded question in a knowledge base; the replacement module is used for carrying out related word replacement on the plurality of original words so as to obtain a plurality of extended question sentences formed by combining the original words with the related words or combining the related words with the related words; the judging module is used for judging the combination effectiveness between adjacent words including the related words in each extended question; the filtering module is used for filtering the plurality of extended questions according to the judging result; and the output module is used for adding the filtered multiple extended questions as the extended questions of the to-be-processed questions into the knowledge base. According to the technical scheme, the expansion question can be automatically generated, and the effectiveness of the expansion question is ensured.

Description

Knowledge base expanding device

Technical Field

The invention relates to the technical field of natural language processing, in particular to a knowledge base expansion device.

Background

In the prior art, a knowledge base for questions and answers typically includes a plurality of knowledge points, each of which includes a standard question, one or more extended questions corresponding to the standard question, and an answer. In order to achieve accuracy of questions and answers, standard questions need to be expanded in a knowledge base to form as many expanded questions as possible. The extension questions in the knowledge base typically require manual writing. Or generate an extended question using a semantic template and a semantic expression.

However, the number of expansion questions formed by the method of generating expansion questions is limited in the prior art; in addition, invalid extended questions exist in the formed extended questions, system resources are occupied, and user questions cannot be matched with standard questions and extended questions, so that accuracy of questions and answers is affected.

Disclosure of Invention

The invention solves the technical problem of automatically generating the extension questions and ensuring the validity of the extension questions.

In order to solve the above technical problems, an embodiment of the present invention provides a knowledge base expansion device, including:

the word vector model training module is used for training a word vector model by utilizing a preset original corpus;

the updating module is used for acquiring a plurality of groups of newly added related words by using the trained word vector model and updating a synonym dictionary for replacing the related words; the updating module comprises: the first word vector calculation unit is used for obtaining word vectors of all words in the preset original corpus by using the trained word vector model; the first related word determining unit is used for determining the plurality of groups of newly added related words according to the distances among word vectors;

the data acquisition module is used for acquiring a to-be-expanded problem and word segmentation is carried out on the to-be-expanded problem to obtain a plurality of original words, wherein the to-be-processed problem is a standard question or an expanded question in a knowledge base;

the replacement module is used for carrying out related word replacement on the plurality of original words so as to obtain a plurality of extended question sentences formed by combining the original words with the related words or combining the related words with the related words;

the judging module is used for judging the combination effectiveness between adjacent words including the related words in each extended question;

the filtering module is used for filtering the plurality of extended questions according to the judging result;

and the output module is used for adding the filtered multiple extended questions as the extended questions of the to-be-processed questions into the knowledge base.

Optionally, the replacing module includes:

the first replacement unit is used for replacing the related words of the plurality of original words by using word classes corresponding to the plurality of original words;

and the second replacing unit is used for replacing the synonyms of the plurality of original words by using a synonym dictionary.

Optionally, the judging module includes:

a combination probability determining unit configured to determine a combination probability between neighboring words including the related word in each extended question;

the effective score calculating unit is used for calculating the effective score of the extended question by using the combined probability;

a storage unit for storing the set threshold value;

and the comparison unit is used for comparing the effective score of the extended question with a set threshold value to obtain the judgment result.

Optionally, the filtering module includes:

and the reservation unit is used for reserving the extended question sentence as an extended question of the to-be-extended question when the judging result shows that the effective score of the extended question sentence reaches the set threshold value.

Optionally, the effective score calculating unit calculates a sum of the combination probabilities as an effective score of the extended question.

Optionally, the combined probability determining unit determines the combined probability between adjacent words including the related word in each extended question using a chinese language model or a neural network language model.

Optionally, the knowledge base expansion device further includes:

the language model training module is used for training the Chinese language model or the neural network language model by utilizing a preset original corpus.

Optionally, the knowledge base includes a plurality of knowledge points, each knowledge point including a standard question, one or more extended questions, and an answer.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

according to the technical scheme, the problem to be expanded is obtained, and word segmentation is carried out on the problem to be expanded, so that a plurality of original words are obtained; performing related word replacement on the plurality of original words to obtain a plurality of expanded question sentences formed by combining the original words with the related words or combining the related words with the related words; judging the combination effectiveness between adjacent words including the related words in each extended question; and filtering the plurality of expansion questions according to the judging result. In the technical scheme of the invention, each word corresponds to a large number of related words, so that a large number of expanded question sentences can be obtained by replacing related words of the original words in the problem to be expanded; in addition, the validity of the combination between adjacent words comprising the related words in the extended question sentence is judged to filter invalid extended questions, so that the validity of the formed extended questions is ensured; and further, the formed extended questions can be matched with the user questions, and the timeliness and accuracy of the subsequent user questions and answers are improved.

Further, the performing related word replacement on the plurality of original words includes: performing related word replacement on the plurality of original words by using word classes corresponding to the plurality of original words; or, performing synonym replacement on the plurality of original words by using a synonym dictionary. In the technical scheme of the invention, the related word replacement can be performed by using a word class or synonym dictionary; because words similar to the original word semanteme are included in the word class and synonym dictionary, after the related word replacement is carried out by using the word class or the synonym dictionary, a plurality of expanded question sentences similar to the to-be-expanded question semanteme can be obtained. In addition, the word class and the synonym dictionary can be updated and the vocabulary quantity is expanded, so that the number of the obtained expanded questions is ensured.

Further, the judging the combination validity between the adjacent words including the related words in each extended question includes: determining the combination probability between adjacent words including the related words in each extended question; calculating the effective score of the extended question by using the combined probability; and comparing the effective score of the extended question with a set threshold value to obtain the judging result. In the technical scheme of the invention, the combination probability between adjacent words can represent the effectiveness of grammar combination of the adjacent words; the validity score of the extended question can be calculated through the combination probability between adjacent words of the extended question so as to represent the validity of the extended question in grammar, so that the extended question can be judged according to the validity score of the extended question and a set threshold value, and the validity judgment accuracy of the extended question is ensured.

Further, before the obtaining the question to be expanded, the method further includes: training a word vector model by using a preset original corpus; and obtaining a plurality of groups of newly added related words by using the trained word vector model, and updating a synonym dictionary for related word replacement. According to the technical scheme, a plurality of groups of related words are obtained through the training word vector model, so that the vocabulary of the synonym dictionary can be expanded; further, when the synonym dictionary is utilized to replace related words, more expanded question sentences can be obtained. In addition, the quality of a plurality of groups of related words obtained through the word vector model is higher, so that the quality of an expanded question sentence obtained by using a synonym dictionary later can be improved.

Drawings

FIG. 1 is a schematic diagram of a knowledge base expansion device according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing a specific structure of the judging module shown in FIG. 1;

FIG. 3 is a schematic diagram of a portion of another knowledge base expansion device according to an embodiment of the invention;

FIG. 4 is a schematic diagram showing a specific configuration of the update module shown in FIG. 3;

FIG. 5 is a schematic diagram of another embodiment of the update module shown in FIG. 3.

Detailed Description

As described in the background art, the number of expansion questions formed by the method of generating expansion questions is limited in the prior art; in addition, invalid extended questions exist in the formed extended questions, system resources are occupied, and user questions cannot be matched with standard questions and extended questions, so that accuracy of questions and answers is affected.

In the technical scheme of the invention, each word corresponds to a large number of related words, so that a large number of expansion questions can be obtained by replacing related words of the original words in the questions to be expanded; in addition, the validity of the combination between adjacent words comprising the related words in the extended question sentence is judged to filter invalid extended questions, so that the validity of the formed extended questions is ensured; and further, the formed extended questions can be matched with the user questions, and the timeliness and accuracy of the subsequent user questions and answers are improved.

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

As shown in fig. 1, an embodiment of the present invention provides a knowledge base extension apparatus 60, which may include:

the data acquisition module 601 is configured to acquire a to-be-expanded question, and segment the to-be-expanded question to obtain a plurality of original words, where the to-be-processed question is a standard question or an expanded question in a knowledge base;

a replacement module 602, configured to perform related word replacement on the plurality of original words to obtain a plurality of extended question sentences in which the original words and the related words are combined together or the related words and the related words are combined together;

a judging module 603, configured to judge the validity of the combination between the adjacent words including the related word in each extended question;

the filtering module 604 is configured to filter the plurality of extended questions according to a determination result;

and the output module (not shown in the figure) is used for adding the filtered multiple extended questions as the extended questions of the to-be-processed problem into the knowledge base.

Since the subsequent related word replacement is performed on the words, in the specific implementation of the data acquisition module 601, the word segmentation process is performed on the problem to be expanded, so as to obtain a plurality of original words of the problem to be expanded. In particular, the question to be expanded may be text. If the problem to be expanded is voice, the voice is required to be converted into text, and word segmentation is performed.

By replacing the related words with the plurality of original words, a plurality of extended question sentences can be obtained. The plurality of extended question sentences comprise extended question sentences formed by combining the original words and the related words and extended question sentences formed by combining the related words and the related words. Specifically, each original word corresponds to a related word. And when the related words are replaced, replacing the related words corresponding to the original words.

For example, the original word 1 and the original word 2 are obtained after the word segmentation of the problem to be expanded; the original word 1 corresponds to the related word 1 and the related word 2, and the original word 2 corresponds to the related word A and the related word B; then, the related word 1 and the related word 2 may replace the original word 1, and the related word a and the related word B may replace the original word 2. After related word replacement, the formed extended question sentence comprises the following several kinds of questions: related word 1 and original word 2, related word 2 and original word 2, original word 1 and related word a, original word 1 and related word B, related word 1 and related word a, related word 2 and related word a, related word 1 and related word B, and related word 2 and related word B.

It may be appreciated that the related word corresponding to the original word may be a word semantically similar to the original word, for example, the semantic similarity may be that the semantic similarity is greater than a preset value.

After obtaining the plurality of extended questions, invalid extended questions may exist among the plurality of extended questions. The invalid extension question may be an extension question that does not conform to the grammatical standard. Since the invalid extension questions cannot be matched with the user questions when in use, it is necessary to filter the invalid extension questions and to retain valid extension questions other than the invalid extension questions.

The judging module 603 may obtain a judging result of the validity of the combination between every two adjacent words. And determining whether the extended question is valid according to the judging result of the combined validity between every two adjacent words in the extended question, so that the filtering module 604 can filter invalid extended questions and keep valid extended questions.

In the embodiment of the invention, because each word corresponds to a large number of related words, a large number of expanded question sentences can be obtained by replacing related words of the original words in the problem to be expanded; in addition, the invalid expansion question is filtered by judging the combination validity between adjacent words comprising the related words in the expansion question, so that the validity of the formed problem to be expanded is ensured; and further, the formed problem to be expanded can be matched with the user problem, and the timeliness and accuracy of the subsequent user questions and answers are improved.

The method and the device can be applied to automatic generation of the extended questions in the robot question-answering system, enrich the question-answering knowledge base of the robot and improve the question-answering effect of the robot.

And determining whether the extended question is an invalid extended question by judging the combination validity between adjacent words comprising the related words in the extended question. That is, by judging whether a combination between every two adjacent words in the extended question is valid, it can be determined whether the extended question is valid. If the combination of at least two adjacent words in the extended question is invalid, the extended question is an invalid extended question.

In particular, the validity of a combination between adjacent words may refer to whether the combination between adjacent words meets grammatical criteria.

It should be noted that, the determination of whether the combination between adjacent words meets the grammar standard may be implemented in any manner, which is not limited in this embodiment of the present invention.

The output module is used for outputting the filtered multiple extended questions to join the knowledge base, wherein the filtered multiple extended questions are effective extended questions of the to-be-extended questions.

In this embodiment, after passing through the filtering module 604, a plurality of filtered extended questions are obtained. The filtered multiple extended questions are effective extended questions of the questions to be extended, namely the extended questions which accord with the grammar standard. Therefore, the output module can output the filtered multiple extended questions to the knowledge base. More specifically, the output module adds the filtered multiple expansion questions into knowledge points to which the problem to be expanded belongs.

In a preferred embodiment of the present invention, the replacing module 602 may include a first replacing unit, configured to replace the plurality of original words with related words by word parts corresponding to the plurality of original words.

In this embodiment, the related word replacement may be performed on the plurality of original words by using word classes corresponding to the plurality of original words. Specifically, each word class may include a plurality of words; the parts of speech may be partitioned according to the semantics of the words, and a set of semantically related words may be grouped together to form the parts of speech. In particular, a part of speech may be composed of a part of speech name and a set of semantically related terms. The part of speech name may be a word having a tagging effect in the set of related words, i.e. a representation of the part of speech. One part of speech includes at least one word (i.e., the part of speech name itself). For example, a part of speech named "mobile" may include a plurality of words "mobile," "mobile phone," "telephone," and the like.

Since the word class includes words similar to the original word semanteme, after the word class is utilized to replace the related words of the original word, a plurality of expanded question sentences similar to the to-be-expanded question semanteme can be obtained. In addition, the word class can be updated and expanded in terms of vocabulary periodically, so that the number of the obtained expanded questions is guaranteed.

Alternatively, the replacement module 602 may include a second replacement unit to perform synonym replacement on the plurality of original terms using a synonym dictionary.

In the technical scheme of the invention, the synonym dictionary can be utilized for replacing related words. The synonym dictionary includes groups of semantically similar words. In practical applications, the synonym dictionary has multiple versions, and embodiments of the present invention do not limit the specific type of synonym dictionary.

Since the synonym dictionary includes words similar to the original word semantics, after the synonym dictionary is used for replacing related words, a plurality of expanded question sentences similar to the to-be-expanded question semantics can be obtained. In addition, the synonym dictionary can be updated and expanded in vocabulary periodically, so that the number of the obtained expanded questions is guaranteed.

As shown in fig. 2, the determining module 603 may include: a combination probability determination unit 6031 for determining a combination probability between adjacent words including the related word in each extended question; a valid score calculation unit 6032 for calculating a valid score of an extended question using the combined probability; a storage unit 6033 for storing the set threshold value; and a comparing unit 6034 for comparing the effective score of the extended question with a set threshold value to obtain the judging result.

The embodiment of the invention provides a specific implementation mode for obtaining the judging result. In particular implementations, the combined probability between neighboring words may represent the combined validity of neighboring words that include the related word. Any applicable algorithm or model may be used to determine the combined probability, and embodiments of the present invention are not limited in this regard.

Further, a combined probability between adjacent words in each extended question that include the related word may be determined using a chinese language model or a neural network language model.

In this embodiment, the chinese language model or the neural network language model may be preconfigured. The segmented expanded question sentence is input into a Chinese language model or a neural network language model, and the Chinese language model or the neural network language model can output the combination probability between every two adjacent words in the expanded question sentence.

In a specific implementation, the effective score calculating unit 6032 calculates the sum of the combination probabilities as the effective score of the extended question.

In particular implementations, the combined probability determination unit 6031 determines the combined probability between neighboring words including the related word in each extended question using a chinese language model or a neural network language model.

Further, the knowledge base extension device 60 shown in fig. 5 may further include a language model training module for training the chinese language model or the neural network language model by using a preset original corpus.

In this embodiment, before the problem to be expanded is expanded, a preparation is required to obtain a trained chinese language model or the neural network language model to determine the validity of the combination of the adjacent words.

Specifically, the preset original corpus may be a large amount of question-answer data. The source of the question-answering data can be crawled by a crawler or manual question-answering data. On the basis of ensuring that the data volume of the preset original corpus is large enough, the training effect of the Chinese language model or the neural network language model can be ensured, and further, the accuracy of judging the combination effectiveness of the Chinese language model or the neural network language model on adjacent words is ensured.

Further, the preset original corpus may be specific to a specific business domain, or may cover multiple business domains.

As previously described, the combined validity of all every two adjacent terms in an extended question may identify the validity of the extended question. In the present embodiment, the combined effectiveness of the neighboring words is represented by a combined probability, whereby the combined probability can be used to calculate the effectiveness score of the extended question. Specifically, the effective score may be a weighted average of a plurality of combined probabilities; or a weighted sum of a plurality of combined probabilities; but also the product of a plurality of combined probabilities, etc.

Further, the sum of the combined probabilities may be calculated as the effective score of the extended question.

The validity score of an extended question may represent the validity of the extended question. And comparing the effective score of the extended question with a set threshold value to obtain a judging result of whether the extended question is effective. Specifically, the higher the validity score of an extended question, the higher the validity of the extended question; and vice versa. If the effective score of the extended question reaches the set threshold, the extended question is indicated to be an effective extended question; otherwise, the extended question is an invalid extended question.

It will be appreciated that, in practical applications, the specific value of the threshold may be adaptively configured according to the practical application environment, which is not limited in the embodiment of the present invention.

Further, the filtering module 604 may include a retaining unit, configured to retain the extended question as the extended question of the to-be-extended question when the judging result indicates that the effective score of the extended question reaches the set threshold.

As described above, if the validity score of the extended question reaches the set threshold, it indicates that the extended question is a valid extended question, and the extended question may be retained. The extended question will be an extended question of the question to be extended. The device can put the expansion question sentence and the problem to be expanded into a knowledge base as a knowledge point and is used for matching with the user problem.

Because the extended question remained in the embodiment of the invention is subjected to validity screening, when matching is performed with the user problem by utilizing the knowledge points, the situation that matching cannot be performed can be avoided, and the matching accuracy can be improved.

In yet another embodiment of the present invention, the question to be expanded is a standard question or a valid expanded question in a knowledge point. In this embodiment, a plurality of valid extension questions of the standard questions or the valid extension questions can be obtained by extending the standard questions or the valid extension questions in the knowledge points. On one hand, the number of the extended questions in the knowledge points is ensured, and on the other hand, the quality of the extended questions in the knowledge points is ensured; and further, when the knowledge points are utilized to conduct user question answering, the accuracy of answer answering can be improved.

Further, as shown in fig. 3, the knowledge base extension apparatus 60 shown in fig. 3 may further include a word vector model training module 605 and an updating module 606. The word vector model training module 605 is configured to train a word vector model using a preset original corpus; the updating module 606 is configured to obtain multiple sets of newly added related words using the trained word vector model, and update the synonym dictionary for performing related word replacement.

In a specific implementation, the preset original corpus may be preconfigured. For example, a large amount of natural language data. After training the word vector model by using the preset original corpus, the trained word vector model can acquire word vectors of words. That is, a plurality of groups of related words in the preset original corpus are obtained by using the trained word vector model; by comparison with synonyms in the synonym dictionary, a plurality of sets of newly added correspondences in the plurality of sets of correspondences can be determined. By adding multiple groups of newly added related words to the synonym dictionary, expansion of the synonym dictionary can be achieved. Further, each set of newly added related words includes a plurality of semantically similar words.

Specifically, the semantic similarity between words can be calculated according to word vectors of the words, and multiple groups of related words can be determined according to the semantic similarity between the words. For example, when the semantic similarity between two words is greater than a preset value, the two words are related words.

In the embodiment of the invention, a plurality of groups of related words are obtained through training a word vector model, so that the vocabulary of a synonym dictionary can be expanded; further, when the synonym dictionary is utilized to replace related words, more expanded question sentences can be obtained. In addition, the quality of a plurality of groups of related words obtained through the word vector model is higher, so that the quality of an expanded question sentence obtained by using a synonym dictionary later can be improved.

In this embodiment, the synonym dictionary may be used to perform related word replacement. In order to ensure the richness of the expanded question after the related word replacement is carried out on the original word, the synonym dictionary can be updated and expanded before the related word replacement is carried out on the original word by utilizing the synonym dictionary, so that the richness of the synonym dictionary is improved.

In a specific embodiment, as shown in fig. 4, the updating module 606 may include a first word vector calculating unit 6061, configured to obtain word vectors of all words in the preset original corpus by using the trained word vector model; the first related word determining unit 6062 is configured to determine the plurality of groups of newly added related words according to distances between word vectors.

In the embodiment of the invention, the synonym dictionary is provided with a plurality of groups of synonyms. The word vector model after training can be utilized to obtain the word vectors of a plurality of groups of synonyms and the word vectors of all words in the preset original corpus. For a set of synonyms, the related terms for each term in the set of synonyms may be calculated. That is, the related words of each word are determined according to the distance between the word vectors.

Thus, for multiple words in a set of synonyms, multiple sets of related words can be obtained. Because there may be duplication between sets of related words, the related words of all words in each set of synonyms are intersected to determine the sets of newly added related words.

In the embodiment of the invention, the distance between the word vectors can represent the semantic similarity of the words corresponding to the word vectors. After all word vectors in the preset original expectation are obtained by using the trained word vector model, the distance between every two word vectors can be calculated respectively, and a plurality of groups of newly added related words are determined according to the distance. Specifically, when the distance between two word vectors reaches a preset value, determining that the word corresponding to the word vector is a related word.

In another embodiment, as shown in fig. 5, the updating module 606 may include a second word vector calculating unit 6063 configured to obtain word vectors of all words in each group of synonyms and word vectors of all words in the preset original corpus by using the trained word vector model; a related word calculation unit 6064 for determining related words of all words in each group of synonyms according to the distance between word vectors; the second related word determining unit 6065 is configured to intersect related words of all words in each set of synonyms to determine the multiple sets of newly added related words.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. A knowledge base extension apparatus, comprising:

2. The knowledge base extension apparatus of claim 1, wherein said replacement module comprises:

3. The knowledge base extension apparatus of claim 2, wherein said judgment module comprises:

a storage unit for storing the set threshold value;

4. A knowledge base extension apparatus according to claim 3, wherein the filtering module comprises:

5. A knowledge base extension apparatus according to claim 3, wherein the effective score calculating unit calculates a sum of the combination probabilities as an effective score of the extension question.

6. A knowledge base extension apparatus according to claim 3, wherein said combined probability determining unit determines a combined probability between adjacent words including said related word in each extended question using a chinese language model or a neural network language model.

7. The knowledge base extension apparatus of claim 6, further comprising:

8. The knowledge base extension device of claim 1, wherein said knowledge base comprises a plurality of knowledge points, each knowledge point comprising a standard question, one or more extension questions, and an answer.