CN114254090A - Question-answer knowledge base expansion method and device - Google Patents

Question-answer knowledge base expansion method and device Download PDF

Info

Publication number
CN114254090A
CN114254090A CN202111490800.2A CN202111490800A CN114254090A CN 114254090 A CN114254090 A CN 114254090A CN 202111490800 A CN202111490800 A CN 202111490800A CN 114254090 A CN114254090 A CN 114254090A
Authority
CN
China
Prior art keywords
sentence
question
user
word
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111490800.2A
Other languages
Chinese (zh)
Inventor
曹磊
王洪斌
李长林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202111490800.2A priority Critical patent/CN114254090A/en
Publication of CN114254090A publication Critical patent/CN114254090A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a question and answer knowledge base expansion method and device, which are used for improving the quality of a question and answer knowledge base while saving manpower and improving the expansion efficiency. The method comprises the following steps: obtaining user statements from historical dialogue data; determining similarity between the user statement and a standard question sentence based on a first statement feature of the user statement, a word feature of a first word contained in the user statement, a second statement feature of the standard question sentence corresponding to the user statement in a question-and-answer knowledge base, and a word feature of a second word contained in the standard question sentence; and determining whether to add the user statement as a new standard question to the question-answer knowledge base or not based on the similarity between the user statement and the standard question.

Description

Question-answer knowledge base expansion method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for expanding a question and answer knowledge base.
Background
The dialogue system is a relatively mature application of natural language, wherein dialogues based on Frequently Asked Questions (FAQ) are the mainstream mode of the dialogue system, and constructing a question-answer knowledge base with high-quality question sentences can effectively improve the dialogue effect.
The method mainly comprises two expansion modes of the current question-answer knowledge base, wherein one mode is that a question is manually compiled by a service expert or a marking person or the current text data is manually marked to obtain a question, and then the obtained question is added to the question-answer knowledge base; another way is to use a deep learning model to learn the existing question in the question-and-answer knowledge base to generate a new question, and then add the generated question to the question-and-answer knowledge base. However, the first method depends on human experience, which not only cannot ensure the accuracy of expanding the question-answering knowledge base, but also needs a lot of manpower, and has high manpower cost and low expansion efficiency; the second method can save labor, but the generated question may have the problem of sentence incompleteness to affect the quality of the question-answer knowledge base, and in addition, the method needs a large amount of existing question as a training sample and is not easy to implement. Therefore, how to improve the quality of the question and answer knowledge base while saving manpower and improving the expansion efficiency is a problem which needs to be solved at present.
Disclosure of Invention
The embodiment of the application provides a question and answer knowledge base expansion method and device, which are used for improving the quality of the question and answer knowledge base while saving manpower and improving the expansion efficiency.
In a first aspect, the present application provides a method for expanding a question-answer knowledge base, including:
obtaining user statements from historical dialogue data;
determining similarity between the user statement and a standard question sentence based on a first statement feature of the user statement, a word feature of a first word contained in the user statement, a second statement feature of the standard question sentence corresponding to the user statement in a question-and-answer knowledge base, and a word feature of a second word contained in the standard question sentence;
and determining whether to add the user statement as a new standard question to the question-answer knowledge base or not based on the similarity between the user statement and the standard question.
In a second aspect, the present application provides an apparatus for expanding a question-answer knowledge base, including:
the acquisition module is used for acquiring user sentences from historical dialogue data;
a similarity determination module, configured to determine a similarity between the user statement and the standard question sentence based on a first statement feature of the user statement, a word feature of a first word included in the user statement, a second statement feature of a standard question sentence corresponding to the user statement in a question-and-answer knowledge base, and a word feature of a second word included in the standard question sentence;
and the expansion module is used for determining whether to add the user statement into the question-answer knowledge base as a new standard question or not based on the similarity between the user statement and the standard question.
In a third aspect, the present application provides an electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of the first aspect.
In a fourth aspect, the present application provides a computer readable storage medium having instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the method of the first aspect.
It can be seen that, in the embodiment of the present application, whether to add a user sentence as a new standard question to the question and answer knowledge base is determined by analyzing the similarity between the user sentence and the corresponding standard question in the question and answer knowledge base, and the question expanded into the question and answer knowledge base is similar to the existing standard question, and the user sentence is obtained from the historical dialogue data, and compared with the question generated by learning the existing question in the question and answer knowledge base by using a deep learning model, the question with an incomplete sentence can be avoided, so that the quality of the question added to the question and answer knowledge base can be ensured, and the quality of the question and answer knowledge base can be improved; on the basis, when the similarity between the user statement and the standard question is analyzed, the analysis is carried out based on the characteristics of the user statement and the standard question in the statement dimension and the word dimension, so that the accuracy of the similarity analysis result can be improved, the quality of the question added into the question-answering knowledge base is further improved, and the quality of the question-answering knowledge base is improved; in addition, the whole expansion process does not need manual participation, so that the labor can be saved, and the expansion efficiency of the question-answer knowledge base is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:
fig. 1 is a schematic flow chart of a method for expanding a knowledge base of questions and answers according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a standard question acquisition method according to an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of an apparatus for expanding a knowledge base of questions and answers according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the specification are capable of operation in sequences other than those illustrated or described herein. In the present specification and claims, "and/or" indicates at least one of the connected objects, and the character "/" generally indicates that the preceding and following related objects are in an "or" relationship.
As described above, there are two main expansion ways of the current question-answer knowledge base, one way is to manually compile a question by a service expert or a annotator or to manually annotate the existing text data to obtain a question, and then add the obtained question to the question-answer knowledge base; another way is to use a deep learning model to learn the existing question in the question-and-answer knowledge base to generate a new question, and then add the generated question to the question-and-answer knowledge base. However, the first method depends on human experience, which not only cannot ensure the accuracy of expanding the question-answering knowledge base, but also needs a lot of manpower, and has high manpower cost and low expansion efficiency; the second method can save labor, but the generated question may have the problem of sentence incompleteness to affect the quality of the question-answer knowledge base, and in addition, the method needs a large amount of existing question as a training sample and is not easy to implement. Therefore, how to improve the quality of the question and answer knowledge base while saving manpower and improving the expansion efficiency is a problem which needs to be solved at present.
Therefore, the embodiments of the present specification aim to provide an expansion scheme of a question-answer knowledge base, which determines whether to add a user sentence into the question-answer knowledge base as a new standard question by analyzing the similarity between the user sentence and a standard question corresponding to the user sentence in the question-answer knowledge base, so as to ensure that the question expanded into the question-answer knowledge base is similar to an existing standard question, and the user sentence is obtained from historical dialogue data, so that compared with a question generated by learning the existing question in the question-answer knowledge base by using a deep learning model, the problem of sentence incompleteness can be avoided, thereby ensuring the quality of the question added into the question-answer knowledge base and improving the quality of the question-answer knowledge base; on the basis, when the similarity between the user statement and the standard question is analyzed, the analysis is carried out based on the characteristics of the user statement and the standard question in the statement dimension and the word dimension, so that the accuracy of the similarity analysis result can be improved, the quality of the question added into the question-answering knowledge base is further improved, and the quality of the question-answering knowledge base is improved; in addition, the whole expansion process does not need manual participation, so that not only can the labor be saved, but also the expansion efficiency of the question-answering knowledge base can be improved.
It should be understood that the method for expanding the question-answering knowledge base and the text provided in the embodiments of the present specification may be executed by an electronic device or software installed in the electronic device, and may specifically be executed by a terminal device or a server device.
The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.
Referring to fig. 1, a flow chart of a method for expanding a knowledge base of questions and answers provided for an embodiment of the present specification is schematically illustrated, where the method may include:
s102, obtaining user sentences from historical dialogue data.
The historical dialogue data refers to data generated by a user and a responder (such as an intelligent customer service, a human seat and the like) in one or more rounds of historical dialogue, and may include not only questions of the user but also answers made by the responder to the user for extraction and the like.
In a specific application, the form of the historical dialogue data can be various, for example, when a user carries out man-machine dialogue with characters, the form of the historical dialogue data is a text form; for another example, when the user performs a man-machine conversation with voice, the form of the historical conversation data is a voice form, and so on. Correspondingly, the manner of obtaining the user sentence from the historical dialogue data may also be various, for example, if the historical dialogue data is in a text form, the text processing may be directly performed on the historical dialogue data to extract the user sentence, such as a question of the user; for another example, if the historical dialogue data is in a voice form, the historical dialogue data in the voice form may be converted into the historical dialogue data in a text form by using an Automatic Speech Recognition (ASR) technology, and then the historical dialogue data in the text form may be subjected to text processing to extract a user sentence, such as a question of the user.
Further, when the form of the historical dialogue data is a Voice form, it is considered that Voice Activity Detection (VAD) operation executed in the process of converting the historical dialogue data in the Voice form into the historical dialogue data in the text form brings certain errors, and such errors may cause that when a user sentence is extracted, a complete word appears in different sentences due to pause, and further cause that the semantics of the extracted user sentence are discontinuous.
S104, determining the similarity between the user statement and the standard question based on the first statement feature of the user statement, the word feature of the first word contained in the user statement, the second statement feature of the standard question corresponding to the user statement in the question-answer knowledge base and the word feature of the second word contained in the standard question.
In the embodiments of the present specification, the word characteristics of words included in a question are used to characterize the characteristics of the question in the word dimension. The word characteristics of the word may include, for example, but not limited to, a part of speech, a core score, an importance score, and the like of the word, where the core score of the word is used to represent whether the word belongs to the core word in the question sentence to which the word belongs, and the importance score of the word is used to represent the importance of the word. More specifically, for the parts of speech of a word, different parts of speech may correspond to different preset values, for example, the preset value corresponding to a noun is 1, the preset value corresponding to an adjective is 2, and the like; the core score of the word may also be set according to actual needs, for example, if the word belongs to a core word in the sentence to which the word belongs, the core score of the word may be set to 1, otherwise, the core score of the word may be set to 0; the importance score of a word may be represented by the Term Frequency-Inverse text Frequency index (TF-IDF) of the word.
The sentence characteristics of the question are used for representing the characteristics of the question in the sentence dimension. Specifically, the sentence characteristics of a question may include, for example, but not limited to, a sentence pattern of a question, a sentence vector, and the like, wherein the sentence vector of a question is a mathematical expression that represents a question. More specifically, a sentence vector of a question may be generated from a word vector and an importance score of words contained in the question. Illustratively, sentence vector senten of a questionvecIs composed of
Figure BDA0003398319480000061
Wherein tfidfiThe importance score, w2v, representing the ith term contained in the questioniA word vector representing the ith word contained in the question, and N represents the number of words contained in the question.
In practical application, various technical means known to those skilled in the art may be adopted to perform feature extraction on a question to obtain sentence features of the question and word features of words included in the question, for example, performing word segmentation processing on the question, stopping words, calculating word vectors, and the like, which is not specifically limited in this description embodiment.
For convenience of distinction, for the words included in the question, the words included in the user sentence are referred to as first words, and the words included in the standard question are referred to as second words; regarding the sentence characteristics of the question sentences, the sentence characteristics of the user sentences are called first sentence characteristics, and the sentence characteristics of the standard question sentences are called second sentence characteristics; for a sentence vector of a question, the sentence vector of a user sentence is referred to as a first sentence vector, and the sentence vector of a standard question is referred to as a second sentence vector.
In this embodiment of the specification, the standard question sentence corresponding to the user sentence in the question and answer knowledge base is a question sentence in the question and answer knowledge base, which is related to the user sentence semantics, for example, the user sentence is "your name is what", and the standard question sentence corresponding to the user sentence in the question and answer knowledge base may include "what name you call", "your maiden name", "ask how you call", and the like. The standard question sentence can be obtained from the question-answer knowledge base according to words contained in the user sentence, and a specific obtaining mode will be described in detail later.
When the similarity between the user statement and the standard question is determined, specifically, the sentence is considered to be composed of individual words, and the semantics of the sentence is mainly represented by the words, so that the similarity between the user statement and the standard question can be analyzed from the characteristics of the user statement and the standard question in the statement dimension and the characteristics of the standard question in the word dimension, respectively, and the accuracy of the similarity analysis result is improved.
And S106, determining whether to add the user statement as a new standard question to the question-answer knowledge base or not based on the similarity between the user statement and the standard question.
If the similarity between the user statement and the standard question is high, the same question for various queries of the user statement and the standard question can be determined, and the user statement can be further determined to be added to the question-answer knowledge base as a new standard question, so that the question-answer knowledge base is expanded; otherwise, the user statement may be discarded.
For example, the user statement is "your name is what", the standard question sentence corresponding to the user statement in the question and answer knowledge base is "what name you call", and if the similarity between the user statement and the standard question sentence is high, the user statement can be used as the standard question sentence of the new inquiry name and added to the question and answer knowledge base.
In the method for expanding the question-answer knowledge base provided in the embodiment of the description, whether the user sentences are added into the question-answer knowledge base as new standard question sentences is determined by analyzing the similarity between the user sentences and the corresponding standard question sentences in the question-answer knowledge base, and the question sentences expanded into the question-answer knowledge base are similar to the existing standard question sentences; on the basis, when the similarity between the user statement and the standard question is analyzed, the similarity between the user statement and the standard question is analyzed based on the characteristics of the user statement and the standard question in the statement dimension and the word dimension, so that the accuracy of a similarity analysis result between the user statement and the standard question can be improved, the quality of the question added into the question-answer knowledge base is further improved, and the quality of the question-answer knowledge base is improved; in addition, the whole expansion process does not need manual participation, so that the labor can be saved, and the expansion efficiency of the question-answer knowledge base is improved.
In the above S104, since sentences with similar semantics generally have a certain similarity in sentence dimension, and in addition, the sentences are composed of individual words, and the relationship between the words affects the relationship between the sentences, based on which, in order to analyze the similarity between the user sentence and the standard question more accurately, in an alternative implementation, determining the similarity between the user sentence and the standard question may include the following steps:
step A1, determining a first similarity between the user sentence and the standard question sentence in the sentence dimension based on the first sentence characteristic and the second sentence characteristic.
In the embodiment of the present specification, the first similarity is used to represent a similarity between a user sentence and a standard question sentence in a sentence dimension.
Alternatively, considering that the semantics of the sentence are affected by the sentence pattern in the normal case, two similar sentences have the same probability in the sentence pattern, for example, the respective sentence patterns of the question "what your name is" asking the name and the question "what your name is" asking the name are adjective main pronouns + nouns + predicates + doubtful words, and in addition, the similarity of the two sentences in the sentence dimension can also be measured by the characteristics of the sentence vectors of the two sentences and the edit distance between the two sentences, based on which, in order to accurately analyze the similarity of the user sentence and the standard question in the sentence dimension, the above step a1 may include:
step A11, based on the first sentence vector and the second sentence vector, determining sentence vector similarity between the user sentence and the standard question sentence.
In this embodiment, the sentence vector similarity between the user sentence and the standard question sentence is used to represent the similarity between the first sentence vector and the second sentence vector. Alternatively, the sentence vector similarity between the user sentence and the standard question sentence can be expressed by cosine similarity, i.e. the similarity between the user sentence and the standard question sentence is expressed by cosine
Figure BDA0003398319480000081
Where a denotes a first sentence vector and B denotes a second sentence vector.
Of course, various sentence vector similarity calculation schemes known in the art may alternatively be employed to determine the sentence vector similarity between the user sentence and the standard question sentence.
Step A12, comparing the sentence pattern of the user sentence with the sentence pattern of the standard question sentence, and determining the sentence pattern similarity between the user sentence and the standard question sentence based on the obtained comparison result.
Illustratively, if the sentence pattern of the user sentence is the same as that of the standard question sentence, it may be determined that the sentence pattern similarity between the user sentence and the standard question sentence is 1; otherwise, it may be determined that the sentence pattern similarity between the user question and the standard question sentence is 0.
Step A13, based on the first words and the second words, determining an edit distance between the user sentence and the standard question sentence.
The Edit Distance (ED) between the user sentence and the standard question refers to the minimum number of editing operations required to convert one sentence into another sentence. Generally, editing operations include replacing words in one question with words in another question, inserting words, deleting words, and the like. If the editing distance between the user statement and the standard question is smaller, the similarity between the user statement and the standard question is larger.
In specific implementation, various edit distance algorithms known to those skilled in the art may be used to determine the edit distance between the user sentence and the standard question sentence based on the words (i.e., the first word and the second word) included in each of the two question sentences.
Step A14, determining a first similarity based on the sentence vector similarity, the sentence pattern similarity and the edit distance.
Optionally, to obtain more accurate first similarity, sentence vector similarity may be calculated,The sentence pattern similarity and the edit distance are weighted and summed, and the weighted and summed result is determined as the first similarity, i.e. score1=w1·similarity+w2·scorepattern+w3ED, wherein score1Representing a first similarity, similarity representing a sentence vector similarity between the user sentence and the standard question sentence, w1Score representing the weight corresponding to the similarity of sentence vectorspatternIndicating sentence pattern similarity between the user sentence and the standard question sentence, w2Representing the weight corresponding to the sentence pattern similarity, ED representing the edit distance between the user sentence and the standard question sentence, w3Indicating the weight corresponding to the edit distance. In specific implementation, the weights corresponding to the sentence vector similarity, the sentence pattern similarity, and the edit distance may be set according to actual needs, which is not specifically limited in this embodiment of the present specification.
Of course, alternatively, the first similarity may also be determined based on the sentence vector similarity, the sentence pattern similarity, and the edit distance in other manners, for example, the maximum value of the sentence vector similarity, the sentence pattern similarity, and the edit distance is determined as the first similarity, and the like, which is not specifically limited in this embodiment of the specification.
Step A2, determining a second similarity between the user sentence and the standard question sentence in the word dimension based on the word characteristics of the first word and the word characteristics of the second word.
In the embodiment of the present specification, the second similarity is used to represent a similarity between the user sentence and the standard question sentence in the word dimension.
Optionally, considering that the degree of importance of the words contained in the two sentences in the sentences and the association relationship between the words can affect the similarity of the two sentences in the word dimension, the similarity of the user sentence and the standard question sentence in the word dimension can be analyzed according to the two influencing factors, so as to ensure the accuracy of the analysis result. Specifically, the step a2 may include:
step A21, determining the importance degree of the first word in the user sentence based on the part of speech, the core score and the importance score of the first word, and determining the importance degree of the second word in the standard question sentence based on the part of speech, the core score and the importance score of the second word.
For example, for a first word, the weighted sum result may be determined as the importance degree of the first word in the user sentence, i.e., W, by performing weighted sum on the part-of-speech score, the core score and the importance score of the part-of-speechi=θ1·pos+θ2·key+θ3Tfidf, wherein WiRepresenting the importance of the ith first word in the user sentence, pos representing the part-of-speech score of the part-of-speech of the first word, θ1Representing the weight corresponding to the part of speech, key representing the core score of the first word, theta2Representing the weight corresponding to the core score, fidf representing the importance score of the first term, θ3Representing the corresponding weight of the importance score.
It should be noted that the part-of-speech scores of different parts-of-speech may be different, and may be specifically set according to actual needs. Secondly, the weights corresponding to the part of speech, the core score and the importance score can be set according to actual needs, for example, the sum of the weights of the part of speech, the core score and the importance score is equal to 1. In addition, the user sentence may include a plurality of first words, and for each first word, the importance degree of the first word in the user sentence may be determined in the above manner.
For the second word, the part of speech, the core score and the importance score of the second word can be weighted and summed by adopting the above method for determining the importance degree of the first word in the user sentence, and the obtained weighted and summed result is determined as the importance degree of the second word in the standard question sentence.
It should also be noted that the standard question sentence may include a plurality of second words, and for each second word, the importance degree of the second word in the user sentence can be determined in the above manner.
Step A22, determining the association between the first term and the second term.
In the embodiments of the present specification, the association relationship between two words may include, for example, but is not limited to, at least one of the following relationships: complete matching, partial word matching, word vector matching, synonym relationships, containment relationships, superior-inferior relationships, belonging entity identity, and the like. Wherein, the complete matching means that the two words are literally completely the same; partial word matching means that two words are partially matched literally, such as the word "off" and the word "off"; word vector matching means that the respective word vectors of two words are similar; synonym relationship means that two words belong to synonyms, such as the word "name" and the word "name"; an inclusive relationship means that one word includes another word, such as the word "weekend" includes the word "saturday"; the upper and lower relationship refers to the upper concept or the lower concept that one word is another word, for example, the word "bird" is the upper concept of the word "sparrow"; the belonging entities are the same, which means that two words correspond to the same word, such as the word "Haihe district" and the word "facing the sun district" belonging to the entity with the place name "Beijing".
In specific implementation, the association relationship between the first term and the second term may be determined by querying a pre-established association relationship word library, or may be determined by various technical means known to those skilled in the art, which is not specifically limited in this embodiment of the present specification.
Step A23, determining a second similarity based on the importance degree of the first word in the user sentence, the importance degree of the second word in the standard question sentence and the association relationship between the first word and the second word.
Optionally, a first one-way similarity of the user sentence to the standard question sentence may be determined based on the importance degree of the first word in the user sentence and the association relationship between the first word and the second word, and a second one-way similarity of the standard question sentence to the user sentence may be determined based on the importance degree of the second word in the standard question sentence and the association relationship between the first word and the second word; further, a second similarity is determined based on the first one-way similarity and the second one-way similarity.
More specifically, the first one-way similarity may be determined by the following formula (1):
Figure BDA0003398319480000121
wherein, scorequery->questionDenotes a first one-way similarity, WiRepresents the importance of the ith first word in the user sentence query, RijThe weight corresponding to the association relationship between the ith first word and the jth second word is represented, i is 1.
Illustratively, the determination process of the first one-way similarity is explained by taking the user sentence "what your name is what" and the standard question sentence "what your name is called" as an example. The first term, the word characteristic of the first term, the second term, and the word characteristic of the second term are shown in table 1.
TABLE 1
Figure BDA0003398319480000122
Based on the data shown in Table 1, it can be determined that the importance of each word in the question to which it belongs is Wi=θ1·pi2·ki3·tiFurther, it can be determined that the first one-way similarity of the user sentence with respect to the standard question sentence is scorequery->question=∑Wi·bk
In a similar manner to the determination of the first one-way similarity, the second one-way similarity may be determined by the following formula (2):
Figure BDA0003398319480000123
wherein, scorequestion->queryDenotes a second one-way similarity, WjRepresenting the importance of the jth second word in the standard question, RjiDenotes the jthAnd the weight corresponding to the incidence relation between the second word and the ith first word, wherein N represents the number of the first words contained in the query of the user sentence, and M represents the number of the second words contained in the query of the standard question sentence.
In the formula (1) and the formula (2), weights corresponding to different association relations may be set to different values, and may be specifically set according to actual needs, which is not specifically limited in this embodiment of the present specification.
After the first one-way similarity and the second one-way similarity are determined, the two one-way similarity and the weighted sum result can be determined as the second similarity, namely score, by weighted sum of the two one-way similarity and the weighted sum result2=α1·scorequery->question2·scorequestion->query. Wherein, score2Denotes a second degree of similarity, scorequery->questionRepresenting a first one-way degree of similarity, α1Represents the weight corresponding to the first one-way similarity, scorequestion->queryRepresenting a second one-way degree of similarity, α2And representing the weight corresponding to the second one-way similarity.
It should also be noted that the weights corresponding to the first one-way similarity and the second one-way similarity respectively may be set according to actual needs, for example, the sum of the two may be set to be equal to 1 as a target.
It can be understood that the second similarity of the user statement and the standard question in the word dimension is analyzed and determined from the two directions of the user statement relative to the standard question and the standard question relative to the user statement, so that errors caused by analyzing the second similarity in a single direction can be avoided, the accuracy of the determined second similarity is further improved, and more powerful data support is provided for the subsequent expansion of the question-answer knowledge base.
It should be noted that, in other alternative schemes, a scheme for analyzing and determining the second similarity from a single direction is also possible, for example, the second similarity is determined only from the direction of the user sentence relative to the standard question, or the second similarity is determined only from the direction of the standard question relative to the user sentence.
Step A3, based on the first similarity and the second similarity, determining the similarity between the user sentence and the standard question sentence.
In specific implementation, the first similarity and the second similarity may be weighted and summed, and an obtained weighted and summed result is determined as the similarity between the user statement and the standard question sentence, for example, the weights corresponding to the first similarity and the second similarity may be both set to 1, that is, scoresum=score1+score2Wherein, scoresumScore representing the similarity between the user statement and the standard question statement1Denotes the first degree of similarity, score2Indicating a second degree of similarity.
It should be noted that the above process is described with only one standard question. In practical application, the user sentence may correspond to a plurality of standard question sentences in the question and answer knowledge base, for example, two standard question sentences of the user sentence "what name you call" and "precious family name you call" in the question and answer knowledge base, in this case, the similarity between the user sentence and each standard question sentence may be determined according to the above manner for each standard question sentence.
It can be understood that, in the above embodiment, by analyzing the similarities of the user sentence and the standard question in the sentence dimension and the word dimension, respectively, and determining the similarity between the user sentence and the standard question based on the similarities of the two dimensions, not only the similarity of the user sentence and the standard question in the overall layer is considered, but also the influence of the relationship between the words constituting the user sentence and the standard question on the similarity between the user sentence and the standard question is fully considered, so that the determined similarity between the user sentence and the standard question is more accurate, and a powerful data support is provided for the subsequent determination for expanding the question and answer knowledge base.
Of course, in the above S104, in some alternative implementations, various technical means known to those skilled in the art may also be used to analyze and determine the similarity between the user sentence and the standard question sentence.
In the above S106, in a first optional implementation manner, a threshold may be preset, and whether to use the user sentence as a new standard question and answer device in the question and answer knowledge base is determined by comparing the similarity between the user sentence and the standard question sentence with the preset threshold. For example, if the similarity between the user statement and the standard question is greater than or equal to the preset threshold, it may be determined that the user statement and the standard question are similar, and the user statement may be added to the question and answer knowledge base as a new standard question; if the similarity between the user statement and the standard question is smaller than the preset threshold, it can be determined that the difference between the user statement and the standard question is large, and the user statement can be discarded.
In a second optional implementation manner, in order to avoid mistakenly discarding user statements originally similar to the standard question, a plurality of different threshold values may be set, and a comparison result between the similarity between the user statement and the standard question and each threshold value is combined to determine whether to add the user statement as a new standard question to the question and answer knowledge base. For example, a first threshold and a second threshold, that is, a first preset threshold and a second preset threshold, may be preset, and the second preset threshold is smaller than the first preset threshold, and accordingly, the similarity between the user statement and the standard question sentence may be respectively compared with the first preset threshold and the second preset threshold, and if the similarity between the user statement and the standard question sentence is greater than or equal to the first preset threshold, it may be determined that the user statement is very similar to the standard question sentence, and then the user statement may be added to the question and answer knowledge base as a new standard question sentence; if the similarity between the user statement and the standard question is smaller than or equal to a second preset threshold, the difference between the user statement and the standard question can be determined to be large, and the user statement can be discarded; if the similarity between the user statement and the standard question is smaller than a first preset threshold and larger than a second preset threshold, the certain similarity between the user statement and the standard question can be determined, the user statement can be further sent to an auditing platform to be audited, and the user statement is added to a question and answer knowledge base as a new standard question when the auditing platform determines that the user statement passes the audit, so that the problem that the high-quality question which can be originally used as the standard question is missed due to direct discarding of the user statement can be avoided, and the problem that the quality of the question and answer knowledge base is influenced due to the fact that the low-quality question which cannot be originally used as the standard question is mistakenly added to the question and answer knowledge base can be avoided.
In practical application, a plurality of standard question sentences may exist in the question and answer knowledge base in the user sentence, in this case, the similarity between the user sentence and each standard question sentence may be compared with a preset threshold value for each standard question sentence, and further, when the similarity between the user sentence and all standard question sentences exceeds a preset threshold value (for example, exceeds a first preset threshold value), the user sentence is added to the question and answer knowledge base as a new standard question sentence. In a more preferable scheme, to reduce workload, a standard question sentence with the largest similarity to the user sentence may be selected, and only the similarity between the user sentence and the standard question sentence is compared with a preset threshold.
Specifically, for the first optional implementation manner, comparing the similarity between the user sentence and the standard question sentence with a preset threshold may include: if the user statement corresponds to a plurality of standard questions in the question-answer knowledge base, selecting the standard question with the maximum similarity between the user statement and the plurality of standard questions, and comparing the similarity between the user statement and the selected standard question with a preset threshold.
For the second optional implementation manner, as an example, comparing the similarity between the user statement and the standard question sentence with a first preset threshold and a second preset threshold respectively includes: if the user statement corresponds to a plurality of standard questions in the question-answer knowledge base, selecting the standard question with the maximum similarity between the rotation of the standard questions and the user statement, and comparing the similarity between the user statement and the selected standard question with a first preset threshold and a second preset threshold respectively.
Illustratively, when the standard question is selected, the plurality of standard question may be sorted in order of high similarity to the user sentence to obtain a sorting result; then, based on the obtained sorting result, a standard question ranked first is selected.
As described above, in the embodiments of the present specification, the standard question sentence may be obtained from the question-and-answer knowledge base according to the words included in the user sentence. However, since the user sentences are obtained from the historical dialogue data, the user sentences may be sentences with short character length, in addition, a large part of the existing questions stored in the question-answer knowledge base are irrelevant to the user sentences, if the standard questions corresponding to the user sentences are directly searched from the question-answer knowledge base, the searched standard questions may not be relevant to the user sentences, and therefore the expansion effect of expanding the question-answer knowledge base is affected, and the workload of analyzing the similarity between the user sentences and the standard questions subsequently is increased, and further the efficiency of the whole expansion process is affected.
Therefore, in another embodiment of the present specification, before S104, the method for expanding a question-answer knowledge base provided in the embodiment of the present specification further includes recalling and roughly arranging the question sentences in the question-answer knowledge base to accurately obtain the standard question sentences corresponding to the user sentences in the question-answer knowledge base. As shown in fig. 2, a standard question acquiring method provided in an embodiment of the present disclosure may include the following steps:
s202, recalling a plurality of sample question sentences from the question-answer knowledge base based on the first words.
In an alternative implementation, the search keyword may be determined based on the first word, and then a question containing the search keyword is recalled from the question-and-answer knowledge base as a sample question. For example, if the user sentence is "your name is what", the first word contained therein includes "name", it can be presumed that the semantic meaning of the user sentence is query name, so that "name", "last name", "first name", and the like can be used as search keywords, and the question sentence containing these keywords is recalled from the question and answer knowledge base as sample question sentences, such as sample question sentences of "what name you call" and "precious last name you" are obtained.
Of course, in alternative implementations, multiple samples may be recalled from the question-and-answer repository using various recall means known to those skilled in the art.
S204, respectively aiming at each recalled sample question, determining the number of overlapped words between the sample question and the user sentence based on the first words and the words contained in the sample question.
Alternatively, the number of overlapping words between the sample question sentence and the user sentence may be determined by the following formula (3):
Numk=|{P1,P2,...,Pn}∩{Q1,Q2,...,Qnk}| (3)
wherein, NumkNumber of words, Q, overlapping between the k sample question and the user sentencenkRepresents the nk term, P, contained in the k sample questionnIndicating the nth first word contained in the user statement.
And S206, selecting at least one sample question with the largest number of overlapped words with the user statement from the plurality of sample questions, and determining the sample question as a standard question corresponding to the user statement in the question-answer knowledge base.
The more the number of the overlapping words between the sample question and the user sentence is, the higher the possibility that the sample question is related to the user sentence is, and therefore, one or more sample questions having the largest number of the overlapping words with the user sentence can be selected from the plurality of sample questions and determined as the standard question.
For example, when selecting the sample question, the sample questions recalled from the question and answer knowledge base may be sorted in order of number of overlapping words with the user sentence from high to low to obtain a sorting result, and then, based on the obtained sorting result, the sample question with the sorting order located at the top N bits is selected from the sample questions to be determined as the standard question corresponding to the user sentence in the question and answer knowledge base, where N is a positive integer greater than or equal to 1, and a specific numerical value of the sample question may be selected according to actual needs.
It can be understood that, in this embodiment, a plurality of sample question sentences are recalled from the question-and-answer knowledge base based on the first word, and then at least one sample question sentence with a larger number of overlaps with the user sentence is selected from the recalled sample question sentences by using a rule that the more the overlapped words between the two sentences are, the larger the correlation between the two sentences is, and is determined as a standard question sentence corresponding to the user sentence in the question-and-answer knowledge base, so that it can be ensured that the determined standard question sentence is related to the user sentence, not only can the effect of subsequently expanding the question-and-answer knowledge base be ensured, but also the huge workload increased by analyzing and determining the similarity between the user sentence and the irrelevant question sentence in the expansion process can be reduced, and thus the efficiency of expanding the question-and-answer knowledge base is further improved.
In addition, in correspondence with the method for expanding the question-answering knowledge base shown in fig. 1, the embodiment of the present specification further provides an expanding device for the question-answering knowledge base. Fig. 3 is a schematic structural diagram of an expansion device 300 for a question-answer knowledge base according to an embodiment of the present disclosure, including:
an obtaining module 310, configured to obtain a user statement from historical dialogue data;
a similarity determining module 320, configured to determine a similarity between the user sentence and the standard question sentence based on the first sentence feature of the user sentence, the word feature of the first word included in the user sentence, the second sentence feature of the standard question sentence corresponding to the user sentence in the question-and-answer knowledge base, and the word feature of the second word included in the standard question sentence;
an expansion module 330, configured to determine whether to add the user statement to the question-and-answer knowledge base as a new standard question based on a similarity between the user statement and the standard question.
The expansion device of the question-answer knowledge base provided in the embodiment of the present specification determines whether to add the user sentence as a new standard question to the question-answer knowledge base by analyzing the similarity between the user sentence and the corresponding standard question in the question-answer knowledge base, and can ensure that the question expanded to the question-answer knowledge base is similar to the existing standard question, and the user sentence is obtained from the historical dialogue data, so that the problem of sentence incommunity can be avoided compared with the question generated by learning the existing question in the question-answer knowledge base by using a deep learning model, thereby ensuring the quality of the question added to the question-answer knowledge base and improving the quality of the question-answer knowledge base; on the basis, when the similarity between the user statement and the standard question is analyzed, the analysis is carried out based on the characteristics of the user statement and the standard question in the statement dimension and the word dimension, so that the accuracy of the similarity analysis result can be improved, the quality of the question added into the question-answering knowledge base is further improved, and the quality of the question-answering knowledge base is improved; in addition, the whole expansion process does not need manual participation, so that not only can the labor be saved, but also the expansion efficiency of the question-answering knowledge base can be improved.
Optionally, the similarity determining module 320 includes:
a first similarity determination submodule, configured to determine a first similarity between the user sentence and the standard question sentence in a sentence dimension based on the first sentence feature and the second sentence feature;
a second similarity determination submodule, configured to determine a second similarity between the user sentence and the standard question sentence in a term dimension based on the term features of the first term and the term features of the second term;
and the third similarity determining submodule is used for determining the similarity between the user statement and the standard question sentence based on the first similarity and the second similarity.
Optionally, the first sentence feature includes a sentence pattern and a first sentence vector of the user sentence, and the second sentence feature includes a sentence pattern and a second sentence vector of the standard question sentence;
the first similarity determination submodule determines a first similarity in a sentence dimension between the user sentence and the standard question sentence based on the first sentence feature and the second sentence feature, and includes:
determining sentence vector similarity between the user sentence and the standard question sentence based on the first sentence vector and the second sentence vector;
comparing the sentence pattern of the user sentence with the sentence pattern of the standard question sentence, and determining the sentence pattern similarity between the user sentence and the standard question sentence based on the obtained comparison result;
determining an edit distance between the user sentence and the standard question sentence based on the first word and the second word;
and determining the first similarity based on the sentence vector similarity, the sentence pattern similarity and the editing distance.
Optionally, the apparatus further comprises:
a first sentence vector generation module, configured to generate the first sentence vector based on a word vector and an importance score of the first word, where the importance score of the first word is used to represent the importance of the first word;
and the second sentence vector generation module is used for generating a second sentence vector based on the word vector and the importance value of the second word, and the importance value of the second word is used for representing the importance of the second word.
The first sentence vector is generated by a word vector of the first word and an importance score, and the importance score of the first word is used for representing the importance of the first word;
the second sentence vector is generated by a word vector of the second word and an importance score, and the importance score of the second word is used for representing the importance of the second word.
Optionally, the word features include a part of speech, a core score and an importance score, the core score is used for representing whether the word belongs to the core word in the belonging sentence, and the importance score is used for representing the importance of the word;
the determining a second similarity between the user sentence and the standard question sentence in a word dimension based on the word features of the first word and the word features of the second word comprises:
determining the importance degree of the first word in the user sentence based on the part of speech, the core value and the importance score of the first word, and determining the importance degree of the second word in the standard question sentence based on the part of speech, the core value and the importance score of the second word;
determining an association between the first term and the second term;
determining the second similarity based on the importance degree of the first word in the user sentence, the importance degree of the second word in the standard question sentence, and the association relationship.
Optionally, the second similarity determination submodule determines the second similarity based on the importance degree of the first word in the user sentence, the importance degree of the second word in the standard question sentence, and the association relationship, and includes:
determining a first one-way similarity of the user statement relative to the standard question statement based on the importance degree of the first word in the user statement and the incidence relation;
determining a second one-way similarity of the standard question relative to the user statement based on the importance degree of the second word in the standard question and the incidence relation;
determining the second similarity based on the first one-way similarity and the second one-way similarity.
Optionally, the association relationship includes at least one of the following relationships: complete matching, partial word matching, word vector matching, synonym relation, inclusion relation, superior-inferior relation and belonging entity are the same.
Optionally, the expansion module 330 includes:
a comparison submodule, configured to compare similarity between the user statement and the standard question statement with a first preset threshold and a second preset threshold, respectively, where the second preset threshold is smaller than the first preset threshold;
a first adding submodule, configured to add the user statement to the question and answer knowledge base as a new standard question if the similarity between the user statement and the standard question is greater than or equal to the first preset threshold;
a discarding submodule, configured to discard the user statement if the similarity between the user statement and the standard question statement is less than or equal to the second preset threshold;
and the second adding submodule is used for sending the user statement to an auditing platform for auditing if the similarity between the user statement and the standard question is smaller than the first preset threshold and larger than the second preset threshold, and adding the user statement serving as a new standard question to the question and answer knowledge base when the auditing platform determines that the user statement passes the auditing.
Optionally, the comparing sub-module compares the similarity between the user sentence and the standard question sentence with a first preset threshold and a second preset threshold, respectively, and includes:
if the user statement corresponds to a plurality of standard question sentences in the question-answer knowledge base, selecting the standard question sentence with the maximum similarity between the user statement and the standard question sentences;
and comparing the similarity between the user statement and the selected standard question sentence with the first preset threshold and the second preset threshold respectively.
Optionally, the apparatus 300 further comprises:
a recall module, configured to recall a plurality of sample question sentences from the question-and-answer knowledge base based on the first words before the similarity determination module 320 determines a similarity user sentence between the user sentence and the standard question sentence based on the first sentence features of the user sentence, the word features of the first words included in the user sentence, the second sentence features of the standard question sentences corresponding to the user sentence in the question-and-answer knowledge base, and the word features of the second words included in the standard question sentences;
the overlap determining module is used for respectively determining the number of overlapped words between the sample question sentences and the user sentences aiming at each recalled sample question sentence based on the words contained in the sample question sentences and the first words;
and the selection module is used for selecting at least one sample question with the largest number of overlapped words with the user statement from the plurality of sample questions and determining the sample question as a standard question corresponding to the user statement in the question-answer knowledge base.
Obviously, the expansion device of the question-answer knowledge base according to the embodiment of the present specification can be used as an execution subject of the expansion method of the question-answer knowledge base shown in fig. 1, and thus the function of the expansion method of the question-answer knowledge base in fig. 1 can be realized. Since the principle is the same, it is not described herein again.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring to fig. 4, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.
And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to question and answer the expansion device of the knowledge base on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:
obtaining user statements from historical dialogue data;
determining similarity between the user statement and a standard question sentence based on a first statement feature of the user statement, a word feature of a first word contained in the user statement, a second statement feature of the standard question sentence corresponding to the user statement in a question-and-answer knowledge base, and a word feature of a second word contained in the standard question sentence;
and determining whether to add the user statement as a new standard question to the question-answer knowledge base or not based on the similarity between the user statement and the standard question.
The method executed by the device for expanding the question-answering knowledge base disclosed in the embodiment of fig. 1 in the present specification can be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
It should be understood that the electronic device of the embodiment of the present specification can implement the functions of the device for expanding the question-answering knowledge base in the embodiment shown in fig. 1. Since the principle is the same, the embodiments of the present description are not described herein again.
Of course, besides the software implementation, the electronic device in this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
Embodiments of the present specification also propose a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, are capable of causing the portable electronic device to perform the method of the embodiment shown in fig. 1, and in particular to perform the following:
obtaining user statements from historical dialogue data;
determining similarity between the user statement and a standard question sentence based on a first statement feature of the user statement, a word feature of a first word contained in the user statement, a second statement feature of the standard question sentence corresponding to the user statement in a question-and-answer knowledge base, and a word feature of a second word contained in the standard question sentence;
and determining whether to add the user statement as a new standard question to the question-answer knowledge base or not based on the similarity between the user statement and the standard question.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims (12)

1. A method for expanding a question-answer knowledge base is characterized by comprising the following steps:
obtaining user statements from historical dialogue data;
determining similarity between the user statement and a standard question sentence based on a first statement feature of the user statement, a word feature of a first word contained in the user statement, a second statement feature of the standard question sentence corresponding to the user statement in a question-and-answer knowledge base, and a word feature of a second word contained in the standard question sentence;
and determining whether to add the user statement as a new standard question to the question-answer knowledge base or not based on the similarity between the user statement and the standard question.
2. The method according to claim 1, wherein the determining the similarity between the user sentence and the standard question sentence based on the first sentence characteristic of the user sentence, the word characteristic of the first word included in the user sentence, the second sentence characteristic of the standard question sentence corresponding to the user sentence in the question-and-answer knowledge base, and the word characteristic of the second word included in the standard question sentence comprises:
determining a first similarity between the user sentence and the standard question sentence in a sentence dimension based on the first sentence feature and the second sentence feature;
determining a second similarity between the user sentence and the standard question sentence in a word dimension based on the word features of the first word and the word features of the second word;
and determining the similarity between the user sentence and the standard question sentence based on the first similarity and the second similarity.
3. The method of claim 2, wherein the first sentence features comprise a sentence pattern and a first sentence vector of the user sentence, and the second sentence features comprise a sentence pattern and a second sentence vector of the standard question sentence;
the determining a first similarity in sentence dimension between the user sentence and the standard question sentence based on the first sentence feature and the second sentence feature includes:
determining sentence vector similarity between the user sentence and the standard question sentence based on the first sentence vector and the second sentence vector;
comparing the sentence pattern of the user sentence with the sentence pattern of the standard question sentence, and determining the sentence pattern similarity between the user sentence and the standard question sentence based on the obtained comparison result;
determining an edit distance between the user sentence and the standard question sentence based on the first word and the second word;
and determining the first similarity based on the sentence vector similarity, the sentence pattern similarity and the editing distance.
4. The method according to claim 3, wherein before determining the similarity between the user sentence and the standard question sentence based on the first sentence characteristic of the user sentence, the word characteristic of the first word contained in the user sentence, the second sentence characteristic of the standard question sentence corresponding to the user sentence in the question-and-answer knowledge base, and the word characteristic of the second word contained in the standard question sentence, the method further comprises:
generating the first sentence vector based on the word vector and the importance score of the first word, wherein the importance score of the first word is used for representing the importance of the first word;
and generating the second sentence vector based on the word vector and the importance score of the second word, wherein the importance score of the second word is used for representing the importance of the second word.
5. The method according to claim 2, wherein the word characteristics comprise part of speech, core score and importance score, the core score is used for representing whether the word belongs to the core word in the belonging sentence, and the importance score is used for representing the importance of the word;
the determining a second similarity between the user sentence and the standard question sentence in a word dimension based on the word features of the first word and the word features of the second word comprises:
determining the importance degree of the first word in the user sentence based on the part of speech, the core value and the importance score of the first word, and determining the importance degree of the second word in the standard question sentence based on the part of speech, the core value and the importance score of the second word;
determining an association between the first term and the second term;
determining the second similarity based on the importance degree of the first word in the user sentence, the importance degree of the second word in the standard question sentence, and the association relationship.
6. The method of claim 5, wherein the determining the second similarity based on the importance of the first term in the user sentence, the importance of the second term in the standard question sentence, and the association comprises:
determining a first one-way similarity of the user statement relative to the standard question statement based on the importance degree of the first word in the user statement and the incidence relation;
determining a second one-way similarity of the standard question relative to the user statement based on the importance degree of the second word in the standard question and the incidence relation;
determining the second similarity based on the first one-way similarity and the second one-way similarity.
7. The method according to claim 1, wherein the determining whether to add the user sentence as a new standard question sentence to the question-and-answer knowledge base based on the similarity between the user sentence and the standard question sentence comprises:
comparing the similarity between the user statement and the standard question statement with a first preset threshold and a second preset threshold respectively, wherein the second preset threshold is smaller than the first preset threshold;
if the similarity between the user statement and the standard question is greater than or equal to the first preset threshold, adding the user statement as a new standard question to the question and answer knowledge base;
if the similarity between the user statement and the standard question statement is smaller than or equal to the second preset threshold, discarding the user statement;
and if the similarity between the user statement and the standard question is smaller than the first preset threshold and larger than the second preset threshold, sending the user statement to an auditing platform for auditing, and adding the user statement as a new question-answering knowledge base when the auditing platform determines that the user statement passes the auditing.
8. The method of claim 7, wherein comparing the similarity between the user sentence and the standard question sentence with a first preset threshold and a second preset threshold respectively comprises:
if the user statement corresponds to a plurality of standard question sentences in the question-answer knowledge base, selecting the standard question sentence with the maximum similarity between the user statement and the standard question sentences;
and comparing the similarity between the user statement and the selected standard question sentence with the first preset threshold and the second preset threshold respectively.
9. The method according to any one of claims 1 to 8, wherein before determining the similarity user sentence between the user sentence and the standard question sentence based on the first sentence characteristic of the user sentence, the word characteristic of the first word contained in the user sentence, the second sentence characteristic of the standard question sentence corresponding to the user sentence in a question-and-answer knowledge base, and the word characteristic of the second word contained in the standard question sentence, the method further comprises:
recalling a plurality of sample question sentences from the question-and-answer knowledge base based on the first words;
respectively aiming at each recalled sample question, determining the number of overlapped words between the sample question and the user sentence based on the words contained in the sample question and the first words;
and selecting at least one sample question with the largest number of overlapped words with the user sentence from the plurality of sample questions, and determining the sample question as a standard question corresponding to the user sentence in the question-answer knowledge base.
10. An apparatus for expanding a knowledge base of questions and answers, comprising:
the acquisition module is used for acquiring user sentences from historical dialogue data;
a similarity determination module, configured to determine a similarity between the user statement and the standard question sentence based on a first statement feature of the user statement, a word feature of a first word included in the user statement, a second statement feature of a standard question sentence corresponding to the user statement in a question-and-answer knowledge base, and a word feature of a second word included in the standard question sentence;
and the expansion module is used for determining whether to add the user statement into the question-answer knowledge base as a new standard question or not based on the similarity between the user statement and the standard question.
11. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 9.
12. A computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-9.
CN202111490800.2A 2021-12-08 2021-12-08 Question-answer knowledge base expansion method and device Pending CN114254090A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111490800.2A CN114254090A (en) 2021-12-08 2021-12-08 Question-answer knowledge base expansion method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111490800.2A CN114254090A (en) 2021-12-08 2021-12-08 Question-answer knowledge base expansion method and device

Publications (1)

Publication Number Publication Date
CN114254090A true CN114254090A (en) 2022-03-29

Family

ID=80791826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111490800.2A Pending CN114254090A (en) 2021-12-08 2021-12-08 Question-answer knowledge base expansion method and device

Country Status (1)

Country Link
CN (1) CN114254090A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113092A (en) * 2023-10-24 2023-11-24 北京睿企信息科技有限公司 Question expansion method based on question-answering task model and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec
CN110347796A (en) * 2019-07-05 2019-10-18 神思电子技术股份有限公司 Short text similarity calculating method under vector semantic tensor space
CN110489538A (en) * 2019-08-27 2019-11-22 腾讯科技(深圳)有限公司 Sentence answer method, device and electronic equipment based on artificial intelligence
CN110555101A (en) * 2019-09-09 2019-12-10 浙江诺诺网络科技有限公司 customer service knowledge base updating method, device, equipment and storage medium
CN111259660A (en) * 2020-01-15 2020-06-09 中国平安人寿保险股份有限公司 Method, device and equipment for extracting keywords based on text pairs and storage medium
CN111898643A (en) * 2020-07-01 2020-11-06 上海依图信息技术有限公司 Semantic matching method and device
CN112328762A (en) * 2020-11-04 2021-02-05 平安科技(深圳)有限公司 Question and answer corpus generation method and device based on text generation model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec
CN110347796A (en) * 2019-07-05 2019-10-18 神思电子技术股份有限公司 Short text similarity calculating method under vector semantic tensor space
CN110489538A (en) * 2019-08-27 2019-11-22 腾讯科技(深圳)有限公司 Sentence answer method, device and electronic equipment based on artificial intelligence
CN110555101A (en) * 2019-09-09 2019-12-10 浙江诺诺网络科技有限公司 customer service knowledge base updating method, device, equipment and storage medium
CN111259660A (en) * 2020-01-15 2020-06-09 中国平安人寿保险股份有限公司 Method, device and equipment for extracting keywords based on text pairs and storage medium
CN111898643A (en) * 2020-07-01 2020-11-06 上海依图信息技术有限公司 Semantic matching method and device
CN112328762A (en) * 2020-11-04 2021-02-05 平安科技(深圳)有限公司 Question and answer corpus generation method and device based on text generation model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113092A (en) * 2023-10-24 2023-11-24 北京睿企信息科技有限公司 Question expansion method based on question-answering task model and storage medium
CN117113092B (en) * 2023-10-24 2024-01-23 北京睿企信息科技有限公司 Question expansion method based on question-answering task model and storage medium

Similar Documents

Publication Publication Date Title
WO2018157805A1 (en) Automatic questioning and answering processing method and automatic questioning and answering system
US11481417B2 (en) Generation and utilization of vector indexes for data processing systems and methods
WO2016188279A1 (en) Generating method and device for fault spectra, and detecting method and device based on fault spectra
US11468238B2 (en) Data processing systems and methods
WO2022110637A1 (en) Question and answer dialog evaluation method and apparatus, device, and storage medium
WO2017181834A1 (en) Intelligent question and answer method and device
JP7153004B2 (en) COMMUNITY Q&A DATA VERIFICATION METHOD, APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM
CN112035730B (en) Semantic retrieval method and device and electronic equipment
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
US20210133264A1 (en) Data Processing Systems and Methods
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
US11526512B1 (en) Rewriting queries
WO2021092272A1 (en) Qa-bots for information search in documents using paraphrases
KR102285232B1 (en) Morphology-Based AI Chatbot and Method How to determine the degree of sentence
CN116150306A (en) Training method of question-answering robot, question-answering method and device
CN114254090A (en) Question-answer knowledge base expansion method and device
CN108804550B (en) Query term expansion method and device and electronic equipment
CN110427626B (en) Keyword extraction method and device
CN109684357B (en) Information processing method and device, storage medium and terminal
CN114970559B (en) Intelligent response method and device
CN110362592B (en) Method, device, computer equipment and storage medium for pushing arbitration guide information
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN112989040B (en) Dialogue text labeling method and device, electronic equipment and storage medium
US11314794B2 (en) System and method for adaptively adjusting related search words
CN112395402A (en) Depth model-based recommended word generation method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination