CN113722452B - Semantic-based rapid knowledge hit method and device in question-answering system - Google Patents
Semantic-based rapid knowledge hit method and device in question-answering system Download PDFInfo
- Publication number
- CN113722452B CN113722452B CN202110807421.5A CN202110807421A CN113722452B CN 113722452 B CN113722452 B CN 113722452B CN 202110807421 A CN202110807421 A CN 202110807421A CN 113722452 B CN113722452 B CN 113722452B
- Authority
- CN
- China
- Prior art keywords
- semantic
- knowledge
- vector
- vectors
- question
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 254
- 238000012549 training Methods 0.000 claims abstract description 23
- 238000004364 calculation method Methods 0.000 claims abstract description 11
- 238000007637 random forest analysis Methods 0.000 claims abstract description 11
- 230000011218 segmentation Effects 0.000 claims abstract description 9
- 238000002372 labelling Methods 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 12
- 230000014509 gene expression Effects 0.000 claims description 8
- 238000002360 preparation method Methods 0.000 claims description 8
- 239000013604 expression vector Substances 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 11
- 238000004422 calculation algorithm Methods 0.000 abstract description 4
- 238000013136 deep learning model Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 10
- 238000003066 decision tree Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a semantic-based rapid knowledge hit method and device in a question-answering system, wherein the method comprises the following steps: preparing corpus for model training, including user questions and knowledge in a corresponding knowledge base, and marking whether the user questions are matched with the knowledge; performing model training according to the labeled corpus according to the two classification tasks based on the Bert model, setting model output as a pool_output layer output of the Bert model after training is finished, and storing the model output as a semantic model; a knowledge base vector representation comprising a set of semantic vectors that are semantic vector spaces; carrying out semantic segmentation on a semantic vector space by adopting a random forest, and generating N binary trees by the same semantic vector space; and converting the user question into a semantic vector, and performing knowledge hit calculation. According to the method, the deep learning model is introduced to improve the knowledge hit effect, and the matching algorithm is optimized to improve the knowledge hit speed, so that the intelligent customer service can support a huge knowledge base.
Description
Technical Field
The invention relates to the technical field of data identification processing, in particular to a semantic-based rapid knowledge hit method and device in a question-answering system.
Background
In recent years, intelligent customer service has been successfully applied to various business consultation service services, and a quick and convenient solution path is provided for enterprises and users. The intelligent customer service is used for automatically identifying the problem of the user through a machine and giving a corresponding solution, and in a specific implementation, the problem of the user is replied through the intelligent customer service, so that the response speed of the problem of the user can be improved, and the labor cost is saved.
Along with the development of the service in the application field, the intelligent customer service question-answering system has a plurality of and complex service scenes, the corresponding knowledge base is larger and larger, the traditional searching and matching algorithm cannot meet the requirements in the aspect of performance or effect, the knowledge hit rate is poor, and the user experience is poor.
Disclosure of Invention
The invention aims to provide a semantic-based rapid knowledge hit method and device in a question-answering system, so as to solve the problems in the technical background.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the first aspect of the present application provides a semantic-based rapid knowledge hit method in a question-answering system, including:
s1, preparing corpus used for model training, wherein the corpus comprises user questions and knowledge in a corresponding knowledge base, and labeling whether the user questions are matched with the knowledge;
s2, training a model based on the Bert model by using labeled corpus according to the two classification tasks, setting the model output as the output of a porous_output layer of the Bert model after training is completed, and storing the model output as a semantic model;
s3, converting the knowledge base of the text representation into a knowledge base of the semantic vector representation, namely a vector knowledge base, wherein a set of semantic vectors contained in the knowledge base is a semantic vector space;
s4, carrying out semantic segmentation on the semantic vector space by adopting a random forest, generating N binary trees by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
s5, converting the user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
s6, calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge.
Specifically, in the step S4, N is a balance value of performance and accuracy, and needs to be adjusted step by step according to the actual effect.
Preferably, the step S1 includes the steps of:
s11: collecting user questions and knowledge in a corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive expressions of which are matched with the knowledge, and negative expressions of which are not matched with the knowledge, and the questions which are similar in word and not matched with the knowledge comprise questions with unmatched semanteme;
s12: labeling whether the user question is matched with the knowledge, wherein the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
Preferably, the step S3 includes the steps of:
s31: converting each piece of knowledge in the knowledge base into digital information by using a vocab dictionary of the Bert model;
s32: inputting the digital information into the semantic model for reasoning, and outputting semantic expression vectors of knowledge;
s33: after all knowledge reasoning is completed, the knowledge base of the text representation is converted into the knowledge base of the semantic vector representation.
Preferably, the step S4 includes the steps of:
s41: randomly selecting one semantic vector V in the vector knowledge base, and calculating cosine similarity between the semantic vectors in all the vector knowledge bases and the randomly selected semantic vector V;
s42: dividing semantic vectors with cosine similarity in the range of (0, 1) into a first subspace, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into a second subspace;
s43: the semantic vector V is taken as a root node, the first subspace is taken as a left subtree, the second subspace is taken as a right subtree, and the semantic vector V, the first subspace and the second subspace form a binary tree;
s44: repeating steps S41-S43 for subspaces on all nodes of the binary tree until the number of semantic vectors in all subspaces is less than or equal to K;
s45: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees.
Preferably, in the step S5, the number of semantic vectors included in the N nearest leaf nodes is equal to or less than n×k.
Preferably, the step S5 includes the steps of:
s51: converting the text information into digital information by using a vorcab dictionary of the Bert model through a user question;
s52: inputting the digital information of the user question into the semantic model for reasoning, and outputting a semantic vector corresponding to the user question, namely, a semantic vector of the user question;
s53: selecting any one of N binary trees;
s54: calculating cosine similarity between the semantic vector of the question of the user and the binary tree node, wherein the cosine similarity is within the range of (0, 1), taking the left subtree node, or else, taking the right subtree node;
s55: repeating the step S54, searching the binary tree until a leaf node of the binary tree, namely the nearest leaf node, is found;
s56: repeating the steps S53-S55, and finding N nearest leaf nodes in all binary trees;
s57: and performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N x K.
Preferably, the step S6 includes the steps of:
s61: calculating cosine similarity between M semantic vectors and user question vectors;
s62: sequencing the M semantic vectors according to the descending order of cosine similarity, and returning the semantic vector D with highest similarity;
s63: and comparing the similarity value of the semantic vector D with a preset distance threshold T, and when D > T, indicating hit knowledge.
The second aspect of the present application provides a semantic-based rapid knowledge hit device in a question-answering system, including:
the corpus labeling preparation module is used for preparing corpus trained by the model, comprising user question sentences and knowledge in a corresponding knowledge base, and labeling whether the user question sentences are matched with the knowledge;
the semantic model fine-tuning module is used for carrying out model training by using the labeled corpus according to the two classification tasks based on the Bert model, setting the model output as the output of a porous_output layer of the Bert model after training is finished, and storing the model output as the semantic model;
the knowledge base vector representation module is used for converting the knowledge base of the text representation into the knowledge base of the semantic vector representation, namely a vector knowledge base, wherein the set of the semantic vectors contained in the knowledge base vector representation is a semantic vector space;
the binary tree generation module is used for carrying out semantic segmentation on the semantic vector space by adopting a random forest, and N binary trees are generated by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
the user question searching module is used for converting a user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
the knowledge hit calculation module is used for calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge.
Specifically, N in the above is a balance value of performance and accuracy, and needs to be adjusted step by step according to the actual effect.
Preferably, the corpus annotation preparation module includes:
the collecting sub-module is used for collecting user questions and knowledge in the corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive questions represent and knowledge are matched, negative questions represent and knowledge are not matched, and the questions which are similar in word and are not matched with the knowledge comprise questions which are not matched with the semantic meaning;
the labeling sub-module is used for labeling whether the question sentence of the user is matched with the knowledge, and the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
Preferably, the user question searching module includes:
the user question semantic vector generation sub-module is used for outputting semantic vectors corresponding to the user questions after reasoning the user questions through the semantic model, namely the user question semantic vectors;
the traversing sub-module is used for traversing N binary trees and searching N nearest neighbor leaf nodes matched with the user question semantic vector in all binary trees;
and the de-duplication processing sub-module is used for performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N.
Preferably, the knowledge hit calculation module includes:
the similarity calculation sub-module is used for calculating cosine similarity between the M semantic vectors and the user question semantic vectors;
the similarity sorting sub-module is used for sorting the M semantic vectors according to the descending order of cosine similarity and returning the semantic vector D with the highest similarity;
the judging submodule is used for judging whether the returned semantic vector D exceeds a preset distance threshold T or not;
the determining submodule is used for determining that the semantic vector is hit knowledge when the similarity of the semantic vector D exceeds a preset distance threshold T.
A third aspect of the present application provides a computer device comprising a memory and a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the steps of the semantic-based fast knowledge hit method in a question-answering system described above when executing the computer readable instructions.
A fourth aspect of the present application provides a computer-readable storage medium storing computer-readable instructions that, when executed by a processor, implement the steps of the semantic-based rapid knowledge hit method in the question-answering system described above.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the application discloses a semantic-based rapid knowledge hit method and device in a question-answering system, wherein the method comprises the following steps: preparing and labeling corpus; fine-tuning a semantic model; a knowledge base vector representation; creating a knowledge base vector projection index; and (5) calculating the knowledge hit. According to the technical scheme, the deep learning model is introduced to improve the knowledge hit effect, and the matching algorithm is optimized to improve the knowledge hit speed, so that the intelligent customer service can support a larger and larger knowledge base. According to the method provided by the application, under the condition of a small amount of even no marked data, in the intelligent question-answering system facing a huge knowledge base, quicker and more accurate knowledge hit can be realized, and the user experience effect is good.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
FIG. 1 is a flow diagram of a semantic-based rapid knowledge hit method in a question-answering system of the present invention;
FIG. 2 is a schematic diagram of a processing procedure of forming a binary tree by a semantic vector V, a subspace A and a subspace B according to the embodiment of the invention;
FIG. 3 is a schematic diagram of a process for spatially projecting semantic vectors of a vector knowledge base into N binary trees in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a process of traversing N binary trees to find all N leaf nodes in an embodiment of the present invention;
FIG. 5 is a logic diagram of a semantic-based fast knowledge hit method in a question-answering system according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a semantic-based rapid knowledge hit device in a question-answering system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a corpus labeling preparation module of a semantic-based rapid knowledge hit device in a question-answering system according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a user question searching module of a semantic-based rapid knowledge hit device in a question and answer system according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a knowledge hit calculation module of a semantic-based fast knowledge hit device in a question-answering system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and effects of the present invention clearer and more obvious, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It is noted that the terms "first," "second," and the like in the description and claims of the present invention and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it is to be understood that the data so used may be interchanged where appropriate. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A method and apparatus for semantic-based fast knowledge hit in a question-answering system of the present application are described below with reference to the accompanying drawings.
Embodiment one:
FIG. 1 is a flow chart of a semantic-based fast knowledge hit method according to the present application, as shown in FIG. 1, comprising the steps of:
step S1, preparing corpus for model training, including user questions and knowledge in a corresponding knowledge base, and marking whether the user questions are matched with the knowledge;
step S2, training a model based on the Bert model by using labeled corpus according to the two classification tasks, setting the model output as the output of a porous_output layer of the Bert model after training is completed, and storing the model output as a semantic model;
step S3, converting the knowledge base of the text representation into a knowledge base of the semantic vector representation, namely a vector knowledge base, wherein the set of the semantic vectors contained in the knowledge base is a semantic vector space;
s4, carrying out semantic segmentation on the semantic vector space by adopting a random forest, generating N binary trees by adopting the same semantic vector space, wherein N is a natural number which is more than or equal to 1, N is a balance value of performance and precision, and the N is required to be adjusted step by step according to an actual effect; wherein, each binary tree corresponds to a knowledge base of semantic representation of random segmentation, leaf nodes of each binary tree represent semantic vectors of not more than K pieces of knowledge, K is a natural number and satisfies 1.ltoreq.K.ltoreq.total vector number/N;
step S5, converting the user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
and S6, calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge.
Specifically, in connection with fig. 2-5, the method comprises the steps of:
the first step: corpus preparation and labeling.
Step 101: preparing corpus for model training, and collecting user question sentences and knowledge in a corresponding knowledge base, wherein the user question sentences comprise positive and negative question sentences, positive expression and knowledge matching, and negative expression and knowledge unmatched question sentences, in particular question sentences similar in word but unmatched in semantic meaning.
Step 102: the labeling format of the training corpus is as follows: user question + knowledge + tag, wherein the tag is matched or not.
And a second step of: semantic model fine-tuning.
Step 201: and training the model by using the corpus marked in the last step according to the classification tasks based on the deep pre-training Bert model.
Step 202: after training is completed, the model output is set as the output of the porous_output layer of the Bert model, and is stored as a semantic model.
And a third step of: knowledge base vector representation.
Step 301: each piece of knowledge in the knowledge base is converted to digital information using the vocab dictionary of the Bert model.
Step 302: the digital information is input into the semantic model subjected to fine adjustment to be inferred, and semantic expression vectors of knowledge are output.
Step 303: after all knowledge reasoning is completed, the knowledge base of the text representation is converted into a knowledge base of semantic vector representation, namely a vector knowledge base, and the set of semantic vectors contained in the knowledge base is a semantic vector space.
Fourth step: vector knowledge base projection index creation.
In many business scenarios, a random forest model is used as a classifier to perform operations such as classification processing on big data in the business, the random forest model is a combined model based on decision trees, and in practical application, the random forest model is classified by the voting results of a plurality of decision trees. Each internal node of the decision tree represents a test on an attribute, each branch represents a test output, each leaf node represents a category, and relevant information of the current node is recorded except the leaf nodes on the decision tree.
In this embodiment, a random forest is used to perform semantic segmentation on the semantic vector space, N binary trees are generated in the same semantic vector space, and N is the number of semantic vectors contained in the semantic vector space.
The method specifically comprises the following steps:
step 401: randomly selecting one semantic vector V in a vector knowledge base, and calculating cosine similarity between all semantic vectors in the vector knowledge base and the randomly selected semantic vector V;
step 402: dividing semantic vectors with cosine similarity in the range of (0, 1) into subspace A, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into subspace B;
step 403: referring to fig. 2, a semantic vector V is taken as a root node, a subspace a is taken as a left subtree, a subspace B is taken as a right subtree, and the semantic vector V, the subspace a and the subspace B form a binary tree;
step 404: repeating steps 401-403 for subspaces on all nodes until the number of semantic vectors in all subspaces is less than or equal to K;
step 405: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees, as shown in the figure 3.
Fifth step: and (5) calculating the knowledge hit.
Step 501: the user question is converted into text information into digital information using the vocab dictionary of the Bert model.
Step 502: and inputting the digital information of the user question into the semantic model for reasoning, and outputting a semantic vector corresponding to the user question, namely, the semantic vector of the user question.
Step 503: any one of the N binary trees is selected.
Step 504: and calculating cosine similarity between the semantic vector of the question sentence of the user and the binary tree node, wherein the cosine similarity is within the range of (0, 1), and taking the left subtree node, or else, taking the right subtree node.
Step 505: step S504 is repeated to search the binary tree until a leaf node of the binary tree, i.e. the nearest leaf node, is found.
Step 506: steps S503 to S505 are repeated to find out the N nearest leaf nodes in all binary trees.
Step 507: and performing de-duplication processing on the semantic vectors which are less than or equal to N and contained in the N nearest leaf nodes to obtain M semantic vectors.
Step 508: and calculating cosine similarity between the M semantic vectors and the semantic vectors of the user question.
Step 509: and sequencing the M semantic vectors according to the descending order of cosine similarity, and returning the semantic vector D with the highest similarity.
Step 510: and comparing the similarity of the semantic vector D with a preset distance threshold T according to the preset distance threshold T determined by multiple tests in the actual service, and indicating hit knowledge when D is more than T.
Embodiment two:
FIG. 6 is a schematic structural diagram of a semantic-based fast knowledge hit apparatus according to the present application, as shown in FIG. 6, the apparatus 100 includes:
the corpus labeling preparation module 110 is configured to prepare a corpus trained by a model, including user question sentences and knowledge in a corresponding knowledge base, and label whether the user question sentences are matched with the knowledge;
the semantic model fine-tuning module 120 is configured to perform model training based on the Bert model by using the labeled corpus according to the two classification tasks, set the model output as a porous_output layer output of the Bert model after training is completed, and store the model output as a semantic model;
a knowledge base vector representation module 130, configured to convert a knowledge base of text representations into a knowledge base of semantic vector representations, i.e. a vector knowledge base, which includes a set of semantic vectors as a semantic vector space;
the binary tree generating module 140 is configured to perform semantic segmentation on the semantic vector space by using a random forest, generate N binary trees in the same semantic vector space, where N is a natural number greater than or equal to 1, and N is a balance value of performance and precision, and needs to be adjusted step by step according to an actual effect; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
the user question searching module 150 is configured to convert a user question into a corresponding semantic vector, that is, a user question semantic vector, traverse the N binary trees using the user question semantic vector, find N nearest leaf nodes, and aggregate and deduplicate semantic vectors contained in the N nearest leaf nodes to obtain M semantic vectors;
the knowledge hit calculation module 160 is configured to calculate the similarity between the M semantic vectors and the semantic vectors of the question of the user, and select the semantic vector with the highest similarity to determine the hit knowledge.
Specifically, referring to fig. 7, the corpus labeling preparation module 110 includes:
a collecting sub-module 111, configured to collect user questions and knowledge points in the corresponding knowledge base, where the user questions include positive and negative questions, positive expressions and knowledge-matched questions, negative expressions and knowledge-unmatched questions, and especially questions that are similar in terms but have unmatched semantics;
the labeling sub-module 112 is configured to label whether the question of the user is matched with the knowledge, and the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
Specifically, referring to fig. 8, the user question searching module 150 includes:
the user question semantic vector generation sub-module 151 is configured to infer a user question from the semantic model and output a semantic vector corresponding to the user question, that is, a user question semantic vector;
a traversing submodule 152, configured to traverse the N binary trees, and find N nearest neighbor leaf nodes in all binary trees that match the user question semantic vector;
the deduplication processing sub-module 153 is configured to perform deduplication processing on all the semantic vectors of the found N nearest leaf nodes to obtain M semantic vectors, where the number of all the semantic vectors of the N leaf nodes is equal to or less than n×k.
Specifically, referring to fig. 9, the knowledge hit calculation module 160 includes:
a similarity calculation submodule 161, configured to calculate cosine similarity between M semantic vectors and user question semantic vectors;
the similarity sorting sub-module 162 is configured to sort the M semantic vectors in descending order of cosine similarity, and return the semantic vector D with the highest similarity;
a judging sub-module 163, configured to judge whether the similarity of the returned semantic vector D exceeds a preset distance threshold T;
a determining submodule 164, configured to determine knowledge that the semantic vector D is hit when the similarity of the semantic vector D exceeds a preset distance threshold T.
In another aspect, the present application further provides a computer device, including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, where the processor executes the computer readable instructions to implement the steps of the semantic-based fast knowledge hit method in the question-answering system.
In another aspect, the present application further provides a computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the steps of the semantic-based fast knowledge hit method in the question-answering system described above.
In summary, the application discloses a semantic-based rapid knowledge hit method and device in a question-answering system, wherein the method comprises the following steps: preparing and labeling corpus; fine-tuning a semantic model; a knowledge base vector representation; creating a knowledge base vector projection index; and (5) calculating the knowledge hit. According to the technical scheme, the deep learning model is introduced to improve the knowledge hit effect, and the matching algorithm is optimized to improve the knowledge hit speed, so that the intelligent customer service can support a larger and larger knowledge base. According to the method provided by the application, under the condition of a small amount of even no marked data, in the intelligent question-answering system facing a huge knowledge base, quicker and more accurate knowledge hit can be realized, and the user experience effect is good.
The above description of the specific embodiments of the present invention has been given by way of example only, and the present invention is not limited to the above described specific embodiments. Any equivalent modifications and substitutions for the present invention will occur to those skilled in the art, and are also within the scope of the present invention. Accordingly, equivalent changes and modifications are intended to be included within the scope of the present invention without departing from the spirit and scope thereof.
Claims (7)
1. A semantic-based rapid knowledge hit method in a question-answering system, comprising:
s1, preparing corpus used for model training, wherein the corpus comprises user questions and knowledge in a corresponding knowledge base, and labeling whether the user questions are matched with the knowledge;
s2, training a model based on the Bert model by using labeled corpus according to the two classification tasks, setting the model output as the output of a porous_output layer of the Bert model after training is completed, and storing the model output as a semantic model;
s3, converting the knowledge base of the text representation into a knowledge base of the semantic vector representation, namely a vector knowledge base, wherein a set of semantic vectors contained in the knowledge base is a semantic vector space;
s4, carrying out semantic segmentation on the semantic vector space by adopting a random forest, generating N binary trees by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
s5, converting the user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
s6, calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity to determine the hit knowledge;
the step S4 includes the following steps S41 to S45:
s41: randomly selecting one semantic vector V in the vector knowledge base, and calculating cosine similarity between the semantic vectors in all the vector knowledge bases and the randomly selected semantic vector V;
s42: dividing semantic vectors with cosine similarity in the range of (0, 1) into a first subspace, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into a second subspace;
s43: the semantic vector V is taken as a root node, the first subspace is taken as a left subtree, the second subspace is taken as a right subtree, and the semantic vector V, the first subspace and the second subspace form a binary tree;
s44: repeating steps S41-S43 for subspaces on all nodes of the binary tree until the number of semantic vectors in all subspaces is less than or equal to K;
s45: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees;
wherein, the step S5 comprises the following steps S51 to S57:
s51: converting the text information into digital information by using a vorcab dictionary of the Bert model through a user question;
s52: inputting the digital information of the user question into the semantic model for reasoning, and outputting a semantic vector corresponding to the user question, namely, a semantic vector of the user question;
s53: selecting any one of N binary trees;
s54: calculating cosine similarity between the semantic vector of the question of the user and the binary tree node, wherein the cosine similarity is within the range of (0, 1), taking the left subtree node, or else, taking the right subtree node;
s55: repeating the step S54, searching the binary tree until a leaf node of the binary tree, namely the nearest leaf node, is found;
s56: repeating the steps S53-S55, and finding N nearest leaf nodes in all binary trees;
s57: and performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N x K.
2. The method for semantic-based rapid knowledge hit in a question-answering system according to claim 1, wherein step S1 includes the steps of:
s11: collecting user questions and knowledge in a corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive expressions of which are matched with the knowledge, and negative expressions of which are not matched with the knowledge, and the questions which are similar in word and not matched with the knowledge comprise questions with unmatched semanteme;
s12: labeling whether the user question is matched with the knowledge, wherein the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
3. The method for semantic-based rapid knowledge hit in a question-answering system according to claim 1, wherein step S3 includes the steps of:
s31: converting each piece of knowledge in the knowledge base into digital information by using a vocab dictionary of the Bert model;
s32: inputting the digital information into the semantic model for reasoning, and outputting semantic expression vectors of knowledge;
s33: after all knowledge reasoning is completed, the knowledge base of the text representation is converted into the knowledge base of the semantic vector representation.
4. The method for semantic-based rapid knowledge hit in a question-answering system according to claim 1, wherein step S6 includes the steps of:
s61: calculating cosine similarity between M semantic vectors and user question semantic vectors;
s62: sequencing the M semantic vectors according to the descending order of cosine similarity, and returning the semantic vector D with highest similarity;
s63: and comparing the similarity of the semantic vector D with a preset distance threshold T, and when D > T, indicating hit knowledge.
5. A semantic-based rapid knowledge hit device in a question-answering system, comprising:
the corpus labeling preparation module is used for preparing corpus trained by the model, comprising user question sentences and knowledge in a corresponding knowledge base, and labeling whether the user question sentences are matched with the knowledge;
the semantic model fine-tuning module is used for carrying out model training by using the labeled corpus according to the two classification tasks based on the Bert model, setting the model output as the output of a porous_output layer of the Bert model after training is finished, and storing the model output as the semantic model;
the knowledge base vector representation module is used for converting the knowledge base of the text representation into the knowledge base of the semantic vector representation, namely a vector knowledge base, wherein the set of the semantic vectors contained in the knowledge base vector representation is a semantic vector space;
the binary tree generation module is used for carrying out semantic segmentation on the semantic vector space by adopting a random forest, and N binary trees are generated by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
the user question searching module is used for converting a user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
the knowledge hit calculation module is used for calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge;
wherein the binary tree generation module is configured to implement the following steps M1 to M5:
m1: randomly selecting one semantic vector V in the vector knowledge base, and calculating cosine similarity between the semantic vectors in all the vector knowledge bases and the randomly selected semantic vector V;
m2: dividing semantic vectors with cosine similarity in the range of (0, 1) into a first subspace, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into a second subspace;
m3: the semantic vector V is taken as a root node, the first subspace is taken as a left subtree, the second subspace is taken as a right subtree, and the semantic vector V, the first subspace and the second subspace form a binary tree;
m4: repeating the steps M1-M3 for subspaces on all nodes of the binary tree until the number of semantic vectors in all subspaces is less than or equal to K;
m5: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees;
wherein, the user question searching module comprises:
the user question semantic vector generation sub-module is used for outputting semantic vectors corresponding to the user questions after reasoning the user questions through the semantic model, namely the user question semantic vectors;
the traversing sub-module is used for traversing N binary trees and searching N nearest neighbor leaf nodes matched with the user question semantic vector in all binary trees;
and the de-duplication processing sub-module is used for performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N.
6. The semantic-based rapid knowledge hit apparatus in a question-answering system according to claim 5, wherein the corpus annotation preparation module comprises:
the collecting sub-module is used for collecting user questions and knowledge in the corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive questions represent and knowledge are matched, negative questions represent and knowledge are not matched, and the questions which are similar in word and are not matched with the knowledge comprise questions which are not matched with the semantic meaning;
the labeling sub-module is used for labeling whether the question sentence of the user is matched with the knowledge, and the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
7. The semantic-based rapid knowledge hit apparatus in a question-answering system according to claim 5, wherein the knowledge hit calculation module includes:
the similarity calculation sub-module is used for calculating cosine similarity between the M semantic vectors and the user question semantic vectors;
the similarity sorting sub-module is used for sorting the M semantic vectors according to the descending order of cosine similarity and returning the semantic vector D with the highest similarity;
the judging submodule is used for judging whether the similarity of the returned semantic vector D exceeds a preset distance threshold T;
the determining submodule is used for determining that the semantic vector is hit knowledge when the similarity of the semantic vector D exceeds a preset distance threshold T.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110807421.5A CN113722452B (en) | 2021-07-16 | 2021-07-16 | Semantic-based rapid knowledge hit method and device in question-answering system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110807421.5A CN113722452B (en) | 2021-07-16 | 2021-07-16 | Semantic-based rapid knowledge hit method and device in question-answering system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113722452A CN113722452A (en) | 2021-11-30 |
CN113722452B true CN113722452B (en) | 2024-01-19 |
Family
ID=78673527
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110807421.5A Active CN113722452B (en) | 2021-07-16 | 2021-07-16 | Semantic-based rapid knowledge hit method and device in question-answering system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113722452B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114490975B (en) * | 2021-12-31 | 2023-02-07 | 马上消费金融股份有限公司 | User question labeling method and device |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10437833B1 (en) * | 2016-10-05 | 2019-10-08 | Ontocord, LLC | Scalable natural language processing for large and dynamic text environments |
CN111090735A (en) * | 2019-12-25 | 2020-05-01 | 成都航天科工大数据研究院有限公司 | Intelligent question-answering method based on knowledge graph and performance evaluation method thereof |
CN111444320A (en) * | 2020-06-16 | 2020-07-24 | 太平金融科技服务(上海)有限公司 | Text retrieval method and device, computer equipment and storage medium |
CN111460798A (en) * | 2020-03-02 | 2020-07-28 | 平安科技(深圳)有限公司 | Method and device for pushing similar meaning words, electronic equipment and medium |
CN111581354A (en) * | 2020-05-12 | 2020-08-25 | 金蝶软件(中国)有限公司 | FAQ question similarity calculation method and system |
CN111639165A (en) * | 2020-04-30 | 2020-09-08 | 南京理工大学 | Intelligent question-answer optimization method based on natural language processing and deep learning |
CN111813916A (en) * | 2020-07-21 | 2020-10-23 | 润联软件系统(深圳)有限公司 | Intelligent question and answer method, device, computer equipment and medium |
CN112015915A (en) * | 2020-09-01 | 2020-12-01 | 哈尔滨工业大学 | Question-answering system and device based on knowledge base generated by questions |
CN112084299A (en) * | 2020-08-05 | 2020-12-15 | 山西大学 | Reading comprehension automatic question-answering method based on BERT semantic representation |
CN112100360A (en) * | 2020-10-30 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Dialog response method, device and system based on vector retrieval |
CN112395396A (en) * | 2019-08-12 | 2021-02-23 | 科沃斯商用机器人有限公司 | Question-answer matching and searching method, device, system and storage medium |
CN112417884A (en) * | 2020-11-05 | 2021-02-26 | 广州平云信息科技有限公司 | Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration |
CN112613320A (en) * | 2019-09-19 | 2021-04-06 | 北京国双科技有限公司 | Method and device for acquiring similar sentences, storage medium and electronic equipment |
CN112989005A (en) * | 2021-04-16 | 2021-06-18 | 重庆中国三峡博物馆 | Knowledge graph common sense question-answering method and system based on staged query |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140358720A1 (en) * | 2013-05-31 | 2014-12-04 | Yahoo! Inc. | Method and apparatus to build flowcharts for e-shopping recommendations |
CN104573028B (en) * | 2015-01-14 | 2019-01-25 | 百度在线网络技术(北京)有限公司 | Realize the method and system of intelligent answer |
US20210397926A1 (en) * | 2018-09-29 | 2021-12-23 | VII Philip Alvelda | Data representations and architectures, systems, and methods for multi-sensory fusion, computing, and cross-domain generalization |
-
2021
- 2021-07-16 CN CN202110807421.5A patent/CN113722452B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10437833B1 (en) * | 2016-10-05 | 2019-10-08 | Ontocord, LLC | Scalable natural language processing for large and dynamic text environments |
CN112395396A (en) * | 2019-08-12 | 2021-02-23 | 科沃斯商用机器人有限公司 | Question-answer matching and searching method, device, system and storage medium |
CN112613320A (en) * | 2019-09-19 | 2021-04-06 | 北京国双科技有限公司 | Method and device for acquiring similar sentences, storage medium and electronic equipment |
CN111090735A (en) * | 2019-12-25 | 2020-05-01 | 成都航天科工大数据研究院有限公司 | Intelligent question-answering method based on knowledge graph and performance evaluation method thereof |
CN111460798A (en) * | 2020-03-02 | 2020-07-28 | 平安科技(深圳)有限公司 | Method and device for pushing similar meaning words, electronic equipment and medium |
CN111639165A (en) * | 2020-04-30 | 2020-09-08 | 南京理工大学 | Intelligent question-answer optimization method based on natural language processing and deep learning |
CN111581354A (en) * | 2020-05-12 | 2020-08-25 | 金蝶软件(中国)有限公司 | FAQ question similarity calculation method and system |
CN111444320A (en) * | 2020-06-16 | 2020-07-24 | 太平金融科技服务(上海)有限公司 | Text retrieval method and device, computer equipment and storage medium |
CN111813916A (en) * | 2020-07-21 | 2020-10-23 | 润联软件系统(深圳)有限公司 | Intelligent question and answer method, device, computer equipment and medium |
CN112084299A (en) * | 2020-08-05 | 2020-12-15 | 山西大学 | Reading comprehension automatic question-answering method based on BERT semantic representation |
CN112015915A (en) * | 2020-09-01 | 2020-12-01 | 哈尔滨工业大学 | Question-answering system and device based on knowledge base generated by questions |
CN112100360A (en) * | 2020-10-30 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Dialog response method, device and system based on vector retrieval |
CN112417884A (en) * | 2020-11-05 | 2021-02-26 | 广州平云信息科技有限公司 | Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration |
CN112989005A (en) * | 2021-04-16 | 2021-06-18 | 重庆中国三峡博物馆 | Knowledge graph common sense question-answering method and system based on staged query |
Non-Patent Citations (1)
Title |
---|
"一网通办"事项要素库智能化探究;徐智蕴;严洁;贝文馨;;现代信息科技(第09期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113722452A (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188168B (en) | Semantic relation recognition method and device | |
CN106570708B (en) | Management method and system of intelligent customer service knowledge base | |
CN110727779A (en) | Question-answering method and system based on multi-model fusion | |
CN109145290B (en) | Semantic similarity calculation method based on word vector and self-attention mechanism | |
CN111460149B (en) | Text classification method, related device and readable storage medium | |
CN111666427A (en) | Entity relationship joint extraction method, device, equipment and medium | |
CN111339277A (en) | Question-answer interaction method and device based on machine learning | |
CN111026886A (en) | Multi-round dialogue processing method for professional scene | |
CN113704386A (en) | Text recommendation method and device based on deep learning and related media | |
CN113342958B (en) | Question-answer matching method, text matching model training method and related equipment | |
CN111078835A (en) | Resume evaluation method and device, computer equipment and storage medium | |
CN112784590A (en) | Text processing method and device | |
CN111666376A (en) | Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching | |
CN109145083A (en) | A kind of candidate answers choosing method based on deep learning | |
CN113722452B (en) | Semantic-based rapid knowledge hit method and device in question-answering system | |
CN116932730B (en) | Document question-answering method and related equipment based on multi-way tree and large-scale language model | |
CN111309926B (en) | Entity linking method and device and electronic equipment | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
CN113159187A (en) | Classification model training method and device, and target text determining method and device | |
CN113761887A (en) | Matching method and device based on text processing, computer equipment and storage medium | |
CN115203589A (en) | Vector searching method and system based on Trans-dssm model | |
CN112579666A (en) | Intelligent question-answering system and method and related equipment | |
CN111813941A (en) | Text classification method, device, equipment and medium combining RPA and AI | |
CN112613320A (en) | Method and device for acquiring similar sentences, storage medium and electronic equipment | |
CN117313748B (en) | Multi-feature fusion semantic understanding method and device for government affair question and answer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 200435 11th Floor, Building 27, Lane 99, Shouyang Road, Jing'an District, Shanghai Applicant after: Shanghai Tongban Information Service Co.,Ltd. Address before: 200433 No. 11, Lane 100, Zhengtong Road, Yangpu District, Shanghai Applicant before: Shanghai Tongban Information Service Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |