CN113722452B - Semantic-based rapid knowledge hit method and device in question-answering system - Google Patents

Semantic-based rapid knowledge hit method and device in question-answering system Download PDF

Info

Publication number
CN113722452B
CN113722452B CN202110807421.5A CN202110807421A CN113722452B CN 113722452 B CN113722452 B CN 113722452B CN 202110807421 A CN202110807421 A CN 202110807421A CN 113722452 B CN113722452 B CN 113722452B
Authority
CN
China
Prior art keywords
semantic
knowledge
vector
vectors
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110807421.5A
Other languages
Chinese (zh)
Other versions
CN113722452A (en
Inventor
郭大勇
张海龙
兰永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Tongban Information Service Co ltd
Original Assignee
Shanghai Tongban Information Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Tongban Information Service Co ltd filed Critical Shanghai Tongban Information Service Co ltd
Priority to CN202110807421.5A priority Critical patent/CN113722452B/en
Publication of CN113722452A publication Critical patent/CN113722452A/en
Application granted granted Critical
Publication of CN113722452B publication Critical patent/CN113722452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a semantic-based rapid knowledge hit method and device in a question-answering system, wherein the method comprises the following steps: preparing corpus for model training, including user questions and knowledge in a corresponding knowledge base, and marking whether the user questions are matched with the knowledge; performing model training according to the labeled corpus according to the two classification tasks based on the Bert model, setting model output as a pool_output layer output of the Bert model after training is finished, and storing the model output as a semantic model; a knowledge base vector representation comprising a set of semantic vectors that are semantic vector spaces; carrying out semantic segmentation on a semantic vector space by adopting a random forest, and generating N binary trees by the same semantic vector space; and converting the user question into a semantic vector, and performing knowledge hit calculation. According to the method, the deep learning model is introduced to improve the knowledge hit effect, and the matching algorithm is optimized to improve the knowledge hit speed, so that the intelligent customer service can support a huge knowledge base.

Description

Semantic-based rapid knowledge hit method and device in question-answering system
Technical Field
The invention relates to the technical field of data identification processing, in particular to a semantic-based rapid knowledge hit method and device in a question-answering system.
Background
In recent years, intelligent customer service has been successfully applied to various business consultation service services, and a quick and convenient solution path is provided for enterprises and users. The intelligent customer service is used for automatically identifying the problem of the user through a machine and giving a corresponding solution, and in a specific implementation, the problem of the user is replied through the intelligent customer service, so that the response speed of the problem of the user can be improved, and the labor cost is saved.
Along with the development of the service in the application field, the intelligent customer service question-answering system has a plurality of and complex service scenes, the corresponding knowledge base is larger and larger, the traditional searching and matching algorithm cannot meet the requirements in the aspect of performance or effect, the knowledge hit rate is poor, and the user experience is poor.
Disclosure of Invention
The invention aims to provide a semantic-based rapid knowledge hit method and device in a question-answering system, so as to solve the problems in the technical background.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the first aspect of the present application provides a semantic-based rapid knowledge hit method in a question-answering system, including:
s1, preparing corpus used for model training, wherein the corpus comprises user questions and knowledge in a corresponding knowledge base, and labeling whether the user questions are matched with the knowledge;
s2, training a model based on the Bert model by using labeled corpus according to the two classification tasks, setting the model output as the output of a porous_output layer of the Bert model after training is completed, and storing the model output as a semantic model;
s3, converting the knowledge base of the text representation into a knowledge base of the semantic vector representation, namely a vector knowledge base, wherein a set of semantic vectors contained in the knowledge base is a semantic vector space;
s4, carrying out semantic segmentation on the semantic vector space by adopting a random forest, generating N binary trees by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
s5, converting the user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
s6, calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge.
Specifically, in the step S4, N is a balance value of performance and accuracy, and needs to be adjusted step by step according to the actual effect.
Preferably, the step S1 includes the steps of:
s11: collecting user questions and knowledge in a corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive expressions of which are matched with the knowledge, and negative expressions of which are not matched with the knowledge, and the questions which are similar in word and not matched with the knowledge comprise questions with unmatched semanteme;
s12: labeling whether the user question is matched with the knowledge, wherein the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
Preferably, the step S3 includes the steps of:
s31: converting each piece of knowledge in the knowledge base into digital information by using a vocab dictionary of the Bert model;
s32: inputting the digital information into the semantic model for reasoning, and outputting semantic expression vectors of knowledge;
s33: after all knowledge reasoning is completed, the knowledge base of the text representation is converted into the knowledge base of the semantic vector representation.
Preferably, the step S4 includes the steps of:
s41: randomly selecting one semantic vector V in the vector knowledge base, and calculating cosine similarity between the semantic vectors in all the vector knowledge bases and the randomly selected semantic vector V;
s42: dividing semantic vectors with cosine similarity in the range of (0, 1) into a first subspace, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into a second subspace;
s43: the semantic vector V is taken as a root node, the first subspace is taken as a left subtree, the second subspace is taken as a right subtree, and the semantic vector V, the first subspace and the second subspace form a binary tree;
s44: repeating steps S41-S43 for subspaces on all nodes of the binary tree until the number of semantic vectors in all subspaces is less than or equal to K;
s45: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees.
Preferably, in the step S5, the number of semantic vectors included in the N nearest leaf nodes is equal to or less than n×k.
Preferably, the step S5 includes the steps of:
s51: converting the text information into digital information by using a vorcab dictionary of the Bert model through a user question;
s52: inputting the digital information of the user question into the semantic model for reasoning, and outputting a semantic vector corresponding to the user question, namely, a semantic vector of the user question;
s53: selecting any one of N binary trees;
s54: calculating cosine similarity between the semantic vector of the question of the user and the binary tree node, wherein the cosine similarity is within the range of (0, 1), taking the left subtree node, or else, taking the right subtree node;
s55: repeating the step S54, searching the binary tree until a leaf node of the binary tree, namely the nearest leaf node, is found;
s56: repeating the steps S53-S55, and finding N nearest leaf nodes in all binary trees;
s57: and performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N x K.
Preferably, the step S6 includes the steps of:
s61: calculating cosine similarity between M semantic vectors and user question vectors;
s62: sequencing the M semantic vectors according to the descending order of cosine similarity, and returning the semantic vector D with highest similarity;
s63: and comparing the similarity value of the semantic vector D with a preset distance threshold T, and when D > T, indicating hit knowledge.
The second aspect of the present application provides a semantic-based rapid knowledge hit device in a question-answering system, including:
the corpus labeling preparation module is used for preparing corpus trained by the model, comprising user question sentences and knowledge in a corresponding knowledge base, and labeling whether the user question sentences are matched with the knowledge;
the semantic model fine-tuning module is used for carrying out model training by using the labeled corpus according to the two classification tasks based on the Bert model, setting the model output as the output of a porous_output layer of the Bert model after training is finished, and storing the model output as the semantic model;
the knowledge base vector representation module is used for converting the knowledge base of the text representation into the knowledge base of the semantic vector representation, namely a vector knowledge base, wherein the set of the semantic vectors contained in the knowledge base vector representation is a semantic vector space;
the binary tree generation module is used for carrying out semantic segmentation on the semantic vector space by adopting a random forest, and N binary trees are generated by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
the user question searching module is used for converting a user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
the knowledge hit calculation module is used for calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge.
Specifically, N in the above is a balance value of performance and accuracy, and needs to be adjusted step by step according to the actual effect.
Preferably, the corpus annotation preparation module includes:
the collecting sub-module is used for collecting user questions and knowledge in the corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive questions represent and knowledge are matched, negative questions represent and knowledge are not matched, and the questions which are similar in word and are not matched with the knowledge comprise questions which are not matched with the semantic meaning;
the labeling sub-module is used for labeling whether the question sentence of the user is matched with the knowledge, and the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
Preferably, the user question searching module includes:
the user question semantic vector generation sub-module is used for outputting semantic vectors corresponding to the user questions after reasoning the user questions through the semantic model, namely the user question semantic vectors;
the traversing sub-module is used for traversing N binary trees and searching N nearest neighbor leaf nodes matched with the user question semantic vector in all binary trees;
and the de-duplication processing sub-module is used for performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N.
Preferably, the knowledge hit calculation module includes:
the similarity calculation sub-module is used for calculating cosine similarity between the M semantic vectors and the user question semantic vectors;
the similarity sorting sub-module is used for sorting the M semantic vectors according to the descending order of cosine similarity and returning the semantic vector D with the highest similarity;
the judging submodule is used for judging whether the returned semantic vector D exceeds a preset distance threshold T or not;
the determining submodule is used for determining that the semantic vector is hit knowledge when the similarity of the semantic vector D exceeds a preset distance threshold T.
A third aspect of the present application provides a computer device comprising a memory and a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the steps of the semantic-based fast knowledge hit method in a question-answering system described above when executing the computer readable instructions.
A fourth aspect of the present application provides a computer-readable storage medium storing computer-readable instructions that, when executed by a processor, implement the steps of the semantic-based rapid knowledge hit method in the question-answering system described above.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the application discloses a semantic-based rapid knowledge hit method and device in a question-answering system, wherein the method comprises the following steps: preparing and labeling corpus; fine-tuning a semantic model; a knowledge base vector representation; creating a knowledge base vector projection index; and (5) calculating the knowledge hit. According to the technical scheme, the deep learning model is introduced to improve the knowledge hit effect, and the matching algorithm is optimized to improve the knowledge hit speed, so that the intelligent customer service can support a larger and larger knowledge base. According to the method provided by the application, under the condition of a small amount of even no marked data, in the intelligent question-answering system facing a huge knowledge base, quicker and more accurate knowledge hit can be realized, and the user experience effect is good.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
FIG. 1 is a flow diagram of a semantic-based rapid knowledge hit method in a question-answering system of the present invention;
FIG. 2 is a schematic diagram of a processing procedure of forming a binary tree by a semantic vector V, a subspace A and a subspace B according to the embodiment of the invention;
FIG. 3 is a schematic diagram of a process for spatially projecting semantic vectors of a vector knowledge base into N binary trees in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a process of traversing N binary trees to find all N leaf nodes in an embodiment of the present invention;
FIG. 5 is a logic diagram of a semantic-based fast knowledge hit method in a question-answering system according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a semantic-based rapid knowledge hit device in a question-answering system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a corpus labeling preparation module of a semantic-based rapid knowledge hit device in a question-answering system according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a user question searching module of a semantic-based rapid knowledge hit device in a question and answer system according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a knowledge hit calculation module of a semantic-based fast knowledge hit device in a question-answering system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and effects of the present invention clearer and more obvious, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It is noted that the terms "first," "second," and the like in the description and claims of the present invention and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it is to be understood that the data so used may be interchanged where appropriate. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A method and apparatus for semantic-based fast knowledge hit in a question-answering system of the present application are described below with reference to the accompanying drawings.
Embodiment one:
FIG. 1 is a flow chart of a semantic-based fast knowledge hit method according to the present application, as shown in FIG. 1, comprising the steps of:
step S1, preparing corpus for model training, including user questions and knowledge in a corresponding knowledge base, and marking whether the user questions are matched with the knowledge;
step S2, training a model based on the Bert model by using labeled corpus according to the two classification tasks, setting the model output as the output of a porous_output layer of the Bert model after training is completed, and storing the model output as a semantic model;
step S3, converting the knowledge base of the text representation into a knowledge base of the semantic vector representation, namely a vector knowledge base, wherein the set of the semantic vectors contained in the knowledge base is a semantic vector space;
s4, carrying out semantic segmentation on the semantic vector space by adopting a random forest, generating N binary trees by adopting the same semantic vector space, wherein N is a natural number which is more than or equal to 1, N is a balance value of performance and precision, and the N is required to be adjusted step by step according to an actual effect; wherein, each binary tree corresponds to a knowledge base of semantic representation of random segmentation, leaf nodes of each binary tree represent semantic vectors of not more than K pieces of knowledge, K is a natural number and satisfies 1.ltoreq.K.ltoreq.total vector number/N;
step S5, converting the user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
and S6, calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge.
Specifically, in connection with fig. 2-5, the method comprises the steps of:
the first step: corpus preparation and labeling.
Step 101: preparing corpus for model training, and collecting user question sentences and knowledge in a corresponding knowledge base, wherein the user question sentences comprise positive and negative question sentences, positive expression and knowledge matching, and negative expression and knowledge unmatched question sentences, in particular question sentences similar in word but unmatched in semantic meaning.
Step 102: the labeling format of the training corpus is as follows: user question + knowledge + tag, wherein the tag is matched or not.
And a second step of: semantic model fine-tuning.
Step 201: and training the model by using the corpus marked in the last step according to the classification tasks based on the deep pre-training Bert model.
Step 202: after training is completed, the model output is set as the output of the porous_output layer of the Bert model, and is stored as a semantic model.
And a third step of: knowledge base vector representation.
Step 301: each piece of knowledge in the knowledge base is converted to digital information using the vocab dictionary of the Bert model.
Step 302: the digital information is input into the semantic model subjected to fine adjustment to be inferred, and semantic expression vectors of knowledge are output.
Step 303: after all knowledge reasoning is completed, the knowledge base of the text representation is converted into a knowledge base of semantic vector representation, namely a vector knowledge base, and the set of semantic vectors contained in the knowledge base is a semantic vector space.
Fourth step: vector knowledge base projection index creation.
In many business scenarios, a random forest model is used as a classifier to perform operations such as classification processing on big data in the business, the random forest model is a combined model based on decision trees, and in practical application, the random forest model is classified by the voting results of a plurality of decision trees. Each internal node of the decision tree represents a test on an attribute, each branch represents a test output, each leaf node represents a category, and relevant information of the current node is recorded except the leaf nodes on the decision tree.
In this embodiment, a random forest is used to perform semantic segmentation on the semantic vector space, N binary trees are generated in the same semantic vector space, and N is the number of semantic vectors contained in the semantic vector space.
The method specifically comprises the following steps:
step 401: randomly selecting one semantic vector V in a vector knowledge base, and calculating cosine similarity between all semantic vectors in the vector knowledge base and the randomly selected semantic vector V;
step 402: dividing semantic vectors with cosine similarity in the range of (0, 1) into subspace A, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into subspace B;
step 403: referring to fig. 2, a semantic vector V is taken as a root node, a subspace a is taken as a left subtree, a subspace B is taken as a right subtree, and the semantic vector V, the subspace a and the subspace B form a binary tree;
step 404: repeating steps 401-403 for subspaces on all nodes until the number of semantic vectors in all subspaces is less than or equal to K;
step 405: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees, as shown in the figure 3.
Fifth step: and (5) calculating the knowledge hit.
Step 501: the user question is converted into text information into digital information using the vocab dictionary of the Bert model.
Step 502: and inputting the digital information of the user question into the semantic model for reasoning, and outputting a semantic vector corresponding to the user question, namely, the semantic vector of the user question.
Step 503: any one of the N binary trees is selected.
Step 504: and calculating cosine similarity between the semantic vector of the question sentence of the user and the binary tree node, wherein the cosine similarity is within the range of (0, 1), and taking the left subtree node, or else, taking the right subtree node.
Step 505: step S504 is repeated to search the binary tree until a leaf node of the binary tree, i.e. the nearest leaf node, is found.
Step 506: steps S503 to S505 are repeated to find out the N nearest leaf nodes in all binary trees.
Step 507: and performing de-duplication processing on the semantic vectors which are less than or equal to N and contained in the N nearest leaf nodes to obtain M semantic vectors.
Step 508: and calculating cosine similarity between the M semantic vectors and the semantic vectors of the user question.
Step 509: and sequencing the M semantic vectors according to the descending order of cosine similarity, and returning the semantic vector D with the highest similarity.
Step 510: and comparing the similarity of the semantic vector D with a preset distance threshold T according to the preset distance threshold T determined by multiple tests in the actual service, and indicating hit knowledge when D is more than T.
Embodiment two:
FIG. 6 is a schematic structural diagram of a semantic-based fast knowledge hit apparatus according to the present application, as shown in FIG. 6, the apparatus 100 includes:
the corpus labeling preparation module 110 is configured to prepare a corpus trained by a model, including user question sentences and knowledge in a corresponding knowledge base, and label whether the user question sentences are matched with the knowledge;
the semantic model fine-tuning module 120 is configured to perform model training based on the Bert model by using the labeled corpus according to the two classification tasks, set the model output as a porous_output layer output of the Bert model after training is completed, and store the model output as a semantic model;
a knowledge base vector representation module 130, configured to convert a knowledge base of text representations into a knowledge base of semantic vector representations, i.e. a vector knowledge base, which includes a set of semantic vectors as a semantic vector space;
the binary tree generating module 140 is configured to perform semantic segmentation on the semantic vector space by using a random forest, generate N binary trees in the same semantic vector space, where N is a natural number greater than or equal to 1, and N is a balance value of performance and precision, and needs to be adjusted step by step according to an actual effect; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
the user question searching module 150 is configured to convert a user question into a corresponding semantic vector, that is, a user question semantic vector, traverse the N binary trees using the user question semantic vector, find N nearest leaf nodes, and aggregate and deduplicate semantic vectors contained in the N nearest leaf nodes to obtain M semantic vectors;
the knowledge hit calculation module 160 is configured to calculate the similarity between the M semantic vectors and the semantic vectors of the question of the user, and select the semantic vector with the highest similarity to determine the hit knowledge.
Specifically, referring to fig. 7, the corpus labeling preparation module 110 includes:
a collecting sub-module 111, configured to collect user questions and knowledge points in the corresponding knowledge base, where the user questions include positive and negative questions, positive expressions and knowledge-matched questions, negative expressions and knowledge-unmatched questions, and especially questions that are similar in terms but have unmatched semantics;
the labeling sub-module 112 is configured to label whether the question of the user is matched with the knowledge, and the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
Specifically, referring to fig. 8, the user question searching module 150 includes:
the user question semantic vector generation sub-module 151 is configured to infer a user question from the semantic model and output a semantic vector corresponding to the user question, that is, a user question semantic vector;
a traversing submodule 152, configured to traverse the N binary trees, and find N nearest neighbor leaf nodes in all binary trees that match the user question semantic vector;
the deduplication processing sub-module 153 is configured to perform deduplication processing on all the semantic vectors of the found N nearest leaf nodes to obtain M semantic vectors, where the number of all the semantic vectors of the N leaf nodes is equal to or less than n×k.
Specifically, referring to fig. 9, the knowledge hit calculation module 160 includes:
a similarity calculation submodule 161, configured to calculate cosine similarity between M semantic vectors and user question semantic vectors;
the similarity sorting sub-module 162 is configured to sort the M semantic vectors in descending order of cosine similarity, and return the semantic vector D with the highest similarity;
a judging sub-module 163, configured to judge whether the similarity of the returned semantic vector D exceeds a preset distance threshold T;
a determining submodule 164, configured to determine knowledge that the semantic vector D is hit when the similarity of the semantic vector D exceeds a preset distance threshold T.
In another aspect, the present application further provides a computer device, including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, where the processor executes the computer readable instructions to implement the steps of the semantic-based fast knowledge hit method in the question-answering system.
In another aspect, the present application further provides a computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the steps of the semantic-based fast knowledge hit method in the question-answering system described above.
In summary, the application discloses a semantic-based rapid knowledge hit method and device in a question-answering system, wherein the method comprises the following steps: preparing and labeling corpus; fine-tuning a semantic model; a knowledge base vector representation; creating a knowledge base vector projection index; and (5) calculating the knowledge hit. According to the technical scheme, the deep learning model is introduced to improve the knowledge hit effect, and the matching algorithm is optimized to improve the knowledge hit speed, so that the intelligent customer service can support a larger and larger knowledge base. According to the method provided by the application, under the condition of a small amount of even no marked data, in the intelligent question-answering system facing a huge knowledge base, quicker and more accurate knowledge hit can be realized, and the user experience effect is good.
The above description of the specific embodiments of the present invention has been given by way of example only, and the present invention is not limited to the above described specific embodiments. Any equivalent modifications and substitutions for the present invention will occur to those skilled in the art, and are also within the scope of the present invention. Accordingly, equivalent changes and modifications are intended to be included within the scope of the present invention without departing from the spirit and scope thereof.

Claims (7)

1. A semantic-based rapid knowledge hit method in a question-answering system, comprising:
s1, preparing corpus used for model training, wherein the corpus comprises user questions and knowledge in a corresponding knowledge base, and labeling whether the user questions are matched with the knowledge;
s2, training a model based on the Bert model by using labeled corpus according to the two classification tasks, setting the model output as the output of a porous_output layer of the Bert model after training is completed, and storing the model output as a semantic model;
s3, converting the knowledge base of the text representation into a knowledge base of the semantic vector representation, namely a vector knowledge base, wherein a set of semantic vectors contained in the knowledge base is a semantic vector space;
s4, carrying out semantic segmentation on the semantic vector space by adopting a random forest, generating N binary trees by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
s5, converting the user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
s6, calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity to determine the hit knowledge;
the step S4 includes the following steps S41 to S45:
s41: randomly selecting one semantic vector V in the vector knowledge base, and calculating cosine similarity between the semantic vectors in all the vector knowledge bases and the randomly selected semantic vector V;
s42: dividing semantic vectors with cosine similarity in the range of (0, 1) into a first subspace, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into a second subspace;
s43: the semantic vector V is taken as a root node, the first subspace is taken as a left subtree, the second subspace is taken as a right subtree, and the semantic vector V, the first subspace and the second subspace form a binary tree;
s44: repeating steps S41-S43 for subspaces on all nodes of the binary tree until the number of semantic vectors in all subspaces is less than or equal to K;
s45: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees;
wherein, the step S5 comprises the following steps S51 to S57:
s51: converting the text information into digital information by using a vorcab dictionary of the Bert model through a user question;
s52: inputting the digital information of the user question into the semantic model for reasoning, and outputting a semantic vector corresponding to the user question, namely, a semantic vector of the user question;
s53: selecting any one of N binary trees;
s54: calculating cosine similarity between the semantic vector of the question of the user and the binary tree node, wherein the cosine similarity is within the range of (0, 1), taking the left subtree node, or else, taking the right subtree node;
s55: repeating the step S54, searching the binary tree until a leaf node of the binary tree, namely the nearest leaf node, is found;
s56: repeating the steps S53-S55, and finding N nearest leaf nodes in all binary trees;
s57: and performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N x K.
2. The method for semantic-based rapid knowledge hit in a question-answering system according to claim 1, wherein step S1 includes the steps of:
s11: collecting user questions and knowledge in a corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive expressions of which are matched with the knowledge, and negative expressions of which are not matched with the knowledge, and the questions which are similar in word and not matched with the knowledge comprise questions with unmatched semanteme;
s12: labeling whether the user question is matched with the knowledge, wherein the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
3. The method for semantic-based rapid knowledge hit in a question-answering system according to claim 1, wherein step S3 includes the steps of:
s31: converting each piece of knowledge in the knowledge base into digital information by using a vocab dictionary of the Bert model;
s32: inputting the digital information into the semantic model for reasoning, and outputting semantic expression vectors of knowledge;
s33: after all knowledge reasoning is completed, the knowledge base of the text representation is converted into the knowledge base of the semantic vector representation.
4. The method for semantic-based rapid knowledge hit in a question-answering system according to claim 1, wherein step S6 includes the steps of:
s61: calculating cosine similarity between M semantic vectors and user question semantic vectors;
s62: sequencing the M semantic vectors according to the descending order of cosine similarity, and returning the semantic vector D with highest similarity;
s63: and comparing the similarity of the semantic vector D with a preset distance threshold T, and when D > T, indicating hit knowledge.
5. A semantic-based rapid knowledge hit device in a question-answering system, comprising:
the corpus labeling preparation module is used for preparing corpus trained by the model, comprising user question sentences and knowledge in a corresponding knowledge base, and labeling whether the user question sentences are matched with the knowledge;
the semantic model fine-tuning module is used for carrying out model training by using the labeled corpus according to the two classification tasks based on the Bert model, setting the model output as the output of a porous_output layer of the Bert model after training is finished, and storing the model output as the semantic model;
the knowledge base vector representation module is used for converting the knowledge base of the text representation into the knowledge base of the semantic vector representation, namely a vector knowledge base, wherein the set of the semantic vectors contained in the knowledge base vector representation is a semantic vector space;
the binary tree generation module is used for carrying out semantic segmentation on the semantic vector space by adopting a random forest, and N binary trees are generated by the same semantic vector space, wherein N is a natural number which is more than or equal to 1; wherein, each binary tree corresponds to a knowledge base represented by semantic vectors which are segmented randomly, leaf nodes of each binary tree represent semantic vectors of which the number is not more than K, K is a natural number and satisfies that K is not less than 1 and not more than total vector number/N;
the user question searching module is used for converting a user question into a corresponding semantic vector, namely a user question semantic vector, traversing the N binary trees by using the user question semantic vector, searching N nearest leaf nodes, collecting and de-duplicating the semantic vectors contained in the N nearest leaf nodes, and obtaining M semantic vectors;
the knowledge hit calculation module is used for calculating the similarity between M semantic vectors and the semantic vectors of the user question, and selecting the semantic vector with the highest similarity as hit knowledge;
wherein the binary tree generation module is configured to implement the following steps M1 to M5:
m1: randomly selecting one semantic vector V in the vector knowledge base, and calculating cosine similarity between the semantic vectors in all the vector knowledge bases and the randomly selected semantic vector V;
m2: dividing semantic vectors with cosine similarity in the range of (0, 1) into a first subspace, and dividing semantic vectors with cosine similarity in the range of [ -1,0] into a second subspace;
m3: the semantic vector V is taken as a root node, the first subspace is taken as a left subtree, the second subspace is taken as a right subtree, and the semantic vector V, the first subspace and the second subspace form a binary tree;
m4: repeating the steps M1-M3 for subspaces on all nodes of the binary tree until the number of semantic vectors in all subspaces is less than or equal to K;
m5: repeating the steps for N times, and projecting the semantic vector space of the vector knowledge base into N binary trees;
wherein, the user question searching module comprises:
the user question semantic vector generation sub-module is used for outputting semantic vectors corresponding to the user questions after reasoning the user questions through the semantic model, namely the user question semantic vectors;
the traversing sub-module is used for traversing N binary trees and searching N nearest neighbor leaf nodes matched with the user question semantic vector in all binary trees;
and the de-duplication processing sub-module is used for performing de-duplication processing on all the semantic vectors of the N nearest leaf nodes to obtain M semantic vectors, wherein the number of all the semantic vectors of the N leaf nodes is less than or equal to N.
6. The semantic-based rapid knowledge hit apparatus in a question-answering system according to claim 5, wherein the corpus annotation preparation module comprises:
the collecting sub-module is used for collecting user questions and knowledge in the corresponding knowledge base, wherein the user questions comprise positive and negative questions, positive questions represent and knowledge are matched, negative questions represent and knowledge are not matched, and the questions which are similar in word and are not matched with the knowledge comprise questions which are not matched with the semantic meaning;
the labeling sub-module is used for labeling whether the question sentence of the user is matched with the knowledge, and the labeling format is as follows: user question + knowledge + tag, wherein the tag is matched or not.
7. The semantic-based rapid knowledge hit apparatus in a question-answering system according to claim 5, wherein the knowledge hit calculation module includes:
the similarity calculation sub-module is used for calculating cosine similarity between the M semantic vectors and the user question semantic vectors;
the similarity sorting sub-module is used for sorting the M semantic vectors according to the descending order of cosine similarity and returning the semantic vector D with the highest similarity;
the judging submodule is used for judging whether the similarity of the returned semantic vector D exceeds a preset distance threshold T;
the determining submodule is used for determining that the semantic vector is hit knowledge when the similarity of the semantic vector D exceeds a preset distance threshold T.
CN202110807421.5A 2021-07-16 2021-07-16 Semantic-based rapid knowledge hit method and device in question-answering system Active CN113722452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110807421.5A CN113722452B (en) 2021-07-16 2021-07-16 Semantic-based rapid knowledge hit method and device in question-answering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110807421.5A CN113722452B (en) 2021-07-16 2021-07-16 Semantic-based rapid knowledge hit method and device in question-answering system

Publications (2)

Publication Number Publication Date
CN113722452A CN113722452A (en) 2021-11-30
CN113722452B true CN113722452B (en) 2024-01-19

Family

ID=78673527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110807421.5A Active CN113722452B (en) 2021-07-16 2021-07-16 Semantic-based rapid knowledge hit method and device in question-answering system

Country Status (1)

Country Link
CN (1) CN113722452B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114490975B (en) * 2021-12-31 2023-02-07 马上消费金融股份有限公司 User question labeling method and device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437833B1 (en) * 2016-10-05 2019-10-08 Ontocord, LLC Scalable natural language processing for large and dynamic text environments
CN111090735A (en) * 2019-12-25 2020-05-01 成都航天科工大数据研究院有限公司 Intelligent question-answering method based on knowledge graph and performance evaluation method thereof
CN111444320A (en) * 2020-06-16 2020-07-24 太平金融科技服务(上海)有限公司 Text retrieval method and device, computer equipment and storage medium
CN111460798A (en) * 2020-03-02 2020-07-28 平安科技(深圳)有限公司 Method and device for pushing similar meaning words, electronic equipment and medium
CN111581354A (en) * 2020-05-12 2020-08-25 金蝶软件(中国)有限公司 FAQ question similarity calculation method and system
CN111639165A (en) * 2020-04-30 2020-09-08 南京理工大学 Intelligent question-answer optimization method based on natural language processing and deep learning
CN111813916A (en) * 2020-07-21 2020-10-23 润联软件系统(深圳)有限公司 Intelligent question and answer method, device, computer equipment and medium
CN112015915A (en) * 2020-09-01 2020-12-01 哈尔滨工业大学 Question-answering system and device based on knowledge base generated by questions
CN112084299A (en) * 2020-08-05 2020-12-15 山西大学 Reading comprehension automatic question-answering method based on BERT semantic representation
CN112100360A (en) * 2020-10-30 2020-12-18 北京淇瑀信息科技有限公司 Dialog response method, device and system based on vector retrieval
CN112395396A (en) * 2019-08-12 2021-02-23 科沃斯商用机器人有限公司 Question-answer matching and searching method, device, system and storage medium
CN112417884A (en) * 2020-11-05 2021-02-26 广州平云信息科技有限公司 Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration
CN112613320A (en) * 2019-09-19 2021-04-06 北京国双科技有限公司 Method and device for acquiring similar sentences, storage medium and electronic equipment
CN112989005A (en) * 2021-04-16 2021-06-18 重庆中国三峡博物馆 Knowledge graph common sense question-answering method and system based on staged query

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140358720A1 (en) * 2013-05-31 2014-12-04 Yahoo! Inc. Method and apparatus to build flowcharts for e-shopping recommendations
CN104573028B (en) * 2015-01-14 2019-01-25 百度在线网络技术(北京)有限公司 Realize the method and system of intelligent answer
US20210397926A1 (en) * 2018-09-29 2021-12-23 VII Philip Alvelda Data representations and architectures, systems, and methods for multi-sensory fusion, computing, and cross-domain generalization

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437833B1 (en) * 2016-10-05 2019-10-08 Ontocord, LLC Scalable natural language processing for large and dynamic text environments
CN112395396A (en) * 2019-08-12 2021-02-23 科沃斯商用机器人有限公司 Question-answer matching and searching method, device, system and storage medium
CN112613320A (en) * 2019-09-19 2021-04-06 北京国双科技有限公司 Method and device for acquiring similar sentences, storage medium and electronic equipment
CN111090735A (en) * 2019-12-25 2020-05-01 成都航天科工大数据研究院有限公司 Intelligent question-answering method based on knowledge graph and performance evaluation method thereof
CN111460798A (en) * 2020-03-02 2020-07-28 平安科技(深圳)有限公司 Method and device for pushing similar meaning words, electronic equipment and medium
CN111639165A (en) * 2020-04-30 2020-09-08 南京理工大学 Intelligent question-answer optimization method based on natural language processing and deep learning
CN111581354A (en) * 2020-05-12 2020-08-25 金蝶软件(中国)有限公司 FAQ question similarity calculation method and system
CN111444320A (en) * 2020-06-16 2020-07-24 太平金融科技服务(上海)有限公司 Text retrieval method and device, computer equipment and storage medium
CN111813916A (en) * 2020-07-21 2020-10-23 润联软件系统(深圳)有限公司 Intelligent question and answer method, device, computer equipment and medium
CN112084299A (en) * 2020-08-05 2020-12-15 山西大学 Reading comprehension automatic question-answering method based on BERT semantic representation
CN112015915A (en) * 2020-09-01 2020-12-01 哈尔滨工业大学 Question-answering system and device based on knowledge base generated by questions
CN112100360A (en) * 2020-10-30 2020-12-18 北京淇瑀信息科技有限公司 Dialog response method, device and system based on vector retrieval
CN112417884A (en) * 2020-11-05 2021-02-26 广州平云信息科技有限公司 Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration
CN112989005A (en) * 2021-04-16 2021-06-18 重庆中国三峡博物馆 Knowledge graph common sense question-answering method and system based on staged query

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"一网通办"事项要素库智能化探究;徐智蕴;严洁;贝文馨;;现代信息科技(第09期);全文 *

Also Published As

Publication number Publication date
CN113722452A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN110188168B (en) Semantic relation recognition method and device
CN106570708B (en) Management method and system of intelligent customer service knowledge base
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN109145290B (en) Semantic similarity calculation method based on word vector and self-attention mechanism
CN111460149B (en) Text classification method, related device and readable storage medium
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
CN111339277A (en) Question-answer interaction method and device based on machine learning
CN111026886A (en) Multi-round dialogue processing method for professional scene
CN113704386A (en) Text recommendation method and device based on deep learning and related media
CN113342958B (en) Question-answer matching method, text matching model training method and related equipment
CN111078835A (en) Resume evaluation method and device, computer equipment and storage medium
CN112784590A (en) Text processing method and device
CN111666376A (en) Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching
CN109145083A (en) A kind of candidate answers choosing method based on deep learning
CN113722452B (en) Semantic-based rapid knowledge hit method and device in question-answering system
CN116932730B (en) Document question-answering method and related equipment based on multi-way tree and large-scale language model
CN111309926B (en) Entity linking method and device and electronic equipment
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN113159187A (en) Classification model training method and device, and target text determining method and device
CN113761887A (en) Matching method and device based on text processing, computer equipment and storage medium
CN115203589A (en) Vector searching method and system based on Trans-dssm model
CN112579666A (en) Intelligent question-answering system and method and related equipment
CN111813941A (en) Text classification method, device, equipment and medium combining RPA and AI
CN112613320A (en) Method and device for acquiring similar sentences, storage medium and electronic equipment
CN117313748B (en) Multi-feature fusion semantic understanding method and device for government affair question and answer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 200435 11th Floor, Building 27, Lane 99, Shouyang Road, Jing'an District, Shanghai

Applicant after: Shanghai Tongban Information Service Co.,Ltd.

Address before: 200433 No. 11, Lane 100, Zhengtong Road, Yangpu District, Shanghai

Applicant before: Shanghai Tongban Information Service Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant